sqoop - Sqoop with Oracle - apache sqoop - sqoop tutorial - sqoop hadoop



Sqoop with Oracle - Reference data in RDBMS

learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples

Sqoop with Oracle - Hadoop for off-line analytics

learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples

Sqoop with Oracle - Hadoop for RDBMS archive

learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples

Sqoop with Oracle - MapReduce results to RDBMS

learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples

SQOOP Details

  • SQOOP import
    • Divide table into ranges using primary key max/min
    • Create mappers for each range
    • Mappers write to multiple HDFS nodes
    • Creates text or sequence files
    • Generates Java class for resulting HDFS file
    • Generates Hive definition and auto-loads into HIVE
  • SQOOP export
    • Read files in HDFS directory via MapReduce
    • Bulk parallel insert into database table
  • SQOOP features:
    • Compatible with almost any JDBC enabled database
    • Auto load into HIVE
    • Hbase support
    • Special handling for database LOBs
    • Job management
    • Cluster configuration (jar file distribution)
    • WHERE clause support
    • Open source, and included in Cloudera distributions
  • SQOOP fast paths & plug ins
    • Invoke mysqldump, mysqlimport for MySQL jobs
    • Similar fast paths for PostgreSQL
    • Extensibility architecture for 3rd parties (like Quest)
    • Teradata, Netezza, etc.

    Working with Oracle

  • SQOOP approach is generic and applicable to all RDBMS
  • However for Oracle, sub-optimal in some respects:
  • Oracle may parallelize and serialize individual mappers
  • Oracle optimizer may decline to use index range scans
  • Oracle physical storage often deliberately not in primary key order (reverse key indexes, hash partitioning, etc)
  • Primary keys often not be evenly distributed
  • Index range scans use single block random reads
    • vs. faster multi-block table scans
  • Index range scans load into Oracle buffer cache
    • Pollutes cache increasing IO for other users
    • Limited help to SQOOP since rows are only read once
  • Luckily, SQOOP extensibility allows us to add optimizations for specific targets
  • Oracle – parallelism :

    learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples
    learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples
    learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples

    Index range scans

    learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples

    Oracle Ideal architecture

    learn sqoop - sqoop tutorial - sqoop2 tutorial - sqoop option text - sqoop job - sqoop code - sqoop programming - sqoop download - sqoop examples

    SQOOP/OraOop best practices

  • Use sequence files for LOBs OR
    • Set inline-lob-limit
  • Directly control datanodes for widest destination bandwidth
    • Can’t rely on mapred.max.maps.per.node
  • Set number of mappers realistically
  • Disable speculative execution (our default)
    • Leads to duplicate DB reads
  • Set Oracle row fetch size extra high
    • Keeps the mappers streaming to HDFS

    Related Searches to Sqoop with Oracle