Sqoop is designed to import mainframe datasets into HDFS. To do so, you must specify a mainframe host name in the Sqoop --connect argument.
This will connect to the mainframe host z390 via ftp.
You might need to authenticate against the mainframe host to access it. You can use the --username to supply a username to the mainframe.
Sqoop provides couple of different ways to supply a password, secure and non-secure, to the mainframe which is detailed below.
Secure way of supplying password to the mainframe:
You should save the password in a file on the users home directory with 400 permissions and specify the path to that file using the --password-file argument, and is the preferred method of entering credentials.
Sqoop will then read the password from the file and pass it to the MapReduce cluster using secure means without exposing the password in the job configuration.
The file containing the password can either be on the Local FS or HDFS.
Example:
Another way of supplying passwords is using the -P argument which will read a password from a console prompt.
Note
The --password parameter is insecure, as other users may be able to read your password from the command-line arguments via the output of programs such as ps.
The -P argument is the preferred method over using the --password argument. Credentials may still be transferred between nodes of the MapReduce cluster using insecure means.
Sqoop imports data in parallel by making multiple ftp connections to the mainframe to transfer multiple files simultaneously.
You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument.
Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ.
By default, four tasks are used. You can adjust this value to maximize the data transfer rate from the mainframe.
Controlling Distributed Cache
Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job.
When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache.
Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs.
Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.
Controlling the Import Process
By default, Sqoop will import all sequential files in a partitioned dataset pds to a directory named pds inside your home directory in HDFS.
For example, if your username is someuser, then the import tool will write to /user/someuser/pds/(files).
You can adjust the parent directory of the import with the --warehouse-dir argument. For example:
This command would write to a set of files in the /shared/pds/ directory.
You can also explicitly choose the target directory, like so:
This will import the files into the /dest directory. --target-dir is incompatible with --warehouse-dir.
By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents.
File Formats:
By default, each record in a dataset is stored as a text record with a newline at the end.
Each record is assumed to contain a single text field with the name DEFAULT_COLUMN.
When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates.
You can also import mainframe records to Sequence, Avro, or Parquet files.
By default, data is not compressed.
You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument.
Since mainframe record contains only one field, importing to delimited files will not contain any field delimiter. However, the field may be enclosed with enclosing character or escaped by an escaping character.
Input parsing arguments:
Argument
Description
--input-enclosed-by <char>
Sets a required field encloser
--input-escaped-by <char>
Sets the input escape character
--fields-terminated-by <char>
Sets the input field separator
--input-lines-terminated-by <char>
Sets the input end-of-line character
--input-optionally-enclosed-by <char>
Sets a field enclosing character
When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import.
The delimiters are chosen with arguments such as --fields-terminated-by; this controls both how the data is written to disk, and how the generated parse()method reinterprets this data.
The delimiters used by the parse() method can be chosen independently of the output arguments, by using --input-fields-terminated-by, and so on.
This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.
Example Invocations
The following examples illustrate how to use the import tool in a variety of situations.
A basic import of all sequential files in a partitioned dataset named EMPLOYEES in the mainframe host z390:
Controlling the import parallelism (using 8 parallel tasks):
sqoop import to hive examplesqoop import to hdfs examplesqoop import command examplesqoop import sql serversqoop command to import data from oraclesqoop import to hive databasesqoop import delimiterimport data from mysql to hive using sqoopsqoop hbase import exampleimport data into hive using sqoopsqoop import into hive tablesqoop import query examplesqoop import mainframesqoop-1272mainframe to hdfssqoop parquetsqoop commandapache sqoop tutorialsqoop exampleapache sqoop documentationsqoop downloadsqoopapache sqoopsqoop tutorialsqoop hadoopsqoop importsqoop interview questionssqoop exportsqoop commandssqoop user guidesqoop documentationsqoop downloadsqoop import to hivewhat is sqoopsqoop2sqoop jobsqoop exampleapache sqoop tutorialsqoop big datasqoop architecturesqoop import examplesqoop tutorial pdfsqoop import commandsqoop installationsqoop logoapache sqoop cookbooksqoop import to hdfssqoop oraclesqoop 2sqoop import all tableshadoop sqoop tutorialapache sqoop cookbook pdfcloudera sqoopsqoop interview questions and answers for experiencedsqoop vs flumesqoosqoop export examplesqoop sql serversqoop export commandsqoop metastorewhat is sqoop in hadoopinstall sqoop