This section describes how to import data from external data sources to MRS.
Generally, you can manually manage data import and export jobs on the Loader UI. To use shell scripts to update and run Loader jobs, you must configure the installed Loader client.
The version 8.1.2.2 is used as an example. Replace it with the actual version number.
cd ${BIGDATA_HOME}/FusionInsight_Porter_8.1.2.2/install/FusionInsight-Sqoop-1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib
chown omm:wheel JAR package name
chmod 600 JAR package name
Connector |
Parameter |
Description |
---|---|---|
generic-jdbc-connector |
JDBC Driver Class |
Name of a JDBC driver class |
JDBC Connection String |
JDBC connection string |
|
Username |
Username for connecting to the database |
|
Password |
Password for connecting to the database |
|
JDBC Connection Properties |
JDBC connection attribute. Click Add to manually add connection attributes.
|
|
ftp-connector |
FTP Server IP Address |
IP address of the FTP server |
FTP Server Port |
Port number of the FTP server |
|
FTP Username |
Username for accessing the FTP server |
|
FTP Password |
Password for accessing the FTP server |
|
FTP Mode |
FTP access mode. Possible values are ACTIVE and PASSIVE. If this parameter is not set, FTP access is in passive mode by default. |
|
FTP Protocol |
FTP protocol.
If this parameter is not set, the FTP protocol is used by default. |
|
File Name Encoding Type |
File name and file path encoding format supported by the FTP server. If this parameter is not set, the default format UTF-8 is used. |
|
hdfs-connector |
- |
- |
oracle-connector |
JDBC Connection String |
Connection string for a user to connect to the database |
Username |
Username for connecting to the database |
|
Password |
Password for connecting to the database |
|
Connection Properties |
Connection attributes. Click Add to manually add connection attributes.
|
|
mysql-fastpath-connector |
JDBC Connection String |
JDBC connection string |
Username |
Username for connecting to the database |
|
Password |
Password for connecting to the database |
|
Connection Properties |
Connection attributes. Click Add to manually add connection attributes.
|
|
sftp-connector |
SFTP Server IP Address |
IP address of the SFTP server |
SFTP Server Port |
Port number of the SFTP server |
|
SFTP Username |
Username for accessing the SFTP server |
|
SFTP Password |
Password for accessing the SFTP server |
|
SFTP Public Key |
Public key of the SFTP server |
|
oracle-partition-connector |
JDBC Driver Class |
Name of a JDBC driver class |
JDBC Connection String |
JDBC connection string |
|
Username |
Username for connecting to the database |
|
Password |
Password for connecting to the database |
|
Connection Properties |
Connection attributes. Click Add to manually add connection attributes.
|
Source File Type |
Parameter |
Description |
---|---|---|
sftp-connector or ftp-connector |
Input Path |
Input path or name of the source file on an SFTP server. If multiple SFTP server IP addresses are configured for the connector, you can set this parameter to multiple input paths separated with semicolons (;). Ensure that the number of input paths is the same as that of SFTP servers configured for the connector. |
File Split Type |
Indicates whether to split source files by file name or size. The files obtained after the splitting are used as the input files of each Map in the MapReduce task for data import. FILE indicates that each Map processes one or more complete source files. The same source file cannot be allocated to different Maps. When the data is saved to the output directory, the directory structure of the input path is retained. SIZE indicates that each Map processes input files of a certain size. A source file can be split into multiple Maps. The number of files saved when data is saved to the output directory is the same as that of Maps. The file name format is import_part_xxxx, where xxxx is a unique random number generated by the system. |
|
Filter Type |
File filtering criterion. WILCARD indicates that a wildcard is used in filtering, and REGEX indicates that a regular expression is used in filtering. This parameter is used together with Path Filter and File Filter. The default value is WILDCARD. |
|
Path Filter |
Wildcard or regular expression for filtering the directories in the input path of the source files. This parameter is used when Filter Type is set. Input Path is not used for filtering. If there are multiple filter conditions, use commas (,) to separate them. If the value is empty, the directories are not filtered. |
|
File Filter |
Wildcard or regular expression for filtering the file names of the source files. This parameter is used when Filter Type is set. If there are multiple filter conditions, use commas (,) to separate them. The value cannot be left blank. |
|
Encoding Type |
Source file encoding format, for example, UTF-8. This parameter can be set only in text file import. |
|
Suffix |
File name extension added to a source file after the source file is imported. If this parameter is empty, no file name extension is added to the source file. |
|
Compression |
Indicates whether to enable compressed transmission when SFTP is used to export data. true indicates that compression is enabled, and false indicates that compression is disabled. |
|
hdfs-connector |
Input Path |
Input path of source files in HDFS |
Path Filter |
Wildcard for filtering the directories in the input paths of the source files. Input Path is not used for filtering. If there are multiple filter conditions, use commas (,) to separate them. If the value is empty, the directories are not filtered. The regular expression filtering is not supported. |
|
File Filter |
Wildcard for filtering the file names of the source files. If there are multiple filter conditions, use commas (,) to separate them. The value cannot be left blank. The regular expression filtering is not supported. |
|
Encoding Type |
Source file encoding format, for example, UTF-8. This parameter can be set only in text file import. |
|
Suffix |
File name extension added to a source file after the source file is imported. If this parameter is empty, no file name extension is added to the source file. |
|
generic-jdbc-connector |
Schema Name |
Database schema name. This parameter exists in the Table name schema. |
Table Name |
Database table name. This parameter exists in the Table name schema. |
|
SQL Statement |
SQL statement for the Loader to query data to be imported in Table SQL statement mode. The SQL statement requires the query condition WHERE ${CONDITIONS}. Without this condition, the SQL statement cannot be run properly, for example, select * from TABLE WHERE A>B and ${CONDITIONS}. If Table column names is set, the column specified by Table column names will replace the column queried in the SQL statement. This parameter cannot be set when Schema name or Table name is set. |
|
Table Column Names |
Table columns whose content is to be imported by Loader. Use commas (,) to separate multiple fields. |
|
Partition Column Name |
Database table column based on which to-be-imported data is determined. This parameter is used for partitioning in a Map job. You are advised to configure the primary key field. NOTE:
|
|
Nulls in Partition Column |
Indicates whether to process records whose values are null in database table columns. If the value is true, the data whose value is null in the partition column is processed. If the value is false, the data whose value is null in the partition column is not processed. |
|
Whether to Specify a Partition Column |
Indicates whether to specify a partition column. |
|
oracle-connector |
Table Name |
Table name. |
Column Name |
Column name. |
|
Query Condition |
Query condition in an SQL statement |
|
Splitting Mode |
Data splitting mode. The options are ROWID and PARTITION. |
|
Table Partition Name |
Name of a table partition. Use commas (,) to separate the names of different partitions. |
|
Data Block Allocation Mode |
Allocation method of data after being split. |
|
Read Size |
Amount of data to be read each time. |
|
mysql-fastpath-connector |
Schema Name |
Database schema name. |
Table Name |
Database table name. |
|
Query Condition |
Query condition of a specified table. |
|
Partition Column Name |
Database table column based on which to-be-imported data is determined. This parameter is used for partitioning in a Map job. You are advised to configure the primary key field. NOTE:
|
|
Nulls in Partition Column |
Indicates whether to process records whose values are null in database table columns. If the value is true, the data whose value is null in the partition column is processed. If the value is false, the data whose value is null in the partition column is not processed. |
|
Whether to Specify a Partition Column |
Indicates whether to specify a partition column. |
|
oracle-partition-connector |
Schema Name |
Database schema name. |
Table Name |
Partition table name. |
|
Query Condition |
Query condition in an SQL statement. |
|
Table Column Names |
Table columns whose content is to be imported by Loader. Use commas (,) to separate multiple fields. |
Check whether source data values in the data operation job created by the Loader can be directly used without conversion, including upper and lower case conversion, cutting, merging, and separation.
Type |
Description |
---|---|
Input Type |
|
Conversion Type |
|
Output Type |
|
The edit box allows you to perform the following tasks:
You can also use the shortcut key Del to delete the example.
For details about how to set parameters in the step conversion information, see Operator Help.
If the conversion step is incorrectly configured, the source data cannot be converted and become dirty data. The dirty data marking rules are as follows:
Storage Type |
Parameter |
Description |
---|---|---|
HDFS |
File Type |
Compression format of files imported to HDFS. Select a format from the drop-down list. If you select NONE or do not set this parameter, data is not compressed. |
Compression Format |
Compression format of files imported to HDFS. Select a format from the drop-down list. If you select NONE or do not set this parameter, data is not compressed. |
|
Output Directory |
Directory for storing data imported into HDFS. |
|
Operation |
Action during data import. When all data is to be imported from the input path to the destination path, the data is stored in a temporary directory and then copied from the temporary directory to the destination path. After the data is imported successfully, the data is deleted from the temporary directory. One of the following actions can be taken when duplicate file names exist during data transfer:
|
|
Extractors |
Number of Maps that are started at the same time in a MapReduce task of a data configuration operation. This parameter cannot be set when Extractor Size is set. The value must be less than or equal to 3000. |
|
Extractor Size |
Size of data processed by Maps that are started in a MapReduce task of a data configuration operation. The unit is MB. The value must be greater than or equal to 100. The recommended value is 1000. This parameter cannot be set when Extractors is set. When a relational database connector is used, Extractor Size is unavailable. You need to set Extractors. |
|
HBASE_BULKLOAD |
HBase Instance |
HBase service instance that Loader selects from all available HBase service instances in the cluster. If the selected HBase service instance is not added to the cluster, the HBase job cannot be run properly. |
Clear data before import |
Indicates whether to clear data in the original table before importing data. The value true indicates that the clearing operation is performed, and the value false indicates that the clearing operation is not performed. If you do not set this parameter, the original table is not cleared by default. |
|
Extractors |
Number of Maps that are started at the same time in a MapReduce task of a data configuration operation. The value must be less than or equal to 3000. |
|
Extractor Size |
HBase does not support this parameter. Please set Extractors. |
|
HBASE_PUTLIST |
HBase Instance |
HBase service instance that Loader selects from all available HBase service instances in the cluster. If the selected HBase service instance is not added to the cluster, the HBase job cannot be run properly. |
Extractors |
Number of Maps that are started at the same time in a MapReduce task of a data configuration operation. The value must be less than or equal to 3000. |
|
Extractor Size |
HBase does not support this parameter. Please set Extractors. |
|
HIVE |
Output Directory |
Directory for storing data imported into Hive. |
Extractors |
Number of Maps that are started at the same time in a MapReduce task of a data configuration operation. This parameter cannot be set when Extractor Size is set. The value must be less than or equal to 3000. |
|
Extractor Size |
Size of data processed by Maps that are started in a MapReduce task of a data configuration operation. The unit is MB. The value must be greater than or equal to 100. The recommended value is 1000. This parameter cannot be set when Extractors is set. When a relational database connector is used, Extractor Size is unavailable. You need to set Extractors. |
|
SPARK |
Output Directory |
Only SparkSQL is supported to access Hive data. You can specify the directory for storing data imported to Hive. |
Extractors |
Number of Maps that are started at the same time in a MapReduce task of a data configuration operation. This parameter cannot be set when Extractor Size is set. The value must be less than or equal to 3000. |
|
Extractor Size |
Size of data processed by Maps that are started in a MapReduce task of a data configuration operation. The unit is MB. The value must be greater than or equal to 100. The recommended value is 1000. This parameter cannot be set when Extractors is set. When a relational database connector is used, Extractor Size is unavailable. You need to set Extractors. |