SAND CDBMS Tools Guide - SAND Data Loader (ndlm)

SAND CDBMS Tools Reference Guide
Data Loader (ndlm)

Previous Topic:
Running ndlm

Chapter Index

Next Topic:
Creating the Import Specification Script

Importing Data with ndlm

Data Import Overview

When importing data, ndlm reads the import specification file in order to obtain an accurate "picture" of the data to be loaded. The ndlm program then inserts the data into the destination or target table, according to the instructions contained in the specification file.

Note that ndlm will not load data into a table on which PRIMARY KEY, FOREIGN KEY, or UNIQUE constraints have been defined. If the table is new, constraints can be added after the load with an ALTER TABLE...ADD CONSTRAINT statement. In the case of existing constraints, they can be disabled via the ALTER TABLE...DISABLE CONSTRAINTS command, then re-enabled with the ALTER TABLE...ENABLE CONSTRAINTS command when the process is complete. Otherwise, constraints can be dropped with an ALTER TABLE...DROP CONSTRAINT statement before the import.

The import specification file contains an import specification script which maps the incoming data format to the table data format. For this reason, the sizes of the fields in the input data should be carefully reviewed, so that incoming fields may be modified if necessary to fit the target table-column requirements.
In most cases, slight differences between input fields and destination columns do not have an impact on the data loaded. As long as the size or width of the target column is greater than or equal to that of the input data, and there is no attempt to insert character data into a numeric column, the data is loaded unchanged. In other situations, which are listed below, data can be altered or rejected during the load process:

If incoming character data is wider than the length of the destination column, a warning is issued and the data is truncated on the right-hand side to the maximum length of the column.

If the number of decimal digits in the input data is greater than the scale defined for the destination column, the decimal portion of the data is rounded to the scale of the column.

If incoming numeric data is larger than the size of the destination column, the data is rejected with a "numeric truncation" error.

If there is an attempt to load data containing alphabetic characters into a numeric column, the data is rejected with an "incompatible datatype" error.

Records that are rejected during the load process are written to a file called input-file.BAD, where input-file is the name of the file containing the data being imported (or the name of the first file processed, if the load specifies multiple input files). To prevent data from being rejected during the load procedure, use NDL++ functions (described below) to make the data compatible with the destination columns.
The following steps summarize the procedure for importing data from an ASCII text file into a database:

Create and configure the target database.

Create the SAND CDBMS domains and tables required to receive the imported data. If PRIMARY KEY, FOREIGN KEY, or UNIQUE constraints are required for a table, they must be added (using one or more ALTER TABLE statements) after the import has been performed.

Locate and prepare the ASCII file containing the data to be imported. This may involve exporting data from some other source, for example a DBMS, to an ASCII file.

Create an import specification script mapping fields from the data file to columns in the database table, and specifying any filtering, conversion, or manipulation operations that are to be applied to the data.

Make sure that the database instance has been started using the nserv program, and that no other users are connected; then invoke ndlm, providing all necessary arguments. The user-name argument must be a user that can query and insert into the target database table.

Deferred Domain Indexing

In cases where multiple load operations will be performed successively, it can be advantageous to defer the domain indexing phase of each load operation until all loads have completed. By default, domain indexing occurs after every load operation. Since domain index work is not usually cumulative, performing this action only once at the end of several loads, instead of after each individual load, can reduce the overall time spent on load work.

Using the SQL command ALTER SESSION SET DOMINDEX, domain indexing can be disabled until the user decides to reactivate it. When all scheduled loads are finished, domain indexing can be turned back on with the same SQL command. The syntax for the command is the following:

ALTER SESSION SET DOMINDEX = n ;

where n is a nonnegative integer value.

When DOMINDEX is set to 0, domain indexing is turned off. When DOMINDEX is set to 1 (or greater), domain indexing is enabled and begins immediately. Note that only a user with DBA privileges can execute this command.

Domain indexing is required before loaded data can be accessed, so ensure that domain indexing is performed after all loads have completed and before clients connect to the database. In order to make the domain indexing permanent in modes other than Real, a SHUTDOWN IMMEDIATE KEEP CHANGE command will have to be executed against the database after the indexing is enabled.

Also note that turning on domain indexing will cause a COMMIT to be issued internally, so any previous DDL statements in the transaction will be made permanent.
When domain indexing will be deferred, the following revised steps describe the procedure for importing data into a database:

Create and configure the target database.

Create the SAND CDBMS domains and tables required to receive the imported data. If UNIQUE or PRIMARY KEY constraints are required for a table, they must be added (using one or more ALTER TABLE statements) after the import has been performed.

Locate and prepare the ASCII file containing the data to be imported. This may involve exporting data from some other source, for example a DBMS, to an ASCII file.

Create an import specification script mapping fields from the data file to columns in the database table, and specifying any filtering, conversion, or manipulation operations that are to be applied to the data.

Make sure that the database instance has been started using the nserv program. Connect to it with a client program and execute the SQL command ALTER SESSION SET DOMINDEX = 0 to disable domain indexing. Disconnect from the database, and ensure that no other user is connected.

Invoke ndlm, providing all necessary arguments. The user-name argument must be a user that can query and insert into the target database table. Repeat this step for each desired load operation. Other users must not connect to the database and attempt to work with unindexed data between individual loads.

Through a client program, execute the SQL command ALTER SESSION SET DOMINDEX = 1 against the database to activate domain indexing.

Creating a SAND Compacted Table

A SAND Compacted Table can also be created using an import specification file. The script is identical to that of a standard data import, except for the presence of the SCTFILE parameter, which enables SCT File creation and specifies the path and name of the resulting SCT File. When the load operation is performed, the data is redirected to the specified SCT File. By default, the data is loaded into the database as well; to bypass the database entirely, set the LOAD parameter to 0, NO, or FALSE.

The same steps described in the "Data Import Overview" section above apply to creating an SCT File from source data files. However, the operation must be performed in Parallel Loader mode. Employ the following ndlm invocation syntax:

ndlm -b -k n connection-name instance-name user-name[/user-password] spec-file-name

Note that a single SCT File can contain at most 2,147,483,648 (2³¹) rows. If the source data files comprise more than this amount, the load operation will have to be partitioned, and multiple SCT Files will have to be created.

It is possible to create an SCT File without an active database by setting the DDL parameter in either the [DCL] section of the nucleus.ini file or in the NUCLEUS environment variable. The DDL parameter specifies the path to a text file containing the SQL DDL commands (CREATE TABLE and CREATE DOMAIN) that fully describe the destination table, thereby removing the need for ndlm to read this information from a running SAND database. Refer to the nucleus.ini file documentation for further details about setting the DDL parameter.

Once the load operation has concluded successfully, a new SCT File will be located in the path specified by the SCTFILE parameter.

Loading into Partitioned and Dimension Tables

When a partitioned table is the target of a load operation, the loader will first read the system metadata for the partitioning strategy (round robin, hash, or range) used by the table, and then use that strategy to partition and load the data into the appropriate tables on remote nodes. For instance, if partitioned table p0 has three partitions (p1 on node 1, p2 on node 2, and p3 on node 3) and uses a range partitioning strategy, loading into p1 will result in the records actually being loaded into partitions p1, p2, and p3 according to the range definitions (say, p1 gets key values less than x, p2 gets key values from x to less than y, and p3 gets key values y and greater).

From the end user perspective, no special instructions are required for partitioned table loading. A user simply has to execute ndlm with the partitioned table as the target of the load.

Dimension tables, on the other hand, are not given special treatment by ndlm. Loading into a dimension table will not result in the imported data being propagated to the corresponding table on other nodes, unlike manual INSERT, UPDATE, and DELETE commands on the table. Data loaded into a dimension table is loaded only into that table. If a load operation will be performed on a dimension table, the exact same load operation must be performed on each remote copy of the table, otherwise the dimension tables will be in an inconsistent state.

For more information about partitioned tables and dimensions, refer to the Massively Parallel Processing (MPP) chapter of the SAND CDBMS Administration Guide.

Previous Topic:
Running ndlm

Chapter Index

Next Topic:
Creating the Import Specification Script

SAND CDBMS Tools Reference Guide Data Loader (ndlm)

SAND CDBMS Tools Reference Guide
Data Loader (ndlm)