SAND CDBMS Tools Reference Guide
Data Loader (ndlm)

 

Previous Topic:
Importing Data from an SCT File
Chapter Index
Next Topic:
Performing Database Updates and Deletions Using the EXECSQL Command

 

Importing Data from a CSV File


Ndlm supports the importing of data from standard comma-separated values (CSV) files, which is accomplished using a special type of load specification script.


The CSV Format

A CSV file has the following characteristics (adapted from the RFC-4180 document):

The SAND implementation supports the standard CSV format, but also allows users to specify their own field and record delimiters. Additionally, there is an error-checking option to ensure that the source CSV file(s) contain the correct number of fields as described in the specification script.


The CSVFORMAT Load Specification Script

The new CSVFORMAT load specification script is a variation of the standard load script. The main difference is that the "CSVFORMAT" keyword appears in the Record section, signaling this type of load operation to ndlm, along with optional CSV parameters that describe the structure of the data inside the CSV file(s). Also, unlike a standard load script, the map specification section must map the columns in the database either to the exact same field names supplied in the header of the CSV file, or to the field names generated automatically ("f1", f2", ..., "f4056") in case there is no CSV header.

Otherwise, the CSV specification script is the same as a standard load script. The same script options can be used in both types of script, and the same NDL++ functions can be used to manipulate data as it is loaded.

The CSVFORMAT script has the following structure:

IMPORT @CSV-file-name [@CSV-file-name1 ... @CSV-file-nameN]
{

[ <options> ]

[ LIBRARY @user-library { <function declarations> } ... ]

[ INIT
{
<initialization statements>
} ]

[ TYPE
{

<data type declarations>

} ]

[ VAR
{

<variable declarations>

} ]

RECORD
{
CSVFORMAT (
          [ FIELDNAMESINFILE:x ]
          [ NUMBEROFFIELDS:y ]
          [ FIELDSEPARATOR:'a' ]
          [ RECORDSEPARATOR:'b' ]
          )
}

[ SKIPIF <logical expression> @skipfile-name ... ]

[schema-name.]table-name
{
<map specification>
}

}

 

Each element of the import specification script is described below.

Note:
Comment lines (beginning with // and ending with a line feed, or surrounded by /*... */ ) can be included anywhere in the load specification script, except in the PIC formatting string.

By default, the field, column, and table names in the script will be read in a case-insensitive manner, and so can be specified in any combination of upper- and lowercase letters; to enable case-sensitivity, include the -q option flag in the ndlm invocation.


CSVFORMAT Script Elements


IMPORT

The IMPORT keyword at the beginning of the load specification script instructs ndlm to perform an import operation. Whitespace separates this keyword from the @CSV-file-name(s) that follow.


@CSV-file-name

Specifies the CSV file containing the data to be imported. This file name cannot be longer than 260 characters and must be prefixed by the "at" symbol ( @ ). Include a full path specification if the CSV file is not located in the current directory. Note that if any referenced file or path contains a blank space, it must be enclosed in double quotation marks; that is, "<path with spaces>".

Multiple CSV files (up to a maximum of 16,384) can be specified for a single import operation. Each CSV file can contain a maximum of 2,147,483,648 (231) rows. The formatting of data in each file must be identical. Each file name is preceded by the "at" sign ( @ ), and separated from the other file specifications by one or more spaces, or by a line feed character. Note that when a specified CSV file is not found by ndlm, processing continues with the next specified file that is present.

Wildcard characters may also be used in the input file name specification: '?' represents a single variable character in the file name; '*' represents zero or more variable characters in the file name. All CSV files with names corresponding to the file name mask will be used for the import (up to the maximum of 16,384 files).

Note that wildcard characters can be interpreted differently on Windows servers if 8.3 file name creation is enabled. Refer to the Creating the Import Specification Script section for more information.


options


The standard load options can be specified before the RECORD section:


LIBRARY
The optional LIBRARY section contains a reference to a custom library file and declares one or more functions from the library. By specifying a custom library, a user may use declared functions from the library in NDL++ statements within the script. Each individual library requires its own separate LIBRARY section.


INIT
The optional INIT section is where custom libraries and LOOKUP function instances are initialized.


TYPE
The optional TYPE section contains the data type declarations for any numeric data used in NDL++ statements within the script.


VAR
The optional VAR section contains the declarations of any variables used in NDL++ statements within the script.


RECORD
The RECORD keyword is required in all import specification files; it signifies that the record specification follows.


CSVFORMAT
The RECORD section contains the CSVFORMAT keyword, indicating the type of import operation. Immediately following CSVFORMAT are parentheses ( ), which can be empty, or may contain one or more of the optional CSV parameters, separated by whitespace:

FIELDNAMESINFILE:x
Indicates whether the first line in the CSV file is a header, containing the field names in order from left to right. A value of 1 (or TRUE or YES) means the first line is a header. A value of 0 (or FALSE or NO) means there is no header. By default, the loader assumes the CSV file does not have a header (FIELDNAMESINFILE:0).

If loading from multiple CSV files with FIELDNAMESINFILE enabled, each CSV file must have the same header row, otherwise the load operation will generate an error.

If the SKIP option or ndlm -s is specified when FIELDNAMESINFILE is enabled, the SKIP/-s option is ignored.

Note that the header must use the same delimiters as the other records in the CSV file.

NUMBEROFFIELDS:y
Is a nonnegative integer that specifies the number of fields in the CSV file. This parameter is strictly for error checking purposes. The NUMBEROFFIELDS value, when positive, must correspond to the actual number of fields in the CSV file, or else an error will be returned. If the value is zero (0), the number of fields is not checked. By default, field number checking is disabled (NUMBEROFFIELDS:0).

In the special case where CSV data is loaded from the standard input stream (stdin) and the data does not contain a header, the NUMBEROFFIELDS parameter must be specified, otherwise the load operation will generate an error.

FIELDSEPARATOR:'a'
Specifies the field separator character(s), contained in single quotation marks. By default, fields are separated by a comma (FIELDSEPARATOR:',').

RECORDSEPARATOR:'b'
Specifies the record separator character(s), contained in single quotation marks. By default, records are separated by a carriage return and linefeed (RECORDSEPARATOR:'\r\n').


SKIPIF

The SKIPIF clause specifies conditions under which input records will be skipped and written to a specified file instead of being loaded into the database.


table-name

The table-name argument identifies the table into which the CSV data will be imported. This table must exist in the database named in the ndlm invocation. The table must not have any PRIMARY KEY, FOREIGN KEY, or UNIQUE constraints. If the table does not belong to the default schema of the user authorization specified in the ndlm invocation, table-name must include a qualifying schema-name prefix as follows: schema-name.table-name.

If either the schema or table name contains spaces or other special characters, the name will have to appear between double quotation marks. If both schema and table names need to be delimited in this manner, there must be separate sets of quotation marks for each. The period ( . ) that separates the schema and table names must never be contained between quotation marks, otherwise it will be interpreted as part of an object name.

For example:

Legal

Illegal
"schema 1".table1

schema 1.table1

schema1."table\/1" 'schema 1'.table1
"schema.1"."table.1"

schema1.table"\/"1

"schema1"."table1" "schema1.table1"


map specification
The map specification describes the correspondence between the fields in the input CSV file and the columns in the table identified by table-name. If the CSV file contains a header, the field names from the header must be used in the map specification.

If there is no header, the fields in the CSV file are identified internally as "f1", "f2", "f3", ..., "f4055", "f4056", reading the fields from left to right. These internal field names must be used to identify the fields in the map specification section.

Note that the standard NDL++ functions can be used on the CSV fields.


CSV File Example

IMPORT @imp*20??.csv @temp2.csv
{

    NULL ****

    RECORD
    {
        CSVFORMAT
        (
            FIELDNAMESINFILE:0
            NUMBEROFFIELDS:4
            FIELDSEPARATOR:':'
            RECORDSEPARATOR:'\n'
        )
    }

    schema1.spj
    {
        sn   f2
        pn   f1
        jn   f3
        qty  f4
    }

}

In this example, data is import from called "temp2.csv", as well as from every file whose name begins with "imp", and contains "20" and two variable characters right before the extension. The "NULL ****" option specifies that every instance of "****" in the CSV data is to be treated as a null value.

As noted, the RECORD section of a CSV import script must contain the keyword "CSVFORMAT", optionally followed by one or more of the CSVFORMAT parameters. The FIELDNAMESINFILE option is disabled in this example, so the input CSV file(s) are not expected to contain a header row with the column names. NUMBEROFFIELDS is also included for error-checking only; if the number of fields in the CSV files does not correspond to the NUMBEROFFIELDS value, the load operation will be halted with an error message. The remaining two CSVFORMAT parameters (FIELDSEPARATOR and RECORDSEPARATOR) specify the field separator and record separator characters in the CSV files, which are ":" and "\n" respectively.

The start of the mapping section specifies the table (spj) in the database where the CSV data will be loaded, qualified by its schema location (schema1). In the body of the mapping section, each column that will be loaded into ("sn", "pn", "jn", "qty") is associated with a column in the CSV files. Note that, since the field names are not supplied in a header row, the CSV columns are named automatically: the first column is "f1", the second "f2", the third "f3", and so on. Because the CSV columns are uniquely identified in this manner, they do not have to be loaded in the exact same order in which they appear in the files; hence the second column ("f2") can be referenced first, followed by the first column ("f1"), as in this example.

 

Previous Topic:
Importing Data from an SCT File
Chapter Index
Next Topic:
Performing Database Updates and Deletions Using the EXECSQL Command