SAND CDBMS Tools Guide - SAND Data Loader (ndlm) - Importing Data from a CSV File

SAND CDBMS Tools Reference Guide
Data Loader (ndlm)

Previous Topic:
Importing Data from an SCT File

Chapter Index

Next Topic:
Performing Database Updates and Deletions Using the EXECSQL Command

Importing Data from a CSV File

Ndlm supports the importing of data from standard comma-separated values (CSV) files, which is accomplished using a special type of load specification script.

The CSV Format

A CSV file has the following characteristics (adapted from the RFC-4180 document):

Each record is located on a separate line, delimited by a line break (CR/LF)

The last record in the file may or may not have an ending line break

An optional header may appear on the first line, with the same format as the other records, containing the field names for the data

Within each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.

Each field may or may not be enclosed in double quotes. If fields are not enclosed in double quotes, then double quotes cannot appear inside the fields.

Fields containing line breaks (CR/LF), double quotes, and commas must be enclosed in double-quotes

If double quotes are used to enclose fields, then a double quote appearing inside a field must be escaped by preceding it with another double quote.

The SAND implementation supports the standard CSV format, but also allows users to specify their own field and record delimiters. Additionally, there is an error-checking option to ensure that the source CSV file(s) contain the correct number of fields as described in the specification script.

The CSVFORMAT Load Specification Script

The new CSVFORMAT load specification script is a variation of the standard load script. The main difference is that the "CSVFORMAT" keyword appears in the Record section, signaling this type of load operation to ndlm, along with optional CSV parameters that describe the structure of the data inside the CSV file(s). Also, unlike a standard load script, the map specification section must map the columns in the database either to the exact same field names supplied in the header of the CSV file, or to the field names generated automatically ("f1", f2", ..., "f4056") in case there is no CSV header.

Otherwise, the CSV specification script is the same as a standard load script. The same script options can be used in both types of script, and the same NDL++ functions can be used to manipulate data as it is loaded.

The CSVFORMAT script has the following structure:

IMPORT @CSV-file-name [@CSV-file-name1 ... @CSV-file-nameN]
{

[ <options> ]

[ LIBRARY @user-library { <function declarations> } ... ]

[ INIT
{
<initialization statements>
} ]

[ TYPE
{
<data type declarations>
} ]

[ VAR
{
<variable declarations>
} ]

RECORD
{
CSVFORMAT (
          [ FIELDNAMESINFILE:x ]
          [ NUMBEROFFIELDS:y ]
          [ FIELDSEPARATOR:'a' ]
          [ RECORDSEPARATOR:'b' ]
          )
}

[ SKIPIF <logical expression> @skipfile-name ... ]

[schema-name.]table-name
{
<map specification>
}

}

Each element of the import specification script is described below.
Note:
Comment lines (beginning with // and ending with a line feed, or surrounded by /*... */ ) can be included anywhere in the load specification script, except in the PIC formatting string.
By default, the field, column, and table names in the script will be read in a case-insensitive manner, and so can be specified in any combination of upper- and lowercase letters; to enable case-sensitivity, include the -q option flag in the ndlm invocation.

CSVFORMAT Script Elements

IMPORT
The IMPORT keyword at the beginning of the load specification script instructs ndlm to perform an import operation. Whitespace separates this keyword from the @CSV-file-name(s) that follow.

@CSV-file-name
Specifies the CSV file containing the data to be imported. This file name cannot be longer than 260 characters and must be prefixed by the "at" symbol ( @ ). Include a full path specification if the CSV file is not located in the current directory. Note that if any referenced file or path contains a blank space, it must be enclosed in double quotation marks; that is, "<path with spaces>".
Multiple CSV files (up to a maximum of 16,384) can be specified for a single import operation. Each CSV file can contain a maximum of 2,147,483,648 (2³¹) rows. The formatting of data in each file must be identical. Each file name is preceded by the "at" sign ( @ ), and separated from the other file specifications by one or more spaces, or by a line feed character. Note that when a specified CSV file is not found by ndlm, processing continues with the next specified file that is present.

Wildcard characters may also be used in the input file name specification: '?' represents a single variable character in the file name; '*' represents zero or more variable characters in the file name. All CSV files with names corresponding to the file name mask will be used for the import (up to the maximum of 16,384 files).

Note that wildcard characters can be interpreted differently on Windows servers if 8.3 file name creation is enabled. Refer to the Creating the Import Specification Script section for more information.

options

The standard load options can be specified before the RECORD section:

IGNORE

SKIP (ignored if FIELDNAMESINFILE is set to 1)

PROCESS

MAXERR

ERRINC

NULL

LIBRARY
The optional LIBRARY section contains a reference to a custom library file and declares one or more functions from the library. By specifying a custom library, a user may use declared functions from the library in NDL++ statements within the script. Each individual library requires its own separate LIBRARY section.

INIT
The optional INIT section is where custom libraries and LOOKUP function instances are initialized.

TYPE
The optional TYPE section contains the data type declarations for any numeric data used in NDL++ statements within the script.

VAR
The optional VAR section contains the declarations of any variables used in NDL++ statements within the script.

RECORD
The RECORD keyword is required in all import specification files; it signifies that the record specification follows.

CSVFORMAT
The RECORD section contains the CSVFORMAT keyword, indicating the type of import operation. Immediately following CSVFORMAT are parentheses ( ), which can be empty, or may contain one or more of the optional CSV parameters, separated by whitespace:

FIELDNAMESINFILE:x
Indicates whether the first line in the CSV file is a header, containing the field names in order from left to right. A value of 1 (or TRUE or YES) means the first line is a header. A value of 0 (or FALSE or NO) means there is no header. By default, the loader assumes the CSV file does not have a header (FIELDNAMESINFILE:0).

If loading from multiple CSV files with FIELDNAMESINFILE enabled, each CSV file must have the same header row, otherwise the load operation will generate an error.

If the SKIP option or ndlm -s is specified when FIELDNAMESINFILE is enabled, the SKIP/-s option is ignored.

Note that the header must use the same delimiters as the other records in the CSV file.

NUMBEROFFIELDS:y
Is a nonnegative integer that specifies the number of fields in the CSV file. This parameter is strictly for error checking purposes. The NUMBEROFFIELDS value, when positive, must correspond to the actual number of fields in the CSV file, or else an error will be returned. If the value is zero (0), the number of fields is not checked. By default, field number checking is disabled (NUMBEROFFIELDS:0).

In the special case where CSV data is loaded from the standard input stream (stdin) and the data does not contain a header, the NUMBEROFFIELDS parameter must be specified, otherwise the load operation will generate an error.

FIELDSEPARATOR:'a'
Specifies the field separator character(s), contained in single quotation marks. By default, fields are separated by a comma (FIELDSEPARATOR:',').

RECORDSEPARATOR:'b'
Specifies the record separator character(s), contained in single quotation marks. By default, records are separated by a carriage return and linefeed (RECORDSEPARATOR:'\r\n').

SKIPIF
The SKIPIF clause specifies conditions under which input records will be skipped and written to a specified file instead of being loaded into the database.

table-name
The table-name argument identifies the table into which the CSV data will be imported. This table must exist in the database named in the ndlm invocation. The table must not have any PRIMARY KEY, FOREIGN KEY, or UNIQUE constraints. If the table does not belong to the default schema of the user authorization specified in the ndlm invocation, table-name must include a qualifying schema-name prefix as follows: schema-name.table-name.

If either the schema or table name contains spaces or other special characters, the name will have to appear between double quotation marks. If both schema and table names need to be delimited in this manner, there must be separate sets of quotation marks for each. The period ( . ) that separates the schema and table names must never be contained between quotation marks, otherwise it will be interpreted as part of an object name.
For example:

Legal
Illegal

"schema 1".table1
schema 1.table1

schema1."table\/1" 'schema 1'.table1

"schema.1"."table.1"
schema1.table"\/"1

"schema1"."table1" "schema1.table1"

map specification
The map specification describes the correspondence between the fields in the input CSV file and the columns in the table identified by table-name. If the CSV file contains a header, the field names from the header must be used in the map specification.
If there is no header, the fields in the CSV file are identified internally as "f1", "f2", "f3", ..., "f4055", "f4056", reading the fields from left to right. These internal field names must be used to identify the fields in the map specification section.
Note that the standard NDL++ functions can be used on the CSV fields.

CSV File Example

IMPORT @imp*20??.csv @temp2.csv
{

    NULL ****

    RECORD
    {
        CSVFORMAT
        (
            FIELDNAMESINFILE:0
            NUMBEROFFIELDS:4
            FIELDSEPARATOR:':'
            RECORDSEPARATOR:'\n'
        )
    }

    schema1.spj
    {
        sn   f2
        pn   f1
        jn   f3
        qty f4
    }

}

In this example, data is import from called "temp2.csv", as well as from every file whose name begins with "imp", and contains "20" and two variable characters right before the extension. The "NULL ****" option specifies that every instance of "****" in the CSV data is to be treated as a null value.
As noted, the RECORD section of a CSV import script must contain the keyword "CSVFORMAT", optionally followed by one or more of the CSVFORMAT parameters. The FIELDNAMESINFILE option is disabled in this example, so the input CSV file(s) are not expected to contain a header row with the column names. NUMBEROFFIELDS is also included for error-checking only; if the number of fields in the CSV files does not correspond to the NUMBEROFFIELDS value, the load operation will be halted with an error message. The remaining two CSVFORMAT parameters (FIELDSEPARATOR and RECORDSEPARATOR) specify the field separator and record separator characters in the CSV files, which are ":" and "\n" respectively.
The start of the mapping section specifies the table (spj) in the database where the CSV data will be loaded, qualified by its schema location (schema1). In the body of the mapping section, each column that will be loaded into ("sn", "pn", "jn", "qty") is associated with a column in the CSV files. Note that, since the field names are not supplied in a header row, the CSV columns are named automatically: the first column is "f1", the second "f2", the third "f3", and so on. Because the CSV columns are uniquely identified in this manner, they do not have to be loaded in the exact same order in which they appear in the files; hence the second column ("f2") can be referenced first, followed by the first column ("f1"), as in this example.

Previous Topic:
Importing Data from an SCT File

Chapter Index

Next Topic:
Performing Database Updates and Deletions Using the EXECSQL Command

Legal	Illegal
"schema 1".table1	schema 1.table1
schema1."table\/1"	'schema 1'.table1
"schema.1"."table.1"	schema1.table"\/"1
"schema1"."table1"	"schema1.table1"

SAND CDBMS Tools Reference Guide Data Loader (ndlm)

SAND CDBMS Tools Reference Guide
Data Loader (ndlm)