What's New in SAND CDBMS

SAND CDBMS Documentation Library
What's New in SAND CDBMS

This page briefly describes the differences between SAND's Nucleus CDBMS releases 8 and 7, and previous releases for the benefit of users who are upgrading to the new release.

What's New in version 8.2!

Highlights of release 8.2.3507

Parquet Support:

SAND Nucleus CDBMS now supports Hadoop's Parquet file format, allowing users to import Parquet files directly into a SAND database via the NDLM loader, export data from a SAND Database to Parquet files, and to generate DDL and load scripts based on a Parquet file.

See details in the 'PARQUET' section below 'Notes'.

Export performance improvements:

We've enhanced SAND's loader export operations speed up the EXPORT of data from the database to our supported file formats: SCT, Parquet, and flat/CSV files.

Enhanced debugging:

SAND CDBMS has included some enhancements to the debugging toolkit to allow more fluid database analysis.

Bug Fixes:

We have resolved an rare engine crash that could result from a very small load into a very small table/column.

Please see the previous version 8 release section for earlier V8 changes in Licensing, Load, Performance and added query support.

Notes:

Windows users:

The SAND Windows installation has the following changes/requirements:

The default installation directory has been changed to c:\SAND to help facilitate file access and editing that many configurations inhibit in the "Program Files" directories.

The SAND Enterprise Service requires Java as a prerequisite, and we've included an installation in the ./Java folder.
If choosing to install a recent version from Oracle, please read this first:

Oracle Java Release Notice

Prerequisites:

Note that SAND V8 for Unix currently runs on CentOS/Redhad linux compatible OSs of version 6+.

The following environment variable is required as a setting on Linux:

MALLOC_ARENA_MAX=1

PARQUET Support Details:

Via ndlm, SAND users can now IMPORT and EXPORT data from a Parquet format file.

Using the ndlm loader, the import and export scripts follow the following format:

IMPORT:

PARQUETIMPORT @filespec [@filespec ...]
{
    tablespec
    {
       colname parquetfield
       ...
    }
}

EXPORT:

PARQUETEXPORT @filespec
{
    queryspec
}

Each name in the select list must be unique and will correspond to a column name in the parquet file.

Script generation based on Parquet file:

ndlm -parquet filespec [tablespec]

and generates: tabname.sql and tabname.ndl

Example:

ndlm -parquet /path/schema.table.parquet

yields    /path/table.sql creating table schema.table
and     /path/table.ndl loading @/path/schema.table.parquet
        into schema.table

or

ndlm -parquet /path/parquetfile.parquet public.table

yields     /path/table.sql creating public.table
and     /path/table.ndl loading @/path/parquetfile.parquet
        into public.table

Export/Import script Examples:

PARQUETEXPORT @lineitem.parquet
{
    select * from lineitem;
}

PARQUETIMPORT @lineitem.parquet
{
    lineitem
       { l_orderkey l_orderkey
       l_partkey l_partkey
       l_suppkey l_suppkey
       l_linenumber l_linenumber
       l_quantity l_quantity
       l_extendedprice l_extendedprice
       l_discount l_discount
       l_tax l_tax
       l_returnflag l_returnflag
       l_linestatus l_linestatus
       l_shipdate l_shipdate
       l_commitdate l_commitdate
       l_receiptdate l_receiptdate
       l_shipinstruct l_shipinstruct
       l_shipmode l_shipmode
       l_comment l_comment
    }
}

Or

PARQUETEXPORT @lineitem.parquet
{
    select L_ORDERKEY F01,
    L_PARTKEY F02,
    L_SUPPKEY F03,
    L_LINENUMBER F04,
    L_QUANTITY F05,
    L_EXTENDEDPRICE F06,
    L_DISCOUNT F07,
    L_TAX F08,
    L_RETURNFLAG F09,
    L_LINESTATUS F10,
    L_SHIPDATE F11,
    L_COMMITDATE F12,
    L_RECEIPTDATE F13,
    L_SHIPINSTRUCT F14,
    L_SHIPMODE F15,
    L_COMMENT F16
    from lineitem where L_LINENUMBER is not null and L_SHIPDATE > '1889-01-01';
}

PARQUETIMPORT @lineitem.parquet
{
    PUBLIC.LINEITEM
    {
       l_orderkey F01
       l_partkey F02
       l_suppkey F03
       l_linenumber F04
       l_quantity F05
       l_extendedprice F06
       l_discount F07
       l_tax F08
       l_returnflag F09
       l_linestatus F10
       l_shipdate F11
       l_commitdate F12
       l_receiptdate F13
       l_shipinstruct F14
       l_shipmode F15
       l_comment F16
    }
}

ndlm Command Line Options with Parquet Export:

    -f nnn

    Where nnn indcates the size in megabytes for each rowset inside the parquet file. This size is the chunk size that is processed at a time and defaults to 1000.

    -compress n

   Where n can range from 0 to 6 and indicates the compression type to use. The default compression is Snappy.

       0 - No Compression
       1 - Snappy
       2 - Gzip
       3 - Lzo
       4 - Brotli
       5 - Lz4
       6 - Zstd

    -forceindex nnn

Where nnn is the maximum number of rows in millions that will go in a single parquet file. The default to 200.

When exporting a resultset with more than -forceindex nnn million rows, a new parquet file is started for each interval. The file names get a sequence number (_nnn) appended to the name before the extension. This facilitates forcing indexing at intervals during imports since the current parquet import logic asynchronously processes entire parquet files in parallel. For example, with the TPCH 100gb lineitem table that has ~600 million rows the export/import scripts would be:

PARQUETEXPORT @lineitem.parquet
{
    select * from lineitem;
}

PARQUETIMPORT
    @lineitem_001.parquet
    @lineitem_002.parquet
    @lineitem_003.parquet
{
    lineitem
    { l_orderkey l_orderkey
       l_partkey l_partkey
       l_suppkey l_suppkey
       l_linenumber l_linenumber
       l_quantity l_quantity
       l_extendedprice l_extendedprice
       l_discount l_discount
       l_tax l_tax
       l_returnflag l_returnflag
       l_linestatus l_linestatus
       l_shipdate l_shipdate
       l_commitdate l_commitdate
       l_receiptdate l_receiptdate
       l_shipinstruct l_shipinstruct
       l_shipmode l_shipmode
       l_comment l_comment
    }
}

What's new in versions prior to 8.2.3507:

Licensing:

SAND V8.1 comes with a default 'Free Demo' mode, where a database can be run without a license as long as the maximum size of the database remains within the range 16-20 Gb of data. After passing this limit, any new data added will fail to be saved via COMMIT until the size of the database is decreased, or a larger license is purchased. This allows new users to explore and experience the Nucleus database, and purchase a license should they require to expand the size of the database beyond the demo limits.

New output on the server's start-up screen now shows the used/free pages, and the page size of the database:

Example: Database has 11903 pages with 7363 free and page size 1048576

The error for when the demo size limit is exceeded is:

Error state 40001. Transaction error - a possible conflict forced a rollback
database size exceeds license limit

Also note that MPP (Massively Parallel Processing) is disabled in Free Demo mode. Queries between federated nodes will return the following error:

Example:

SELECT MIN(qty) FROM spj GROUP BY pn ORDER BY 1 FETCH FIRST 4 ONLY;
Error state 62400. Non-existent connection
Connection: <#> Requires a License

Performance:
V8's emphasis is on query performance improvements, especially on large data sets.
Internal testing and benchmarking on large data sets have shown up to a 3x improvement compared to previous releases.

Additionally, new default load functionaly has also been introduced, the Column-at-a-Time (Caat) load, which can significantly improve load times. This replaces the previous default mode, 'Two-Phase' or 'Serial' loading, with a more multi-threaded, column-oriented approach.

While the changes made to achieve these improvements are not visible to the user, they involve:

SQL:

- Exploiting additional opportunities for parallel execution (multi-threading).
- Query optimization improvements in numerous areas.
- Improvements to expression evaluation.
- Caching of internal objects has been greatly streamlined.
- Reduced calls to low-level information functions.
- Making Low-level engine operations more efficient.
- Supporting the use of Join Indexes.
(Additional details at 'bottom of the 8.1 section of the readme )

LOAD:

The CaaT (Column-at-a-Time) load mode replaces the default two-phase, or serial load, behavior of the ndlm loader utility. This new load paradigm concentrates on processing parallel threads of column-centric data from the flat file to stream into the database engine, rather than reading in and passing through chunks of rows.

The new CaaT load process should be used with the -j option to specify the number of read threads to use to parse the load file data. The best performance is normally achieved when the -j option is set to the number of processors available to the loader on the load client computer.

The first stage of the CaaT load is the "Striping" stage where the flat file is turned from rows into columns. After striping the client opens up a socket connection for each column of data and starts sending the data asynchronously to the server. For very large data files, column indexing should be performed at a regular interval. The -forceindex command line option is used to control the column indexing frequency. The default is every 200 million rows and is configurable from 1 to 1000. This parameter is also useful when limited temporary file space is available for striping since it controls the size of the temporary files. The number of connections that are opened during the asynchronous phase of processing is controlled by the CPU Count at the server as well as the maximum user connections allowed at the server.

New Functionality:

In addition to performance gains, V8 includes new features and functions available to the user, such as:

CaaT Load options:

Two new command-line flags are available in ndlm. The first allows the user to revert to the older default load behavior, and the second to tweak the periodic indexing of data during a CaaT load:

-twophase = force no Column at a Time (CaaT) loading
-forceindex nnn = force CaaT indexing at nnn million rows (1-1000/default 200)

Additionally, server-side commands have been made available to administer resources related to CaaT loading:

ALTER SESSION SET CPUCOUNT = ;

When the server starts up the CPU count is set to the number of processors at the server computer. This count can also be changed with the above SESSION command.

ALTER SESSION SET LOADCOUNT = ;

In addition to the CPUCOUNT the server memory resources for loading can be controlled with the LOADCOUNT parameter. The default is 16 or 16 million column elements per chunk to be divided among the number of loading threads. The minimum is 1 and the maximum is 32.

Expanded SQL support:

- New SQL functions now included in V8 are:

GREATEST() and LEAST() Functions
FIRST_DAY() and LAST_DAY() of Month Functions
INITCAP() Function
LPAD() and RPAD() Character Functions
LAG() and LEAD() OLAP Functions

Synonyms:

TO_DATE() for DATE()
TO_CHAR() for CHAR()

- V8 now supports more Oracle-type syntax and functions, including:

Oracle (+) Outer Join Notation is now supported
DECODE() Function
NVL() Function
TRUNC() Function, Single-Parameter

Additional Oracle Synonyms for Nucleus Special Constants are included:

SYSDATE for CURRENT DATE
SYSTIME for CURRENT TIME
SYSTIMESTAMP for CURRENT TIMESTAMP

- Common Table Expressions (CTEs) using the WITH clause has been implemented.

Currently, only non-recursive CTEs are supported.

Additional OLAP support:

- Support for a limited set of OLAP Aggregate functions was added.

The supported non-reporting OLAP aggregate functions are:

COUNT(*)
COUNT(DISTINCT )
MIN()
MAX()
SUM()
SUM(DISTINCT )
AVG()
AVG(DISTINCT )

- OLAP "Reporting" Aggregate Functions:

These are aggregate functions used in an OVER clause with no ORDER BY clause.
The aggregate functions supported are:

AVG(), COUNT(), MAX(), MIN(), SUM(), MEANDEV(), MEDIAN(), STDDEV(),
SUMSQ(), VARIANCE(), CENTMOM2(), CENTMOM3(), CENTMOM4(), KURTOSIS(),
SKEWNESS(), SUMCUBE(), SUMQUAD() and ONEOF().

Expanded CASTing support, including:

- DATE Casting of a Character String Based on a Picture

The DATE() function now takes up to two arguments:
the string to be cast and a picture defining the date elements.

DATE( , )

- CHAR() Casting of Durations.

Previously, the CHAR() function would return an error if was used to cast
a duration to a character string. This is now allowed.

CHAR( - )

- INTEGER() Casting of Time Intervals

- Expanded CAST() Function for Existing Casts

We now support the syntax of:

CAST ( AS )

DOMAIN Queries and Aggregation:

- V8 now has the ability to directly query Nucleus DOMAINS:

Examples:

SELECT DOMAIN VALUES FROM DOMAIN .

SELECTs the values stored in the named domain.

SELECT DOMAIN COUNT FROM DOMAIN .

SELECTs the number of values stored in the named domain.

SELECT DOMAIN MAX[(n)] FROM DOMAIN .

SELECTs the maximum value (or the n maximum value) stored in the named domain.

SELECT DOMAIN MIN[(n)] FROM DOMAIN .

SELECTs the minimum value (or the n minimum values) stored in the named domain.

Explain Plans:

The ability to produce more user-friendly Query EXPLAIN Plans is now available.
EXPLAIN Plan output can be toggled on or off via the following Admin commands:

ALTER SESSION SET EXPLAIN=1|0;
ALTER SESSION SET EXPLAINONLY=1|0;

Both commands display the final QEP (Query Execution Plan) after all optimizations
have been applied. This is the query that will be executed by Nucleus.

EXPLAIN displays the QEP then executes the query.

EXPLAINONLY displays the QEP and returns control to the user without executing the query.

The QEP display includes the elements of the query and their order of execution, along with estimated row counts and cost to execute.

If EXPLAIN is used, the query execution is followed by a display of the execution plan with timing and I/O information gathered during execution.

Improved Handling of Host Variables:

Allow for Host Variables and Constants in Date Casting
Allow for Concatenation of Host Variables and Constants
Addition of RANGE Processing for Host Variables and Constants

Additional detail of internal improvements impacting performance:

1. Exploiting additional opportunities for parallel execution (multi-threading):
a. Aggregations and expressions are multi-threaded both vertically (multiple operations and horizontally (large columns are partitioned).
b. UNIONs are executed in parallel.
c. Some filtering is now executed in parallel.
d. Global resources used by threads have been replaced with local resources to reduce contention.
e. Execution of disjoint JOINs and Semijoins in bushy execution trees have been made parallel.
f. Added parallel subordinate execution to GROUP BY and ORDER BY.
g. Added parallel execution of disjoin subqueries in predicates.
2. Query optimization is improved in numerous areas:
a. Bushy JOIN trees can now be produced, allowing more optimization opportunities.
b. More opportunities to turn regular JOINs into Star Joins and Semijoins are discovered and used.
c. GROUP BY can now be pushed over or under JOINs.
d. More opportunities to push filters over or under JOINs are discovered and used.
e. Some queries are internally rewritten to optimize Nucleus
f. Reuse SUM() (divided by COUNT()) when SUM() and AVG() of the same column appear in a query.
g. Support Multi-Column Star Join.
h. Split multi-key joins when possible.
i. Improve ordering of predicate operations.
j. Improve performance of Multi-Key JOINs.
k. Short circuit query execution when returning empty sets.
l. Hot-rod single value MAX() and MIN() queries and COUNT(*)-only queries.
m. Improve performance of queries with FETCH FIRST ONLY.ORDER BY by using FETCH to limit data sets.
n. GROUP BY with large datasets has been pipelined.
o. After completing the above optimizations, the costing of operations was adjusted to account for these improvements.
3. Improvements to expression evaluation:
a. Redundant and unnecessary sub-expressions are combined or ignored.
b. Redundant (and commutative redundant) predicates in queries are removed.
c. Expressions are normalized and simplified.
d. Expressions are rewritten to find and exploit JOIN opportunities.
e. Algebraic strength reduction is used to simplify expressions.
f. Common subexpressions in queries are recognized and executed once only.
g. Redundant predicates created by swapping the columns and operators are now recognized and removed.
4. Caching of internal objects has been greatly streamlined.
a. Column comparisons are cached and reused when possible.
5. Calls to low-level information functions have been reduced.
6. Low-level operations have been made more efficient.
7. Support use of Join Indexes (MATERIALIZED JOINs and PRIMARY KEY->FOREIGN KEYs) to improve query performance.

What's New in version 7.1

SAND CDBMS release 7.1 incorporates the following new features and improvements:

Massively Parallel Processing (MPP)
SAND CDBMS v7 supports a form of Massively Parallel Processing (MPP) where table data is partitioned among multiple computers in a network to take advantage of simultaneous intra-query processing. Intra-query results are received from each worker node, combined at the head node, and full query results are transparently returned to end users.

One of the main components of SAND MPP is the "partitioned table", which is a special table created on the head node that does not contain data, but rather points via local linked tables to the remote tables that store the partitioned data. End users query this table as they would any standard table, and under the covers the query is redirected to remote nodes for parallel execution.

A partitioned table is created using the new CREATE PARTITION TABLE command, which defines the partitioning method (hash, round robin, or range) and the structure of the partitioned table. When the CREATE PARTITION TABLE command is issued, the remote partitions must already exist and have compatible structures.

Once a partitioned table exists, the SAND Data Loader (ndlm) can load new records into it from files or ODBC data sources, using standard IMPORT jobs. When a load operation is executed against a partitioned table on the head node, the data is actually pushed to the remote tables associated with the partitioned table.

The contents of a partitioned table can be changed via DELETE and UPDATE statements, as well as INSERT...SELECT statements that query other partitioned and/or dimension tables (but not direct INSERT via a value list at this time). The structure of a partitioned table can also be changed using ALTER TABLE statements.

To delete a partitioned table, simply execute the standard DROP TABLE command with the partitioned table specified.

To delete a partitioned table and its related database objects, including remote tables, add the CASCADE keyword to the end of the DROP TABLE command. This command will remove the following objects:

the specified partitioned table

each linked table that was specified in the partitioned table creation statement

the remote tables (partitions) associated with the link tables

the default partition of each user that had the partitioned table set as default

There is an MPP mode that is required for some MPP activities. A user automatically enters MPP mode if that user has a default partitioned table, which is defined via the ALTER AUTHORIZATION...SET DEFAULT PARTITION command. When in MPP mode, certain MPP-related commands are executed in the context of the default partitioned table (summarized in a table in the "MPP Mode" section of the SAND CDBMS Administration Guide).

In addition to partitioned tables, SAND MPP also supports "dimension tables", which are special tables created on the head node that are automatically replicated and maintained across the different nodes of the system. In data warehousing terms, a this kind of table functions as a "dimension" table in relation to a partitioned "fact" table, if a star or snowflake schema model is being used.

Once a dimension table is created and replicated across nodes, all changes (via INSERT, UPDATE, DELETE, ALTER TABLE) to this table on the head node are propagated to the other nodes, whether or not in MPP mode. Similarly, removing the dimension table from the head node with the DROP TABLE command will cause the same table to be removed from the remote nodes. An existing dimension table can be converted to a standard table using the ALTER TABLE...DROP DIMENSION command (the corresponding remote tables are not deleted by this action).

In MPP mode, all new tables created through the CREATE TABLE command are dimension tables. To create a non-dimension table in MPP mode, the CREATE LOCAL TABLE syntax must be used. Existing tables can also be converted to dimension tables using the ALTER TABLE...ADD DIMENSION command (in MPP mode) or the ALTER TABLE...DIMENSION OF... command (outside of MPP mode), but this will only copy the table structure to the remote nodes, not the table contents. The REFRESH DIMENSION TABLE command can be used at any time to propagate the contents of a dimension table on the head node to each of the corresponding remote dimension tables, replacing existing records in the remote dimension tables if they are not empty.

A special domain, called a "distributed domain", is automatically replicated across the different nodes of the system. This type of domain is used in the field definitions of dimension tables. Any new domain created in MPP mode is a distributed domain, unless the CREATE LOCAL DOMAIN syntax is used. Outside of MPP mode, the CREATE DOMAIN...DIMENSION OF... command is used to create a distributed domain.

For more details about SAND MPP, refer to the Massively Parallel Processing (MPP) chapter of the SAND CDBMS Administration Guide and the descriptions of the related SQL commands in the SAND CDBMS SQL Reference Guide.

New SQL Command for Remote Database Execution

The new EXECUTE command sends an embedded SQL command to a remote SAND database for execution. The syntax is as follows:

EXECUTE :: <SQL command> :: WITH CONNECTION <connection>;

The remote database is accessed using information contained in the connection object specified by the EXECUTE command, so this connection object must have been defined previously via the CREATE CONNECTION command. The embedded SQL statement is delimited by double colons ("::"), and must not include a terminating semicolon (";") or be contained between quotation marks.

Note that the embedded SQL statement is a "pass-through" command. There is no local validation of this SQL before it is sent; it is executed as is at the remote database.

DEC / DECIMAL / NUMERIC Data Type Precision Increased to 50

The maximum number of decimal places supported by the DEC/DECIMAL/NUMERIC data types has been increased from 31 to 50.

The nucleus.ini CONNECTION section parameter IntegerDivisionScale, which sets the scale for the results of integer division, has changed as a result. The maximum value has changed from 31 to 50. For large IntegerDivisionScale values, the maximum scale for A divided by B has been changed to the following:

If A is SMALLINT, then an IntegerDivisionScale value of 45 to 50 yields a result with a maximum of 45 decimal places.

If A is INTEGER, then an IntegerDivisionScale value of 40 to 50 yields a result with a maximum of 40 decimal places.

If A is UNSIGNED, then an IntegerDivisionScale value of 31 to 50 yields a result with a maximum of 31 decimal places.

Change to FLOAT Data Type Precision

The meaning of the FLOAT data type argument, when defining a column or domain, has changed from bits of precision to significant decimal digits of precision. The valid precision values are now 1 to 50, inclusive. For example, the following defines a floating point column that can store up to 30 significant digits:

CREATE TABLE T1 (C1 FLOAT(30));

As a consequence, the SQL FLOAT( ) cast function now includes an optional extra parameter that limits the precision of the returned value to the specified number of decimal places:

FLOAT ( value-expression [ , precision ] )

New SAND JDBC Driver

The Nucleus JDBC Driver has been renamed the "SAND JDBC Driver", and the associated jar file name is now "sandj.jar" (previously "nucleus.jar"). The new JDBC driver no longer requires the Nucleus ODBC Driver (sando.dll) to function, but requires a later version of Java (v1.6 or higher) than previously (v1.3.1).

The connection string for the SAND JDBC Driver has the following structure:

jdbc:sand://[username:password@]hostname:port/database

Refer to the readme_jdbc(.txt) file in the installation directory for more information about the SAND JDBC Driver.

New Octopus Parameters

The following new optional ODDB parameter can be specified in the DATABASE sections of the nucleus.ini file:

ODDBPreferredNode=<comma-separated list of nodes>
Specifies a prioritized list of nodes where the ODDB instance should start. Octopus will attempt to start the database as an ODDB instance on the first node in the ODDBPreferredNode list. If that attempt fails, another attempt will be made to start the instance on the next node in the list. And so on down the list until the ODDB instance successfully starts on a node, or the list of nodes is exhausted (resulting in an error condition). The asterisk wildcard character (*) can be used in the list to indicate "any available node".

The following new optional parameters can be specified in OCTOPUS sections of the nucleus.ini file:

Recover={TRUE | FALSE}
When set to TRUE in Public Update mode, Octopus will restart the master engine in recovery mode after an abnormal termination, which will allow a restart from the last successful COMMIT WORK.

TMPAutoDel={TRUE | FALSE}
(On-Demand Database mode only) Forces OctoEng's to automatically delete (TRUE) or retain (FALSE) the temporary Delta File.

Analytics Parameters Ported to Octopus

All of the LDAP parameters can now be set for Octopus instances, in OCTOPUS sections of the nucleus.ini file:

LDAPBASEDN

LDAPBINDINGS

LDAPCNALIAS

LDAPDOMAIN

LDAPHOST

LDAPMODE

LDAPPATH

LDAPROLE

LDAPSERVICEPASSWORD

LDAPSERVICEUSER

The following parameters can now be set for Octopus instances, in OCTOPUS or SUBCLASS sections of the nucleus.ini file:

IntegerDivisionScale

MaxVMem / ODDBMaxVMem

Virtual Memory Statistics Displayed on Server Startup (UNIX Only)

In UNIX, nserv's virtual memory information (VM limit, current usage, and usage as a percentage of the limit) is now displayed at startup, similar to the following:

STARTUP: VMEM: 1152 MB limit with 149 MB in-use (13%)

SAND CDBMS Documentation Library What's New in SAND CDBMS

SAND CDBMS Documentation Library
What's New in SAND CDBMS