SAND CDBMS SQL Reference Guide
Value Expressions

 

Previous Topic:
Aggregate Functions
Chapter Index
Next Topic:
Cast Functions

 

Functions


OLAP Functions

OLAP (OnLine Analytical Processing) functions are tools for analyzing and presenting multidimensional data. The OLAP functions supported by SAND CDBMS can be used for various forms of data ranking and row numbering. In all cases, the input data can be partitioned by field(s), so that the function applies separately to each partitioned set. If there is no explicit partitioning, the whole set of input rows is considered one single partition.

The following OLAP functions are supported:

CUME_DIST()
Gives the cumulative distribution of rows in a partition, based on the ordering specified in the OVER clause, with the percentile ranking of each row expressed as a floating point value ranging from 0 to 1. In ascending order, this value is calculated as the number of rows ranked the same or lower than the current row (including the current row), divided by the total number of rows in the partition. In descending order, the value is the number of rows ranked higher than or equal to the current row, divided by the total number of rows in the partition.

This function takes no parameters, so empty parentheses must be specified immediately after the function name.

DENSE_RANK()
Gives the numerical rank for each row in a partition, based on the ordering specified in the OVER clause. The ranking starts with 1 for the highest rank, and increments by 1 for each lower ranking. If two or more rows are tied in the ordering, they will all share the same rank value.

The only difference between the DENSE_RANK() and RANK() functions is in how ranking ties are handled. With DENSE_RANK() there is no gap in the numerical ranking sequence following tied ranks (for example, 1, 2, 2, 2, 3), whereas there is a sequence gap with RANK() (for example, 1, 2, 2, 2, 5).

This function takes no parameters, so empty parentheses must be specified immediately after the function name.

NTILE(integer)
Gives the tiled rank for each row in a partition, based on the ordering specified in the OVER clause. A positive integer, representing the total number of tiled ranks, must be provided for this function. For instance, NTILE(4) means that the input data will be ranked 1-4, with the top 25% ordered rows in the first rank, the next 25% in the second rank, the next 25% in the third rank, and the bottom 25% in the fourth rank.

In cases where the specified NTILE number does not divide evenly into the number of input rows, the number of rows in each tiled rank is calculated as

CEILING( <number of rows in the partition>/<NTILE number> )

except for the lowest rank, which will have one or more less rows than the higher ranks.

If the specified NTILE value exceeds the number of elements in the partition, each row will be ranked with a unique number, resembling the behavior of the ROW_NUMBER() function.

PERCENT_RANK()
Gives a floating point ranking value between 0 and 1 for each row in a partition, based on the ordering specified in the OVER clause. The ranking value for a row is calculated as <the row's rank in the partition> – 1, divided by <the total number of rows in the partition> – 1. The first row is always ranked 0, and the last row is always ranked 1. If there is a single row in the partition, it has a rank of 0.

This function takes no parameters, so empty parentheses must be specified immediately after the function name.

RANK()
Gives the numerical rank for each row in a partition, based on the ordering specified in the OVER clause. The ranking starts with 1 for the highest rank, and increments by 1 for each lower ranking, except when there are ties in rank. If two or more rows are tied in the ordering, they will all share the same rank value, and the next rank number will be the previous rank number + the number of rows tied for the previous rank.

The only difference between the DENSE_RANK() and RANK() functions is in how ranking ties are handled. With DENSE_RANK() there is no gap in the numerical ranking sequence following tied ranks (for example, 1, 2, 2, 2, 3), whereas there is a sequence gap with RANK() (for example, 1, 2, 2, 2, 5).

This function takes no parameters, so empty parentheses must be specified immediately after the function name.

ROW_NUMBER()
Gives a number for each row in the partition, based on the ordering specified in the OVER clause. The rows start with 1 and increment by 1 for each subsequent row, regardless of ties in the ordering.

This function takes no parameters, so empty parentheses must be specified immediately after the function name.


Syntax


NTILE (integer)

The NTILE function requires a positive integer as a parameter. This value sets the total number of tiled ranks that will be included in the results. So, for example, NTILE (100) will rank the results 1-100 for each partition, with approximately n/100 rows in each rank (where n is the total number of rows in the partition).

PARTITION BY column name
The optional PARTITION BY column list specifies one or more fields on which to split the input data. Each row that shares a common value for the partition field will be considered part of the same partition, and all OLAP functions included in the statement will be applied separately for each partition. Multiple partition fields can be specified, separated by commas. When there are multiple partition fields, rows must have the same value in each field to be part of the same partition.

If the PARTITION BY clause is omitted, the entire set of input rows is considered a single partition.

ORDER BY expression
The ORDER BY clause defines the way the input data is ordered within each partition for the OLAP function. The value expression that sets the order can consist of any combination of column names, functions, constants, and/or scalar subqueries.
Multiple ordering expressions can be specified, separated by commas.

Note that this ORDER BY clause is independent of any other ORDER BY clause appearing in the SQL statement, and does not necessarily reflect the way the result set will be ordered.

ASC | DESC
The results of the preceding ORDER BY
can be sorted in ascending order (the default) or descending order by specifying the ASC or DESC keywords, respectively.

NULLS FIRST | NULLS LAST
The results of the preceding ORDER BY can have null values sorted before non-null values or after (the default) by specifying the NULLS FIRST or NULLS LAST option, respectively.

Note that the sorting of nulls is independent of the ASC/DESC option, so NULLS LAST will put the null values at the bottom, and NULLS FIRST will put the null values at the beginning, regardless of whether the non-null values are in ascending or descending order.


Examples

Consider this table called Sales:

TDATE PART QTY TSALES CREDIT
2008-01-04 nail 610 68.95
Y
2008-01-04 rivet 143 40.56
N
2008-01-07 nail 756 85.34
Y
2008-01-07 screw 94 12.79
N
2008-01-07 nut 478 27.02
N
2008-01-09 nut 350 19.80
N
2008-01-09 rivet 85 ****
Y
2008-01-10 nail 917 103.55
Y
2008-01-10 bolt 412 65.12
Y
2008-01-10 screw 169 22.93
N
2008-01-14 rivet 184 52.00
Y

where **** represents a null value.


The following SELECT statement on the Sales table returns the cumulative distribution of unique values in the Part field:

SELECT Part AS part,
       SUM(TSales) AS total_sales,
       CUME_DIST() OVER ( ORDER BY SUM(TSales) ) AS cume_dist
  FROM Sales
  GROUP BY Part
  ORDER BY SUM(TSales) ;

Output:

part     total_sales  cume_dist             
-------  -----------  ----------------------
screw          35.72    2.00000000000000e-01
nut            46.82    4.00000000000000e-01
bolt           65.12    6.00000000000000e-01
rivet          92.56    8.00000000000000e-01
nail          257.84    1.00000000000000e+00

Ordering by the sum of TSales (ascending, which is the default), the CUME_DIST() function shows where each part lies in the distribution of sales by part. In this case, screws have the lowest total sales, such that this part is in the bottom 20% of the cumulative distribution, which also means that 80% of the other parts have higher total sales. Next is the nut, which is in the bottom 40% of total sales by part, followed by the bolt at 60% and the rivet at 80% of the cumulative distribution. Finally, the nail is at 100% of the cumulative distribution, meaning 0% (that is, none) of the other parts have more total sales than the nail.


The next example includes both the NTILE() and PERCENT_RANK() functions:

SELECT TDate AS date,
       SUM(TSales) AS total_sales,
       NTILE(3) OVER ( ORDER BY SUM(TSales) DESC ) AS ntile,
       PERCENT_RANK() OVER ( ORDER BY SUM(TSales) ) AS percent_rank
   FROM Sales
   GROUP BY TDate
   ORDER BY SUM(TSales) ;

Output:

date        total_sales   ntile   percent_rank         
----------  ------------  -----  ----------------------
2008-01-09        19.80       3    0.00000000000000e+00
2008-01-14        52.00       2    2.50000000000000e-01
2008-01-04       109.51       2    5.00000000000000e-01
2008-01-07       125.15       1    7.50000000000000e-01
2008-01-10       191.60       1    1.00000000000000e+00

Here, we are grouping total sales by date, and presenting the information in ascending order of sales per day. The NTILE() function serves to rank the rows in descending order of sales per day, according to a "tertile" (1 to 3) ranking because 3 was specified as the parameter for NTILE(). So in the ntile field, the lowest total sales is given the lowest rank (3), the next two total sales are given the middle rank (2), and the two highest sales totals are given the top rank (1).

Also in the above example, the PERCENT_RANK() function is based on the ascending order of total sales, so it displays for each row in floating point form what percentage of the other rows (excluding the current one) have lower total sales. The first row has the lowest total sales (19.80), so 0% of the other rows are ranked below it. For the next row (with 52.00 in total sales), one out of the four other rows is below it in total sales, so the percent rank is 25%. The percent rank for the next rows are calculated in the same manner. The row with the highest total sales (191.60) has all four of the other rows below it, so the percent rank for that last row is 100%.


The following example illustrates the difference between the RANK() and DENSE_RANK() functions:

SELECT Part AS part,
       SUM(Qty) AS qty,
       RANK() OVER ( ORDER BY SUM(qty) DESC ) AS rank,
       DENSE_RANK() OVER ( ORDER BY SUM(qty) DESC ) AS dense_rank
   FROM Sales
   GROUP BY part
   ORDER BY rank ;

Output:

part    qty    rank  dense_rank
------  -----  ----  ----------
nail     2283     1           1
nut       828     2           2
rivet     412     3           3
bolt      412     3           3
screw     263     5           4

Both of the OLAP functions in the above example rank the summed qty values per part in descending order. So the highest qty value is given a rank of 1, the next highest a rank of 2, and so on. The difference arises after we have a tie in the qty values. Both of the functions assign a rank of 3 to the tied rows, but the subsequent row is given a rank of 5 by RANK() and a rank of 4 by DENSE_RANK(). No matter how many ties in the ranking there are, DENSE_RANK() will never skip a number in the numerical sequence; whereas RANK(), after ties in the ranking, will assign to the next row a number equal to the tied rank plus the number of rows tied for that rank (in this case, a tied rank of 3 + 2 tied rows = 5).


The final example shows the use of the ROW_NUMBER() function with partitioning and ordering that places null values at the beginning:

SELECT ROW_NUMBER() OVER (PARTITION BY Credit ORDER BY TSales DESC NULLS FIRST) AS row,
       TDate AS date,
       Part AS part,
       TSales AS sales,
       Credit AS credit
   FROM Sales
   ORDER BY credit, row ;

Output:

row  date        part     sales    credit
---  ----------  -------  -------  ------
  1  2008-01-04  rivet      40.56   N
  2  2008-01-07  nut        27.02   N
  3  2008-01-10  screw      22.93   N
  4  2008-01-09  nut        19.80   N
  5  2008-01-07  screw      12.79   N
  1  2008-01-09  rivet       ****   Y
  2  2008-01-10  nail      103.55   Y
  3  2008-01-07  nail       85.34   Y
  4  2008-01-04  nail       68.95   Y
  5  2008-01-10  bolt       65.12   Y
  6  2008-01-14  rivet      52.00   Y

In the above example, the ROW_NUMBER() function operates separately on the data sets created by partitioning the Credit field. Since there are two unique values for Credit ('N' and 'Y'), there are two partitions. For each partition, the row numbers (starting at 1 and incrementing by 1) are assigned to the rows in descending order of sales.

In addition, the ROW_NUMBER() ordering specifies the NULLS FIRST option, which means that any row with a null value in the TSales column will be placed at the start of the partition, rather than at the end, which is the default. Hence, the first row in the second partition is assigned a row number of 1 because its sales (TSales) value is null (represented by ****).

 

Previous Topic:
Aggregate Functions
Chapter Index
Next Topic:
Cast Functions