SQL is a very capable language and there are very few questions that it cannot answer. I find that I can come up with some convoluted SQL query to answer virtually any question you could ask from the data. However, the performance of some of these queries is not what it should be - nor is the query itself easy to write in the first place. Some of the things that are hard to do in straight SQL are actually very commonly requested operations, including:
- Calculate a running total - Show the cumulative salary within a department row by row, with each row including a summation of the prior rows' salary.
- Find percentages within a group - Show the percentage of the total salary paid to an individual in a certain department. Take their salary and divide it by the sum of the salary in the department.
- Top-N queries - Find the top N highest-paid people or the top N sales by region.
- Compute a moving average - Average the current row's value and the previous N rows values together.
- Perform ranking queries - Show the relative rank of an individual's salary within their department.
Both the "Top-N" and ranking queries could perhaps be implemented by simply having a result row number be returned (after sorting). The row number can then be used to calculate positional-based info. It would be similar to Oracle's "rownum" pseudo-column, but without the limits of Oracle's take.
Analytic functions, are designed to address these issues. They add extensions to the SQL language that not only make these operations easier to code; they make them faster than could be achieved with the pure SQL approach. These extensions are currently under review by the ANSI SQL committee for inclusion in the SQL specification.
The syntax of the analytic function is rather straightforward in appearance, but looks can be deceiving. It starts with:
FUNCTION_NAME(<argument>,<argument>,?) OVER (<Partition-Clause> <Order-by-Clause> <Windowing Clause>)
- The PARTITION BY clause logically breaks a single result set into N groups, according to the criteria set by the partition expressions. The words 'partition' and 'group' are used synonymously.
- The ORDER BY clause specifies how the data is sorted within each group (partition).
- The windowing clause gives us a way to define a sliding or anchored window of data, on which the analytic function will operate, within a group. This clause can be used to have the analytic function compute its value based on any arbitrary sliding or anchored window within a group.
Example:
This example shows how to use the analytical function SUM to perform a cumulative sum. First, we fill some values in a table. The table is very simple and consists of the field dt and xy only. Note, that for a given date it is possible to insert multiple rows which is exactly what I do here. What I am interested is to extract the cumulative sum for each day in the table. That is, if I have three entries for the same date, for example 3, 4 and 5, I don't want the sum to only be 3+4+5 for each row, but 3 for the first row, 3+4 for the second row and 3+4+5 for the third row.
create table sum_example ( dt date, xy number );
insert into sum_example values (to_date('27.08.1970','DD.MM.YYYY'),4); insert into sum_example values (to_date('02.09.1970','DD.MM.YYYY'),1); insert into sum_example values (to_date('09.09.1970','DD.MM.YYYY'),5); insert into sum_example values (to_date('26.08.1970','DD.MM.YYYY'),3); insert into sum_example values (to_date('28.08.1970','DD.MM.YYYY'),4); insert into sum_example values (to_date('26.08.1970','DD.MM.YYYY'),6); insert into sum_example values (to_date('29.08.1970','DD.MM.YYYY'),9); insert into sum_example values (to_date('30.08.1970','DD.MM.YYYY'),2); insert into sum_example values (to_date('12.09.1970','DD.MM.YYYY'),7); insert into sum_example values (to_date('23.08.1970','DD.MM.YYYY'),2); insert into sum_example values (to_date('27.08.1970','DD.MM.YYYY'),5); insert into sum_example values (to_date('09.09.1970','DD.MM.YYYY'),9); insert into sum_example values (to_date('01.09.1970','DD.MM.YYYY'),3); insert into sum_example values (to_date('07.09.1970','DD.MM.YYYY'),1); insert into sum_example values (to_date('12.09.1970','DD.MM.YYYY'),4); insert into sum_example values (to_date('03.09.1970','DD.MM.YYYY'),5); insert into sum_example values (to_date('03.09.1970','DD.MM.YYYY'),8); insert into sum_example values (to_date('07.09.1970','DD.MM.YYYY'),7); insert into sum_example values (to_date('04.09.1970','DD.MM.YYYY'),8); insert into sum_example values (to_date('09.09.1970','DD.MM.YYYY'),1); insert into sum_example values (to_date('29.08.1970','DD.MM.YYYY'),3); insert into sum_example values (to_date('30.08.1970','DD.MM.YYYY'),7); insert into sum_example values (to_date('24.08.1970','DD.MM.YYYY'),7); insert into sum_example values (to_date('07.09.1970','DD.MM.YYYY'),9); insert into sum_example values (to_date('26.08.1970','DD.MM.YYYY'),2); insert into sum_example values (to_date('09.09.1970','DD.MM.YYYY'),8);
select dt, sum(xy) over (partition by trunc(dt) order by dt rows between unbounded preceding and current row) s, xy from sum_example;
drop table sum_example;
-
The the analytical function:
sum(xy) over (partition by trunc(dt) order by dt rows between unbounded preceding and current row)
-
The select statement will return:
23.08.70 | 2 | 2 |
24.08.70 | 7 | 7 |
26.08.70 | 3 | 3 |
26.08.70 | 5 | 2 |
26.08.70 | 11 | 6 |
27.08.70 | 4 | 4 |
27.08.70 | 9 | 5 |
28.08.70 | 4 | 4 |
29.08.70 | 9 | 9 |
29.08.70 | 12 | 3 |
30.08.70 | 2 | 2 |
30.08.70 | 9 | 7 |
01.09.70 | 3 | 3 |
02.09.70 | 1 | 1 |
03.09.70 | 5 | 5 |
03.09.70 | 13 | 8 |
04.09.70 | 8 | 8 |
07.09.70 | 1 | 1 |
07.09.70 | 8 | 7 |
07.09.70 | 17 | 9 |
09.09.70 | 5 | 5 |
09.09.70 | 14 | 9 |
09.09.70 | 15 | 1 |
09.09.70 | 23 | 8 |
12.09.70 | 7 | 7 |
12.09.70 | 11 | 4 |
-
The third column correspondents to xy (the values inserted with the insert
into ... above). The interesting column is the second. For example on the 26th of August in 1970,
the first row for that date
is 3 (equals xy), the second is 5 (equals xy+3) and the third is 11 (equals
xy+3+5).
A list of analytic functions we could find in Oracle Express 10:
- AVG (<distinct|all> expression ) Used to compute an average of an expression within a group and window. Distinct may be used to find the average of the values in a group after duplicates have been removed.
- CORR (expression, expression) Returns the coefficient of correlation of a pair of expressions that return numbers. It is shorthand for:
- COVAR_POP(expr1, expr2) /
- STDDEV_POP(expr1) * STDDEV_POP(expr2)). Statistically speaking, a correlation is the strength of an association between variables. An association between variables means that the value of one variable can be predicted, to some extent, by the value of the other. The correlation coefficient gives the strength of the association by returning a number between -1 (strong inverse correlation) and 1 (strong correlation). A value of 0 would indicate no correlation.
- COUNT (<distinct> <*> <expression>) This will count occurrences within a group. If you specify * or some non-null constant, count will count all rows. If you specify an expression, count returns the count of non-null evaluations of expression. You may use the DISTINCT modifier to count occurrences of rows in a group after duplicates have been removed.
- COVAR_POP (expression, expression) This returns the population covariance of a pair of expressions that return numbers.
- COVAR_SAMP (expression, expression) This returns the sample covariance of a pair of expressions that return numbers.
- CUME_DIST This computes the relative position of a row in a group. CUME_DIST will always return a number greater then 0 and less then or equal to 1. This number represents the 'position' of the row in the group of N rows. In a group of three rows, the cumulate distribution values returned would be 1/3, 2/3, and 3/3 for example.
- DENSE_RANK This function computes the relative rank of each row returned from a query with respect to the other rows, based on the values of the expressions in the ORDER BY clause. The data within a group is sorted by the ORDER BY clause and then a numeric ranking is assigned to each row in turn starting with 1 and continuing on up. The rank is incremented every time the values of the ORDER BY expressions change. Rows with equal values receive the same rank (nulls are considered equal in this comparison). A dense rank returns a ranking number without any gaps. This is in comparison to RANK below.
- FIRST_VALUE This simply returns the first value from a group.
- LAG (expression, <offset>, <default>) LAG gives you access to other rows in a resultset without doing a self-join. It allows you to treat the cursor as if it were an array in effect. You can reference rows that come before the current row in a given group. This would allow you to select 'the previous rows' from a group along with the current row. See LEAD for how to get 'the next rows'. Offset is a positive integer that defaults to 1 (the previous row). Default is the value to be returned if the index is out of range of the window (for the first row in a group, the default will be returned)
- LAST_VALUE This simply returns the last value from a group.
- LEAD (expression, <offset>, <default>) LEAD is the opposite of LAG. Whereas LAG gives you access to the a row preceding yours in a group - LEAD gives you access to the a row that comes after your row. Offset is a positive integer that defaults to 1 (the next row). Default is the value to be returned if the index is out of range of the window (for the last row in a group, the default will be returned).
- MAX(expression) Finds the maximum value of expression within a window of a group.
- MIN(expression) Finds the minimum value of expression within a window of a group.
- NTILE (expression) Divides a group into 'value of expression' buckets. For example; if expression = 4, then each row in the group would be assigned a number from 1 to 4 putting it into a percentile. If the group had 20 rows in it, the first 5 would be assigned 1, the next 5 would be assigned 2 and so on. In the event the cardinality of the group is not evenly divisible by the expression, the rows are distributed such that no percentile has more than 1 row more then any other percentile in that group and the lowest percentiles are the ones that will have 'extra' rows. For example, using expression = 4 again and the number of rows = 21, percentile = 1 will have 6 rows, percentile = 2 will have 5, and so on.
- PERCENT_RANK This is similar to the CUME_DIST (cumulative distribution) function. For a given row in a group, it calculates the rank of that row minus 1, divided by 1 less than the number of rows being evaluated in the group. This function will always return values from 0 to 1 inclusive.
- RANK This function computes the relative rank of each row returned from a query with respect to the other rows, based on the values of the expressions in the ORDER BY clause. The data within a group is sorted by the ORDER BY clause and then a numeric ranking is assigned to each row in turn starting with 1 and continuing on up. Rows with the same values of the ORDER BY expressions receive the same rank; however, if two rows do receive the same rank the rank numbers will subsequently 'skip'. If two rows are number 1, there will be no number 2 - rank will assign the value of 3 to the next row in the group. This is in contrast to DENSE_RANK, which does not skip values.
- RATIO_TO_REPORT (expression) This function computes the value of expression / (sum(expression)) over the group. This gives you the percentage of the total the current row contributes to the sum(expression).
- REGR_ xxxxxxx (expression, expression) These linear regression functions fit an ordinary-least-squares regression line to a pair of expressions. There are different regression functions available for use.
- ROW_NUMBER Returns the offset of a row in an ordered group. Can be used to sequentially number rows, ordered by certain criteria.
- STDDEV (expression) Computes the standard deviation of the current row with respect to the group.
- STDDEV_POP (expression) This function computes the population standard deviation and returns the square root of the population variance. Its return value is same as the square root of the VAR_POP function.
- STDDEV_SAMP (expression) This function computes the cumulative sample standard deviation and returns the square root of the sample variance. This function returns the same value as the square root of the VAR_SAMP function would.
- SUM(expression) This function computes the cumulative sum of expression in a group.
- VAR_POP (expression) This function returns the population variance of a non-null set of numbers (nulls are ignored). VAR_POP function makes the following calculation for us: (SUM(expr*expr) - SUM(expr)*SUM(expr) / COUNT(expr)) / COUNT(expr)
- VAR_SAMP (expression) This function returns the sample variance of a non-null set of numbers (nulls in the set are ignored). This function makes the following calculation for us: (SUM(expr*expr) - SUM(expr)*SUM(expr) / COUNT(expr)) / (COUNT(expr) - 1)
- VARIANCE (expression) This function returns the variance of expression.
This list of function is often used to improve performance :
-
SUM, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE)
A few of these functions on SQLite are supported through Perl's SQLite::More module.