SQL Window Functions Cheat Sheet


As software engineers, we're always on the lookout for the best tools and tricks to help us work more efficiently. SQL window functions provide data professionals with a number of really useful and powerful features. Window functions are a part of advanced modern SQL, and knowing them will definitely widen any SQL user's horizons. We've made this SQL Window Functions cheat sheet to help you with that.

Window functions can be employed in almost every part of a business and have a wide range of applications. They're typically used for data analysis and in-depth company intelligence. Thus, understanding SQL window functions is critical in the data world.

The SQL window functions cheat sheet below includes window function syntax, a list of window functions, examples, and much more.




BASIC - SQL Window Functions Cheat Sheet



Window Function Basics

A window function calculates over a set of table rows, and also uses the information within individual rows when required. This is in contrast to the type of calculation that an aggregate function can perform, which only returns a single value for multiple rows.

Unlike conventional aggregate functions, however, using a window function does not result in the rows being aggregated into a single output row; instead, the rows keep their individual identities. Under the hood, the window function can access more than just the current row of the query result.


Why use Window Functions?

The biggest advantage of window functions is that it enables users to work with both aggregate and non-aggregate values simultaneously. This is possible because the rows are not all collapsed together.

Moreover, because of the simplicity of window functions, they help reduce the complexity of your SQL queries, making them easier to maintain in the long run.

They can also help with performance issues by eliminating the need for doing self-join or cross-join which are computationally intensive.


Logical Order of Operations in SQL

Before we begin, it's crucial to remember that window functions rank sixth among SQL operations in terms of their priority.

This is significant because window functions are allowed in SELECT and ORDER BY clauses based on this logical order, but not in FROM, WHERE, GROUP BY, or HAVING clauses.

It's worth noting that if you truly require it in a WHERE or GROUP BY clause, you can use a subquery or a WITH query to get around this restriction.

Following is the logical order:

  1. FROM, JOIN
  2. WHERE
  3. GROUP BY
  4. aggregate functions
  5. HAVING
  6. window functions
  7. SELECT
  8. DISTINCT
  9. UNION/INTERSECT/EXCEPT
  10. ORDER BY
  11. OFFSET
  12. LIMIT/FETCH/TOP

Syntax

The following snippet shows the generic syntax for a window function in the SELECT clause:

window_function ( [ ALL ] expression )
OVER ( [ PARTITION BY partition_list ] [ ORDER BY order_list] )

Arguments

Window_function

The window_function is used to specify the name of the window function the user may want to use; for example, sum, avg, or row_number.

ALL

ALL is an optional keyword that can be used. When you select ALL, all values, including duplicates, will be counted. In the window function, the DISTINCT function is not supported.

expression

The column or expression on which the functions are applied. To put it another way, the name of the column for which an aggregated value is required. For instance, we may want to apply an expression over a column with the order amount so that we can see the total number of orders received.

OVER

The OVER keyword is used to specify the window clauses for aggregate functions.

PARTITION BY partition_list

Defines the window for window functions (the group of rows on which the window function runs). After the PARTITION BY clause, we must supply a field or list of fields for the partition. As is customary, a comma must be used to separate several fields. If PARTITION BY is not given, by default, the table will be grouped as a whole, and values will be aggregated.

ORDER BY order_list

It is used to sort the rows within each partition. If the ORDER BY function is not specified, by default it uses the entire table.




INTERMEDIATE - SQL Window Functions Cheat Sheet



Window Frame

A window frame is a group of rows that are related to the current row in some way. Within each partition, the window frame is evaluated independently. The below figure illustrates this.

SQL Window Functions cheat sheet

Code:

ROWS | RANGE | GROUPS BETWEEN lower_bound AND upper_bound

Bounds

The SQL bounds may belong to any of the five different categories:

  • UNBOUNDED FOLLOWING
  • UNBOUNDED PRECEDING
  • n FOLLOWING
  • n PRECEDING
  • CURRENT ROW

Note: lower_bound should come BEFORE the upper_bound.

See details of these bounds in the table below:

Abbreviation Meaning
UNBOUNDED PRECEDING Between UNBOUNDED PRECEDING and CURRENT ROW
n PRECEDING Between n PRECEDING and CURRENT ROW
CURRENT ROW Between CURRENT ROW and CURRENT ROW
n FOLLOWING Between CURRENT ROW and n FOLLOWING
UNBOUNDED FOLLOWING Between CURRENT ROW and UNBOUNDED FOLLOWING


Types of Window Functions

There are four main types of window functions:

  1. Aggregate (operate on a set of values to return a single scalar value)
  2. Ranking (ranking rows in various ways)
  3. Analytic/Value (doing value comparison)
  4. Distribution (distribution analysis by returning a scalar between 0 and 1)

All of them are discussed at great length in our advanced cheat sheet.




ADVANCED - SQL Window Functions Cheat Sheet



Aggregate Functions

These functions can be used to calculate various aggregations within each window or partition, such as average, total # of rows, maximum or lowest values, or total sum. We have performed some of these functions on the products database shown here to illustrate this.

ProductID Product Name Units Price
1 Syrup 5 20
2 Shampoo 10 15
3 Soap 4 12
4 Hair gel 14 19
5 Cream 25 17

AVG()

The AVG function calculates the sum of all the values in the window frame, averages the value, and finally returns the average value.

Take a look at the following example query:

SELECT AVG(Price) from Products;

The AVG function sums up all the prices in the Price column of the Products table, then divides it by the total number of elements in the Table. This yields the average of the prices in the entire table i.e., 16.6.


COUNT()

The COUNT function counts the rows within the window frame and returns this count.

The SQL query below will return the number of products in the Products table.

SELECT COUNT(ProductID) FROM Products;

Since the ProductID for each product is distinct, counting all ProductID's will give us the total number of products in the Products table i.e. 5.


MAX()

The MAX function returns the maximum value within the window frame.

Let's Take a look at an example SQL query for the MAX function:

SELECT MAX(Price) FROM Products;

Running the SQL command above yields the maximum price from the Products table i.e., 20. This can be helpful when doing comparisons in or between tables.


MIN()

The MIN function returns the minimum value within the window frame.

Let's take a look at an example SQL query for the MIN function:


SELECT MIN(ProductID) FROM Products;

Similar to the MAX function, running the SQL command above yields the minimum price from the Products table i.e., 12.


SUM()

The SUM function adds up each value in the window frame and then returns it.

Here is an example query of the SUM function:

SELECT SUM(Units) FROM Products;

The SQL query sums up all the elements in the Units column of the Products table. This yields a value of 58, denoting the total units of all products in the database.

Note: The Aggregate functions do not require the ORDER BY clause. The functions only accept window frame definitions like ROWS, RANGE, and GROUPS.



Ranking Functions

These functions are useful for determining the rank of rows within a partition. For the following functions, we will consider a table "Products" with the following attributes:

Products

  1. ProductID
  2. ProductName
  3. Price
  4. ManufaturingDate

ROW_NUMBER()

The function assigns a unique number for each row within the partition, with different numbers for tied values. Let us take a look at an example:

SELECT 
   ROW_NUMBER() OVER (
	ORDER BY Price
   ) ProductID,
   ProductName, 
   Price
FROM 
   Products;

Running the above function will yield a table consisting of a row number assigned to each element in the table in ascending order of price. This can be changed to descending order by specifying it alongside the ORDER BY function. For example, we get

ROW_NUMBER ProductID ProductName Price
1 55 Soap 12
2 42 Shampoo 15
3 63 Cream 17

RANK()

The function assigns a ranking within the individual groups. Moreover, the Rank() function skips duplicate ranks gaps and the same ranking for tied values. For example,

SELECT
	ProductID,
	ProductName,
	Price,
	RANK () OVER ( 
		ORDER BY Price DESC
	) price_rank 
FROM
	Products;

This will yield a table with ranks assigned based on the Price of each product.

ProductID ProductName Price price_rank
55 Soap 12 1
42 Shampoo 15 2
63 Lipstick 15 2

Adding another product with a price, for example, 17 will cause it to have a rank of 4. Rank 3 is skipped due to a tie in rank 2.


DENSE_RANK()

Similar to the RANK function, the DENSE_RANK function also assigns a ranking within the partition, with no gaps and the same ranking for tied values. For example,

SELECT
	ProductID,
	ProductName,
	Price,
	DENSE_RANK () OVER ( 
		ORDER BY Price DESC
	) price_rank 
FROM
	Products;

This will yield a table with ranks assigned based on the Price of each product.

ProductID ProductName Price price_rank
55 Soap 12 1
42 Shampoo 15 2
63 Lipstick 15 2

The only difference between the RANK() and DENSE_RANK() function is that DENSE_RANK() does not skip ranks. So adding another product with a price of 17 will cause it to have a rank of 3. It is not skipped.

Note: ORDER_BY is required for RANK() and DENSE_RANK(), but not for ROW_NUMBER(). Window frame definition is not accepted by ranking functions (ROWS, RANGE, GROUPS).



Analytic Functions

These functions are also referred to as value functions. They allow users to make a meaningful comparison between values from the partition's previous or next rows, or the partition's initial or last value. Moreover, these functions are excellent for pulling values from other rows that could be relevant to a report.

For example, the LAG and LEAD functions can be used to create a column that can pull values from other rows. LAG can return values from prior rows, while LEAD can only return data from subsequent rows. When working with time-series data and calculating differences over time, comparing prior or subsequent rows can be extremely useful.


LEAD()

It takes three parameters: expression, offset, and default.

  1. expression returns the value for the row offset rows after the current row;
  2. offset is an optional parameter and specifies the number of rows from the current row from where the query should access the data.
  3. default is another optional parameter. It is used to return a default value if the offset goes beyond the scope of the partition

Let us take a look at an example:

Consider this table, called "Sales" where each month has corresponding earnings:

Month NetSales
1 5000
2 10000
3 7000

Let us run the following query:

SELECT 
	Month,
	NetSales,
	LEAD(net_sales,1) OVER (
		ORDER BY Month
	) next_month_sales
FROM 
	Sales;

The query will result in the following table:

Month NetSales next_month_sales
1 5000 10000
2 10000 7000
3 7000 NULL

As you can see the query returns a new table with another column named "next_month_sales" that have the sales for the next month. Note that the next_month_sales of Month 3 is NULL as the original table does not contain any values for the 4th month.


LAG()

It takes three parameters: expression, offset, and default. It returns the value for the row offset rows before the current; offset and default are optional; default values: offset = 1, default = NULL. Using the example in the LEAD() function lets us run the following example query:

SELECT 
	Month,
	NetSales,
	LAG(net_sales,1) OVER (
		ORDER BY Month
	) prev_month_sales
FROM 
	Sales;

Running the query gives us the following result:

Month NetSales prev_month_sales
1 5000 NULL
2 10000 5000
3 7000 10000

NTILE()

The NTILE function takes a parameter, n, and divides rows within a partition as equally as possible into n groups, assigning a group number to each row. Let us take a look at an example.

Consider the following table named "NetSales" which shows the sales each month:

Month Sales
January 500
February 1000
March 1200
April 400
May 100
June 2000

Now, let's run the following example query:

SELECT
	Month, 
	Sales,
	NTILE(2) OVER(
		ORDER BY Sales DESC
	) net_sales_group
FROM 
	NetSales;

The query groups the months based on the sales each month. This yield the following results:

Month Sales net_sales_group
June 2000 1
March 1200 1
February 1000 1
January 500 2
April 400 2
May 100 2

FIRST_VALUE()

The FIRST_VALUE as the name suggests returns the first value based on some defined heuristic. The function only takes an expression as a parameter and returns the value for the first row within the window frame. Let us look at an example table called "Employees" with the following attributes:

Employees

  1. EmpolyeeID
  2. Name
  3. Salary

Let's create another table with the employee with the highest salary in a third column. For this, we run the following example query:

SELECT 
    Name,
    Salary
    FIRST_VALUE(Name) OVER(
        ORDER BY Salary DESC
    ) HighestSalary
FROM 
    Employees;

This yields the following example table:

Name Salary HighestSalary
John Doe 5000 Jane Doe
Jane Doe 9000 Jane Doe
Jannet Doe 2000 Jane Doe

LAST_VALUE()

The LAST_VALUE function is quite similar to the FIRST_VALUE function and returns the last values based on some heuristic. Like the FIRST_VALUE, the function only takes an expression as a parameter and returns the first row's value within the window frame.

Let's take a look at the same table we used in the FIRST_VALUE function and run the following example query:

SELECT 
    Name,
    Salary
    FIRST_VALUE(Name) OVER(
        ORDER BY Salary DESC
    ) LowestSalary
FROM 
    Employees;

This yields a similar result as the FIRST_VALUE function, however, with the employee with the lowest salary in the last column:

Name Salary LowestSalary
John Doe 5000 Jannet Doe
Jane Doe 9000 Jannet Doe
Jannet Doe 2000 Jannet Doe

NTH_VALUE()

The function takes two parameters: expression and n, and returns the value for the n-th row within the window frame. It is important to note that n must be an integer. For example, similar to the previous functions we can run the following example query to get the employee with the second highest salary:

SELECT 
    Name,
    Salary
    NTH_VALUE(Name, 2) OVER(
        ORDER BY Salary DESC
    ) SecondHighest
FROM 
    Employees;

This yields the following table:

Name Salary SecondHighest
John Doe 5000 John Doe
Jane Doe 9000 John Doe
Jannet Doe 2000 John Doe

Note:

  • The NTILE(), LEAD(), and LAG() functions need an ORDER BY clause. They do not accept other window frame definitions like ROWS, RANGE, GROUPS.
  • FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE() do not need an ORDER BY clause and can accept window frame definitions like ROWS, RANGE, GROUPS.


Distribution Functions

We'll use a table called "Staff" with the following properties to explain the distribution functions.

Staff

  1. StaffID
  2. FullName
  3. NetSales
  4. Year

The Staff table contains the total sales obtained by each staff member based on year.


PERCENT_RANK()

Returns the percentile ranking number of a row. This is a value in [0, 1] interval:

(rank-1) / (total number of rows - 1)

Using the table above we can look at an example SQL query:

SELECT 
    FullName,
    NetSales, 
    PERCENT_RANK() OVER (
        ORDER BY NetSales DESC
    ) percent_rank
FROM 
    Staff
WHERE 
    YEAR = 2016;

The query above calculates the sales percentile of each sales staff in 2016. Here is an example output:

FullName NetSales percent_rank
John Doe 100 0
Jane Doe 400 0.5
Jannet Doe 500 1

CUME_DIST()

The CUME_DIST function calculates the cumulative distribution of a value within a set of values. This is a value in the [0, 1] range. Let us look at an example:

SELECT 
    FullName,
    NetSales, 
    CUME_DIST() OVER (
        ORDER BY NetSales DESC
    ) cume_dist
FROM 
    Staff
WHERE 
    YEAR = 2016;

Running the SQL query above yields the cumulative distribution of each staff member based on their NetSale value. Here is an example output:

FullName NetSales percent_rank
John Doe 100 0.1
Jane Doe 400 0.5
Jannet Doe 500 1

Note: ORDER BY is required for distribution functions. They refuse to accept the definition of a window frame. (ROWS, RANGE, GROUPS).



Quick Recap

The window function can access more than just the current row of the query result behind the scenes. As a result, SQL window functions allow you to focus on a specific area of the result set rather than the complete set.

Window functions are analogous to the GROUP BY clause's aggregation. Rows are not merged into a single row; instead, each row preserves its own identity. A window function may, thus, return a single result for each row.

We hope that this SQL Window Functions cheat sheet was instructive and helpful in revising and understanding these functions, and will undoubtedly help you further your knowledge of SQL Windows Functions.