SQL Window Functions Cheat Sheet
As software engineers, we're always on the lookout for the best tools and tricks to help us work more efficiently. SQL window functions provide data professionals with a number of really useful and powerful features. Window functions are a part of advanced modern SQL, and knowing them will definitely widen any SQL user's horizons. We've made this SQL Window Functions cheat sheet to help you with that.
Window functions can be employed in almost every part of a business and have a wide range of applications. They're typically used for data analysis and in-depth company intelligence. Thus, understanding SQL window functions is critical in the data world.
The SQL window functions cheat sheet below includes window function syntax, a list of window functions, examples, and much more.
BASIC - SQL Window Functions Cheat Sheet
Window Function Basics
A window function calculates over a set of table rows, and also uses the information within individual rows when required. This is in contrast to the type of calculation that an aggregate function can perform, which only returns a single value for multiple rows.
Unlike conventional aggregate functions, however, using a window function does not result in the rows being aggregated into a single output row; instead, the rows keep their individual identities. Under the hood, the window function can access more than just the current row of the query result.
Why use Window Functions?
The biggest advantage of window functions is that it enables users to work with both aggregate and non-aggregate values simultaneously. This is possible because the rows are not all collapsed together.
Moreover, because of the simplicity of window functions, they help reduce the complexity of your SQL queries, making them easier to maintain in the long run.
They can also help with performance issues by eliminating the need for doing self-join or cross-join which are computationally intensive.
Logical Order of Operations in SQL
Before we begin, it's crucial to remember that window functions rank sixth among SQL operations in terms of their priority.
This is significant because window functions are allowed in SELECT and ORDER BY clauses based on this logical order, but not in FROM, WHERE, GROUP BY, or HAVING clauses.
It's worth noting that if you truly require it in a WHERE or GROUP BY clause, you can use a subquery or a WITH query to get around this restriction.
Following is the logical order:
- FROM, JOIN
- GROUP BY
- aggregate functions
- window functions
- ORDER BY
The following snippet shows the generic syntax for a window function in the SELECT clause:
window_function ( [ ALL ] expression ) OVER ( [ PARTITION BY partition_list ] [ ORDER BY order_list] )
The window_function is used to specify the name of the window function the user may want to use; for example, sum, avg, or row_number.
ALL is an optional keyword that can be used. When you select ALL, all values, including duplicates, will be counted. In the window function, the DISTINCT function is not supported.
The column or expression on which the functions are applied. To put it another way, the name of the column for which an aggregated value is required. For instance, we may want to apply an expression over a column with the order amount so that we can see the total number of orders received.
The OVER keyword is used to specify the window clauses for aggregate functions.
PARTITION BY partition_list
Defines the window for window functions (the group of rows on which the window function runs). After the PARTITION BY clause, we must supply a field or list of fields for the partition. As is customary, a comma must be used to separate several fields. If PARTITION BY is not given, by default, the table will be grouped as a whole, and values will be aggregated.
ORDER BY order_list
It is used to sort the rows within each partition. If the ORDER BY function is not specified, by default it uses the entire table.
INTERMEDIATE - SQL Window Functions Cheat Sheet
A window frame is a group of rows that are related to the current row in some way. Within each partition, the window frame is evaluated independently. The below figure illustrates this.
ROWS | RANGE | GROUPS BETWEEN lower_bound AND upper_bound
The SQL bounds may belong to any of the five different categories:
- UNBOUNDED FOLLOWING
- UNBOUNDED PRECEDING
- n FOLLOWING
- n PRECEDING
- CURRENT ROW
Note: lower_bound should come BEFORE the upper_bound.
See details of these bounds in the table below:
|UNBOUNDED PRECEDING||Between UNBOUNDED PRECEDING and CURRENT ROW|
|n PRECEDING||Between n PRECEDING and CURRENT ROW|
|CURRENT ROW||Between CURRENT ROW and CURRENT ROW|
|n FOLLOWING||Between CURRENT ROW and n FOLLOWING|
|UNBOUNDED FOLLOWING||Between CURRENT ROW and UNBOUNDED FOLLOWING|
Types of Window Functions
There are four main types of window functions:
- Aggregate (operate on a set of values to return a single scalar value)
- Ranking (ranking rows in various ways)
- Analytic/Value (doing value comparison)
- Distribution (distribution analysis by returning a scalar between 0 and 1)
All of them are discussed at great length in our advanced cheat sheet.
ADVANCED - SQL Window Functions Cheat Sheet
These functions can be used to calculate various aggregations within each window or partition, such as average, total # of rows, maximum or lowest values, or total sum. We have performed some of these functions on the products database shown here to illustrate this.
The AVG function calculates the sum of all the values in the window frame, averages the value, and finally returns the average value.
Take a look at the following example query:
SELECT AVG(Price) from Products;
The AVG function sums up all the prices in the Price column of the Products table, then divides it by the total number of elements in the Table. This yields the average of the prices in the entire table i.e., 16.6.
The COUNT function counts the rows within the window frame and returns this count.
The SQL query below will return the number of products in the Products table.
SELECT COUNT(ProductID) FROM Products;
Since the ProductID for each product is distinct, counting all ProductID's will give us the total number of products in the Products table i.e. 5.
The MAX function returns the maximum value within the window frame.
Let's Take a look at an example SQL query for the MAX function:
SELECT MAX(Price) FROM Products;
Running the SQL command above yields the maximum price from the Products table i.e., 20. This can be helpful when doing comparisons in or between tables.
The MIN function returns the minimum value within the window frame.
Let's take a look at an example SQL query for the MIN function:
SELECT MIN(ProductID) FROM Products;
Similar to the MAX function, running the SQL command above yields the minimum price from the Products table i.e., 12.
The SUM function adds up each value in the window frame and then returns it.
Here is an example query of the SUM function:
SELECT SUM(Units) FROM Products;
The SQL query sums up all the elements in the Units column of the Products table. This yields a value of 58, denoting the total units of all products in the database.
Note: The Aggregate functions do not require the ORDER BY clause. The functions only accept window frame definitions like ROWS, RANGE, and GROUPS.
These functions are useful for determining the rank of rows within a partition. For the following functions, we will consider a table "Products" with the following attributes:
The function assigns a unique number for each row within the partition, with different numbers for tied values. Let us take a look at an example:
SELECT ROW_NUMBER() OVER ( ORDER BY Price ) ProductID, ProductName, Price FROM Products;
Running the above function will yield a table consisting of a row number assigned to each element in the table in ascending order of price. This can be changed to descending order by specifying it alongside the ORDER BY function. For example, we get
The function assigns a ranking within the individual groups. Moreover, the Rank() function skips duplicate ranks gaps and the same ranking for tied values. For example,
SELECT ProductID, ProductName, Price, RANK () OVER ( ORDER BY Price DESC ) price_rank FROM Products;
This will yield a table with ranks assigned based on the Price of each product.
Adding another product with a price, for example, 17 will cause it to have a rank of 4. Rank 3 is skipped due to a tie in rank 2.
Similar to the RANK function, the DENSE_RANK function also assigns a ranking within the partition, with no gaps and the same ranking for tied values. For example,
SELECT ProductID, ProductName, Price, DENSE_RANK () OVER ( ORDER BY Price DESC ) price_rank FROM Products;
This will yield a table with ranks assigned based on the Price of each product.
The only difference between the RANK() and DENSE_RANK() function is that DENSE_RANK() does not skip ranks. So adding another product with a price of 17 will cause it to have a rank of 3. It is not skipped.
Note: ORDER_BY is required for RANK() and DENSE_RANK(), but not for ROW_NUMBER(). Window frame definition is not accepted by ranking functions (ROWS, RANGE, GROUPS).
These functions are also referred to as value functions. They allow users to make a meaningful comparison between values from the partition's previous or next rows, or the partition's initial or last value. Moreover, these functions are excellent for pulling values from other rows that could be relevant to a report.
For example, the LAG and LEAD functions can be used to create a column that can pull values from other rows. LAG can return values from prior rows, while LEAD can only return data from subsequent rows. When working with time-series data and calculating differences over time, comparing prior or subsequent rows can be extremely useful.
It takes three parameters: expression, offset, and default.
- expression returns the value for the row offset rows after the current row;
- offset is an optional parameter and specifies the number of rows from the current row from where the query should access the data.
- default is another optional parameter. It is used to return a default value if the offset goes beyond the scope of the partition
Let us take a look at an example:
Consider this table, called "Sales" where each month has corresponding earnings:
Let us run the following query:
SELECT Month, NetSales, LEAD(net_sales,1) OVER ( ORDER BY Month ) next_month_sales FROM Sales;
The query will result in the following table:
As you can see the query returns a new table with another column named "next_month_sales" that have the sales for the next month. Note that the next_month_sales of Month 3 is NULL as the original table does not contain any values for the 4th month.
It takes three parameters: expression, offset, and default. It returns the value for the row offset rows before the current; offset and default are optional; default values: offset = 1, default = NULL. Using the example in the LEAD() function lets us run the following example query:
SELECT Month, NetSales, LAG(net_sales,1) OVER ( ORDER BY Month ) prev_month_sales FROM Sales;
Running the query gives us the following result:
The NTILE function takes a parameter, n, and divides rows within a partition as equally as possible into n groups, assigning a group number to each row. Let us take a look at an example.
Consider the following table named "NetSales" which shows the sales each month:
Now, let's run the following example query:
SELECT Month, Sales, NTILE(2) OVER( ORDER BY Sales DESC ) net_sales_group FROM NetSales;
The query groups the months based on the sales each month. This yield the following results:
The FIRST_VALUE as the name suggests returns the first value based on some defined heuristic. The function only takes an expression as a parameter and returns the value for the first row within the window frame. Let us look at an example table called "Employees" with the following attributes:
Let's create another table with the employee with the highest salary in a third column. For this, we run the following example query:
SELECT Name, Salary FIRST_VALUE(Name) OVER( ORDER BY Salary DESC ) HighestSalary FROM Employees;
This yields the following example table:
|John Doe||5000||Jane Doe|
|Jane Doe||9000||Jane Doe|
|Jannet Doe||2000||Jane Doe|
The LAST_VALUE function is quite similar to the FIRST_VALUE function and returns the last values based on some heuristic. Like the FIRST_VALUE, the function only takes an expression as a parameter and returns the first row's value within the window frame.
Let's take a look at the same table we used in the FIRST_VALUE function and run the following example query:
SELECT Name, Salary FIRST_VALUE(Name) OVER( ORDER BY Salary DESC ) LowestSalary FROM Employees;
This yields a similar result as the FIRST_VALUE function, however, with the employee with the lowest salary in the last column:
|John Doe||5000||Jannet Doe|
|Jane Doe||9000||Jannet Doe|
|Jannet Doe||2000||Jannet Doe|
The function takes two parameters: expression and n, and returns the value for the n-th row within the window frame. It is important to note that n must be an integer. For example, similar to the previous functions we can run the following example query to get the employee with the second highest salary:
SELECT Name, Salary NTH_VALUE(Name, 2) OVER( ORDER BY Salary DESC ) SecondHighest FROM Employees;
This yields the following table:
|John Doe||5000||John Doe|
|Jane Doe||9000||John Doe|
|Jannet Doe||2000||John Doe|
- The NTILE(), LEAD(), and LAG() functions need an ORDER BY clause. They do not accept other window frame definitions like ROWS, RANGE, GROUPS.
- FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE() do not need an ORDER BY clause and can accept window frame definitions like ROWS, RANGE, GROUPS.
We'll use a table called "Staff" with the following properties to explain the distribution functions.
The Staff table contains the total sales obtained by each staff member based on year.
Returns the percentile ranking number of a row. This is a value in [0, 1] interval:
(rank-1) / (total number of rows - 1)
Using the table above we can look at an example SQL query:
SELECT FullName, NetSales, PERCENT_RANK() OVER ( ORDER BY NetSales DESC ) percent_rank FROM Staff WHERE YEAR = 2016;
The query above calculates the sales percentile of each sales staff in 2016. Here is an example output:
The CUME_DIST function calculates the cumulative distribution of a value within a set of values. This is a value in the [0, 1] range. Let us look at an example:
SELECT FullName, NetSales, CUME_DIST() OVER ( ORDER BY NetSales DESC ) cume_dist FROM Staff WHERE YEAR = 2016;
Running the SQL query above yields the cumulative distribution of each staff member based on their NetSale value. Here is an example output:
Note: ORDER BY is required for distribution functions. They refuse to accept the definition of a window frame. (ROWS, RANGE, GROUPS).
The window function can access more than just the current row of the query result behind the scenes. As a result, SQL window functions allow you to focus on a specific area of the result set rather than the complete set.
Window functions are analogous to the GROUP BY clause's aggregation. Rows are not merged into a single row; instead, each row preserves its own identity. A window function may, thus, return a single result for each row.
We hope that this SQL Window Functions cheat sheet was instructive and helpful in revising and understanding these functions, and will undoubtedly help you further your knowledge of SQL Windows Functions.