Current Year, Previous Year, The Missing Dates Problem, and LAG
/When working with data that has date gaps, we need to have different ways of filling in the missing dates. This could be something like previous revenue where you want to see what a customer purchased last year but they don't have a sale in the current year. Or maybe you want to carry a value forward over a certain period where you are missing dates, like a rate value over a holiday. I want to go over several different scenarios and ways to solve these problems. We will look at issues with LAG when dealing with missing years, single missing year vs. multiple missing years, and the different methods we can use to fill in the gaps.
The first problem we will look at is the previous year issue. We want to see revenue totals and previous years revenue totals for customers even if they don't have a sale in the current year. For simplicity, our sample data only contains a single customer, but the solutions here have proved performant across larger data sets. We have the customer, sales year, trimester, and revenue. You will see that the customer has no sales in 2013 or 2017 so we can address the missing years. Sales are mixed across different trimesters in different years to provide all the different previous year scenarios.
DROP TABLE IF EXISTS #Sales; CREATE TABLE #Sales ( CustomerNumber int ,SaleYear int ,Trimester int ,Revenue decimal(7,2) ); INSERT INTO #Sales SELECT 333, 2012, 1, 25.00 UNION ALL SELECT 333, 2012, 1, 44.00 UNION ALL SELECT 333, 2012, 3, 18.00 UNION ALL SELECT 333, 2012, 3, 12.00 UNION ALL SELECT 333, 2014, 1, 18.00 UNION ALL SELECT 333, 2014, 1, 24.00 UNION ALL SELECT 333, 2014, 3, 15.00 UNION ALL SELECT 333, 2014, 3, 24.00 UNION ALL SELECT 333, 2015, 1, 56.00 UNION ALL SELECT 333, 2015, 2, 6.00 UNION ALL SELECT 333, 2015, 3, 31.00 UNION ALL SELECT 333, 2016, 1, 43.00 UNION ALL SELECT 333, 2016, 1, 11.00 UNION ALL SELECT 333, 2016, 2, 36.00 UNION ALL SELECT 333, 2016, 3, 31.00 UNION ALL SELECT 333, 2018, 1, 29.00 UNION ALL SELECT 333, 2018, 1, 33.00 UNION ALL SELECT 333, 2018, 2, 17.00 UNION ALL SELECT 333, 2018, 3, 16.00 ;
In our first example we look at how to get previous year, year over year, without addressing the missing year using the self-join method. Because our customer has multiple sales in the same year and trimester we summarize the data in a CTE before doing our self-join for previous year.
-- Get total revenue by Customer, SaleYear, and Trimester WITH CTE AS ( SELECT CustomerNumber ,SaleYear ,Trimester ,SUM(Revenue) AS Revenue FROM #Sales GROUP BY CustomerNumber, SaleYear, Trimester ) -- Then use self join to get revenue for previous year SELECT c.CustomerNumber ,c.SaleYear ,c.Trimester ,c.Revenue AS CurrentRevenue ,ISNULL(p.Revenue, 0) AS PreviousRevenue FROM CTE c LEFT JOIN CTE p ON c.CustomerNumber = p.CustomerNumber AND c.SaleYear = p.SaleYear + 1 AND c.Trimester = p.Trimester ORDER BY c.Trimester, c.SaleYear;
In the query above we are self-joining the CTE; if we didn't have to summarize the data we could exclude the CTE and just self join to the table (personal note, I find CTE's more readable for queries like this). The key here is in the join condition c.SaleYear = p.SaleYear + 1. How this works is, if c.SaleYear = 2016 and p.SaleYear = 2015, then if we add 1 to p.SaleYear that equals 2016 and so the join condition evaluates to true, which effectively gives us the previous year record if that record exists. This solution returns the correct value for previous year when that year is missing but does not return the actual missing year (we will get to that shortly). Next let's take a look at the LAG function.
Anytime we start talking about previous value it's common to think about the LAG function, and while LAG works great with contiguous date periods, it falls apart on data sets that are missing dates. Take for example:
WITH CTE AS ( SELECT CustomerNumber ,SaleYear ,Trimester ,SUM(Revenue) AS Revenue FROM #Sales GROUP BY CustomerNumber, SaleYear, Trimester ) SELECT CustomerNumber ,SaleYear ,Trimester ,Revenue ,LAG(Revenue, 1, 0) OVER (PARTITION BY CustomerNumber, Trimester ORDER BY Trimester, SaleYear) AS PreviousRevenue FROM CTE c ORDER BY c.Trimester, c.SaleYear
Where dates are contiguous it provides the correct result, but in the case of 2012 to 2014 it is showing 2012 as the previous revenue for 2014 and that is not what we are looking for. If you look at the previous revenue from 2014 to 2016 the previous year data is correct, but again it breaks down between 2016 and 2018. In order to use LAG in this scenario we would need to manufacture all possible years and trimesters for each customer. We will look at that in a minute but first lets look at a better way to fill in the missing years for the previous year problem.
In the following example we are again using a self-join, but this time a FULL OUTER JOIN to create records for missing years where there was a sale in the previous year but not the current year. Remember, we have no sales in 2013 but we still want to see previous year sales from 2012 on a line for 2013 regardless if there are no sales in 2013.
WITH CTE AS ( SELECT CustomerNumber ,SaleYear ,Trimester ,SUM(Revenue) AS Revenue FROM #Sales GROUP BY CustomerNumber, SaleYear, Trimester ) -- Then use self join to get revenue for previous year SELECT COALESCE(c.CustomerNumber, p.CustomerNumber) AS CustomerNumber ,COALESCE(c.SaleYear, p.SaleYear + 1) AS SaleYear ,COALESCE(c.Trimester, p.Trimester) AS Trimester ,ISNULL(c.Revenue, 0) AS CurrentRevenue ,ISNULL(p.Revenue, 0) AS PreviousRevenue FROM CTE c FULL OUTER JOIN CTE p ON c.CustomerNumber = p.CustomerNumber AND c.SaleYear = p.SaleYear + 1 AND c.Trimester = p.Trimester ORDER BY COALESCE(c.Trimester, p.Trimester), COALESCE(c.SaleYear, p.SaleYear);
If you look at the data you will see we now have records for the missing years or trimesters, and we have rows for both 2013 and 2017. This uses the same thinking as the first self-join we looked at with a few differences. First, we are using a FULL OUTER JOIN instead of the LEFT JOIN. The years are filled in in the SELECT using the COALESCE on SaleYear and you can see we are also using the SaleYear + 1 here, this will always give us the missing year. We use COALESCE for the customer and trimester to complete the data set. A simple way to get previous year and fill in missing years.
What if we wanted to do something like this using LAG? Again, we would need to manufacture every possible year and trimester for each customer so we could provide a contiguous date range to LAG on. There are a few different ways to do this, with a date table, a numbers tables, or recursive CTE; we are going to demonstrate this with recursive CTE's because it doesn't require the use of those other tables if you don't have them already, those other options will be more performant in most scenarios.
First we need to get all possible years starting from the first sale year, next we need to get all possible trimesters, then get our revenue totals, and finally, LAG on the previous year.
-- All possible SaleYears WITH rCTEyr AS ( SELECT DISTINCT CustomerNumber ,MIN(SaleYear) OVER (PARTITION BY CustomerNumber) AS SaleYear ,CAST(0.00 AS decimal(7,2)) AS Revenue FROM #Sales UNION ALL SELECT CustomerNumber ,SaleYear + 1 ,Revenue FROM rCTEyr WHERE SaleYear <= 2017 ), -- All possible Trimesters rCTEQtr AS ( SELECT CustomerNumber ,SaleYear ,1 AS Trimester ,Revenue FROM rCTEyr UNION ALL SELECT CustomerNumber ,SaleYear ,Trimester + 1 ,Revenue FROM rCTEQtr WHERE Trimester < 3), -- Get our revenue totals CTEfinal AS ( SELECT st.CustomerNumber ,st.SaleYear ,st.Trimester ,ISNULL(SUM(s.Revenue), 0) AS Revenue FROM rCTEQtr st LEFT JOIN #Sales s ON st.CustomerNumber = s.CustomerNumber AND s.SaleYear = st.SaleYear AND s.Trimester = st.Trimester GROUP BY st.CustomerNumber, st.SaleYear, st.Trimester ) SELECT DISTINCT CustomerNumber ,SaleYear ,Trimester ,Revenue ,LAG(Revenue, 1, 0) OVER (PARTITION BY CustomerNumber, Trimester ORDER BY Trimester, SaleYear) AS PreviousRevenue FROM CTEfinal ORDER BY Trimester, SaleYear
This creates a record for every possible combination of year and trimester which is useful, but maybe not so much for the previous year problem. You'll see we have the correct previous years revenue as well as the missing years now, but when you compare this with the simplicity of the self-join, the self-join is a faster, more efficient solution to this problem. This brings us to our second scenario, multiple missing dates or years.
Imagine you have auto finance rates that change day to day. Rates don't change on the holidays or weekends or special events. You have a report that shows the rate for every day in a given period. We created several gaps in our data set below, gaps which span multiple days. We need to carry our rate forward for each missing day to show the rate didn't change for that day. This is the missing dates issue, we need to roll the value forward each missing day until we hit the next valid record. We are going to use a different data set this time to illustrate this.
DROP TABLE IF EXISTS #Test; CREATE TABLE #Test ( RateDate date ,RateValue decimal(5,4) ) INSERT INTO #Test SELECT '12/29/16', 1.2266 UNION ALL SELECT '12/30/16', 1.2345 UNION ALL SELECT '01/03/17', 1.2240 UNION ALL SELECT '01/06/17', 1.1902;
You will see that in the table above we have values for the 29th, 30th, 3rd, and the 6th but we are missing the dates in between. We need to carry the last rate value forward for each of the missing days. Let's image we have a report and our report accepts 2 parameters, the start date and the end date. We could generate all the days and then join them to the actual records in the table to effectively create missing dates and roll forward rate values. This is another example where you could use a date table, numbers table, or recursive CTE, we are using the CTE for convenience.
DECLARE @StartDate date = '12/29/16' ,@EndDate date = '01/06/17'; WITH rCTE AS ( SELECT @StartDate AS RateDate UNION ALL SELECT DATEADD(DAY, 1, RateDate) FROM rCTE WHERE RateDate <= @EndDate ) SELECT r.RateDate ,z.RateValue FROM rCTE r CROSS APPLY ( SELECT TOP 1 RateValue FROM #Test t WHERE t.RateDate <= r.RateDate ORDER BY t.RateDate DESC ) AS z
The CROSS APPLY works great here and allows us to carry our rates forward for each day. The CROSS APPLY is the most performant of the different ways to solve this problem. I did want to include one more option because it's a clever way to solve this using ROWS UNBOUNDED, but might not be as performant as the CROSS APPLY.
DECLARE @StartDate date = '12/29/16' ,@EndDate date = '01/06/17'; -- Get missing dates WITH rCTE AS ( SELECT @StartDate AS RateDate UNION ALL SELECT DATEADD(DAY, 1, RateDate) FROM rCTE WHERE RateDate < @EndDate ), -- Join missing dates to result C1 AS ( SELECT c.RateDate ,RateValue FROM rCTE c LEFT JOIN #Test t ON c.RateDate = t.RateDate ), -- Create date groups for your ranges C2 AS ( SELECT RateDate ,RateValue ,MAX(CASE WHEN RateValue IS NOT NULL THEN RateDate END) OVER(ORDER BY RateDate ROWS UNBOUNDED PRECEDING) AS DateGroup FROM C1 ) SELECT RateDate ,MIN(RateValue) OVER(PARTITION BY DateGroup) AS grp2 FROM C2
In summary, there are many different ways to solve the missing dates problem. It's about knowing when to use what patterns for which problems and ensuring those solutions are performant. The queries here are a great start and give you some options as you begin to explore solutions around the missing dates problems.