There are many ways developers can write code to handle sparsely populated data and fill the gaps. One method that is sometimes overlooked is the Partitioned Outer Join. The name might suggest it only works on data physically partitioned in a table, but it actually relates to logically partitioning the data. This is different from partition-wise joins, which is another concept entirely.
Let's jump right into an example to explain things better.
WITH TEST_DATA AS
(SELECT 'A' AS THING_ID,
TO_DATE('01-JAN-2020', 'DD-MON-YYYY') AS START_DATE,
TO_DATE('29-FEB-2020', 'DD-MON-YYYY') AS END_DATE,
20 AS DATA_VALUE
FROM DUAL
UNION ALL
SELECT 'B' AS THING_ID,
TO_DATE('01-APR-2020', 'DD-MON-YYYY') AS START_DATE,
TO_DATE('31-MAY-2020', 'DD-MON-YYYY') AS END_DATE,
30 AS DATA_VALUE
FROM DUAL
UNION ALL
SELECT 'C' AS THING_ID,
TO_DATE('01-NOV-2020', 'DD-MON-YYYY') AS START_DATE,
TO_DATE('31-DEC-2020', 'DD-MON-YYYY') AS END_DATE,
40 AS DATA_VALUE
FROM DUAL
UNION ALL
SELECT 'D' AS THING_ID,
TO_DATE('01-JAN-2020', 'DD-MON-YYYY') AS START_DATE,
TO_DATE('31-DEC-2020', 'DD-MON-YYYY') AS END_DATE,
50 AS DATA_VALUE
FROM DUAL)
SELECT *
FROM TEST_DATA
ORDER BY 1;
This is our sample set of data which has 4 widgets with a unique THING_ID and a date range that their DATA_VALUE is applicable for.
For 2020, A's values cover January and February, B's values cover April and May, C's values cover November and December, and D's values cover the entire year.
Our goal is to display all months of 2020 for all four THING_IDs. If a THING_ID is applicable for a month, show the DATA_VALUE; otherwise, leave it as NULL. However, we need all THING_IDs to appear for every month.
This need is exactly what Partitioned Outer Joins are for so now we will build on this test data to create our "calendar" that will represent the time span for the entire set.
....
MIN_MAX_DATES AS
(SELECT MIN(START_DATE) AS MIN_START_DATE,
MAX(END_DATE) AS MAX_END_DATE
FROM TEST_DATA)
We need to know the upper and lower bounds of the data, so I have added a Common Table Expression (CTE) to get the MIN and MAX dates. We can then use these dates to generate a calendar limited to these boundaries.
......
CALENDAR AS
(SELECT ADD_MONTHS(MMD.MIN_START_DATE, ROWNUM - 1) AS DT
FROM MIN_MAX_DATES MMD
CONNECT BY LEVEL <= ROUND(MONTHS_BETWEEN(MMD.MAX_END_DATE, MMD.MIN_START_DATE)))
So far, so good. Now, we want to ensure that all the THING_IDs are returned for all these months. If I only use a RIGHT OUTER JOIN approach, I don't get the desired results (previous CTEs have been hidden for brevity; the full SQL will be shown later).
WITH TEST_DATA AS
(....),
MIN_MAX_DATES AS
(....),
CALENDAR AS
(....)
SELECT X.DT,
D.THING_ID,
D.DATA_VALUE
FROM TEST_DATA D
RIGHT OUTER JOIN (SELECT C.DT
FROM CALENDAR C) X
ON (X.DT BETWEEN D.START_DATE AND D.END_DATE)
ORDER BY 1,
2
You can see that A only appears for January and February, B for April and May, C for November and December, and D appears for all months. The trick now is to add the magic PARTITION BY syntax and let SQL handle the rest.
WITH TEST_DATA AS
(....),
MIN_MAX_DATES AS
(....),
CALENDAR AS
(....)
SELECT X.DT,
D.THING_ID,
D.DATA_VALUE
FROM TEST_DATA D PARTITION BY(D.THING_ID)
RIGHT OUTER JOIN (SELECT C.DT
FROM CALENDAR C) X
ON (X.DT BETWEEN D.START_DATE AND D.END_DATE)
ORDER BY 1,
2
Notice in the FROM clause we have added PARTITION BY(D.THING_ID). This applies all the distinct THING_IDs to the RIGHT OUTER JOIN, and voilà, we get our desired dataset returned.
All months now display all THING_IDs and their DATA_VALUE where applicable. Here is the complete SQL for reference:
WITH TEST_DATA AS
(SELECT 'A' AS THING_ID,
TO_DATE('01-JAN-2020', 'DD-MON-YYYY') AS START_DATE,
TO_DATE('29-FEB-2020', 'DD-MON-YYYY') AS END_DATE,
20 AS DATA_VALUE
FROM DUAL
UNION ALL
SELECT 'B' AS THING_ID,
TO_DATE('01-APR-2020', 'DD-MON-YYYY') AS START_DATE,
TO_DATE('31-MAY-2020', 'DD-MON-YYYY') AS END_DATE,
30 AS DATA_VALUE
FROM DUAL
UNION ALL
SELECT 'C' AS THING_ID,
TO_DATE('01-NOV-2020', 'DD-MON-YYYY') AS START_DATE,
TO_DATE('31-DEC-2020', 'DD-MON-YYYY') AS END_DATE,
40 AS DATA_VALUE
FROM DUAL
UNION ALL
SELECT 'D' AS THING_ID,
TO_DATE('01-JAN-2020', 'DD-MON-YYYY') AS START_DATE,
TO_DATE('31-DEC-2020', 'DD-MON-YYYY') AS END_DATE,
50 AS DATA_VALUE
FROM DUAL),
MIN_MAX_DATES AS
(SELECT MIN(START_DATE) AS MIN_START_DATE,
MAX(END_DATE) AS MAX_END_DATE
FROM TEST_DATA),
CALENDAR AS
(SELECT ADD_MONTHS(MMD.MIN_START_DATE, ROWNUM - 1) AS DT
FROM MIN_MAX_DATES MMD
CONNECT BY LEVEL <= ROUND(MONTHS_BETWEEN(MMD.MAX_END_DATE, MMD.MIN_START_DATE)))
SELECT X.DT,
D.THING_ID,
D.DATA_VALUE
FROM TEST_DATA D PARTITION BY(D.THING_ID)
RIGHT OUTER JOIN (SELECT C.DT
FROM CALENDAR C) X
ON (X.DT BETWEEN D.START_DATE AND D.END_DATE)
ORDER BY 1,
2
Thanks for taking the time to read this blog. Hopefully, it will now be part of your developer toolkit to solve similar problems.