Mind the Gaps - Partitioned Outer Joins

There are many ways developers can write code to handle sparsely populated data and fill the gaps. One method that is sometimes overlooked is the Partitioned Outer Join. The name might suggest it only works on data physically partitioned in a table, but it actually relates to logically partitioning the data. This is different from partition-wise joins, which is another concept entirely.

Let's jump right into an example to explain things better.

WITH TEST_DATA AS
 (SELECT 'A' AS THING_ID,
         TO_DATE('01-JAN-2020', 'DD-MON-YYYY') AS START_DATE,
         TO_DATE('29-FEB-2020', 'DD-MON-YYYY') AS END_DATE,
         20 AS DATA_VALUE
  FROM   DUAL
  UNION ALL
  SELECT 'B' AS THING_ID,
         TO_DATE('01-APR-2020', 'DD-MON-YYYY') AS START_DATE,
         TO_DATE('31-MAY-2020', 'DD-MON-YYYY') AS END_DATE,
         30 AS DATA_VALUE
  FROM   DUAL
  UNION ALL
  SELECT 'C' AS THING_ID,
         TO_DATE('01-NOV-2020', 'DD-MON-YYYY') AS START_DATE,
         TO_DATE('31-DEC-2020', 'DD-MON-YYYY') AS END_DATE,
         40 AS DATA_VALUE
  FROM   DUAL
  UNION ALL
  SELECT 'D' AS THING_ID,
         TO_DATE('01-JAN-2020', 'DD-MON-YYYY') AS START_DATE,
         TO_DATE('31-DEC-2020', 'DD-MON-YYYY') AS END_DATE,
         50 AS DATA_VALUE
  FROM   DUAL)
SELECT *
FROM   TEST_DATA
ORDER  BY 1;

This is our sample set of data which has 4 widgets with a unique THING_ID and a date range that their DATA_VALUE is applicable for.

For 2020, A's values cover January and February, B's values cover April and May, C's values cover November and December, and D's values cover the entire year.

Our goal is to display all months of 2020 for all four THING_IDs. If a THING_ID is applicable for a month, show the DATA_VALUE; otherwise, leave it as NULL. However, we need all THING_IDs to appear for every month.

This need is exactly what Partitioned Outer Joins are for so now we will build on this test data to create our "calendar" that will represent the time span for the entire set.

....
MIN_MAX_DATES AS
 (SELECT MIN(START_DATE) AS MIN_START_DATE,
         MAX(END_DATE) AS MAX_END_DATE
  FROM   TEST_DATA)

We need to know the upper and lower bounds of the data, so I have added a Common Table Expression (CTE) to get the MIN and MAX dates. We can then use these dates to generate a calendar limited to these boundaries.

......
CALENDAR AS
 (SELECT ADD_MONTHS(MMD.MIN_START_DATE, ROWNUM - 1) AS DT
  FROM   MIN_MAX_DATES MMD
  CONNECT BY LEVEL <= ROUND(MONTHS_BETWEEN(MMD.MAX_END_DATE, MMD.MIN_START_DATE)))

So far, so good. Now, we want to ensure that all the THING_IDs are returned for all these months. If I only use a RIGHT OUTER JOIN approach, I don't get the desired results (previous CTEs have been hidden for brevity; the full SQL will be shown later).

WITH TEST_DATA AS
 (....),
MIN_MAX_DATES AS
 (....),
CALENDAR AS
 (....)
SELECT X.DT,
       D.THING_ID,
       D.DATA_VALUE
FROM   TEST_DATA D
RIGHT  OUTER JOIN (SELECT C.DT
                   FROM   CALENDAR C) X
ON     (X.DT BETWEEN D.START_DATE AND D.END_DATE)
ORDER  BY 1,
          2

You can see that A only appears for January and February, B for April and May, C for November and December, and D appears for all months. The trick now is to add the magic PARTITION BY syntax and let SQL handle the rest.

WITH TEST_DATA AS
 (....),
MIN_MAX_DATES AS
 (....),
CALENDAR AS
 (....)
SELECT X.DT,
       D.THING_ID,
       D.DATA_VALUE
FROM   TEST_DATA D PARTITION BY(D.THING_ID)
RIGHT  OUTER JOIN (SELECT C.DT
                   FROM   CALENDAR C) X
ON     (X.DT BETWEEN D.START_DATE AND D.END_DATE)
ORDER  BY 1,
          2

Notice in the FROM clause we have added PARTITION BY(D.THING_ID). This applies all the distinct THING_IDs to the RIGHT OUTER JOIN, and voilà, we get our desired dataset returned.

All months now display all THING_IDs and their DATA_VALUE where applicable. Here is the complete SQL for reference:

WITH TEST_DATA AS
 (SELECT 'A' AS THING_ID,
         TO_DATE('01-JAN-2020', 'DD-MON-YYYY') AS START_DATE,
         TO_DATE('29-FEB-2020', 'DD-MON-YYYY') AS END_DATE,
         20 AS DATA_VALUE
  FROM   DUAL
  UNION ALL
  SELECT 'B' AS THING_ID,
         TO_DATE('01-APR-2020', 'DD-MON-YYYY') AS START_DATE,
         TO_DATE('31-MAY-2020', 'DD-MON-YYYY') AS END_DATE,
         30 AS DATA_VALUE
  FROM   DUAL
  UNION ALL
  SELECT 'C' AS THING_ID,
         TO_DATE('01-NOV-2020', 'DD-MON-YYYY') AS START_DATE,
         TO_DATE('31-DEC-2020', 'DD-MON-YYYY') AS END_DATE,
         40 AS DATA_VALUE
  FROM   DUAL
  UNION ALL
  SELECT 'D' AS THING_ID,
         TO_DATE('01-JAN-2020', 'DD-MON-YYYY') AS START_DATE,
         TO_DATE('31-DEC-2020', 'DD-MON-YYYY') AS END_DATE,
         50 AS DATA_VALUE
  FROM   DUAL),
MIN_MAX_DATES AS
 (SELECT MIN(START_DATE) AS MIN_START_DATE,
         MAX(END_DATE) AS MAX_END_DATE
  FROM   TEST_DATA),
CALENDAR AS
 (SELECT ADD_MONTHS(MMD.MIN_START_DATE, ROWNUM - 1) AS DT
  FROM   MIN_MAX_DATES MMD
  CONNECT BY LEVEL <= ROUND(MONTHS_BETWEEN(MMD.MAX_END_DATE, MMD.MIN_START_DATE)))
SELECT X.DT,
       D.THING_ID,
       D.DATA_VALUE
FROM   TEST_DATA D PARTITION BY(D.THING_ID)
RIGHT  OUTER JOIN (SELECT C.DT
                   FROM   CALENDAR C) X
ON     (X.DT BETWEEN D.START_DATE AND D.END_DATE)
ORDER  BY 1,
          2

Thanks for taking the time to read this blog. Hopefully, it will now be part of your developer toolkit to solve similar problems.