Handling EAV-data: Timelines

As promised in my previous post about working with EAV-data this post would about handling multiple timelines (or periods for that matter).
While this blog is specifically written for an EAV-environment, the technique explained in this post can be useful for other situations as well.

The problem I’m discussing in this article is about working with multiple, overlapping periods of time and the way they affect the outcome of PIVOT statements.
Because there is so little to find about this subject, I decided to make a write-up of the solution me and my colleagues came up with.

The Problem

Before diving into the technique I would like to position the problem, so first of all I’d like start off with a simple example of what’s going wrong.
Take a look at the following tables:

Table: Entity

EntityID Entity
1 Employee

Table: Attribute

AttributeID Attribute
100 Name
101 Birthday
102 Nationality
103 Marital Status

Table: EAV_Data

ID EntityID AttributeID Value ValidFrom ValidTo
1234 1 100 John 01-01-2000 31-12-2999
1234 1 101 12-12-1970 01-01-2000 31-12-2999
1234 1 102 English 01-01-2000 02-02-2009
1234 1 102 Dutch 02-02-2009 31-12-2999
1234 1 103 Single 01-01-2000 03-03-2010
1234 1 103 Married 03-03-2010 31-12-2999

They have a striking similarity with the tables I’ve used in my previous post, but this time the EAV_Data table is a bit more complex due to the added ValidFrom and ValidTo columns.
Because of these two columns, it is now possible to have multiple values  for different periods of time.

Meet John.
John is an Englishman who chose the green wetlands of Holland to make his new home. He liked it so much that he even acquired the Dutch nationality at some point.
His actions did not go unnoticed and John conquered the heart of the girl of his dreams, who he married a year later.

See underneath for a graphic representation of the situation:

image

Now let’s say we ignore the time-axe and just start pivoting, just like we did last time:

   1: SELECT * FROM   

   2: (  

   3:     SELECT ID  

   4:     , ValidFrom

   5:     , ValidTo 

   6:     , [100] AS Name   

   7:     , [101] AS Birthday   

   8:     , [102] AS Nationality   

   9:     , [103] AS Marital_Status   

  10:     FROM     

  11:     (       SELECT ID, EntityID, AttributeID, Value, ValidFrom, ValidTo 

  12:             FROM EAV_Data     

  13:     ) p  

  14:     PIVOT     

  15:     (     

  16:         MAX (Value)  

  17:         FOR AttributeID IN ([100], [101], [102], [103])     

  18:     ) AS pvt  

  19: ) as result

This query yields the following result:

ID ValidFrom ValidTo Name Birthday Nationality Marital_Status
1234 2000-01-01 2009-02-02 NULL NULL English NULL
1234 2000-01-01 2010-03-03 NULL NULL NULL Single
1234 2000-01-01 2999-12-31 John 12-12-1970 NULL NULL
1234 2009-02-02 2999-12-31 NULL NULL Dutch NULL
1234 2010-03-03 2999-12-31 NULL NULL NULL Married


This result introduces a few new problems.

First of all, the periods overlap each other.  This is a bad thing, because now I don’t have one version the truth.

Second of all, most columns are blank (NULL). This means that in order to get a correct selection we have to combine the right periods AND merge the right values to get a complete, correct visualization of the record on a specific moment.

But what is causing this behavior in the first place? Because the PIVOT operator now ‘rotates’ the data around the ID, ValidFrom and ValidTo columns, data is only presented in a row if it matches the exact same period. This is incorrect behavior, because the name (John) is valid during the entire period and therefore should be present in every row.

What we need, is what I’d like to call an Effective Timeline.

The Solution

An effective timeline is a timeline that consist out of multiple consecutive periods, rather than multiple parallel/overlapping periods.

This way, a specific moment belongs to one period and one period only.

If we merge all the periods present in our example and remove the duplicates we get something like this:

image

See how this graph only has one line for every period?

But how do you achieve this programmatically?

In my project we came up with the following solution;

  • We simply gather all unique start-and enddates of every period in one big collection.
  • Next, we sort this collection from oldest to newest, ignoring the fact if the current record is a start- or enddate.
  • Now we have a temporary collection with all possible dates, conviently ordered by date.
  • Finally we join the collection with itself (one record shifted), combine all dates and effectively create a whole new set of consecutive periods.

See the sample UDF below for clarity:

   1: CREATE FUNCTION [dbo].[fn_EffectiveTimelines]

   2: (

   3: )

   4: RETURNS @l_ResultTable TABLE

   5: (

   6:  

   7:          ID INT

   8:         , ValidFrom DATETIME

   9:         , ValidTo DATETIME

  10: )

  11: AS

  12: BEGIN    

  13:  

  14:     WITH EffectiveTimelines AS

  15:     (

  16:         SELECT ID, Date, Type, ROW_NUMBER() OVER (ORDER BY  ID, Date) as Number 

  17:         FROM 

  18:         (

  19:             SELECT DISTINCT ID, Date, Type

  20:             FROM

  21:             (

  22:                 SELECT ID

  23:                     , ValidFrom AS Date

  24:                     , 1 as Type

  25:                 FROM EAV_Data

  26:                 UNION

  27:                 SELECT ID

  28:                     , ValidTo AS Date

  29:                     , 0 as Type

  30:                 FROM EAV_Data

  31:             ) as TimeSet

  32:         ) as TimeSetOrdered

  33:     )

  34:  

  35:     INSERT INTO @l_ResultTable

  36:     SELECT

  37:     t1.ID,

  38:     CASE

  39:         WHEN t1.Type = 0 THEN DATEADD(dd, 1, t1.Date)

  40:         ELSE t1.Date 

  41:     END

  42:     ,

  43:     CASE

  44:         WHEN t2.Type = 1 THEN DATEADD(dd, -1, t2.Date)

  45:         ELSE t2.Date 

  46:     END

  47:     FROM EffectiveTimelines t1

  48:     INNER JOIN EffectiveTimelines t2

  49:     ON t1.Number + 1 = t2.Number

  50:     AND t1.ID = t2.ID

  51:     AND DATEADD(dd, 1, t1.Date) < t2.Date

  52:&#160; 

  53:     RETURN;

  54: END


Running this function will give the following result:

ID ValidFrom ValidTo
1234 2000-01-01 2009-02-02
1234 2009-02-02 2010-03-03
1234 2010-03-03 2999-12-31


Now that we have an unigue, effective timeline, we can now join the rest of the data to get a complete overview. See the following example:

   1: SELECT * FROM   

   2: (  

   3:     SELECT ID  

   4:     , ValidFrom

   5:     , ValidTo 

   6:     , [100] AS Name   

   7:     , [101] AS Birthday   

   8:     , [102] AS Nationality   

   9:     , [103] AS Marital_Status   

  10:     FROM     

  11:     (       SELECT eav.ID, eav.EntityID, eav.AttributeID, eav.Value, t.ValidFrom, t.ValidTo 

  12:             FROM EAV_Data eav

  13:             INNER JOIN fn_EffectiveTimelines() t

  14:             ON eav.ID = t.ID

  15:             AND eav.ValidFrom < t.ValidTo

  16:             AND eav.ValidTo > t.ValidFrom  

  17:     ) p  

  18:     PIVOT     

  19:     (     

  20:         MAX (Value)  

  21:         FOR AttributeID IN ([100], [101], [102], [103])     

  22:     ) AS pvt  

  23: ) as result


In this step the data is assigned to every period is has overlap with. Because some values are longer valid than others, it can happen that one value is assigned to multiple periods, as shown in the graph below:

image&#160;

Because the periods are now neatly aligned, the PIVOT statement will come up with the following result:

ID ValidFrom ValidTo Name Birthday Nationality Marital_Status
1234 2000-01-01 2009-02-02 John 12-12-1970 English Single
1234 2009-02-02 2010-03-03 John 12-12-1970 Dutch Single
1234 2010-03-03 2999-12-31 John 12-12-1970 Dutch Married

This results is a more accurate representation of the data. The NULL fields are gone and the data is now correctly divided amongst the periods.

With this technique you can pivot data with multiple timelines with little effort, it doesn’t matter if you have 3 periods or 3000: this technique will always calculate the effective timeline (albeit a bit slower ;)).

Conclusion

By making use of effective timelines working with EAV is a bit less complex.

And like I said in the beginning of this blog, calculating an effective timeline by merging periods can be useful in other situations as well. For example, if you have multiple periods that have overlap and you need to count the unique days over all periods, having a effective timeline can really speed up the process.