Handling EAV-data: Timelines
As promised in my previous post about working with EAV-data this post would about handling multiple timelines (or periods for that matter).
While this blog is specifically written for an EAV-environment, the technique explained in this post can be useful for other situations as well.
The problem I’m discussing in this article is about working with multiple, overlapping periods of time and the way they affect the outcome of PIVOT statements.
Because there is so little to find about this subject, I decided to make a write-up of the solution me and my colleagues came up with.
The Problem
Before diving into the technique I would like to position the problem, so first of all I’d like start off with a simple example of what’s going wrong.
Take a look at the following tables:
Table: Entity
EntityID | Entity |
1 | Employee |
Table: Attribute
AttributeID | Attribute |
100 | Name |
101 | Birthday |
102 | Nationality |
103 | Marital Status |
Table: EAV_Data
ID | EntityID | AttributeID | Value | ValidFrom | ValidTo |
1234 | 1 | 100 | John | 01-01-2000 | 31-12-2999 |
1234 | 1 | 101 | 12-12-1970 | 01-01-2000 | 31-12-2999 |
1234 | 1 | 102 | English | 01-01-2000 | 02-02-2009 |
1234 | 1 | 102 | Dutch | 02-02-2009 | 31-12-2999 |
1234 | 1 | 103 | Single | 01-01-2000 | 03-03-2010 |
1234 | 1 | 103 | Married | 03-03-2010 | 31-12-2999 |
They have a striking similarity with the tables I’ve used in my previous post, but this time the EAV_Data table is a bit more complex due to the added ValidFrom and ValidTo columns.
Because of these two columns, it is now possible to have multiple values  for different periods of time.
Meet John.
John is an Englishman who chose the green wetlands of Holland to make his new home. He liked it so much that he even acquired the Dutch nationality at some point.
His actions did not go unnoticed and John conquered the heart of the girl of his dreams, who he married a year later.
See underneath for a graphic representation of the situation:
Now let’s say we ignore the time-axe and just start pivoting, just like we did last time:
1: SELECT * FROM
2: (
3: SELECT ID
4: , ValidFrom
5: , ValidTo
6: , [100] AS Name
7: , [101] AS Birthday
8: , [102] AS Nationality
9: , [103] AS Marital_Status
10: FROM
11: ( SELECT ID, EntityID, AttributeID, Value, ValidFrom, ValidTo
12: FROM EAV_Data
13: ) p
14: PIVOT
15: (
16: MAX (Value)
17: FOR AttributeID IN ([100], [101], [102], [103])
18: ) AS pvt
19: ) as result
This query yields the following result:
ID | ValidFrom | ValidTo | Name | Birthday | Nationality | Marital_Status |
1234 | 2000-01-01 | 2009-02-02 | NULL | NULL | English | NULL |
1234 | 2000-01-01 | 2010-03-03 | NULL | NULL | NULL | Single |
1234 | 2000-01-01 | 2999-12-31 | John | 12-12-1970 | NULL | NULL |
1234 | 2009-02-02 | 2999-12-31 | NULL | NULL | Dutch | NULL |
1234 | 2010-03-03 | 2999-12-31 | NULL | NULL | NULL | Married |
This result introduces a few new problems.
First of all, the periods overlap each other.  This is a bad thing, because now I don’t have one version the truth.
Second of all, most columns are blank (NULL). This means that in order to get a correct selection we have to combine the right periods AND merge the right values to get a complete, correct visualization of the record on a specific moment.
But what is causing this behavior in the first place? Because the PIVOT operator now ‘rotates’ the data around the ID, ValidFrom and ValidTo columns, data is only presented in a row if it matches the exact same period. This is incorrect behavior, because the name (John) is valid during the entire period and therefore should be present in every row.
What we need, is what I’d like to call an Effective Timeline.
The Solution
An effective timeline is a timeline that consist out of multiple consecutive periods, rather than multiple parallel/overlapping periods.
This way, a specific moment belongs to one period and one period only.
If we merge all the periods present in our example and remove the duplicates we get something like this:
See how this graph only has one line for every period?
But how do you achieve this programmatically?
In my project we came up with the following solution;
- We simply gather all unique start-and enddates of every period in one big collection.
- Next, we sort this collection from oldest to newest, ignoring the fact if the current record is a start- or enddate.
- Now we have a temporary collection with all possible dates, conviently ordered by date.
- Finally we join the collection with itself (one record shifted), combine all dates and effectively create a whole new set of consecutive periods.
See the sample UDF below for clarity:
1: CREATE FUNCTION [dbo].[fn_EffectiveTimelines]
2: (
3: )
4: RETURNS @l_ResultTable TABLE
5: (
6: 
7: ID INT
8: , ValidFrom DATETIME
9: , ValidTo DATETIME
10: )
11: AS
12: BEGIN
13: 
14: WITH EffectiveTimelines AS
15: (
16: SELECT ID, Date, Type, ROW_NUMBER() OVER (ORDER BY ID, Date) as Number
17: FROM
18: (
19: SELECT DISTINCT ID, Date, Type
20: FROM
21: (
22: SELECT ID
23: , ValidFrom AS Date
24: , 1 as Type
25: FROM EAV_Data
26: UNION
27: SELECT ID
28: , ValidTo AS Date
29: , 0 as Type
30: FROM EAV_Data
31: ) as TimeSet
32: ) as TimeSetOrdered
33: )
34: 
35: INSERT INTO @l_ResultTable
36: SELECT
37: t1.ID,
38: CASE
39: WHEN t1.Type = 0 THEN DATEADD(dd, 1, t1.Date)
40: ELSE t1.Date
41: END
42: ,
43: CASE
44: WHEN t2.Type = 1 THEN DATEADD(dd, -1, t2.Date)
45: ELSE t2.Date
46: END
47: FROM EffectiveTimelines t1
48: INNER JOIN EffectiveTimelines t2
49: ON t1.Number + 1 = t2.Number
50: AND t1.ID = t2.ID
51: AND DATEADD(dd, 1, t1.Date) < t2.Date
52: 
53: RETURN;
54: END
Running this function will give the following result:
ID | ValidFrom | ValidTo |
1234 | 2000-01-01 | 2009-02-02 |
1234 | 2009-02-02 | 2010-03-03 |
1234 | 2010-03-03 | 2999-12-31 |
Now that we have an unigue, effective timeline, we can now join the rest of the data to get a complete overview. See the following example:
1: SELECT * FROM
2: (
3: SELECT ID
4: , ValidFrom
5: , ValidTo
6: , [100] AS Name
7: , [101] AS Birthday
8: , [102] AS Nationality
9: , [103] AS Marital_Status
10: FROM
11: ( SELECT eav.ID, eav.EntityID, eav.AttributeID, eav.Value, t.ValidFrom, t.ValidTo
12: FROM EAV_Data eav
13: INNER JOIN fn_EffectiveTimelines() t
14: ON eav.ID = t.ID
15: AND eav.ValidFrom < t.ValidTo
16: AND eav.ValidTo > t.ValidFrom
17: ) p
18: PIVOT
19: (
20: MAX (Value)
21: FOR AttributeID IN ([100], [101], [102], [103])
22: ) AS pvt
23: ) as result
In this step the data is assigned to every period is has overlap with. Because some values are longer valid than others, it can happen that one value is assigned to multiple periods, as shown in the graph below:
Because the periods are now neatly aligned, the PIVOT statement will come up with the following result:
ID | ValidFrom | ValidTo | Name | Birthday | Nationality | Marital_Status |
1234 | 2000-01-01 | 2009-02-02 | John | 12-12-1970 | English | Single |
1234 | 2009-02-02 | 2010-03-03 | John | 12-12-1970 | Dutch | Single |
1234 | 2010-03-03 | 2999-12-31 | John | 12-12-1970 | Dutch | Married |
This results is a more accurate representation of the data. The NULL fields are gone and the data is now correctly divided amongst the periods.
With this technique you can pivot data with multiple timelines with little effort, it doesn’t matter if you have 3 periods or 3000: this technique will always calculate the effective timeline (albeit a bit slower ;)).
Conclusion
By making use of effective timelines working with EAV is a bit less complex.
And like I said in the beginning of this blog, calculating an effective timeline by merging periods can be useful in other situations as well. For example, if you have multiple periods that have overlap and you need to count the unique days over all periods, having a effective timeline can really speed up the process.