Handling EAV-data: Timelines

As promised in my previous post about working with EAV-data this post would about handling multiple timelines (or periods for that matter).
While this blog is specifically written for an EAV-environment, the technique explained in this post can be useful for other situations as well.

The problem I’m discussing in this article is about working with multiple, overlapping periods of time and the way they affect the outcome of PIVOT statements.
Because there is so little to find about this subject, I decided to make a write-up of the solution me and my colleagues came up with.

The Problem

Before diving into the technique I would like to position the problem, so first of all I’d like start off with a simple example of what’s going wrong.
Take a look at the following tables:

Table: Entity

EntityID	Entity
1	Employee

Table: Attribute

AttributeID	Attribute
100	Name
101	Birthday
102	Nationality
103	Marital Status

Table: EAV_Data

ID	EntityID	AttributeID	Value	ValidFrom	ValidTo
1234	1	100	John	01-01-2000	31-12-2999
1234	1	101	12-12-1970	01-01-2000	31-12-2999
1234	1	102	English	01-01-2000	02-02-2009
1234	1	102	Dutch	02-02-2009	31-12-2999
1234	1	103	Single	01-01-2000	03-03-2010
1234	1	103	Married	03-03-2010	31-12-2999

They have a striking similarity with the tables I’ve used in my previous post, but this time the EAV_Data table is a bit more complex due to the added ValidFrom and ValidTo columns.
Because of these two columns, it is now possible to have multiple values  for different periods of time.

Meet John.
John is an Englishman who chose the green wetlands of Holland to make his new home. He liked it so much that he even acquired the Dutch nationality at some point.
His actions did not go unnoticed and John conquered the heart of the girl of his dreams, who he married a year later.

See underneath for a graphic representation of the situation:

Now let’s say we ignore the time-axe and just start pivoting, just like we did last time:

   1: SELECT * FROM

   2: (

   3:     SELECT ID

   4:     , ValidFrom

   5:     , ValidTo

   6:     , [100] AS Name

   7:     , [101] AS Birthday

   8:     , [102] AS Nationality

   9:     , [103] AS Marital_Status

  10:     FROM

  11:     (       SELECT ID, EntityID, AttributeID, Value, ValidFrom, ValidTo

  12:             FROM EAV_Data

  13:     ) p

  14:     PIVOT

  15:     (

  16:         MAX (Value)

  17:         FOR AttributeID IN ([100], [101], [102], [103])

  18:     ) AS pvt

  19: ) as result

This query yields the following result:

ID	ValidFrom	ValidTo	Name	Birthday	Nationality	Marital_Status
1234	2000-01-01	2009-02-02	NULL	NULL	English	NULL
1234	2000-01-01	2010-03-03	NULL	NULL	NULL	Single
1234	2000-01-01	2999-12-31	John	12-12-1970	NULL	NULL
1234	2009-02-02	2999-12-31	NULL	NULL	Dutch	NULL
1234	2010-03-03	2999-12-31	NULL	NULL	NULL	Married

This result introduces a few new problems.

First of all, the periods overlap each other.  This is a bad thing, because now I don’t have one version the truth.

Second of all, most columns are blank (NULL). This means that in order to get a correct selection we have to combine the right periods AND merge the right values to get a complete, correct visualization of the record on a specific moment.

But what is causing this behavior in the first place? Because the PIVOT operator now ‘rotates’ the data around the ID, ValidFrom and ValidTo columns, data is only presented in a row if it matches the exact same period. This is incorrect behavior, because the name (John) is valid during the entire period and therefore should be present in every row.

What we need, is what I’d like to call an Effective Timeline.

The Solution

An effective timeline is a timeline that consist out of multiple consecutive periods, rather than multiple parallel/overlapping periods.

This way, a specific moment belongs to one period and one period only.

If we merge all the periods present in our example and remove the duplicates we get something like this:

See how this graph only has one line for every period?

But how do you achieve this programmatically?

In my project we came up with the following solution;

We simply gather all unique start-and enddates of every period in one big collection.
Next, we sort this collection from oldest to newest, ignoring the fact if the current record is a start- or enddate.
Now we have a temporary collection with all possible dates, conviently ordered by date.
Finally we join the collection with itself (one record shifted), combine all dates and effectively create a whole new set of consecutive periods.

See the sample UDF below for clarity:

   1: CREATE FUNCTION [dbo].[fn_EffectiveTimelines]

   2: (

   3: )

   4: RETURNS @l_ResultTable TABLE

   5: (

   6:&#160;

   7:          ID INT

   8:         , ValidFrom DATETIME

   9:         , ValidTo DATETIME

  10: )

  11: AS

  12: BEGIN

  13:&#160;

  14:     WITH EffectiveTimelines AS

  15:     (

  16:         SELECT ID, Date, Type, ROW_NUMBER() OVER (ORDER BY  ID, Date) as Number

  17:         FROM

  18:         (

  19:             SELECT DISTINCT ID, Date, Type

  20:             FROM

  21:             (

  22:                 SELECT ID

  23:                     , ValidFrom AS Date

  24:                     , 1 as Type

  25:                 FROM EAV_Data

  26:                 UNION

  27:                 SELECT ID

  28:                     , ValidTo AS Date

  29:                     , 0 as Type

  30:                 FROM EAV_Data

  31:             ) as TimeSet

  32:         ) as TimeSetOrdered

  33:     )

  34:&#160;

  35:     INSERT INTO @l_ResultTable

  36:     SELECT

  37:     t1.ID,

  38:     CASE

  39:         WHEN t1.Type = 0 THEN DATEADD(dd, 1, t1.Date)

  40:         ELSE t1.Date

  41:     END

  42:     ,

  43:     CASE

  44:         WHEN t2.Type = 1 THEN DATEADD(dd, -1, t2.Date)

  45:         ELSE t2.Date

  46:     END

  47:     FROM EffectiveTimelines t1

  48:     INNER JOIN EffectiveTimelines t2

  49:     ON t1.Number + 1 = t2.Number

  50:     AND t1.ID = t2.ID

  51:     AND DATEADD(dd, 1, t1.Date) < t2.Date

  52:&#160;

  53:     RETURN;

  54: END

Running this function will give the following result:

ID	ValidFrom	ValidTo
1234	2000-01-01	2009-02-02
1234	2009-02-02	2010-03-03
1234	2010-03-03	2999-12-31

Now that we have an unigue, effective timeline, we can now join the rest of the data to get a complete overview. See the following example:

   1: SELECT * FROM

   2: (

   3:     SELECT ID

   4:     , ValidFrom

   5:     , ValidTo

   6:     , [100] AS Name

   7:     , [101] AS Birthday

   8:     , [102] AS Nationality

   9:     , [103] AS Marital_Status

  10:     FROM

  11:     (       SELECT eav.ID, eav.EntityID, eav.AttributeID, eav.Value, t.ValidFrom, t.ValidTo

  12:             FROM EAV_Data eav

  13:             INNER JOIN fn_EffectiveTimelines() t

  14:             ON eav.ID = t.ID

  15:             AND eav.ValidFrom < t.ValidTo

  16:             AND eav.ValidTo > t.ValidFrom

  17:     ) p

  18:     PIVOT

  19:     (

  20:         MAX (Value)

  21:         FOR AttributeID IN ([100], [101], [102], [103])

  22:     ) AS pvt

  23: ) as result

In this step the data is assigned to every period is has overlap with. Because some values are longer valid than others, it can happen that one value is assigned to multiple periods, as shown in the graph below:

Because the periods are now neatly aligned, the PIVOT statement will come up with the following result:

ID	ValidFrom	ValidTo	Name	Birthday	Nationality	Marital_Status
1234	2000-01-01	2009-02-02	John	12-12-1970	English	Single
1234	2009-02-02	2010-03-03	John	12-12-1970	Dutch	Single
1234	2010-03-03	2999-12-31	John	12-12-1970	Dutch	Married

This results is a more accurate representation of the data. The NULL fields are gone and the data is now correctly divided amongst the periods.

With this technique you can pivot data with multiple timelines with little effort, it doesn’t matter if you have 3 periods or 3000: this technique will always calculate the effective timeline (albeit a bit slower ;)).

Conclusion

By making use of effective timelines working with EAV is a bit less complex.

And like I said in the beginning of this blog, calculating an effective timeline by merging periods can be useful in other situations as well. For example, if you have multiple periods that have overlap and you need to count the unique days over all periods, having a effective timeline can really speed up the process.

Geïnteresseerd in meer blogs?

Bekijk alle blogs

Datum van publicatie

21 februari 2010

Update: 10 september 2024

Leestijd

5 min

Auteur

Mark Streutker

Navigeer naar

Wijzig taal

Handling EAV-data: Timelines

Geïnteresseerd in meer blogs?

Datum van publicatie

Leestijd

Auteur

Tags

Handling EAV-data: Timelines

Deel dit artikel

Geïnteresseerd in meer blogs?

Datum van publicatie

Leestijd

Auteur

Tags

Gerelateerde blogs

Microsoft versterkt koppositie in Gartner's Magic Quadrant voor BI

Over collations, codepages en Unicode

Power BI for Office 365

Cleansing data in Data Explorer