Speeding things up in SSIS, literally

Anyone who works with SSIS on a regular basis knows that there are many roads that lead to Rome.
A datasource, for example, can be configured in a few different ways.
One possibility is by using a parameterized query to retrieve data from a database, but if you decide to go with this technique you must realize that there might be a performance penalty attached to it.

The SQL Server optimizer has a hard time picking the right execution plan when there are variables (parameters) involved, because it has no way to determine the selectivity of the predicate on beforehand.
In this case a generic execution plan will be generated, which isn’t necessarily the best plan for the job.

To demonstrate this behavior I have created the following example, based on the AdventureWorks database.

image

It’s a simple package, with one dataflow task and an OLE DB Source with the following query:

   1: SELECT 

   2:     contact.ContactID, 

   3:     contact.FirstName, 

   4:     contact.LastName 

   5: FROM        HumanResources.Employee employee

   6: INNER JOIN  HumanResources.Employee manager

   7:         ON  employee.ManagerID = manager.EmployeeID

   8: INNER JOIN  Person.Contact contact

   9:         ON  employee.ContactID = contact.ContactID

  10:     WHERE   manager.EmployeeID = ?

I have created a Dummy Destination with a Custom Script Task, since I’m only interested in the output from the Datasource and there is no direct need to actually do something with this data.

The local variable named “ManagerID” has a fixed value of 200, which is going to be assigned to the parameter.

The manager with ManagerID 200 has two employees working for him, so this query should return two records.

image

Notice the ‘?’ at the end of the statement, this will be our parameter.

image

 

With a profiler running on the background, the package is firing the next set of statements to the AdventureWorks database:

   1: declare @p1 int

   2: set @p1=0

   3: exec sp_prepare @p1 output,N'@P1 int',N'SELECT 

   4:     contact.ContactID, 

   5:     contact.FirstName, 

   6:     contact.LastName 

   7: FROM        HumanResources.Employee employee

   8: INNER JOIN  HumanResources.Employee manager

   9:     ON   employee.ManagerID = manager.EmployeeID

  10: INNER JOIN  Person.Contact contact

  11:         ON  employee.ContactID = contact.ContactID

  12:      WHERE  manager.EmployeeID = @P1',1

  13: select @p1

  14: go

  15: exec sp_execute 1,200

  16: go

  17: exec sp_unprepare 1

  18: go

Note that the value 200 is passed as a parameter.

Now that we’ve got our hands on the SQL statement we can replay the situation on the database itself, running this query yields the following result:

image

Based on the information that was available, the optimizer decided to include a Clustered Index Scan on the Employee table to select ManagerID 200.

If we take this step a little further, we notice that the “Actual Number of Rows” is 2 (which is correct), while the “Estimated Number of Rows” is close to 6 (which is a bit off).

Because we used a parameter (a variable factor) in our query, the optimizer couldn’t tell on beforehand how many rows were going to be returned.

This was solved by building a plan based on average numbers.

In case you were wondering where the number 6 is coming from, this is the average number of employees the managers have:

   1: SELECT AVG(NumberOfEmployees)

   2: FROM

   3: (

   4:     SELECT manager.EmployeeID, COUNT(*) as NumberOfEmployees FROM HumanResources.Employee employee

   5:     INNER JOIN HumanResources.Employee manager

   6:     ON employee.ManagerID = manager.EmployeeID

   7:     GROUP BY manager.EmployeeID

   8: ) sub

This plan was generated under the assumption that the number of rows returned was going to be close to 6, but it might not have been the best plan for the number of rows we actually got back.

By rewriting the query to use literals instead of parameters we can help the optimizer to make a better estimate to generate a plan.

To illustrate this, I have altered the package by adding an extra variable (conviently named “Query”). This variable has EvaluateAsExpression set to true, which enables us to write an expression in this variable. I have entered the following string as expression:

   1: "SELECT 

   2:     contact.ContactID, 

   3:     contact.FirstName, 

   4:     contact.LastName 

   5: FROM        HumanResources.Employee employee

   6: INNER JOIN  HumanResources.Employee manager

   7:         ON  employee.ManagerID = manager.EmployeeID

   8: INNER JOIN  Person.Contact contact

   9:         ON  employee.ContactID = contact.ContactID

  10:      WHERE  manager.EmployeeID = " + (DT_WSTR, 3)  @[User::ManagerID]

Right after that I have used this variable as the source for my Datasource: 
image

After running the package the following SQL is generated:

   1: declare @p1 int

   2: set @p1=0

   3: exec sp_prepare @p1 output,NULL,N'SELECT 

   4:     contact.ContactID, 

   5:     contact.FirstName, 

   6:     contact.LastName 

   7: FROM        HumanResources.Employee employee

   8: INNER JOIN  HumanResources.Employee manager

   9:         ON  employee.ManagerID = manager.EmployeeID

  10: INNER JOIN  Person.Contact contact

  11:         ON  employee.ContactID = contact.ContactID

  12:      WHERE  manager.EmployeeID = 200',1

  13: select @p1

  14: go

  15: exec sp_execute 1

  16: go

  17: exec sp_unprepare 1

  18: go

 

Note that the value 200 is now included as a literal, rather than a parameter.

Once again, if we run this against the AdventureWorks database we get the following plan:

image

While one might argue that this was basically the same query, the execution plans are entirely different as you can see. The Clustered Index Scan is now gone and an Index Seek and Key Lookup have taken it’s place.

When we check the details on the filter once more we see that the estimated and actual rowcount are now both 2.

Because our query was completely based on literals, the optimizer was able to generate a better plan for our statement.

When both queries are run right after another, it becomes clear that the second solution (the expression) is a lot faster than the parameterized query:

image

One final note, an expression won’t always give a guaranteed performance improvement.

For example, if a query always returns the same number of results (because the filtered column is unique) a parameterized query might result in the same plan as a literal query.

Nonetheless, it is always good to know that even if you use the same query, using parameters or literals can influence the performance of a package.