Speeding things up in SSIS, literally
Anyone who works with SSIS on a regular basis knows that there are many roads that lead to Rome.
A datasource, for example, can be configured in a few different ways.
One possibility is by using a parameterized query to retrieve data from a database, but if you decide to go with this technique you must realize that there might be a performance penalty attached to it.
The SQL Server optimizer has a hard time picking the right execution plan when there are variables (parameters) involved, because it has no way to determine the selectivity of the predicate on beforehand.
In this case a generic execution plan will be generated, which isn’t necessarily the best plan for the job.
To demonstrate this behavior I have created the following example, based on the AdventureWorks database.
It’s a simple package, with one dataflow task and an OLE DB Source with the following query:
1: SELECT
2: contact.ContactID,
3: contact.FirstName,
4: contact.LastName
5: FROM HumanResources.Employee employee
6: INNER JOIN HumanResources.Employee manager
7: ON employee.ManagerID = manager.EmployeeID
8: INNER JOIN Person.Contact contact
9: ON employee.ContactID = contact.ContactID
10: WHERE manager.EmployeeID = ?
I have created a Dummy Destination with a Custom Script Task, since I’m only interested in the output from the Datasource and there is no direct need to actually do something with this data.
The local variable named “ManagerID” has a fixed value of 200, which is going to be assigned to the parameter.
The manager with ManagerID 200 has two employees working for him, so this query should return two records.
Notice the ‘?’ at the end of the statement, this will be our parameter.
With a profiler running on the background, the package is firing the next set of statements to the AdventureWorks database:
1: declare @p1 int
2: set @p1=0
3: exec sp_prepare @p1 output,N'@P1 int',N'SELECT
4: contact.ContactID,
5: contact.FirstName,
6: contact.LastName
7: FROM HumanResources.Employee employee
8: INNER JOIN HumanResources.Employee manager
9: ON employee.ManagerID = manager.EmployeeID
10: INNER JOIN Person.Contact contact
11: ON employee.ContactID = contact.ContactID
12: WHERE manager.EmployeeID = @P1',1
13: select @p1
14: go
15: exec sp_execute 1,200
16: go
17: exec sp_unprepare 1
18: go
Note that the value 200 is passed as a parameter.
Now that we’ve got our hands on the SQL statement we can replay the situation on the database itself, running this query yields the following result:
Based on the information that was available, the optimizer decided to include a Clustered Index Scan on the Employee table to select ManagerID 200.
If we take this step a little further, we notice that the “Actual Number of Rows” is 2 (which is correct), while the “Estimated Number of Rows” is close to 6 (which is a bit off).
Because we used a parameter (a variable factor) in our query, the optimizer couldn’t tell on beforehand how many rows were going to be returned.
This was solved by building a plan based on average numbers.
In case you were wondering where the number 6 is coming from, this is the average number of employees the managers have:
1: SELECT AVG(NumberOfEmployees)
2: FROM
3: (
4: SELECT manager.EmployeeID, COUNT(*) as NumberOfEmployees FROM HumanResources.Employee employee
5: INNER JOIN HumanResources.Employee manager
6: ON employee.ManagerID = manager.EmployeeID
7: GROUP BY manager.EmployeeID
8: ) sub
This plan was generated under the assumption that the number of rows returned was going to be close to 6, but it might not have been the best plan for the number of rows we actually got back.
By rewriting the query to use literals instead of parameters we can help the optimizer to make a better estimate to generate a plan.
To illustrate this, I have altered the package by adding an extra variable (conviently named “Query”). This variable has EvaluateAsExpression set to true, which enables us to write an expression in this variable. I have entered the following string as expression:
1: "SELECT
2: contact.ContactID,
3: contact.FirstName,
4: contact.LastName
5: FROM HumanResources.Employee employee
6: INNER JOIN HumanResources.Employee manager
7: ON employee.ManagerID = manager.EmployeeID
8: INNER JOIN Person.Contact contact
9: ON employee.ContactID = contact.ContactID
10: WHERE manager.EmployeeID = " + (DT_WSTR, 3) @[User::ManagerID]
Right after that I have used this variable as the source for my Datasource:
After running the package the following SQL is generated:
1: declare @p1 int
2: set @p1=0
3: exec sp_prepare @p1 output,NULL,N'SELECT
4: contact.ContactID,
5: contact.FirstName,
6: contact.LastName
7: FROM HumanResources.Employee employee
8: INNER JOIN HumanResources.Employee manager
9: ON employee.ManagerID = manager.EmployeeID
10: INNER JOIN Person.Contact contact
11: ON employee.ContactID = contact.ContactID
12: WHERE manager.EmployeeID = 200',1
13: select @p1
14: go
15: exec sp_execute 1
16: go
17: exec sp_unprepare 1
18: go
Note that the value 200 is now included as a literal, rather than a parameter.
Once again, if we run this against the AdventureWorks database we get the following plan:
While one might argue that this was basically the same query, the execution plans are entirely different as you can see. The Clustered Index Scan is now gone and an Index Seek and Key Lookup have taken it’s place.
When we check the details on the filter once more we see that the estimated and actual rowcount are now both 2.
Because our query was completely based on literals, the optimizer was able to generate a better plan for our statement.
When both queries are run right after another, it becomes clear that the second solution (the expression) is a lot faster than the parameterized query:
One final note, an expression won’t always give a guaranteed performance improvement.
For example, if a query always returns the same number of results (because the filtered column is unique) a parameterized query might result in the same plan as a literal query.
Nonetheless, it is always good to know that even if you use the same query, using parameters or literals can influence the performance of a package.