WF persistence ownershipTimeout and multiple runtime instances, how to avoid scheduling of the the same workflow instance
In one of the projects I work for as an architect we ran into an interesting issue related to WF and WCF messages we receive. We discovered that each time we started our workflow host again, we would see the same workflow instance scheduled twice in different workflow runtimes. We discovered this had to do with persistence points we injected to ensure messages would be processed. So what we had is basically the following workflow:
We have external messages coming in from the outer word, then we ensure the message is persisted in the workflow and then we call an other internal web service that handles part of the process. Now what is interesting about this construct is that when you first run this for the first time you see the following behavior:
AppDomain.CurrentDomain.FriendlyName : Workflow Instance ID : Message
8dc76321-1-129203865283528469: 6271b6f5-962c-41fa-877e-e74f8f74b511: Workflow1:1
8dc76321-1-129203865283528469: 6271b6f5-962c-41fa-877e-e74f8f74b511: Workflow1:2
8dc76321-1-129203865283528469: 6271b6f5-962c-41fa-877e-e74f8f74b511: Workflow1:1
8dc76321-1-129203865283528469: 1f5d7117-86bc-43d0-bc30-eb9c6b006b38: Workflow2:1
8dc76321-1-129203865283528469: 6271b6f5-962c-41fa-877e-e74f8f74b511: Workflow1:3
Now at first this looks like a bug, but it get’s more interesting if you dive into the behavior of windows workflow and the way they handle incoming message receive activities. What happens is that when you receive an incoming message, the receive activity activates the workflow type it is part of. Because we enforced persistence within the sequence, a persistence record will be in the persistence store. Now when I receive the second message the WF infrastructure initiates a new windows workflow runtime for the new type of workflow “Workflow type 2”. Because a new workflow runtime is started, also persistence is initialized and there it will look for any instance that is in the state running and has no ownership, so the runtime can pick up that instance and re-start it.
So lets first start with showing the configuration I used for the service behavior:
So what happens when you use this “default” configuration of the persistence services?
Both services will use the same persistence store and don’t claim ownership of an instance while they run. The reason for this is that the workflow runtime does not want to incur a locking performance penalty when one workflow runtime uses one store. Mainly in a farm scenario you would need a configuration where the persistence store is multiple workflow runtime instances aware and uses a locking mechanism. The SqlworkflowPersistence class,that is responsible for the persistence implementation has multiple constructors that will initiate different behavior. When you configure the persistence using configuration as shown, you will use the constructor that initiates a non locking version of the persistence service. The way to create an instance that is locking aware, you need to configure a property OwnershipTimeoutSeconds. This interval specifies how long the ownership of a workflow runtime is honored and when the lock should be treated as “ignore the lock, the instance probably died” and pick up this instance and schedule it for execution.
Once you set this property, the calls to the stored procedures are different in such a way that each runtime instance will have it’s own instance ID. This ID is used for locking.
So the configuration for a workflow service hosted in IIS would look as follows:
One thing I was not aware of is that for each workflow type there initiates a new workflow runtime instance and therefore will cause this problem when you have multiple workflow services in the same workflow hosting environment. So for our two workflow types in the described scenario there will be two instances of the workflow runtime.
Now the only remaining question is, what value should I specify for the OwnershipTimeoutSeconds?
The value must be big enough to allow a workflow instance episode (from idle to running, back to idle and persisted again) to fully execute. If the ownership is to short, you will experience multiple instances of the same workflow executing in different workflow runtime instances. (like the issue we had) The maximum value for this property is the time you think is allowed between a failing workflow to get picked up by another workflow runtime. For my sample workflows in this post 10 seconds is more then enough, but in more complex scenario’s you need to figure out the maximum time spend and e.g. the recovery time allowed with your non functional requirements. One thing to discover the minimum value, would be to plug in the tracking services as well and track the time spend between WorkflowTrack events of type Persisted. this way you can see the intervals between persistence and determine the minimum interval that needs to be set to a value higher then the maximum value found.
Hope this helps when you run into the same issue.
Cheers,
Marcel
Follow my new blog on http://fluentbytes.com