Race-condition in NServiceBus Sagas

A while ago I wrote a post about an Advanced Saga-Mapping component for NServiceBus, based on RavenDB’s query mechanic. After toying around with it for a bit (we really like the option to map on headers), I ran into some issues. Or rather, one issue in particular: on occasion my saga data would not be persisted yet when a second message for that saga arrived. Most peculiar, this also happened if the second message was a result of a message sent by the saga itself (e.g. an event sent by a handler which was running due to a command sent by the saga).

Race-condition

After some thorough research I came to the conclusion that the problem lies with RavenDB and the Distributed Transaction Coordinator (DTC). When a saga is done handling its initial message the NServiceBus framework will commit the transaction. Sadly this is where the issue originates, as the transaction interacts with both MSMQ and RavenDB it is a distributed transaction. When you commit a distributed transaction the DTC will at some point tell you the commit has succeeded (by returning from your method call). At that point the DTC has determined that no participant in the transaction has a need to rollback and as such the commit will go through. However, the data isn’t actually committed yet. At the moment the DTC’s transaction is completed, it has told all other participating systems to commit their transaction without waiting for them to actually do it. While this still ensures the data will be committed, it also introduces a race-condition: NServiceBus might attempt to find the saga data before RavenDB actually finishes committing it.

RavenDB: Query vs. Load

Fortunately, RavenDB is aware of this issue (see http://ravendb.net/docs/faq/working-with-dtc) and provides a work-around for this. You can actually tell RavenDB to wait until the pending transaction commits by setting an advanced property on your session. While this solves the problem if you attempt to load the data from the database, it doesn’t actually work when you attempt to query the database.
RavenDB offers two ways to retrieve data, you can run a query with the Query method or load the data using the Load method. There is a catch of course, if you wish to use the Load method you will need to provide it with the key that identifies your data. In my case I’d like to map on multiple fields (both a property and a header of my message) and as such will need to either combine all of those in the key of the document, or will need to use the query to check multiple fields.

So, while loading might solve my problem, it also ensures I will always need the key to the document I need. Querying on the other hand is more flexible but doesn’t wait for the pending transaction to commit. While the query can be set to wait for non-stale results, RavenDB does not consider it’s indexes stale when new data isn’t committed yet. As a result, the query will only return already existing documents. (If any of those documents would be altered in the committed transaction RavenDB will let us know but that is not helpful in this scenario.)

Unique

So how does the default NServiceBus Saga implementation work around this issue? Well, have you ever wondered why it is so important to flag one of your saga data’s properties as unique (using the UniqueAttribute)? When saga data has a unique property, NServiceBus will create an additional document in the RavenDB store which can be identified by a hash of the unique property. This allows NServiceBus to first load this additional document by calculating it’s key and then use the data in that document to load the actual saga data. For timeouts it’s even easier, as they contain the SagaId, so they can load the saga data by its key directly.

So, what now?

While NServiceBus solves the race-condition in the case of time-outs and messages mapped on the saga data’s unique property, it still remains an issue when you want to map a second message on another property. In my case I have the messages that come from business services mapped on logical keys, combined with a header to provide multi-tenancy, while all messages resulting from actions set out by the saga return based on a generated correlation id (no header required, since the id is unique over all tenants). This forces me to use raven’s query method for at least one of those scenarios, as described in my previous blogpost.

I did think of some solutions for this issue:
Concatenate all fields you wish to map on into a single property, both on the saga data and every mapped message. While I can accept this solution for commands, I deem it undesirable to add properties to events just because it might be handled by a specific saga.
Concatenate all fields on the saga into a single (unique) property and allow a Func<> to be provided in the mapping to concatenate fields for the message. This solves the issue of having the undesired concatenated properties on events and ensures all mapping logic lies with the Saga. Still, it’s not the neatest way to go…
Add a mapping in RavenDB, much like the one used by NServiceBus’s UniqueAttribute, to allow mapping in multiple properties. For the developer writing the saga, this is an elegant solution as he will hardly see any of it. For me on the other hand, it means not only a custom implementation to find sagas, but also to persist them! Additionally, I am not quite sure how far I can go here without needing to acquire a separate RavenDB license.
Implement a retry mechanism for saga resolving. While forcing NServiceBus to retry finding the saga if it doesn’t find it is not an actual solution, it greatly reduces the fall-out due to the race-condition.

Right now, I don’t have a real (fitting) solution for this issue. As such, I would like to have your input on this matter. If you have any ideas that could contribute to a solution, please let me know! You can leave comments on this blog, respond on the NServiceBus community forums or contact me via Twitter (@MarkTaling).