Atomic operations on several transactionless external systems
Say you have an application connecting 3 different external systems. You need to update something in all 3. In case of a failure, you need to roll back the operations. This is not a hard thing to implement, but say operation 3 fails, and when rolling back, the rollback for operation 1 fails! Now the first external system is in an invalid state...
I'm thinking a possible solution is to shut down the application and forcing a manual fix of the external system, but then again... It might already have used this information (and perhaps that's why it failed), or we might not have sufficient access. Or it might not even be a good way to rollback the action!
Are there some good ways of handling such cases?
EDIT: Some application details..
It's a multi user web application. Most of the work is done with scheduled jobs (through Quartz.Net), so most operations is run in it's own thread. Some user actions should trigger jobs that update several systems though. The external systems are somewhat unstable.
I Was thinking of changing the application to use the Command and Unit Of Work pattern
Two-Phase Commit (2PC) might be suitable here.
The first phase is getting the various databases to agree that they are willing to go ahead with the commit. In your example, database 1 won't proceed with the write until it is sure that all three databases have reported that the transaction will be possible.
This compares with the process that you are describing that is an "optimistic" approach - Database 1 will assume the transaction should go through until it learns otherwise, and is forced to rollback.
Would you like to explain further how the rollback of operation 1 could fail?
The state it is aiming to get to is one that it has been in before, so it should be logically consistent. There might be transient issues like network failure, but it might be the case that the best way to deal with that is to retry until the problems goes away.
If the problem is that subsequent transactions have locked or changed the data in the meantime, then you have a much larger problem - your transactions are not atomic, and rolling them back may cause the output of other transactions to become invalid.
Depending on the size of the application (single user vs. enterprise), shutting down the application might be a bad idea.
First of all, I'd suggest saving the initial state of the information being changed in the 3 external apps to storage local to your own app. That means you can at least determine what the rollback state is supposed to be should your app crash/the rollback fail/etc. Once the transaction has successfully committed you can then delete this data.
What to do when one of the operations fails depends on the functionality of the 3 external systems. Let's assume that one of these systems holds employee data. Shutting down the application simply because one employee's address is wrong due to a failed transaction is overkill. It's much better to simply check the failed transaction log (ie. the local storage to which you saved the initial states of the 3 external apps) whenever an employee's data is accessed. If that employee data is flagged as invalid, throw an error indicating that the record is in an invalid state and cannot be retrieved.
However, if the entire external system will be thrown into disarray by a failed transaction, then yes - there's nothing you can do here but shut down your app until the problem is fixed.
Oddthinking's answer is a good one, but limited because it is very difficult to actually reliably do a 2PC. This has been known in the distributed computing community for quite a while, though lots of people try their best to just ignore it.
If you're interested in delving deeper into this area, the Paxos consensus algorithm is a good place to start. And be aware that this is a surprisingly difficult problem, precisely because of both the problems you allude to and the fact that it's actually impossible to build a truly reliable messaging system that can deliver a message in a bounded amount of time. (To understand why that's true, consider that someone with a backhoe might wipe out all the network links between the various communicating parties…)
I suspect the real fix is design the architecture of the overall system and how you roll out changes across it so that a loss of communications in one area is not catastrophic. This might or might not be easy to do, depending on the exact details.