Time for a bit of revision. What is Two-phase Commit (2PC)?
Firstly, there’s lots of in formation out there on 2PC including Wkikpedia and MSDN. Those articles will go into much more detail about 2PC than I will here. This post is really just an aide-mémoire.
In a nutshell:
“It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (roll back) the transaction (it is a specialized type of consensus protocol). The protocol achieves its goal even in many cases of temporary system failure (involving either process, network node, communication, etc. failures), and is thus widely utilized.” [1]
Essentially 2PC provides a mechanism for tasks to be executed across separate systems as a single atomic distributed transaction. For example you might want to make updates to separate databases on different servers - with each update running in its own transaction - and have the whole process run as a single distributed transaction. If an error occurs during any of the component transactions then all of the transactions should be aborted (rolled back).
Note that 2PC does not have to apply to database transactions. A step in the process could mean executing a program.
“The term transaction (or any of its derivatives, such as transactional), might be misleading. In many cases, the term transaction describes a single program executing on a mainframe computer that does not use the 2PC protocol. In other cases, however, it is used to denote an operation that is carried out by multiple programs on multiple computers that are using the 2PC protocol.” [2]
There will be 2 basic actors to 2PC: a coordinating process that manages the distributed transaction, and participating processes (participants, cohorts, or workers).
The 2PC protocol calls for 2 phases (see reference [1] for full details):
- Commit-request phase (or Voting phase)
- The coordinator sends an instruction to all cohorts to undertake their part of the distributed transaction and waits until it has received a reply from all cohorts.
- The cohorts execute the transaction up to the point where they will be asked to commit. They each write an entry to their undo log and an entry to their redo log.
- Each cohort replies with an agreement message (cohort votes Yes to commit), if the cohort's actions succeeded, or an abort message (cohort votes No, not to commit), if the cohort experiences a failure that will make it impossible to commit.
- Commit phase (or Completion phase)
- Success
- If the coordinator received an agreement message from all cohorts during the commit-request phase:
- The coordinator sends a commit message to all the cohorts.
- Each cohort completes the operation, and releases all the locks and resources held during the transaction.
- Each cohort sends an acknowledgment to the coordinator.
- The coordinator completes the transaction when all acknowledgments have been received.
- Failure
- If any cohort votes No during the commit-request phase (or the coordinator's timeout expires):
- The coordinator sends a rollback message to all the cohorts.
- Each cohort undoes the transaction using the undo log, and releases the resources and locks held during the transaction.
- Each cohort sends an acknowledgement to the coordinator.
- The coordinator undoes the transaction when all acknowledgements have been received.
The key point is that the cohorts do their work up to the point that they need to commit their transactions. They then vote on whether or not to commit. If all cohorts vote “Yes” then the coordinator tells all cohorts to commit. If any cohort votes “No” then the coordinator tells all cohorts to abort (rollback).
An interesting scenario to be considered is what happens if a cohort crashes having already voted but before it receives or processes the coordinator’s instruction to commit. The trick here is that the distributed transaction is not committed until all cohorts have acknowledged that they have committed. The coordinator will instruct the crashed cohort to commit when it becomes available again. To make this kind of scenario work the cohorts need to use logging to keep track of what steps they have taken (e.g. the database transaction log):
“To accommodate recovery from failure (automatic in most cases) the protocol's participants use logging of the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures.” [1]
Disadvantages
The two-phase commit protocol is a blocking protocol. So,
- Resources may be locked by cohorts while waiting for an instruction from the coordinator to commit or abort
- If the coordinator fails permanently, some cohorts will never resolve their transactions
References
[1] Two-phase commit protocol (Wikipedia)
[2] Two-Phase Commit (MSDN)