Multi-version concurrency control for ZODB
Multi-version concurrency control (MVCC) is a standard technique for avoiding conflicts between reads and writes of the same object. MVCC guarantees that each transaction sees a consistent view of the database by reading non-current data for objects modified by concurrent transactions.
ZODB will use an optimistic variant of MVCC that provides execution time consistency. (PostgreSQL uses locks -- a pessimistic scheme.) Instead of raising a ReadConflictError? to signal a consistency problem, ZODB will automatically read non-current data that provides consistency.
The goal of implementing MVCC for ZODB is to improve performance by reducing the number of transactions that are retried because of conflicts. It will also eliminate read conflicts, which complicate the ZODB programming model. (Basically, any operation on a persistent object could raise an exception if it is unfortunate enough to cause a read conflict.)
Read conflicts occur when a database cannot provide the application with a consistent view of data. Specifically, ZODB always reads the most recent revision of an object. If the most recent revision of an object was written after the transaction started and the current transaction has read other objects, we can't guarantee consistency.
Read conflicts cause transactions to abort and retry all the work previously done. Long-running transactions are more likely to suffer read conflicts, because they are more likely to run concurrently with other transactions. Since long-running transactions are costly to restart, MVCC should be a significant benefit.
One alternative is to allow the current transaction to read the old revision of the object. This approach allows clients to read stale date in return for improving performance.
The read-old-revision approach will work if the conflict resolution can safely be performed on the object or if the object is read but not written. Note that you need to think harder about consistency to make sure this is safe. Example: What happens when conflict resolution is performed on a BTree? that reads the old state of an internal BTree? node? I suspect this is safe, but it needs a bit more thought.
If we read in the past, we still need to provide a consistent view of the database. Once we start reading non-current data, we need to make sure that all future reads are consistent with past reads.
There are two primary implementation issues. The database connection must detect potential read conflicts and request a revision of an object that avoids the conflict. The storage must provide an API for the client to request an earlier revision.
The connection knows about the modification times of all the objects that have already been read. When it detects a read conflict, it can request a revision of an object as of the most recent time of any object read by the current transaction. This means that the connection isn't asking for a specific revision; it's providing a bound and asking the storage to provide the right answer.
Some entries in the object cache must be flushed after an MVCC transaction. Any non-current object must be removed from the cache, but the next transaction that uses the cache expects all entries to be current. It is straightforward to keep a list of non-current objects and remove them on commit or abort.
The existing storage API isn't sufficient for this purpose. The storage API has load() and loadSerial() methods. Load takes an oid and a version and returns the current revision. loadSerial() takes an oid and a serial number and returns that revision. The history() method will return information about each revision of a particular object, including the serialno. So history() could be used to find the serial number and loadSerial() to find the object. This solution is inefficient because it requires two storage calls and because it prevents the storage for optimizing the new behavior.
If we extend the storage API, we need to identify which storages will support it and how much effort we will put into an efficient implementation. A storage needs to support multiple revisions for this to work at all, so FileStorage and BDBStorage? are the only realistic candidates. Since almost all large sites use ZEO, an efficient ZEO implementation is also crucial for success.
It would be complicated to support MVCC and versions. Instead, we propose to ignore versions. If a connection is using a version, it will not be able to use MVCC. (We could always implement it later if there was a need.)
The proposed extension is a loadNonCurrent() method to the storage API. (In theory, it could be optional and BaseStorage? could provide an implementation using history() and loadSerial().)
loadNonCurrent(oid, tid) -> data, serialno, start_tid, end_tid Return revision of object `oid` that was current before transaction `tid` committed. Returns a 4-tuple including the data and serialno of the revision and two transaction ids; the ids are of the transaction that wrote the revision and the transaction that wrote the next revision. For simplicity, loadNonCurrent() only handles non-version data. It raises KeyError if the object is not found.
The two transaction ids in the return value are needed by the ZEO client cache. ZEO needs to know the entire lifetime of the revision so that it can handle loadNonCurrent() calls from cached data without contacting the server. Since ZEO is the only client that needs this information, it would be possible to have a second method that only returned the data and serialno. Since BDBStorage? and FileStorage provide the transaction ids without any extra cost or complexity, it seems a little simpler to have only one method.
Note that the ordering issues in MVCC depend only on transaction ids, not timestamps, and a storage must generate transaction ids in monotonically increasing order. If multiple storages are involved in a single transaction, they will each generate a different timestamp. As a result, MVCC does not offer cross-storage consistency, because transaction ids from different storages can not be meaningfully compared.
A transaction normally reads the most current revision of an object. In MVCC, the first invalidation received by the connection sets an upper-bound on the range of transactions that can be considered part of a consistent snapshot. When an object to be loaded is in the invalidated set, the connection must request a non-current object passing the upper-bound as the second argument.
This means that all storages must pass a transaction id along with the set of oids when it delivers an invalidation message. This means a change in the storage, database, and connection APIs?.
We believe that an important factor in ZEO's scalability is the client cache. The ZEO client should be able to honor loadByTime() requests out of its client cache when possible. This will require a major overhaul of the cache implementation and file format, so that it can cache multiple revisions of an object.
The cache should support an operation like loadNonCurrent() that returns a revision that was current as of a specific transaction. The cache must be able to store multiple revisions of an object, including the transaction range for which the revision is valid. We must also create a cache eviction policy that deals with multiple revisions in a useful way; an old revision may be referenced more recently that a current revision, but it's not clear which revision futures accesses will require.
The cache will need to be re-written from scratch to accommodate the new requirement to store multiple revisions. We may want to change the way the cache manages its storage to achieve better hit rates. Guido did some trace analysis that showed a buddy cache scheme would perform better than the current scheme, which effectively evicts half of the objects on each cache flip. A cache based on BerkeleyDB? would be easier to implement, because low-level details would be handled by BerkeleyDB?, but there is uncertainty about the reliability of Python+BerkeleyDB?.
FileStorage keeps an index of the location of the most recent revision of each object. Each data record has a pointer to the previous data record for the object. It is possible to follow these pointers to find the revision requested. It may also be possible to keep a cache of the locations of recently modified objects, assuming that recently modified objects will be involved in MVCC.
The implementation for BDBFullStorage? should be simple. There is a BTree? that stores all the serialno and tid of all existing revisions of an object. A regular load() fetches the last serialno+tid and uses it to lookup the actual data in another table. A loadNonCurrent() will do a search in the first index to find the appropriate serialno+tid.
Since the goal of MVCC is to improve performance, we should have some clear metrics that can be used to gage our progress.
For example, on a workload we care about, how common are conflict errors? We have anecdotal evidence that they are common in some scenarios. Example: Feed processing for a media site.
We expect that most MVCC requests will be for very recent revisions of objects -- often just the last one. We should look at specific applications and verify if this is correct. The implementation may look different if we expect to see requests for many objects.
From jbelmonte Fri Sep 26 13:50:00 US/Eastern 2003 From: jbelmonte Date: 2003/09/26 13:50 EST Subject: eliminating read conflicts is *the* goal Message-ID: <20030926135032EST@www.zope.org>
Your goal is not clear. You have a heading "Goal: reduce read conflicts". Then you say in that section that the goal is to improve performance by reducing transaction retries. Then you state that read conflicts can be eliminated entirely. Later in "Benchmarks and measurement" you say again that the goal is to improve performance.
The correct goal is to eliminate read conflicts. There are always cases in which a system will fail without MVCC. For example, in a frequent-write application or under a heavy load, an application's maximum transaction retry count could be exceeded. Improved performance by avoiding transaction retries is a side effect. The real benefit is to relieve applications from considering read conflicts.
In other words, improving performance will not affect your ANSI level, while eliminating read conflicts will.From jeremy Tue Nov 4 15:20:00 US/Eastern 2003 From: jeremy Date: 2003/11/04 15:20 EST Subject: eliminating read conflicts is a goa Message-ID: <20031104152057EST@zope.org> The goal is to improve performance by eliminating read conflicts. The previous comments about read conflicts above are a little unclear. A read conflict is part of the mechanism to provide execution-time consistency. A transaction that gets a read conflict is restarted so that it does not see inconsistent state. Read conflicts and performance are closely linked. An application that gets a read conflict will need to retry the transaction, which hurts performance.
From jbelmonte Tue Nov 25 17:39:00 US/Eastern 2003 From: jbelmonte Date: 2003/11/25 17:39 EST Subject: goal Message-ID: <20031125173906EST@zope.org>Yes, I did have a misunderstanding about consistency guarantees in ZODB when I wrote the previous comment. However, my opinion that your goal is not well stated in the article still stands: don't have a heading called "Goal: reduce read conflicts". Reducing read conflicts is not your goal, and in any case you aim to eliminate them, not reduce them.