Architecture For Second Prototype
There are four customer requirements that drive the architecture of the second prototype.
- If the primary server fails, Standby should automatically fail over to a replica.
- When failover occurs, ZEO clients must be redirected from the old primary to the new primary.
- In the absence of failures, the system can be reconfigured to swap a replica and the primary.
- The replicas can be used as read-only ZEO servers.
- There can be different groups of replicas within a single Standby storage system. Scenario: Standby storage servers run in three geographically separate locations. One location is primary: the primary server runs there and one or more replicas runs there keeping a full copy of the primary server. If the primary fails, it fails over to one of these replicas. Remote locations also have replicas, but these replicas do not take over as the primary if the primary fails.
- A replica that fails because of a partition should continue to operate as a read-only server and attempt to reconnect when communication becomes possible.
There are a bunch of loose ends to resolve to make this ready for serious testing.
- The Spread wrapper needs to be fleshed out. It doesn't handle issues like transitional membership messages.
- Test and write code to handle membership transitions. I haven't given this much thought, because I haven't had an environment where such changes can be tested.
- Figure out how to configure Spread to run without a single point of failure.
Grab-bag of issues raised by customer requirements:
- Integrate Spread with asyncore in some reasonable way. The Spread library was not designed to run asynchronously. (The library appears to be designed for a multi-threaded environment, which is a reasonable thing.) It does expose a poll() method, but it only checks to see if some data is available on the socket; it can't easily guarantee that enough data is available to prevent the library from blocking.
- We need to update ZODB storages to support recovery. The
changes are fairly simple, but we need to decide how to package
them. I presume they should go into some future release of
ZODB, even though they won't be as useful without Standby.
Minimal set of features appears to be: lastTransaction() method, extended iterator() that supports start and stop arguments.
- The ZEO reconfiguration layer may be somewhat complicated.
Each Standby installation will have one primary and one or more
replicas. Each ZEO client can be configured to use the primary
for read-write operations or a replica for read-only
operations.
How is this system administered?
If a new replica becomes available, the new config info should be propagated to clients. Otherwise, we'd need to restart each ZEO client to make it see that change.
The ZEO clients need to figure out what the new primary will be. (One possibility is try to connecting to each of the replicas until it finds one :-(.)
If a replica becomes the primary, can it still service read-only requests? How hard is it to maintain those connections will switching modes?
- The original Standby design said it was okay if a replica was a
few transactions behind. The primary would send out updates to
all the replicas, but would continue without waiting for
acknowledgement. If a catastrophic failure occurred, it was
considered acceptable for a small possible amount of lost data.
Is this still the case?
If there is automatic failover, it's possible that the system fails over to a replica that is not completely up-to-date. This breaks all sorts of expectations. The clients will think some set of transactions were committed and have cached data for them, but the new primary won't know anything about those transactions. Not sure how significant the problem is. It may be sufficient to flush cached copies of lost data.
- Implementation question: Are the new ZEO features implementable without finishing the ZEO2 project? It's primarily a question of pain. ZEO2 is much easier to extend, but it's not clear how hard it will be to make it production-ready.
- Fast recovery mode. Normal recovery copies all lost transactions. Fast recovery just copies the object revisions needed to have the most recent revision of every object; it leaves holes that would prevent undo. (Not sure how this would interact with Berkeley's refcounting stuff.)
Quick list of tasks:
Task Days
-------------------------+---------
Finish prototyp 1 | 1
Finish spread wrapper | 3
Spread configuration | 1
ZEO reconfiguration | 4
Read-only replica | 1
Administration | 2 (make it possible for people to
| administer the system)
Fast recovery | 2
Automatic failover | 2
Evaluation on testbed | 4
-------------------------+---------
Total | 20
This list is pretty rough because the plans are pretty vague for many tasks. It also assumes that bugs will be shallow and/or easy to debug. The work on the first prototype suggests that sometimes it's quite hard to get the end-cases right; one example is being robust in the face of failures that are hard to produce in a testing environment.
- bwarsaw (Dec 18, 2001 11:55 am; Comment #1)
- Re: Berkeley refcounting. Refcounts in Berkeley storage are all internal, so if the storage doesn't know anything about earlier transactions, they shouldn't interfere with reference counting. I think the important thing is that there won't be any dangling references; i.e. references to objects (in a pickle) that aren't stored in the database. If I understand the fast recovery correctly, this should be guaranteed.