home contents changes options help subscribe edit (external edit)

Standby Storage Proposal

This proposal was implemented by the commercial product, ZRS.

Contact

Jim Fulton jim@zope.com

Problem

The storage for a ZODB application is a single point of failure that limits the availability of the system. A catastrophic failure of the machine running the storage for an application can cause a prolonged outage and loss of data. Before the application can be restarted, the storage must be restored from the failed machine. If the storage is damaged, it must be restored from backups, which will not include recently committed changes.

This proposal aims to increase the availability of ZODB-based applications by reducing the mean time to repair in the case of catastrophic storage failures. Storage updates are replicated to one or more write-only replicas that can be put into service if the primary storage fails.

Approach

Our approach is to keep one or more write-only replicas of the storage available for use when the primary fails. The primary server is used by ZODB for all operations. (In a system using ZEO, the primary server is the ZEO storage server.) The primary server broadcasts all transactions to the replicas, which perform the transactions locally. In the event of a failure, each replica should have a complete, consistent copy of the storage that can be brought online as a new primary.

Replication occurs at the level of the storage interface. When a transaction commits, it makes several storage API calls, including store(), tpc_vote(), and tpc_finish(). Each storage participating in a replication group can run a different storage implementation, e.g. FileStorage, bsddb3Storage (known as Berkeley storage), or OracleStorage?.

There is a separate ZEOReplicatedStorage? proposal that uses replication to addresses a different facet of availability. That proposal proposes to use quorum replication to allow the storage to continue operation despite the failure of a minority of replicas. We believe that the approach here is simpler, though it has less effect on availability. The simplicity of the approach means it is attractive to implement first.

This approach poses several sub-problems:

  1. CommunicationPrimitives -- how is broadcast to slaves achieved
  2. BasicUpdateOperations -- replicating store(), tpc_finish(), etc.
  3. AdvancedUpdateOperations -- undo, pack
  4. FailureDetection? -- detecting when components fail, e.g. slaves
  5. RecoveryPage? -- restoring system to normal function after failure
  6. AdministrationPage -- managing system and notifying operator of status

Customer Requirements

Customer requirements will drive the next prototype implementation of Standby storage. In summary, the requirements are:

  • automatic reconfiguration when primary fails
  • ZEO changes to support automatic failover
  • support for changing master from one-site to other without failure
  • use replicas as read-only ZEO servers

These issues are discussed more in ArchitectureForSecondPrototype

Scope

Scope is the distinguishing feature of this proposal. Many of the hard and interesting problems are deferred to ZEOReplicatedStorage?. This proposal focuses on a narrow problem -- improving speed of recovery when a storage server fails catastrophically.

This proposal does not address the use of replication to support partially disconnected operations.

The proposal does not say anything about security or performance. This is intentional for the elaboration phase of the process. These issues may be addressed in later phases.

Risks

The proposal glosses over hard problems like recovery and failure detection. The prototype implementation may show these problems to be even harder than they appear. The risk is limited because the first phase of development should be completed quickly and provide a basis for evaluating the viability of the effort.

Related Work

ReplicatedFileStorage? was implemented by Toby Dickenson. It supports replication of FileStorage at the file level. It replaces built-in file objects and filesystem operations like rename() and remove() with versions that replicate the operations on files on remote machines. It uses multiple asynchat connections to replicas to communicate updates. It's not clear to me what guarantees ReplicatedFileStorage? provides for durability of updates or what performance it offers.

Oracle databases support a variety of replication solutions. The Data Guard product introduced in Oracle 9i is probably the most similar approach. The Data Guard keeps a physical copy of the database on a remote machine. As transactions are commited, the redo logs are shipped to the remote site.

MySQL has one-way replication starting with version 3.23.15. It sounds like this strategy is quite similar to the standby approach we have, as it uses a master server and a slave that replays logs. This replication approach allows the slave to perform read-only operations.

Deliverables

The proposal will be implemented in two phases. The first phase will deliver a fleshed out design and a prototype implementation. The second phase will deliver a full implementation and documentation. During the second phase of development, we will pay attention to performance testing and full-fledged integration tests. Unit tests will be developed during phase 1.

  • ElaborationPhase?

    Status

    • failure detection incomplete
    • testing incomplete

      It is important that testing be done so that future development can proceed without tedious manual test runs.

  • Phase 2 Implementation Plan (Construction)

    TBD

    Phase 2 will include a public release of the Spread wrapper in some form.


jim (Sep 13, 2001 7:06 pm; Comment #1)
This looks really good. A few comments.
  • This isn't Zope specific.
  • On message format, I assume that we can base this on the ZEO (v2) protocols you've been working on.
  • I agree that there is a special need to split messages that exceed 100K in length, but why do we have to worry about lost messages. Can't we rely on spread for reliable deliver. I guess you're just saying that we have to check.
  • I suggest that packs should not be replicated. That is, masters and slaves can be packed independently. Some slaves might even be packless.
  • Undo doesn't need special treatment if we only support transactional undo, so let's only support transactional undo. :)
  • The name for phase 1 is "elaboration" and the name for phase 2 is "construction". ;)
paul (Sep 17, 2001 11:30 am; Comment #2)
There's two things that I think you can add to scope. First, state that security (authenticating and/or encrypting the communications) is not in scope. People can get that through other means. Second, let people know this isn't intended to be an offline replication model like Notes or Exchange. It's clear enough for people that take the time to read it, but not everyone will, so it might save some grief to simply call that out more clearly.

Here's a question about scope: should performance be in scope? The primary goal on this proposal is survivability. At what point would the tradeoff in performance be severe enough that it gets in scope? What is the impact of adding the third or tenth replica, and should we even care?

jeremy (Sep 24, 2001 3:44 pm; Comment #3)
 > paul (Sep 17, 2001 11:30 am; Comment #2) --
 > Here's a question about scope: should performance be in scope? The primary
 > goal on this proposal is survivability. At what point would the
 > tradeoff in performance be severe enough that it gets in scope? What
 > is the impact of adding the third or tenth replica, and should we
 > even care?
 

There are performance goals, but they haven't been stated clearly. In part, this is because we have no reasonable and standard way to measure how storage affects ZODB application performance. The solution proposed here should not have a big effect on performance. (That is, the solution should be deemed a failure if it has a big effect.) I don't anticipate the number of replicas having much effect on performance, although the locations of the replicas could.

bwarsaw (Dec 18, 2001 12:04 pm; Comment #4)
We're banking quite a bit on the Spread toolkit. As we've seen w/ Berkeley storage there are numerous hidden risks that we need to try to minimize. What if Spread isn't mature enough to handle the task? What if there's something in particular about our application that tickles unsupported (or poorly supported) features of Spread? What if we need changes in Spread to support Standby? I don't think we can fully appreciate risks that we will encounter as we go forward with this task, but we should have contingency plans in case Spread isn't up to the task.
SmartZope? (Jan 22, 2002 5:12 pm; Comment #5)
At first I would focus on the failure detection mechanism, because the ability of this meachanism would give an hint about the necessary changes and needed abilities of the recovery mechanism. - What type of failures can occur? (disk crash, mounted filesystem on remote server -> server shutdown, disk full) - On what level will the error probably be raised? How can we catch it? - What time scales we are talking about? -> how will the depending processes behave? (Connections to databases like MySQL, Oracle etc.) Not every failure will cause at once a exception but may take a while...


comments:

Replication, considering the MySQL case --mtb, 2004/05/10 16:58 EST reply
MySQL is able to do replication somewhat more easily because of its binary transaction logging system. If the storage backend were to support the recording of transactions in a file or series of files, replication clients could pick up the transactions much as MySQL does and replay them in their own database systems. This would actually allow for online replication; multiple live Zope servers using the same data, with only one server allowing writes.



subject:
  ( 11 subscribers )