home contents changes options help subscribe edit (external edit)

Status: IsDraftProposal?

* A simpler version of this proposal was implemented at PyCon 2005. See ZODB/tests/multidb.txt for a tutorial doctest. *

Author

JimFulton?

Problem

We sometimes want to use data from multiple databases. There are a number of reasons for this, including:

  • Spreading load across multiple database servers
  • Supporting multiple database management policies. For example, we might choose never to pack a content database but to frequently pack a database containing index data that changes frequently and can be computed from content.
  • Sharing data across organizations.

To use data from multiple databases requires managing multiple database connections. Often the use of multiple databases is controlled in application code, often driven from a "primary" connection. The application has an open connection. It wants to open secondary connections and keep them open as long as it has the primary connection open. Managing the secondary connections can be tricky. We went through this with Zope database "mounting". Currently, the code to manage the secondary connections is in Zope and is sometimes (as in, currently ;) broken by ZODB changes. ZODB needs to provide low-level support that allows applications like Zope mounting to be implemented in a robust way.

For a long time, we've wanted to allow cross-database object references. Currently, an object reference consists of an object id (and, possibly a cached object class, as an optimization). To allow cross-database references, we'd need to store a database name as well as an object id. When dereferencing such a reference, we'd need to get a database connection for the identified database. It's important that we always use the same database connection to make object references work properly. This adds an additional burden to connection management. It isn't enough to close secondary connections when a primary connection is closed. We need to assure that the same secondary connection is used whenever a secondary connection is needed by a primary connection.

Proposal

* A simpler version of this proposal was implemented at PyCon 2005. See ZODB/tests/multidb.txt for a tutorial doctest. *

I propose to define a new ZODB framework to make using multiple databases easier. A goal is to make this framework as simple as possible.

We define a multi-database. A multi-database is a collection of databases. It manages a mapping from database names to database objects.

ZODB DB objects will grow databases and __name__ attributes. If a database is part of a multi-database, then it's databases attribute will be set to the multi-database's mapping from database names to databases and it's __name__ attribute will be set to it's name within the multi-database. All databases within a multi-database share the same mapping.

Connections will grow a qualified_get method:

    def qualified_get(database_name, oid):
        """Get an object for a database name and object id
        """

Applications that want to look up an object in a particular database can walk up to any connection to a database in the multi-database and call qualified_get to get the object, given the database name and the object id.

The ZODB configuration schema will grow a multi-database section type to define multi-databases as a table of named databases.

Eventually (but maybe not in ZODB 3.3), it will be possible to store direct references from objects in one database to objects in another database within the same multi-database. Note, however, that changing database ids within the multi-database would break these references. These references will be managed as persistent weak references. They will not prevent referenced objects from being garbage collected. There are some details of the behavior of cross-database references that need to be worked out. I would prefer to do that in a separate proposal.

Implementation notes

* A simpler version of this proposal was implemented at PyCon 2005. See ZODB/tests/multidb.txt for a tutorial doctest. *

Connections will grow an internal _connections attribute, which is a mapping from database name to connection. It will be initialized to None. We will refer to this mapping as a connection group, as it will hold information about a group of related connections. This mapping will be shared by all connections in the group.

For illustration purposes, here's a sample implementation of 'qualified_get':

    def qualified_get(self, database_name, oid):

        try:
            connection = self._connections.get(database_name)
        except AttributeError:
            self._connections = {self._db.__name__: self}
            connection = self._connection.get(database_name)

        if connection is None:
            try:
                db = self._db.databases.get(database_name)
            except AttributeError:
                raise SingleDatabaseError("Not a multi database")
            if db is None:
                raise UnknownDatabaseError(database_name)
            connection = db.open()
            connection._connections = self._connections
            self._connection[database] = connection

        return connection.get(oid)

Here's a description of what's going on when qualified_get is called. We'll refer to the connection that qualified_get is called on as the primary connection. We'll call the connection corresponding to the given name the secondary connection. (Note that the primary and secondary connections could be the same.)

When qualified_name is called, the primary connection will:

  • Get a secondary connection corresponding to the given database name.
    • If the primary connection's _connections mapping hasn't been initialized yet, it will be created with a single key, the primary connection's database name, mapped to the primary connection.
    • If there isn't an entry in the connections map for the given database name, then an entry is created by opening a new secondary database connection.
  • Call get on the secondary connection to get the object.

Note that this is symmetric. Any connection can be a primary or secondary connection as far as qualified_get is concerned.

From the point of view of an application, there will be a specific primary connection. This is the connection that is returned from a call to DB.open and this is the connection that will ultimately be closed by the application.

Note that secondary connections will never be closed. They are under the control of primary connections. As proposed here, this isn't enforced, but will be a natural consequence of normal usage. Perhaps we should make this more explicit by actually defining a SecondaryConnection? class that prevents closing and a separate DB method to create secondary connections.

A number of methods will be modified to delegate to other connections in a connection group. (Looking at the connection class, we'll probably do some refactoring as we do this.) For example, quite often we need to call _incrgc on connection caches to perform incremental garbage collection. We'll define a _incrgc method on connections:

    def _incrgc(self):
        if self._connections:
            for connection in self._connections.values():
               connection._cache.incrgc()
        else:
            self._cache.incrgc()

I won't go into more detail here. I think that this gives an idea of what will change.

Impact on Zope

Currently, Zope has a configuration mechanism for mapping Zope object-file-system paths to database objects. This would need to be changed in one (or both) of two ways:

Author's note: I don't understand how this configuration currently works, so I'm waving my hands a bit here.

  1. Change the mount-point configuration to use database names defined by a ZODB multi-database configuration, at least optionally. In other words, break the current configuration into two parts. The first part is provided by the new (proposed) ZODB multi-database definition. The second part is a Zope-provided definition of mounts, defined in terms of ZODB-defined databases.

    This approach (separating database definition from mounting information) has a number of advantages:

    • It's closer to the inspiration for the current configuration scheme: fstab. Unix's fstab defines how already defined file devices get mounted. Something else actually defines the devices.
    • It's more flexible. Databases can be defined for other applications beyond mounting. Conceivably, databases could be mounted in multiple locations, or databases can be unmounted without losing their definitions.
  2. Modify the exiting configuration to generate a multi-database configuration internally. We may need to allow the current configuration style to be used for backward compatability.

Zope's mounting code can be greatly simplified. Zope will no longer need to manage database connections. When traversing a mount point, the mount point object in the first database, will simply use qualified_get to get the root object in the second database and then traverse that to the mount point, as it does now.

Notes

Perhaps the database __name__, when set, should be used for the connection sort key.

This proposal depends on SimplifyConnectionManagement



subject:
  ( 14 subscribers )