Database generations
Status: IsProposal
Author
Problem
A database reflects an application schema. In a relational database, this is reflected directly in a table structure. In a ZODB-based database, the schema is reflected in the structures of the saved pickles.
Over time, an application's schema will change to reflect evolving requirements and design decisions. As the application evolves, its schema may become inconsistent with the schema stored in the database. In the case of ZODB databases, the application schema becomes inconsistent with the structure of stored objects. This inconsistency has to be resolved in some manner. There are a number of ways to do this, including:
- Discard existing databases.
This approach is really only (barely) practical during development. Even then, it can be inconvenient.
- Include the old schema as a subset of the new schema.
In this approach, the application must support old object structures as well as new ones. Python provides a number of facilities to ease this approach, making it fairly straightforward for minor schema changes. For example, addition of new object attributes can often be accommodated by providing default values or descriptors in a class that provide values on behalf of old objects.
- Provide run-time data conversion in custom __setstate__ methods.
Here we provide custom __setstate__ methods that recognize old state structures and convert them to new structures when an object is loaded from the database. The conversion is done every time the object is loaded from the database. The converted state isn't stored in the database unless the object is modified for some other reason and saved. This can be expensive and requires keeping code in the implementation to handle this conversion.
A variation on this above approach is to arrange for the object to be stored when it it is converted. This is problematic, because it can cause a transaction that would otherwise be a read transaction to become a write transaction, which is undesirable as a matter of principle. This approach can render a read-only database unusable.
- Provide conversion code that runs either when a database is opened or explicitly under the control of the user.
Options 2 and 3 have the significant disadvantage that they increase application complexity. Over time, the application would collect a mass of backward-compatibility code that would add significantly to application complexity.
Option 4 (providing conversion scripts) is the best approach in the long run. Unfortunately, we lack a good infrastructure for running such scripts. For Zope 3, we've used database-opened event subscribers to automate running scripts when a database is opened. This has worked somewhat well, but the approach has some problems:
- Conversion is always run automatically (implicitly) when the database is opened. This has worked well at this stage in Zope 3's development. It won't work so well with ZEO in heavy use. In a ZEO environment, you need better control over database conversions to prevent race conditions or inconsistent conversions across database clients.
- Determining whether a conversion is needed is ad hoc. The conversion scripts are run every time a database is opened. It is up to the conversion scripts to decide how much work to do depending on what they find in the database.
- As with run-time conversion, it is potentially necessary to run
every existing conversion script if a database is very old. If
we want to support old databases, we may need to keep conversion
scripts around for a long time. We want to manage these
scripts in a way that separates them from the main application
code to keep the application code clean.
In my opinion, supporting old databases is important. People who deploy Zope need flexibility over when they upgrade software. We should endeavor not to force our customers to upgrade their software for our convenience.
- It is difficult to assess the state of a database to determine needed upgrades. The manager of an application should have tools to help them to decide a database's status and determine whether conversion is needed.
Proposal
I propose to add the notion of schema generations. A schema generation is essentially an identifier for an application schema. It consists of an application identifier and an integer generation number. When an application evolves its schema, it will increment its schema generation. Application identifiers are necessary for systems like Zope that support multiple applications simultaneously In addition, we will introduce the application schema manager:
class ISchemaManager(zope.interface.Interface):
"""Manage schema evolution for an application."""
minimum_generation = zope.interface.Attribute(
"Minimum supported schema generation")
generation = zope.interface.Attribute(
"Current schema generation")
def evolve(context, generation):
"""Evolve a database to the given schema generation.
The database should be assumed to be at the schema
generation one less than the given generation
argument. In other words, the evolve method is only
required to make one evolutionary step.
The context argument has a connection attribute,
providing a database connection to be used to change
the database. A context argument is passed rather than
a connection to make it possible to provide additional
information later, if it becomes necessary.
"""
Each application that wants to support schema evolution will need to provide a schema manager and register it using an application identifier. In Zope 3, schema managers will be registered as utilities, using the application identifier as the utility name. For example, Zope will register a schema manager for the base Zope software using the identifier, "zope.app". An add-on package will use a different identifier, based on the package's underlying Python package name.
Minimum and current generation numbers provide a means for an application to provide its customers flexibility over database evolution. Database evolution need not be implicit (automatic). For example, suppose I'm using a wiki package, SupeWiki. Perhaps I've previously installed it and I have the current database schema for SupeWiki, number 3. Now, suppose I get a new version of SupeWiki that has a current generation of 6 but a minimum generation of 2. (I didn't bother installing previous releases of SupeWiki that had current generations of 4 and 5.) For now, I can use the new SupeWiki release without upgrading my database schema. That's a good thing, because I'm running ZEO and don't enable automatic migration on startup. Rather, I can go to the process-management UI and see a page that:
- Lists the defined application schema.
- For each schema, shows the application minimum and current schema and shows the current database schema for that application.
- For every application schema for which the database is not current, a button is provided for updating the database to the current schema.
- If any part of the database isn't current for any schemas, then a button is provided at the bottom of the page to update all of the schemas.
With this mechanism, customers have control over when updates are done. Note that this represents a combination of the approaches listed earlier. To support old schemas, the application will need to use many of the tricks described in options 2 and 3. But support for old schemas can be phased out in an orderly fashion over time.
When evolving a database schema, if an error occurs, we will apply the updates we can. In the example above, suppose there was an error while evolving the database from SupeWiki generation 5 to generation 6. We will still perform the updates to generation 5.
We will provide software for managing evolution. We'll provide a class that implements ISchemaManager. This schema manager will provide support for generation definition and management of evolution scripts via modules:
class SchemaManager(min_generation=0, generation=0, package_name=None):
"""Implement ISchemaManager.
The given generation must be greater than or equal to the
minimum generation. Both generations must be non-negative
integers.
If the generation is non-zero, a package name must be
provided. The package name names a package containing
modules named "evolven", where n is a generation number to
evolve to. Each module must provide an evolve function:
def evolve(context)
The evolve function in module "evolven" is responsible for
migrating a database from generation n-1 to generation n.
The evolve functions must not commit or abort top-level
transactions, although they can use subtransactions.
"""
Applications can use this class to provide their own ISchemaManager implementations. Of course, they are free to use alternative implementations.
In addition, we'll provide software to a database's generations. Tis software will store curent database generations in a item of the database root. The software will manage update of database generations based on site-configured policies. In Zope 3, this will take the following form:
- Schema managers will be registered as named utilities.
- There will be a database-opened subscriber that managed database
generations. It will scan the schema managers and make sure that a
database generation is recorded for all of them, If there is no
recorded database generation, then the schema managers current
generation will be assumed. In the initial version of this
subscriber, we'll automatically update the database to the curent
generation for each application.
Later, we'll provide an alternate subscriber that:
- Doesn't apply any updates on startup
- Issues warnings if generations aren't current.
- Generates errors is the database isn't at a minimum generation.
Then we'll provide a UI in the process-management area of the site to show generation status and apply updates, as described earlier.
A possible variation is to automatically select one or the orther os thse strategoes depending on whether ZEO is being used.
Databases can only be evolved from known schema generations. For this reason, it's important that applications provide schema managers for generation 0. This will cause the initial generation to be recorded, so that evolution can be performed later. This is an issue for Zope itself. To benefit from this mechanism, it's important that databases are updated to a version of Zope with a schema manager as soon as possible. It's unlikely that existing old Zope 3 databases will be readily upgradeable to a state with a known generation. Once the system is in place, we may decide only to support databases created after the system was put in place. That is, we may want to pick a point in the near future where we make a "clean start". At this point, we will remove existing backward-compatibility hacks and rely on the generation system to provide orderly migration for the future.
Risks
This system is new. It should be viewed as experimental. Hopefully, it is flexible enough to accomodate changes based on early experience and evolving needs.
variation on mechanism 3 --k_vertigo, 2004/04/18 03:06 EST reply
in addition to __setstate__ methods at additional cost applications
can use __getstate__ methods to transparently perform upgrades, useful
implementations of this are generally backed by some additional
framework based on storing version numbers on instance creation, and
upon load look for appropriate upgrade method/function to take the
instance to the latest class/app version (possibly chaining upgrade
methods).
simplifying schema managers... --k_vertigo, 2004/04/18 03:29 EST reply
perhaps there should be additional framework beyond the base schema
managers, ie. a DatabaseSchemaManager? which would be responsibile as a
container/collection of application schema managers, and would serve
the purpose of providing uniform inspection to application schemas and
their state. there is some open question here, not answered by the
interface of how to determine the current generation of the
application data.
- JimFulton's response
- Such a framework was implicit in the proposal. I've added some text to make it more explicit.
additionally i think that basically handing the schema manager an open database connection and telling it to perform upgrade puts an awful lot of responsiblity and work on the schema manager. because each must not only implement db walking but also instance upgrading. i think it might be possible to have it reasonably simplified by having some of the required database walking/trawling functionality located in the DataSchemaManager?, such that upon loading a instance during its db walk, it could identify the class and based on the class python path, find the most appropriate app schema manager to handle it by matching python paths, iff the instance needs upgrading. hmm.. perhaps a better alternative is to provide a standard set of lazy walking classes/functions for use by schema managers.
- JimFulton
- In the fix-up scripts I've written, the "walking"
approaches have been highly application specific. I don't see a
sane general approach that would have helped me.
After a number of evolution scripts hve been written, I'm sure some patterns will emerge and we can identify useful libraris that make writing evolution scripts easier.
One such library is likely to be some sort of database iterator. I think we'll need something like this. As you suggest at the end, we can provide this as a library.
Unique application identifiers? --klm, 2004/04/19 10:48 EST reply
This seems like yet another reason that we need to make it easy for applications to have unique names. Is this already provided for, or are there plans? The obvious thing is to use the java convention where an application is prefixed by a reversed domain name of the authority for the application. (Domain names are generally retained with an organization's assets, including applications.) Could be nice to have the Products section of the control panel (is there an analog in zope3?) divided into com, org, net, gov, etc., and then subdivided into com.zope, com.infrae, com.plone, etc.
Unique application identifiers? --jim, 2004/04/19 17:30 EST reply
This seems like yet another reason that we need to make it easy for applications to have unique names. Is this already provided for, or are there plans?
Yes. Zope 3 has a notion of "identifiers", which may be either:
- Dotted python package names
- URIs?
Both of these come from well-structured namespaces.
Normally, an application will use it's package name as it's identifier.
evolve_%05d.py ? --bwarsaw, 2004/04/22 15:07 EST reply
minor nit: i'm not crazy about evolveN.py. it makes directory listings less useful. how about evolve_%05d.py? okay, maybe 5 is too big, so pick something > 1.
