An alternative object caching strategy for ZODB: object classification
| Authors: | Christian Theune <ct@gocept.com> |
|---|---|
| Status: | Draft |
Current state
Every ZODB connection maintains a cache that keeps objects in memory to avoid unnecessary IO work.
All objects that are loaded within a transaction are put into the cache and are evicted from the cache at certain points (e.g. transaction boundaries) when cache minimization is performed. The currently implemented strategy for evicting objects is least recently used (LRU).
The size of the object cache is limited by the number of objects. This size is configurable per database.
Problems
Large applications have different usage patterns on objects so that the LRU strategy does not always fit those patterns effectively. Examples are:
- Applications with a variety of smaller and larger objects. The classical example of files stored as pdata constructs defeats the LRU caching strategy and potentially evicts heavily used (or even all) objects from the cache just to allow loading many smaller, infrequently used objects.
- Catalogs: Light-weight objects like brains might benefit from a higher priority of staying cached in comparison to other objects.
Historical solutions
To allow different cache settings developers resorted to splitting their database into multiple smaller databases. Each database then is responsible for maintaining its own cache for the objects stored in it. An example for this is to create a separate database for catalogs and one for the actual application objects.
Creating separate databases for separating caches is a non-ideal solution as the storage layout becomes dependent on the caching strategies and blocks the use of separate databases for other, potentially orthogonal, uses.
Proposed solution
I propose to:
- modify the ZODB to allow different cache implementations to be used, and
- implement a cache that performs an object classification using a configurable strategy
Modify the ZODB to support different cache implementations
The current ZODB connection class has the instanciation of the object cache hard coded, referring to the cPickleCache.PickleCache? class.
To flexibilize this, we need to:
Describe the cache interface formally
Allow the cache class to be passed to the DB class through the constructor
Allow the cache class to be passed to the Connection class through the constructor.
Extend the ZConfig? support to allow to specify a cache implementation through config files. The proposed syntax is:
<zodb> <somestorage> ... </somestorage> <somecache> </somecache> </zodb>
For this we need to define a new datatype Cache that can optionally be specified once in a database section. If no cache is specified, the normal PickleCache? is used.
The configuration of the pickle cache would be:
<zodb> <somestorage> ... </somestorage> <picklecache> size 5000 </picklecache> </zodb>The existing spelling for specifying the cache size would be used in the case that no cache section is given.
Note: For backwards compatibility, all values are defined with defaults that reflect the current implementation.
Implement an object classification cache
The existing implementation of the object cache stores all objects that are loaded up to a given maximum number of objects. The cache size may exceed this size temporarily until cache minimisation is performed.
To support application-specific usage patterns of objects and to optimize loading and evicting objects I propose to implement a cache that supports a configurable strategy to classify objects and provide a separate PickleCache? given by the classes.
Storing objects in the cache
- When an object is stored, we first classify the object.
- If the class name is None, the object is not cached.
- Otherwise, we maintain a mapping of class names to PickleCaches? and store the object in the PickleCache? for the associated class. If a class is used for the first time, we create a new PickleCache? for it.
Minimizing the cache
When the cache minimization is called, all classes are minimized.
The object classification API
Application developers can implement different classifications using the IClassification? interface:
class IClassification(zope.interface.Interface): def classify(object): """Return the name of a cache for this object. Return None if the object should not be cached. """Configuration
The classification cache has the following configuration parameters:
- The classification implementation
- The default size of PickleCaches? (the default for the default is 5000)
- The sizes for specified classes
In ZConfig? the cache can be configured like this:
%import classifyingcache <zodb> <filestorage> ... </filestorage> <classifyingcache> strategy zodb.classifyingcache.default.DefaultStrategy default-size 5000 <classes> brains 20000 content 20000 others 5000 </classes> </classifyingcache> </zodb>
Risks
- Storing and loading objects from and to the caches requires more computational overhead. Complex classification strategies might become unforeseen bottlenecks.
- Increasing complexity might make the system more fragile.
TODO
- Determine package name for the classifying cache. Optionally think about making this a core functionality.
- Provide default classification implementations that fit a few regularly used patterns: a Zope 2 database, a Zope 2/CMF database, ... ?
- Prepare an analysis strategy to use a live site to compare the caches' efficiency. The analysis should result in a comparison of the amount of loads that are performed using the different caches. (Using cacti or some other graphing tool, isn't there some information source within Zope for this anyway?)
Future enhancements
- Allow the classes to use different cache implementations themselves, e.g. one that uses a different way of evicting the cache than LRU.
- Provide a monitoring tool to show the status of the cache in detail for every class (objects in class, size of class), maybe including the details of how many objects of which Python class are in each cache class. (Eventually this is just an extension to getCacheDetail and getCacheExtremeDetail)
- Provide a monitoring tool that logs object loads, stores and releases. (Stating the time, the action, the oid, the object class and the cache class). (Eviction might be pretty hard as it happens invisibly for us.)
Sensible defaults --tseaver, Mon, 20 Aug 2007 17:43:01 +0000 reply
- Pdata objects should never be cached. although the root "file" objects can be.
- BTree?-related objects (buckets, sets, trees) and brains objects should be cached in the same class.
Configuration coherence --tseaver, Mon, 20 Aug 2007 17:49:59 +0000 reply
There needs to be an easy way to specify classifications in the same place that the cache size is set. Either:
Define the classification model using ZConfig?, something like:
<classifyingcache> <class "catalog"> </class> <class "pdata">
... --dmaurer, Tue, 21 Aug 2007 18:41:59 +0000 reply
There are two competing proposals for a new interface between persistent and caching: (part of) MemorySizeLimitedCache and DecouplePersistenceDatabaseAndCache.
I have implemented the first one in our (local) Zope -- and used it to implement a common cache for all ZODB databases (we mount hundreds of databases -- giving each of them a dedicated cache it very inefficient). I still have a SIGSEGV during finalization when the common cache is enabled, but I expect that the problem is in the common cache implementation and not in the cache/persistent interface.
My "common cache" use case is probably not covered by your proposal. The cache constructor needs additional parameters to learn about a common cache to be used (from a parent connection).
... --dmaurer, Tue, 21 Aug 2007 18:53:06 +0000 reply
I expect that "should not be cached" is difficult to realize: the cache has two tasks: increase access efficiency by caching heavily used objects and ensure referential transparency. The second task requires that all objects referenced from the applation must be in the cache, maybe ghosted. Of course, we cannot ghost an object which should not be cached as soon as it is loaded -- because, it was loaded for some purpose. We can work around this if we combine loading with "Used" (see DecouplePersistenceDatabaseAndCache). Then, the corresponding "Unused" can be used to ghost objects again that should not be cached. However, we must be aware that any attribute access will cause a loading followed by a ghosting -- such objects will be very expensive ...