Author
JimFulton?
Status
IsDraftProposal?
Problem
The current persistence architecture suffers from a number of problems:
- The persistence, cache, and database frameworks are very tightly coupled. This is particularly acute for the persistence and cache frameworks. The implementation of the two frameworks are very tightly coupled, making it extremely difficult to innovate in cache designs, as has been pointed out in MemorySizeLimitedCache.
- It is impossible to mix the Persistent base class with types that define C data structures. This is because it is impossible to mix two C base types that define data in C. This has prevented or complicated C implementation of a number of types (e.g. in zope.interface).
- The persistence API intrudes into application APIs?, requiring name-space tricks ("_p_" and "_v_" prefixes) to avoid name conflicts. This isn't a very serious problem, but experience with proxy frameworks suggests that functional frameworks can be employed to avoid name-space pollution.
Fundamentally, the persistence framework is about events. Databases and caches need to be notified of certain events:
- object access
- object change
- object destruction
- C code use
When C code accesses a persistent object, it notifies the persistence system of its use and non-use to prevent an object's data from disappearing unexpectedly.
- Conversion of objects to ghosts.
Of course, the framework also has to provide facilities for object creation and re-creation, and state access and manipulation.
Python already contains a framework for observing object destruction, the weakref framework. Objects that support weakrefs, as most objects do, allow any number of observers to be registered and notified when an object is destroyed. We could leverage this framework to track other events too.
Proposal
I propose to leverage weakrefs to track persistence-relevant events and to provide access to persistent information. Cache implementations will define persistent observers. These observers will be weakref objects, instances of subclasses of weakref.ref. Persistent observers will take over responsibility for storing persistence-relevant data for an object.
The responsibilities of the Persistent base class will be radically reduced:
- Persistent would no-longer implement management of persistent data.
- Persistent will no-longer manage persistent-object state changes.
- The responsibilities will be mostly limited notifying persistent observers of persistence-relevant events.
The Persistent base class will continue to provide methods _p_deactivate and _p_invalidate to allow subclasses to override these, when necessary.
There will be a PersistentObserver? base class that will act as a marker. The Persistent implementation will find an object's persistent observer (if any) by searching it's weak references for an instance of PersistentObserver?.
Persistent observers will manage persistent data including:
- object id
- object's data manager
- object serial
- object use count
- object state
- cache-defined data
Persistent observers will provide an API to be used by persistent objects to notify observers of events. We will shamelessly use a hack to make this go fast. :) Persistent observers will implement mapping __setitem__ taking an event object as the key and ignoring the value. The key will be required to be one of a standard set of events:
- Accessed
- Changed
- Used
- Unused
- Ghostified
The Used event indicates that an object is being used. An object that is being used must not be deactivated or invalidated. An object may be used multiple times, so persistent observer has to keep a use count. A Used event increments the use count and an Unused event decrements it. (Note that the use count replaces the internal sticky state for which there are probably latent bugs that should be dealt with by a separate proposal.)
The Ghostified event indicates that the object's state has been released and that it has entered the ghost state. This is necessary because persistent objects are responsible for becoming ghosts and can control whether to become a ghost. It is, therefore, the persistent object's responsibility to notify its observer when it actually becomes a ghost.
Persistent observers extend the weakref API with the following attributes and method:
- oid
- The persistent object id. This attribute should only be assigned by the data manager.
- serial
- The object serial, which is None for a ghost. This attribute should only be assigned by the data manager.
- manager
- The data manager. This is a read-only attribute.
- used
- The object's use count. This is a read-only attribute.
- state
- The object's persistent state, which must be one of:
- persistent.GhostState?
- The object is a ghost. This variable will have the value None.
- persistent.ChangedState?
- The object has been modified. The value will have a true boolean value.
- persistent.SavedState?
- The object is not a ghost and hasn't been modified. The data-manager's "accessed" method should be called if the object has been accessed. The value will have a false boolean value.
- persistent.ReadState?
- The object is not a ghost and hasn't been modified. The data-manager's "accessed" method should not be called if the object has been accessed. The value will have a false boolean value.
The state attribute will be mostly controlled by the data manager. The exception is that the observer will set the state to the ghost state in response to a Ghostified event.
- __setitem__(event, ignored)
- Notify the observer of an event.
If the event is Changed and the object is in the saved or read state, then call the method
registeron the data manager, passing the persistent object.If the event is Accessed and the object is in the saved state, then call the method
accessedon the data manager, passing the persistent object.
The persistent attributes _p_jar, _p_oid, _p_changed, _p_serial, and _p_mtime will still be supported but will be deprecated for a long deprecation period.
The persistent module will provide some new API functions for getting at persistence-relevant information:
- oid(object)
- Return an object's object id, if any.
- manager(object)
- Return an object's data manager, if any.
- state(object)
- Return an object's persistent state.
- observer(object)
- Return an object's persistent observer, if any.
- changed(object)
- If an object is in the saved or read state, move it to the modified state, or otherwise, do nothing. This function will be the preferred way to tell the persistence system that an object has changed in cases where the persistence system cannot detect a change automatically.
- unchanged(object)
- If an object is in the changed state, move it to the saved state, or otherwise, do nothing. This function will be used in those very rare situations in which the persistence system would determine that an object has changed when it should not.
Data managers will have a method, mtime, that returns an object's modification time.
Why weakrefs?
Most Python objects already support weakrefs. This allows us to associate persistence-related data with an object without affecting its data structure. This makes the Persistent base class a pure mix-in class that can be combined with other custom base classes. This will allow us to more easily make types persistent and will allow us to stop special casing persistent classes.
The existing persistent-object cache is very similar to a weak-value dictionary, but avoids the overhead of weak references by making persistent objects responsible for calling back to the object cache when an object is destroyed. This code is fairly complex and brittle. By leveraging the existing weakref framework, we'll be able to greatly simplify the cache implementation and increase its reliability.
Risks
The proposed change will increase the per-object memory usage because of the introduction of observer objects. The increase is the per-object overhead, about 5 words (20-bytes on a 32-bit platform).
[I don't understand this calculation - there is an additional overhead of an entire PersistentObserver? instance, so I think I'm missing something.] -JeffRush?
Implementation Status
Implementation sketch, by JeffRush?, for discussion.
from weakref import ref, getweakrefs
GhostState = None # The object is a ghost
ChangedState = True # The object has been modified
SavedState = False # The object is not a ghost and has not been modified.
# The data-manager's "accessed" method should be called
# if the object has been accessed.
ReadState = False # The object is not a ghost and has not been modified.
# The data-manager's "accessed" method should _not_ be
# called if the object has been accessed.
class DataManager:
def __init__(self):
pass
def register(self, pobj):
pass
def mtime(self, pobj):
return 0
class PersistentCache:
"""
Maintains a collection of weak references to persistent objects.
"""
class PersistentObserver(ref):
"""
one entry in a cache
There is one observer but multiple events, for persistence.
"""
def __init__(self, obj):
ref.__init__(self, obj, self.cb)
self.used = 0 # should be read-only
# assigned by the data manager
# self.oid = persistent object id
# self.serial = object serial or None for a ghost
# self.manager = ? (r/o attribute)
self.state = ReadState
# self.mtime ???
def cb(self, wref):
print "PersistentObserver called back by %s" % `wref`
def __setitem__(self, name, value):
if name == "Accessed": # Persistence Event
if self.state == SavedState:
pobj = self()
self.manager.accessed(pobj)
if name == "Changed": # Persistence Event
if self.state in (SavedState, ReadState):
pobj = self()
self.manager.register(pobj)
if name == "Used": # Persistence Event
self.used += 1
print "used bumped up to %d" % self.used
if name == "Unused": # Persistence Event
self.used -= 1
print "used bumped down to %d" % self.used
if name == "Ghostified": # Persistence Event
pass
def __init__(self):
self.oids = {}
def create_object(self, oid):
"""
oid -> classname + state pickle
"""
obj = X()
obj.__dict__.update({})
self.oids[oid] = self.PersistentObserver(obj)
return obj
def recreate_object(self, oid):
"""
oid -> classname + state pickle
"""
obj = X()
obj.__dict__.update({})
self.oids[oid] = self.PersistentObserver(obj)
return obj
pc = PersistentCache()
class Persistent:
"notify persistence observers of persistence-relevent events"
def __init__(self, *args, **kwargs):
"""
The __init__ method is only called the first time an object is
created and not on each subsequent recreation.
"""
global pc
pc.oids[1] = pc.PersistentObserver(self)
def find_persistence_observer(self):
for r in getweakrefs(self):
if isinstance(r, PersistentCache.PersistentObserver):
return r
return None
# Deprecated
_p_jar = property(lambda self: pc.PersistentObserver(self).manager)
_p_oid = property(lambda self: pc.PersistentObserver(self).oid)
_p_changed = property(lambda self: pc.PersistentObserver(self).state == ChangedState)
_p_serial = property(lambda self: pc.PersistentObserver(self).serial)
_p_mtime = property(lambda self: pc.PersistentObserver(self).mtime)
def oid(pobj):
return pc.PersistentObserver(pobj).oid
def manager(pobj):
return pc.PersistentObserver(pobj).manager
def state(pobj):
return pc.PersistentObserver(pobj).state
def observer(pobj):
return pc.PersistentObserver(pobj)
def changed(pobj):
"""
If the object is in the saved or read state, move it to the modified
state. Else, do nothing.
This function is the preferred way to tell the persistence system that
an object has changed in cases where the persistence system cannot
detect a change automatically.
"""
if pc.PersistentObserver(pobj).state in (SavedState, ReadState):
pc.PersistentObserver(pobj).state = ChangedState
def unchanged(pobj):
"""
If the object is in the changed state, move it to the saved state.
Else, do nothing.
This function is used in those very rare situations in which the
persistence system would determine that an object has changed when it
should not.
"""
if pc.PersistentObserver(pobj).state in (ChangedState, ):
pc.PersistentObserver(pobj).state = SavedState
class X(Persistent):
def __del__(self):
print "X.__del__ invoked"
def use_it(pobj):
"""
Indicate that this persistent object is being used and must not be
deactivated or invalidated. An object may be used multiple times
so maintain a usecount.
"""
po = pobj.find_persistence_observer()
po['Used'] = 1
def unuse_it(pobj):
po = pobj.find_persistence_observer()
po['Unused'] = 1
#pc = PersistentCache()
#x = pc.instantiate(1)
x = X()
print x.find_persistence_observer()
use_it(x)
unuse_it(x)
del x
import gc
gc.collect()
... --dmaurer, Sat, 18 Aug 2007 16:01:15 +0000 reply
This proposal adds significant magic (magical use of the "weakref" slot) and overhead (beside the listed space overhead we have time overhead for each attribute access caused by traversing the weakref list and calling issubclass on any element in this list).
This might be justified if the proposal would really drastically simplify the implementation of persistent C extensions. But, I doubt that this is the case: A C extension whose instances might be persisted must still be fully persistent aware and call unghostify, accessed, use/unuse appropriately UNLESS it accesses its data via standard Python means only (rather than directly on C level) or is extremely simple (does not have any methods accessing its data). True: the C extension would no longer need to worry to prepend the persistent object header (instead of the Python object header) but this is a very small part of the total complexity in the implementation of persistent C extensions.
... --dmaurer, Sun, 19 Aug 2007 12:52:43 +0000 reply
The proposal suggests to hand persistent state management over to
persistent observers (which, I suppose, are considered part of the caching
subsystem) and to communicate with it via events. In this case, the event set
is incomplete. We need at least one additional event, e.g. "Saved", to
bring the state from "Changed" (or "Read" or "Ghostified") into the state
"Saved" (note that "Save" and "Saved" are probably inadequate names --
as objects with ghost states are usually "saved" as well).
The proposal states that ghostification is a Persistent responsibility.
I am a bit surprised as most ghostifications are triggered by the
caching subsystem but understand that the persistent class may want
to customize ghostification. If "ghostification" is not a
caching responsibility, then "unghostification" should be neither
(it typically is performed by the DataManager). In this case,
the legality of persistent observer events depend on the persistent
state. Notifying Changed in a Ghost state would illegal as
the persistent observer does not know how to unghostify the object.
Of course, it could call a DataManager method, but then this should
be stated in the description (as is done for the register and accessed
calls).
We need to specify some state for new objects. I expect them to be
in the Changed state.
... --dmaurer, Sun, 19 Aug 2007 13:07:14 +0000 reply
There is some overlap between the Used and Accessed events.
Why is Used/Unused not sufficient, why do we need Accessed in addition?
I recognize that Used is paired with Unused and that Accessed
could in principle be called multiple times within a Used/Unused brace -- but
do we really need to care about? As soon as Used is called, the object
is effectively protected against deactivation and Accessed cache activities
become only relevant from the Unused on. When we would call
DataManager.accessed(obj) on Used events (if the object is in
state Saved) and take accessing cache actions on Unused we could
abondon the Accessed action.
... --dmaurer, Sun, 19 Aug 2007 13:12:03 +0000 reply
The proposal states that an object cannot be invalidated when its use
count is nonzero. In my view, it is bug when a called for invalidation
cannot be performed (for whatever reason) as this is likely to introduce
a cache inconsistency. Therefore, I suggest to raise an exception
when the invalidation is prevented due to a nonzero use count.