FileStorage overview
FileStorage is the most widely used ZODB storage. It's a relatively simple design, and is generally the fastest storage implementation.
All data is saved in a single .fs file. A bit simplified, a .fs file is a sequence of
transaction records. Each transaction record has a transaction header, followed by a
sequence of data records. There's one data record for each object modified by the
transaction. Each data record has a data header, followed by a serialized form ("pickle")
of the object's new state. A FileStorage grows only by appending, so transactions appear
in order, from oldest to newest.
To speed lookups, a .fs.index file is derived from a .fs file. The .index
file maps an object id to the offset within the .fs file at which the data record
for the current revision of the object appears. If an .index file doesn't exist
when a .fs file is opened, a .index file is generated automatically.
It's generally true that once a byte is written to a .fs file, it's never
overwritten: changes to a .fs file are made only by appending new data. Because
this strategy is so simple, it's very robust, and naturally supports undo (old
data is never erased). However, this also means that a FileStorage grows
without bound, and explicit packing is necessary to remove old revisions of objects
that are no longer needed. Packing doesn't modify a FileStorage in-place either.
Packing writes a new .fs file, and renames files at the end to replace the original
.fs file with the packed one.
There's one exception to the rule that bytes in a FileStorage are never overwritten. When a new transaction is committed to a FileStorage, a special status byte is first written to record that a transaction record append has started. When the append is complete, this status byte (near the start of the new transaction record) is overwritten with a value indicating that the commit completed successfully. If, for example, the computer crashes before the append is complete, the next time the FileStorage is opened the status byte still has its initial "append started but didn't complete" value, and the FileStorage is then truncated, to remove the incomplete append.
Best FileStorage practice
- Don't try to access a
.fsfile over a networked filesystem. FileStorage files experience heavy I/O activity, and networked filesystems introduce their own layers of bugs. - Establish a good backup procedure, and follow it religiously. The intended way to do FileStorage backup and restore is explained later on this page.
- Packing is not an error recovery procedure. The purpose of packing is to reduce database size (by removing old revisions of objects), not to recover from errors, neither to prevent errors. If you're having problems, packing can make them worse, at least by obscuring the original nature of the damage by throwing away original evidence. See later sections for FileStorage diagnostic and recovery tools.
- When packing, pack to a time in the past, not to "right now". There are pathological cases where packing to the current instant can delete an object from the database that's in use by a current transaction. In a rare subset of those cases (and arguably an application error when this occurs), that can leave the database with a dangling reference (an object in the database that refers to another persistent object that no longer exists in the database).
- Because a FileStorage grows one transaction at a time, and transactions
can be small, under some filesystem implementations a FileStorage
can become badly fragmented quickly. For example, I've observed bad
FileStorage fragmentation under NTFS (the high-end native Windows
filesystem). Fragmentation can grossly increase the time it takes
to seek to an object's data record in the file, so can grossly increase
object load times. So if you're using a filesystem that's prone to
fragmentation, defragment your
.fsfiles regularly. (For Windows NT/2000/XP,Contig.exefrom www.sysinternals.com is a free command-line utility that can defragment individual files.) - Use ZEO. ZEO is a wonderful thing. Keeping your database on a ZEO server isolates it from hardware and software problems on your application machine. Even running ZEO on the same machine, but in a different process, protects the FileStorage code from most software problems in the application process.
- The Zope Replication Services (ZRS) product replicates a live FileStorage on a transaction-by-transaction basis. Even if the original FileStorage is on a machine with a bad disk, ZRS does not work by copying bytes, so the ZRS secondary replica(s) are extremely unlikely to experience the same kind of corruption.
Backing up FileStorage files
It's important to back up FileStorage files (.fs files), for the same
reasons it's important to back up all critical files. The intended way to
back up an .fs file is with the oddly named repozo.py script. This
performs efficient incremental backups against a live FileStorage (there's
no need to bring the app down), and is aware of that a FileStorage is in
a temporarily non-sane state during the time a transaction is in the
process of committing. Doing a raw file copy, or rsync, isn't aware of
the latter problem, and should not be used on a live FileStorage file; a
filesystem-level copy is fine if the FileStorage file isn't open for write
access, but repozo still has an advantage then in knowing how to do
incremental backup.
repozo is easy to use, but supports many options, which may be confusing
at first. All the options are explained in the module docstring. Here we'll
just use a set of typical options.
You first need to create a directory to store backups. I'll call it backup here.
This directory should be created for repozo's exclusive use, and you must create
a distinct backup directory for each distinct FileStorage you want to back up.
To create a backup of Data.fs, storing backup files in directory backup :
repozo -BvzQ -r /foo/backup -f /foo/Data.fs
-B tells repozo to do a backup. -v causes it to display messages about
what it's doing. -z causes the backup files to be compressed, using gzip.
-Q is an optimization using md5 checksums to skip large amounts of I/O; there
is a vanishingly small chance that -Q will cause repozo to do a wrong thing,
and you can omit -Q if that bothers you, at the cost of more I/O and longer
runtime. -r and -f specify the backup directory and FileStorage to back up,
respectively.
Data.fs can be in active use when you run repozo. repozo makes a read-only
connection to the FileStorage, and backs up to the point of the most recent
fully committed transaction at the time this connection is made.
repozo will make either a full backup or an incremental backup. You can force
a full backup with the -F flag. Else repozo does a full backup only if
necessary. For example, a full backup is necessary if this is the first time
a backup has been made, or if the FileStorage has been packed since the last
time a backup was made.
In the backup directory, repozo creates data files with names of the form
YYYY-MM-DD-HH-MM-SS.$ext, where $ext is fsz for a compressed full backup,
deltafsz for a compressed incremental backup, fs for an uncompressed full
backup, or deltafs for an uncompressed incremental backup. It also creates,
or appends to, a .dat file, which is an index containing metadata about the
data files. The YYYY-MM-DD-HH-MM-SS part records the UTC (not local) time
at which the backup was made.
repozo is also used to recreate an .fs file from the backup files:
repozo -Rv -r backup -D YYYY-MM-DD-HH-MM-SS -o Copy.fs
-D is optional, and specifies a UTC (not local) time; by default, current
time is used. If specified, the hour, minute, and second parts are optional.
repozo recreates the originally backed-up FileStorage, to the state it had
at the most recent backup at or before this time. The -o option specifies
an output file path, the name of the reconstructed FileStorage. In the
example, the recreated FileStorage is Copy.fs in the current directory.
repozo -BQ is fast, usually taking time proportional to the growth in
the FileStorage since the last time it was run. It's a good idea to
make backups daily; incremental repozo backups are made quickly enough that
you may wish to run them more frequently.
fstest verifies low-level file integrity
fstest.py does low-level checks that the structure of a FileStorage is correct.
It verifies that the transaction and data headers are well-formed, and mutually
consistent. It does not examine object pickles, so cannot detect corruption in
pickles, and neither can it verify that inter-object references are legitimate:
fstest [-v[v]] Data.fs
Without flags, fstest normally produces no output. If it finds a problem, it
prints information about the first problem found, then exits.
You can run fstest against a live FileStorage, but if a transaction is in the
process of being committed when fstest reaches the end of the file, you're likely
to see a spurious error message of the form:
Data.fs truncated possibly because of damaged records at 5981372
This is because the tail end of a FileStorage is temporarily in a non-sane state
while new data is being appended to it, and fstest isn't aware of this.
With -v, a line of output per transaction is also printed, giving the transaction's
"tid" (transaction identifier) and the starting offset of the transaction record
within the file. With -vv, a line of output per data record is also printed, giving the
object's "oid" (persistent object identifier) and the starting offset of the data
record; output for all data records within a transaction appears before the line
of output for the transaction record in this case.
fsrefs checks object sanity
fsrefs.py checks object sanity by trying to load
the current revision of every object O in the database, and
also verifies that every object directly reachable from each such O exists in the database.
It's hard to explain exactly what it does because it relies on undocumented
features in Python's cPickle module: many of the crucial steps of loading an object are
taken, but application objects aren't actually created. This saves a lot of time,
and allows fsrefs to be run even if the code implementing the object classes isn't
available.
The command line is very simple:
fsrefs [-v] Data.fs
A read-only connection to the specified FileStorage is made, but it is not recommended
to run fsrefs against a live FileStorage. Because a live FileStorage is mutating while
fsrefs runs, it's not possible for fsrefs to get a wholly consistent view of the
database across the entire time fsrefs is running; spurious error messages may result.
A useful tactic is to make a backup with repozo (see above; this should go quickly
if you're routinely making incremental repozo backups), and run fsrefs against a
temporary FileStorage recreated by repozo from its then-current backup files.
fsrefs doesn't normally produce any output. If an object fails to load, the oid of
the object is given in a message saying so, and if -v was specified then the traceback
corresponding to the load failure is also displayed (this is the only effect of the -v
flag).
Precisely which other errors are detected depends on the version of ZODB. Here I'll
describe ZODB 3.4's behavior; see the fsrefs module docstring for details specific
to your version.
Three other kinds of errors are also detected, when an object O loads OK, and directly refers to a persistent object P but there's a problem with P:
- If P doesn't exist in the database, a message saying so is displayed. The unsatisifiable reference to P is often called a "dangling reference"; P is called "missing" in the error output.
- If the current state of the database is such that P's creation has been undone, then P can't be loaded either. This is also a kind of dangling reference, but is identified as "object creation was undone".
- If P can't be loaded (but does exist in the database), a message saying that O refers to an object that can't be loaded is displayed.
fsrefs also (indirectly) checks that the .index file is sane, because fsrefs
uses the index to get its idea of what constitutes "all the objects in the database".
Note these limitations: because fsrefs only looks at the current revision of objects,
it does not attempt to load objects in versions, or non-current revisions of objects;
therefore fsrefs cannot find problems in versions or in non-current revisions.
fsdump creates a human-readable rendering
Like fstest -vv, fsdump produces a line of output for each object in each
transaction. Their purposes are quite different, though. fstest verifies the
low-level well-formedness of the file, but fsdump assumes the FileStorage is
in good shape, and prints more human-oriented information (for example, the
times associated with transactions, and the names of object classes):
fsdump Data.fs
There are no flags or optional arguments. Here's sample output from fsdump for
a single transaction:
Trans #00004 tid=035557d2b17e7bbb time=2004-05-25 14:10:41.600174 offset=1620
status='p' user= description=Added ZGlobals
data #00000 oid=0000000000000000 class=Persistence.PersistentMapping
data #00001 oid=000000000000000a class=BTrees.OOBTree.OOBTree
For contrast, this is fstest -vv output for the same transaction:
1620: object oid 0x0000000000000000 #0
1854: object oid 0x000000000000000a #1
1583: transaction tid 0x035557d2b17e7bbb #4
Overall, the fsdump output is more informative, but skips low-level information like
the file offsets of individual data records -- if you're not tracking down a file
corruption problem, the offsets aren't really that interesting. There are some peculiar
differences too, primarily that the offset given by fsdump for a transaction is actually
the offset of the first data record within the transaction, and that fstest produces
output for a transaction record after the output for that transaction's data records.
Note that since fsdump assumes its input FileStorage is in good shape, it
skips most sanity checks, and may produce silly output if the database is
damaged.
New in ZODB 3.4: data records also display their size, in bytes. For example, where before you might have seen:
data #00003 oid=0000000000000011 class=Products.ZCatalog.Catalog.Catalog
data #00004 oid=0000000000000018 class=BTrees.Length.Length
under ZODB 3.4 you might see:
data #00003 oid=0000000000000011 size=595 class=Products.ZCatalog.Catalog.Catalog
data #00004 oid=0000000000000018 size=38 class=BTrees.Length.Length
fsrecover attempts to repair FileStorage damage
XXX fsrecover.py
FileStorage corruption
Most people who use FileStorage never experience corruption. Those who do seem to see corruption regularly. No general cause is known, or even suspected, and it's usually difficult to track down a specific cause. Hardware or system software failure can, of course, corrupt any database, and some specific cases of this have been diagnosed: bad memory chip, bad disk, bad disk driver, a RAID controller that fails only under heavy load, a rogue C extension doing wild stores and corrupting system I/O buffers as a result. There are no known ways the current releases of ZODB or ZEO can cause FileStorage corruption.
By "corruption", I mean what corruption conventionally means for any file:
the .fs and/or .fs.index file is damaged at the byte level, as if
someone had overwritten some region (or regions) with nonsense bytes. Of
course this can be a disaster when it occurs. Visible symptoms may include:
FileStorage.pyraisesCorruptedDataError. The exception detail in this case often reveals that a string of 42 NUL (0) bytes was read for a data record header. In such a case, investigation usually shows that a contiguous region of the FileStorage has been overwritten by NUL bytes, starting and stopping in the middle of otherwise-undamaged transactions.FileStorage.pypasses on this exception from Python's struct module:error: unpack str size does not match formatA cause for this may be that the
.fsand.fs.indexfiles have gotten out of synch, perhaps that the.fsfile has gotten packed, but the old (pre-pack).fs.indexfile is still being used. It may help to delete the.fs.indexfile in that case (when a FileStorage is opened, it will recreate an.fs.indexfile if one doesn't already exist).
Because no specific, generally applicable cause for FileStorage corruption is known, it's hard to give advice to those who experience it. Because so many low-level system problems can be at fault, it's a good idea, when possible, to try running your app on a different machine. In our experience, this usually makes the problem go away -- but, of course, then you'll never determine the true cause of the problem on the original machine.
If you can't diagnose or fix a case of recurring FileStorage corruption, follow best practice, and prepare for frequent recovery.