home contents changes options help subscribe edit (external edit)

FileStorage overview

FileStorage is the most widely used ZODB storage. It's a relatively simple design, and is generally the fastest storage implementation.

All data is saved in a single .fs file. A bit simplified, a .fs file is a sequence of transaction records. Each transaction record has a transaction header, followed by a sequence of data records. There's one data record for each object modified by the transaction. Each data record has a data header, followed by a serialized form ("pickle") of the object's new state. A FileStorage grows only by appending, so transactions appear in order, from oldest to newest.

To speed lookups, a .fs.index file is derived from a .fs file. The .index file maps an object id to the offset within the .fs file at which the data record for the current revision of the object appears. If an .index file doesn't exist when a .fs file is opened, a .index file is generated automatically.

It's generally true that once a byte is written to a .fs file, it's never overwritten: changes to a .fs file are made only by appending new data. Because this strategy is so simple, it's very robust, and naturally supports undo (old data is never erased). However, this also means that a FileStorage grows without bound, and explicit packing is necessary to remove old revisions of objects that are no longer needed. Packing doesn't modify a FileStorage in-place either. Packing writes a new .fs file, and renames files at the end to replace the original .fs file with the packed one.

There's one exception to the rule that bytes in a FileStorage are never overwritten. When a new transaction is committed to a FileStorage, a special status byte is first written to record that a transaction record append has started. When the append is complete, this status byte (near the start of the new transaction record) is overwritten with a value indicating that the commit completed successfully. If, for example, the computer crashes before the append is complete, the next time the FileStorage is opened the status byte still has its initial "append started but didn't complete" value, and the FileStorage is then truncated, to remove the incomplete append.

Best FileStorage practice

  • Don't try to access a .fs file over a networked filesystem. FileStorage files experience heavy I/O activity, and networked filesystems introduce their own layers of bugs.
  • Establish a good backup procedure, and follow it religiously. The intended way to do FileStorage backup and restore is explained later on this page.
  • Packing is not an error recovery procedure. The purpose of packing is to reduce database size (by removing old revisions of objects), not to recover from errors, neither to prevent errors. If you're having problems, packing can make them worse, at least by obscuring the original nature of the damage by throwing away original evidence. See later sections for FileStorage diagnostic and recovery tools.
  • When packing, pack to a time in the past, not to "right now". There are pathological cases where packing to the current instant can delete an object from the database that's in use by a current transaction. In a rare subset of those cases (and arguably an application error when this occurs), that can leave the database with a dangling reference (an object in the database that refers to another persistent object that no longer exists in the database).
  • Because a FileStorage grows one transaction at a time, and transactions can be small, under some filesystem implementations a FileStorage can become badly fragmented quickly. For example, I've observed bad FileStorage fragmentation under NTFS (the high-end native Windows filesystem). Fragmentation can grossly increase the time it takes to seek to an object's data record in the file, so can grossly increase object load times. So if you're using a filesystem that's prone to fragmentation, defragment your .fs files regularly. (For Windows NT/2000/XP, Contig.exe from www.sysinternals.com is a free command-line utility that can defragment individual files.)
  • Use ZEO. ZEO is a wonderful thing. Keeping your database on a ZEO server isolates it from hardware and software problems on your application machine. Even running ZEO on the same machine, but in a different process, protects the FileStorage code from most software problems in the application process.
  • The Zope Replication Services (ZRS) product replicates a live FileStorage on a transaction-by-transaction basis. Even if the original FileStorage is on a machine with a bad disk, ZRS does not work by copying bytes, so the ZRS secondary replica(s) are extremely unlikely to experience the same kind of corruption.

Backing up FileStorage files

It's important to back up FileStorage files (.fs files), for the same reasons it's important to back up all critical files. The intended way to back up an .fs file is with the oddly named repozo.py script. This performs efficient incremental backups against a live FileStorage (there's no need to bring the app down), and is aware of that a FileStorage is in a temporarily non-sane state during the time a transaction is in the process of committing. Doing a raw file copy, or rsync, isn't aware of the latter problem, and should not be used on a live FileStorage file; a filesystem-level copy is fine if the FileStorage file isn't open for write access, but repozo still has an advantage then in knowing how to do incremental backup.

repozo is easy to use, but supports many options, which may be confusing at first. All the options are explained in the module docstring. Here we'll just use a set of typical options.

You first need to create a directory to store backups. I'll call it backup here. This directory should be created for repozo's exclusive use, and you must create a distinct backup directory for each distinct FileStorage you want to back up.

To create a backup of Data.fs, storing backup files in directory backup :

        repozo -BvzQ -r /foo/backup -f /foo/Data.fs

-B tells repozo to do a backup. -v causes it to display messages about what it's doing. -z causes the backup files to be compressed, using gzip. -Q is an optimization using md5 checksums to skip large amounts of I/O; there is a vanishingly small chance that -Q will cause repozo to do a wrong thing, and you can omit -Q if that bothers you, at the cost of more I/O and longer runtime. -r and -f specify the backup directory and FileStorage to back up, respectively.

Data.fs can be in active use when you run repozo. repozo makes a read-only connection to the FileStorage, and backs up to the point of the most recent fully committed transaction at the time this connection is made.

repozo will make either a full backup or an incremental backup. You can force a full backup with the -F flag. Else repozo does a full backup only if necessary. For example, a full backup is necessary if this is the first time a backup has been made, or if the FileStorage has been packed since the last time a backup was made.

In the backup directory, repozo creates data files with names of the form YYYY-MM-DD-HH-MM-SS.$ext, where $ext is fsz for a compressed full backup, deltafsz for a compressed incremental backup, fs for an uncompressed full backup, or deltafs for an uncompressed incremental backup. It also creates, or appends to, a .dat file, which is an index containing metadata about the data files. The YYYY-MM-DD-HH-MM-SS part records the UTC (not local) time at which the backup was made.

repozo is also used to recreate an .fs file from the backup files:

        repozo -Rv -r backup -D YYYY-MM-DD-HH-MM-SS -o Copy.fs

-D is optional, and specifies a UTC (not local) time; by default, current time is used. If specified, the hour, minute, and second parts are optional. repozo recreates the originally backed-up FileStorage, to the state it had at the most recent backup at or before this time. The -o option specifies an output file path, the name of the reconstructed FileStorage. In the example, the recreated FileStorage is Copy.fs in the current directory.

repozo -BQ is fast, usually taking time proportional to the growth in the FileStorage since the last time it was run. It's a good idea to make backups daily; incremental repozo backups are made quickly enough that you may wish to run them more frequently.

fstest verifies low-level file integrity

fstest.py does low-level checks that the structure of a FileStorage is correct. It verifies that the transaction and data headers are well-formed, and mutually consistent. It does not examine object pickles, so cannot detect corruption in pickles, and neither can it verify that inter-object references are legitimate:

        fstest [-v[v]] Data.fs

Without flags, fstest normally produces no output. If it finds a problem, it prints information about the first problem found, then exits.

You can run fstest against a live FileStorage, but if a transaction is in the process of being committed when fstest reaches the end of the file, you're likely to see a spurious error message of the form:

        Data.fs truncated possibly because of damaged records at 5981372

This is because the tail end of a FileStorage is temporarily in a non-sane state while new data is being appended to it, and fstest isn't aware of this.

With -v, a line of output per transaction is also printed, giving the transaction's "tid" (transaction identifier) and the starting offset of the transaction record within the file. With -vv, a line of output per data record is also printed, giving the object's "oid" (persistent object identifier) and the starting offset of the data record; output for all data records within a transaction appears before the line of output for the transaction record in this case.

fsrefs checks object sanity

fsrefs.py checks object sanity by trying to load the current revision of every object O in the database, and also verifies that every object directly reachable from each such O exists in the database.

It's hard to explain exactly what it does because it relies on undocumented features in Python's cPickle module: many of the crucial steps of loading an object are taken, but application objects aren't actually created. This saves a lot of time, and allows fsrefs to be run even if the code implementing the object classes isn't available.

The command line is very simple:

        fsrefs [-v] Data.fs

A read-only connection to the specified FileStorage is made, but it is not recommended to run fsrefs against a live FileStorage. Because a live FileStorage is mutating while fsrefs runs, it's not possible for fsrefs to get a wholly consistent view of the database across the entire time fsrefs is running; spurious error messages may result. A useful tactic is to make a backup with repozo (see above; this should go quickly if you're routinely making incremental repozo backups), and run fsrefs against a temporary FileStorage recreated by repozo from its then-current backup files.

fsrefs doesn't normally produce any output. If an object fails to load, the oid of the object is given in a message saying so, and if -v was specified then the traceback corresponding to the load failure is also displayed (this is the only effect of the -v flag).

Precisely which other errors are detected depends on the version of ZODB. Here I'll describe ZODB 3.4's behavior; see the fsrefs module docstring for details specific to your version.

Three other kinds of errors are also detected, when an object O loads OK, and directly refers to a persistent object P but there's a problem with P:

  • If P doesn't exist in the database, a message saying so is displayed. The unsatisifiable reference to P is often called a "dangling reference"; P is called "missing" in the error output.
  • If the current state of the database is such that P's creation has been undone, then P can't be loaded either. This is also a kind of dangling reference, but is identified as "object creation was undone".
  • If P can't be loaded (but does exist in the database), a message saying that O refers to an object that can't be loaded is displayed.

fsrefs also (indirectly) checks that the .index file is sane, because fsrefs uses the index to get its idea of what constitutes "all the objects in the database".

Note these limitations: because fsrefs only looks at the current revision of objects, it does not attempt to load objects in versions, or non-current revisions of objects; therefore fsrefs cannot find problems in versions or in non-current revisions.

fsdump creates a human-readable rendering

Like fstest -vv, fsdump produces a line of output for each object in each transaction. Their purposes are quite different, though. fstest verifies the low-level well-formedness of the file, but fsdump assumes the FileStorage is in good shape, and prints more human-oriented information (for example, the times associated with transactions, and the names of object classes):

        fsdump Data.fs

There are no flags or optional arguments. Here's sample output from fsdump for a single transaction:

        Trans #00004 tid=035557d2b17e7bbb time=2004-05-25 14:10:41.600174 offset=1620
            status='p' user= description=Added ZGlobals
          data #00000 oid=0000000000000000 class=Persistence.PersistentMapping 
          data #00001 oid=000000000000000a class=BTrees.OOBTree.OOBTree

For contrast, this is fstest -vv output for the same transaction:

        1620: object oid 0x0000000000000000 #0
        1854: object oid 0x000000000000000a #1
        1583: transaction tid 0x035557d2b17e7bbb #4 

Overall, the fsdump output is more informative, but skips low-level information like the file offsets of individual data records -- if you're not tracking down a file corruption problem, the offsets aren't really that interesting. There are some peculiar differences too, primarily that the offset given by fsdump for a transaction is actually the offset of the first data record within the transaction, and that fstest produces output for a transaction record after the output for that transaction's data records.

Note that since fsdump assumes its input FileStorage is in good shape, it skips most sanity checks, and may produce silly output if the database is damaged.

New in ZODB 3.4: data records also display their size, in bytes. For example, where before you might have seen:

        data #00003 oid=0000000000000011 class=Products.ZCatalog.Catalog.Catalog
        data #00004 oid=0000000000000018 class=BTrees.Length.Length

under ZODB 3.4 you might see:

       data #00003 oid=0000000000000011 size=595 class=Products.ZCatalog.Catalog.Catalog
       data #00004 oid=0000000000000018 size=38 class=BTrees.Length.Length

fsrecover attempts to repair FileStorage damage

XXX fsrecover.py

FileStorage corruption

Most people who use FileStorage never experience corruption. Those who do seem to see corruption regularly. No general cause is known, or even suspected, and it's usually difficult to track down a specific cause. Hardware or system software failure can, of course, corrupt any database, and some specific cases of this have been diagnosed: bad memory chip, bad disk, bad disk driver, a RAID controller that fails only under heavy load, a rogue C extension doing wild stores and corrupting system I/O buffers as a result. There are no known ways the current releases of ZODB or ZEO can cause FileStorage corruption.

By "corruption", I mean what corruption conventionally means for any file: the .fs and/or .fs.index file is damaged at the byte level, as if someone had overwritten some region (or regions) with nonsense bytes. Of course this can be a disaster when it occurs. Visible symptoms may include:

  • FileStorage.py raises CorruptedDataError. The exception detail in this case often reveals that a string of 42 NUL (0) bytes was read for a data record header. In such a case, investigation usually shows that a contiguous region of the FileStorage has been overwritten by NUL bytes, starting and stopping in the middle of otherwise-undamaged transactions.
  • FileStorage.py passes on this exception from Python's struct module:

    error: unpack str size does not match format

    A cause for this may be that the .fs and .fs.index files have gotten out of synch, perhaps that the .fs file has gotten packed, but the old (pre-pack) .fs.index file is still being used. It may help to delete the .fs.index file in that case (when a FileStorage is opened, it will recreate an .fs.index file if one doesn't already exist).

Because no specific, generally applicable cause for FileStorage corruption is known, it's hard to give advice to those who experience it. Because so many low-level system problems can be at fault, it's a good idea, when possible, to try running your app on a different machine. In our experience, this usually makes the problem go away -- but, of course, then you'll never determine the true cause of the problem on the original machine.

If you can't diagnose or fix a case of recurring FileStorage corruption, follow best practice, and prepare for frequent recovery.


comments:

where to run the repozo command>? --sajith, Tue, 09 Dec 2008 09:09:33 -0500 reply
Hi, Where do i runf this command?

-----

I tried to answer the above question, but the spam filter keeps sensible comments out, pheew!

There was a problem: your edit contained a banned link pattern. Please contact the site administrator for help.

spam filter fixed --simon, Tue, 23 Jun 2009 12:53:36 -0400 reply
Thanks for the report, chriswayg - should be fixed now.

... --dinesh, Tue, 14 Jul 2009 01:36:51 -0400 reply
Where is the answer to the above question? I am also having the same question.

pkg_resources.DistributionNotFound?: ZODB3==3.9.3 --danielt, Mon, 25 Jan 2010 17:40:45 -0500 reply
When I tried to backup my Data.fs with repozo I get the following error: pkg_resources.DistributionNotFound?: ZODB3==3.9.3 What can I do to fix this?



subject:
  ( 14 subscribers )