The production server has been running for five months and never actually flushed the database. After doing an OS upgrade, Hexarc was killed (not shut down) which caused us to do a database recovery of five months' worth of changes!

It all worked, but ran into a few scary moments:

  1. During recovery, memory usage grew unbounded, probably because we couldn't collect garbage. We need to either pause to collect garbage or (more usefully) switch to ref-counting garbage-collection.
  2. Recovery took at least 30 minutes. But the boot process of Cryptosaur tries to open the database and times out after 30 seconds. This prevented the system from booting. At minimum we should increase the time-out. [Fortunately, shutting down the service flushed the database properly, so the next time we started, everything was fine.]
  3. And, of course, the ultimate cause of all this is not flushing the database periodically.

See also: Recovery file should have a back up. The really scary part is that if the primary volume had failed we would have lost five months of data. At minimum we need to flush the database daily, if not hourly.

george moromisato 21 Oct 2021:

Deployed to hexarc.com.