Post by Oswald BuddenhagenPost by Yuri D'EliaAs for truncation, this might still happen if the file is not fsynced
explicitly at critical transaction points (including before fclose).
you're not getting truncation, but data corruption, as that's what
appending a number of null bytes is. thers is _no_ standard that permits
this without an interim system crash, fsync or not.
Actually, without an fsync ***anything*** goes. In particular, if you
append to a file, and the system allocates a new block, it's fair game
for the file system to attach a block to the disk, but mark the block
the as uninitalized, so that reads to that block results in zeros.
That's not technically data corruption. All of the data up to the
last fsync is safe. What happens after the last fsync is up in the
air. The behavior I described is what XFS will do.
With ext4, we use delayed allocation, but the way we do data=ordered
is that we flush the data blocks *before* we do the commit, so in
practice it shouldn't be happening with ext4. However, we reserve the
right to switch how we do things in the future to be more like XFS,
since there are some performance advantages for not forcing out the
data block, but just marking the block as uninitalized and then
marking the block as initialized after the writeback completes.
If you mount with the data=writeback flag, then we don't force out
data blocks before we do a commit (which gives a performance
advantage, which is why some users might choose to use it), but it
means that it's possible for stale data (the previous contents of the
data block) to become revealed after a crash. But (and this is
important) it's completelly legal as far as the POSIX standard is
concerned.
So if you care about this, I would strongly recommend that you include
a CRC of the contents of the transaction blocks in the commit record.
Also note that technically speaking, although fsync() guarantees that
after it returns, everything written is committed to stable store, it
does not guarantee about the *order* that data will be commited to
stable store before the fsync() completes. So if you want to be
technically correct, what you need to do is either (a) write the
transaction blocks, fsync, then write the commit record, and then
fsync a second time, or (b) write the transaction blocks, and write
the commit block with a CRC, and then fsync --- and then on the
replay, check the CRC in the commit block, and if the CRC does not
check out, discard the last transaction since it wasn't fully
committed to stable store before the crash.
(Yes, storage is hard. The reason why it's hard is because users
insist on extreme performance, and so POSIX guarantes are fairly
loose. They have to be, or every day performance would be horrific.
What this does mean is that if you want transaction / atomic
guarantees, you have all of the low-level tools, but it's up to the
application programmer or the database library implementor to use
those tools corretly.)
Best regards,
- Ted