Incomplete journal entries

Post by Yuri D'Elia
I'm running mbsync as part of a cron job under timeout(1).
Error: incomplete journal entry at /path.journal:2
where the journal file looks simply truncated (probably not flushed).
00000000: 320a 0000 0000 0000 0000 0000 0000 0000 2...............
00000010: 0000 0000 0000 0000 0000 0000 0000 ..............
I remember asking in the list how robust the journal mechanism, but it
doesn't seem reliable enough for mbsync to be SIGTERM'ed/interrupted.

that doesn't make sense to me. a signal may not alter what gets written
to the file, only how much of it is written. zero-padding may only
happen as a response to internal inconsistency, which a sane kernel
positively will not introduce itself just because something interrupted
a userspace process.
maybe you can strace your cron job to see what is happening.

Post by Yuri D'Elia
This is compounded by the fact that I cannot easily distinguish a
network error (that I want to ignore - as it's generally transient) from
an mbsync's internal error such as this.

i'm collecting error messages and grepping them. that way i filter out
all the silly noise from outlook.
getting proper exit code handling is kinda tough, because mbsync can
perform multiple independent jobs - do you report the best or the worst?
i guess it would be possible to write a precise report file with one
line per mailbox. however, undoubtedly the best solution would be
letting mbsync handle transient errors and the timing itself.

Yuri D'Elia

2016-10-27 20:15:27 UTC

Noticed this didn't get posted on the list, so reposting...

Post by Yuri D'Elia
where the journal file looks simply truncated (probably not flushed).
00000000: 320a 0000 0000 0000 0000 0000 0000 0000 2...............
00000010: 0000 0000 0000 0000 0000 0000 0000 ..............
I remember asking in the list how robust the journal mechanism, but it
doesn't seem reliable enough for mbsync to be SIGTERM'ed/interrupted.

It's hard to reproduce at will. The incomplete journal is consistently
the INBOX folder of an outlook365 server. But this is because this is
also the slowest folder to sync, and thus most likely to be interrupted.

As for truncation, this might still happen if the file is not fsynced
explicitly at critical transaction points (including before fclose).

I didn't look at the code yet, so I'm just reporting what I'm seeing and
guessing at this point.

Post by Oswald Buddenhagen
i'm collecting error messages and grepping them. that way i filter out
all the silly noise from outlook.
getting proper exit code handling is kinda tough, because mbsync can
perform multiple independent jobs - do you report the best or the worst?
i guess it would be possible to write a precise report file with one
line per mailbox. however, undoubtedly the best solution would be
letting mbsync handle transient errors and the timing itself.

I would actually be pretty happy to have two (or more) ORed flags for
the exit status as many tools already do, in order to distinguish the
broad categories.

For example. !=0 is any problem, but then the status&0x0F range could
mark just transient error classes (such as connection/timeout errors).
As an end-user, I can decide what to do.

My plan would be allow for up to three transient errors to pass without
notice, or any non-transient error to be notified immediately. Easily
done in *sh. Logs are a PITA to parse, and technically I could already
do this by parsing the regular output (which I'd like to avoid).

Oswald Buddenhagen

2016-10-28 08:04:12 UTC

Post by Yuri D'Elia
It's hard to reproduce at will. The incomplete journal is consistently
the INBOX folder of an outlook365 server. But this is because this is
also the slowest folder to sync, and thus most likely to be interrupted.

you wouldn't have gotten it several times already if it was hard. it
seems to me that it would be sufficient to interrupt it while it is
fetching the message list.

Post by Yuri D'Elia
As for truncation, this might still happen if the file is not fsynced
explicitly at critical transaction points (including before fclose).

you're not getting truncation, but data corruption, as that's what
appending a number of null bytes is. thers is _no_ standard that permits
this without an interim system crash, fsync or not.

Post by Yuri D'Elia
I would actually be pretty happy to have two (or more) ORed flags for
the exit status as many tools already do, in order to distinguish the
broad categories.

that's doable, but there are still some grey zones in the
interpretation. for example, dns lookup failure is generally a permanent
error - except when it's actually caused by a network failure. in
principle there are separate error codes for that, yet somehow i'm
regularly getting NXDOMAIN in this case. certainly something botched in
the setup of the dns proxy on my host or router, but still real.

Theodore Ts'o

2016-10-28 14:59:42 UTC

Post by Yuri D'Elia
As for truncation, this might still happen if the file is not fsynced
explicitly at critical transaction points (including before fclose).

you're not getting truncation, but data corruption, as that's what
appending a number of null bytes is. thers is _no_ standard that permits
this without an interim system crash, fsync or not.

Actually, without an fsync ***anything*** goes. In particular, if you
append to a file, and the system allocates a new block, it's fair game
for the file system to attach a block to the disk, but mark the block
the as uninitalized, so that reads to that block results in zeros.
That's not technically data corruption. All of the data up to the
last fsync is safe. What happens after the last fsync is up in the
air. The behavior I described is what XFS will do.

With ext4, we use delayed allocation, but the way we do data=ordered
is that we flush the data blocks *before* we do the commit, so in
practice it shouldn't be happening with ext4. However, we reserve the
right to switch how we do things in the future to be more like XFS,
since there are some performance advantages for not forcing out the
data block, but just marking the block as uninitalized and then
marking the block as initialized after the writeback completes.

If you mount with the data=writeback flag, then we don't force out
data blocks before we do a commit (which gives a performance
advantage, which is why some users might choose to use it), but it
means that it's possible for stale data (the previous contents of the
data block) to become revealed after a crash. But (and this is
important) it's completelly legal as far as the POSIX standard is
concerned.

So if you care about this, I would strongly recommend that you include
a CRC of the contents of the transaction blocks in the commit record.

Also note that technically speaking, although fsync() guarantees that
after it returns, everything written is committed to stable store, it
does not guarantee about the *order* that data will be commited to
stable store before the fsync() completes. So if you want to be
technically correct, what you need to do is either (a) write the
transaction blocks, fsync, then write the commit record, and then
fsync a second time, or (b) write the transaction blocks, and write
the commit block with a CRC, and then fsync --- and then on the
replay, check the CRC in the commit block, and if the CRC does not
check out, discard the last transaction since it wasn't fully
committed to stable store before the crash.

(Yes, storage is hard. The reason why it's hard is because users
insist on extreme performance, and so POSIX guarantes are fairly
loose. They have to be, or every day performance would be horrific.
What this does mean is that if you want transaction / atomic
guarantees, you have all of the low-level tools, but it's up to the
application programmer or the database library implementor to use
those tools corretly.)

Best regards,

- Ted

Evgeniy Berdnikov

2016-10-28 16:21:58 UTC

Post by Yuri D'Elia
As for truncation, this might still happen if the file is not fsynced
explicitly at critical transaction points (including before fclose).

you're not getting truncation, but data corruption, as that's what
appending a number of null bytes is. thers is _no_ standard that permits
this without an interim system crash, fsync or not.

Userspace process generally DO NOT read raw blocks from disk.
It uses kernel's buffer space, so buffered data should be consistent
regardless of disk contents.

Oswald is right, process completion (by exit() or interrupt by signal)
can only truncate data, but can not corrupt. Only system crash can
lead to data corruption.

--
Eugene Berdnikov

Theodore Ts'o

2016-10-28 18:36:42 UTC

Post by Evgeniy Berdnikov
Userspace process generally DO NOT read raw blocks from disk.
It uses kernel's buffer space, so buffered data should be consistent
regardless of disk contents.
Oswald is right, process completion (by exit() or interrupt by signal)
can only truncate data, but can not corrupt. Only system crash can
lead to data corruption.

I could be confused, but I thought the original bug report involved
the journal containing unexpected zero bytes after a system crash?

- Ted

Oswald Buddenhagen

2016-10-28 20:22:13 UTC

Post by Evgeniy Berdnikov
Oswald is right, process completion (by exit() or interrupt by signal)
can only truncate data, but can not corrupt. Only system crash can
lead to data corruption.

I could be confused, but I thought the original bug report involved
the journal containing unexpected zero bytes after a system crash?

it didn't.

Evgeniy Berdnikov

2016-10-28 20:31:49 UTC

...

Post by Evgeniy Berdnikov
Oswald is right, process completion (by exit() or interrupt by signal)
can only truncate data, but can not corrupt. Only system crash can
lead to data corruption.

I could be confused, but I thought the original bug report involved
the journal containing unexpected zero bytes after a system crash?

Original mail stated that isync is started as cron job under "timeout"
utility (designed to spawn a command and then terminate it by signal
if command run too much time). Sometimes journal file gets truncated
records with trailing zeros. Looks like a result of system crash, yes,
but crashes were not mentioned.

--
Eugene Berdnikov

Yuri D'Elia

2016-10-29 21:29:08 UTC

I could be confused, but I thought the original bug report involved
the journal containing unexpected zero bytes after a system crash?

It can also happen (this is a laptop, and when the battery goes the fs
is not synced), and from this instance it can be the case.

But it's too frequent to be always the case. I'll collect one broken
journal making sure the system didn't go down for some reason.

Evgeniy Berdnikov

2016-10-30 00:29:56 UTC

Post by Theodore Ts'o
I could be confused, but I thought the original bug report involved
the journal containing unexpected zero bytes after a system crash?

It can also happen (this is a laptop, and when the battery goes the fs
is not synced), and from this instance it can be the case.
But it's too frequent to be always the case. I'll collect one broken
journal making sure the system didn't go down for some reason.

I propose to run strace in your cron job, to clarify whether corrupted
data is written by isync, i.e. change command to something like

strace -s128 -o /var/log/`date +%d%H%M%S`/isync.strace \
-e open,read,write -f timeout ... isync ...

Note that in crontab(5) each % should be escaped as \% and
complete command should be written on a single line.

--
Eugene Berdnikov

Evgeniy Berdnikov

2016-12-23 14:18:44 UTC

Hello.

Post by Yuri D'Elia
As for truncation, this might still happen if the file is not fsynced
explicitly at critical transaction points (including before fclose).

you're not getting truncation, but data corruption, as that's what
appending a number of null bytes is. thers is _no_ standard that permits
this without an interim system crash, fsync or not.

Userspace process generally DO NOT read raw blocks from disk.
It uses kernel's buffer space, so buffered data should be consistent
regardless of disk contents.
Oswald is right, process completion (by exit() or interrupt by signal)
can only truncate data, but can not corrupt. Only system crash can
lead to data corruption.

Oswald, I got the same Mbsync error without any crash and/or reboot,
after several monthes of normal run. Mbsync was invoked in a loop
"while sleep 30 ; do timeout 15s mbsync -a -V ; done".
Any concurrency with other mbsync process is completely ruled out.

Error message was iether "mailformed journal entry" or "incomplete
journal entry", I do not remember exactly. The broken journal file
is 33 bytes length:

% hd .mbsyncstate.journal
00000000 32 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |2...............|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000021

After deletion of this file Mbsync continued to run as usual.

I use Mbsync is from "isync-1.2.1-2" Debian package (Linux, 32-bit).

--
Eugene Berdnikov

Yuri D'Elia

2016-12-24 14:18:46 UTC

Post by Evgeniy Berdnikov
Oswald is right, process completion (by exit() or interrupt by signal)
can only truncate data, but can not corrupt. Only system crash can
lead to data corruption.

Oswald, I got the same Mbsync error without any crash and/or reboot,
after several monthes of normal run. Mbsync was invoked in a loop
"while sleep 30 ; do timeout 15s mbsync -a -V ; done".
Any concurrency with other mbsync process is completely ruled out.
Error message was iether "mailformed journal entry" or "incomplete
journal entry", I do not remember exactly. The broken journal file
% hd .mbsyncstate.journal
00000000 32 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |2...............|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000021
After deletion of this file Mbsync continued to run as usual.

I'm glad mine is not the only case. I couldn't rule out other scenarios
completely so far.

Oswald Buddenhagen

2016-12-25 23:48:20 UTC

Post by Evgeniy Berdnikov
Oswald is right, process completion (by exit() or interrupt by signal)
can only truncate data, but can not corrupt. Only system crash can
lead to data corruption.

Oswald, I got the same Mbsync error without any crash and/or reboot,
after several monthes of normal run. Mbsync was invoked in a loop
"while sleep 30 ; do timeout 15s mbsync -a -V ; done".
Any concurrency with other mbsync process is completely ruled out.
Error message was iether "mailformed journal entry" or "incomplete
journal entry", I do not remember exactly. The broken journal file
% hd .mbsyncstate.journal
00000000 32 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |2...............|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000021
After deletion of this file Mbsync continued to run as usual.

I'm glad mine is not the only case. I couldn't rule out other scenarios
completely so far.

even after looking at the code several more times, i plain can't see how
this would be possible. it seems that *someting* is pre-allocating space
in the file, but is getting interrupted before writing the actual data.
alternatively, it's actually appending nulls for some bizarre reason.
ltrace and strace of the process could help for starters - if something
fishy is done from user space, this should become immediately obvious.

Yuri D'Elia

2016-12-26 12:06:48 UTC

Post by Oswald Buddenhagen
even after looking at the code several more times, i plain can't see how
this would be possible. it seems that *someting* is pre-allocating space
in the file, but is getting interrupted before writing the actual data.
alternatively, it's actually appending nulls for some bizarre reason.

This looks like an initial open/creat, where a write was interrupted
with no flush. The nulls should either be a multiple size of stdbuf (if
libc regular IO is used) or the default block alloc of the underlying
fs (depending on where the write was interrupted).

The nulls are there because the space was allocated, but nobody wrote
into it yet.

Post by Oswald Buddenhagen
ltrace and strace of the process could help for starters - if something
fishy is done from user space, this should become immediately obvious.

I had to stop logging at some point, as I couldn't reproduce it when
running under strace (probably due to the added latencies).

Evgeniy Berdnikov

2016-12-26 12:43:37 UTC

In my case broken journal file has 33 bytes: "2", "\n" and 31 NULLs.
But file allocation should be multiple of 512 bytes, and I suspect the
defult libc buffer size is much higher than 31 and multiple of 2.

Post by Yuri D'Elia
into it yet.

This assumption is quite natural, but it conradicts at least two facts:
1. small data record sizes and 2. the model of file operation (unflushed
data can appear in a file only on crash/reboot or severe kernel bug).

Post by Oswald Buddenhagen
ltrace and strace of the process could help for starters - if something
fishy is done from user space, this should become immediately obvious.

I had to stop logging at some point, as I couldn't reproduce it when
running under strace (probably due to the added latencies).

No doubts, this bug is rare, so it's very hard to reproduce it.

I've got damage when internet route to server was very slow and unstable
(high %loss and frequent breaks). Probably bug is triggered on processing
of network timeout condition. It may be some write() to wrong fd and/or
with erroneous pointer to data.

--
Eugene Berdnikov

Evgeniy Berdnikov

2017-05-29 09:57:23 UTC

Hi folks,

I've got the broken journal again (mbsync 1.2.1 from Debian distribution).

# ls -al
total 300
drwx--x--x 5 berd zp 4096 May 29 12:44 .
drwxrwsr-x 5 root mail 4096 Jan 11 2015 ..
-rw-r--r-- 1 berd zp 12444 May 27 20:06 .mbsyncstate
-rw-r--r-- 1 berd zp 33 May 27 20:07 .mbsyncstate.journal
-rw-r--r-- 1 berd zp 12444 May 27 20:07 .mbsyncstate.new
-rw------- 1 berd zp 18 May 27 20:02 .uidvalidity
drwx--x--x 2 berd zp 118784 May 26 15:19 cur
drwx--x--x 2 berd zp 126976 May 27 20:02 new
drwx--x--x 2 berd zp 4096 May 27 20:02 tmp

# hd .mbsyncstate.journal
00000000 32 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |2...............|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000021

# stat .mbsyncstate.journal
File: .mbsyncstate.journal
Size: 33 Blocks: 8 IO Block: 4096 regular file
Device: 808h/2056d Inode: 146871 Links: 1
Access: (0644/-rw-r--r--) Uid: (24826/ berd) Gid: ( 1307/ zp)
Access: 2017-05-29 12:34:11.859035473 +0300
Modify: 2017-05-27 20:07:12.235838762 +0300
Change: 2017-05-27 20:07:12.235838762 +0300
Birth: -

Files .mbsyncstate and .mbsyncstate.new are identical. In this case
reason of the crash was power fail 27.05 20:07, causing host to reboot.
Thank you for attention.

Post by Yuri D'Elia
into it yet.

1. small data record sizes and 2. the model of file operation (unflushed
data can appear in a file only on crash/reboot or severe kernel bug).