I'm facing a strange race condition in my bash
program. I tried duplicating it via a simple enough demo program but, obviously, as true for all/most timing-related race demonstration attempts, I couldn't.
Here's an abstracted version of the program that DOES NOT duplicate the issue, but let me still explain:
# Abstracted version of the original program
# that is NOT able to demo the race.
#
function foo() {
local instance=$1
# [A lot of logic here -
# all foreground commands, nothing in the background.]
echo "$instance: test" > /tmp/foo.$instance.log
echo "Instance $instance ended"
}
# Launch the process in background...
#
echo "Launching instance 1"
foo 1 &
# ... and wait for it to complete.
#
echo "Waiting..."
wait
echo "Waiting... done. (wait exited with: $?)"
# This ls command ALWAYS fails in the real
# program in the 1st while-iteration, complaining about
# missing files, but works in the 2nd iteration!
#
# It always works in the very 1st while-iteration of the
# abstracted version.
#
while ! ls -l /tmp/foo.*; do
:
done
In my original program (and NOT in the above abstracted version), I do see Waiting... done. (wait exited with: 0)
on stdout, just as I see in the above version. Yet, the ls -l
always fails in the original, but always works in the above abstracted version in the very first while
loop iteration.
Also, the ls
command fails despite seeing the Instance 1 ended
message on stdout. The output is:
$ ./myProgram
Launching instance 1
Waiting...
Waiting... done. (wait exited with: 0)
Instance 1 ended
ls: cannot access '/tmp/foo.*': No such file or directory
/tmp/foo.1
$
I noticed that the while loop can be safely done away with if I put a sleep 1
right before ls
in my original program, like so:
# This too works in the original program:
sleep 1
ls -l /tmp/foo.*
Question: Why isn't wait
working as expected in my original program? Any suggestions to at least help troubleshoot the problem?
I'm using bash 4.4.19
on Ubuntu 18.04.
EDIT: I just also verified that the call to wait
in the original, failing program is exiting with a status code of 0
.
EDIT 2: Shouldn't the Instance 1 ended
message appear BEFORE Waiting... done. (wait exited with: 0)
? Could this be a 'flushing problem' with OS' disk-buffer/cache when dealing with background processes in bash?
EDIT 3: If instead of the while
loop or sleep 1
hacks, I issue a sync
command, then, voila, it works! But why should I have to do a sync
in one program but the other?