0

I'm facing a strange race condition in my bash program. I tried duplicating it via a simple enough demo program but, obviously, as true for all/most timing-related race demonstration attempts, I couldn't.

Here's an abstracted version of the program that DOES NOT duplicate the issue, but let me still explain:

# Abstracted version of the original program
# that is NOT able to demo the race.
#
function foo() {
    local instance=$1

    # [A lot of logic here -
    #  all foreground commands, nothing in the background.]

    echo "$instance: test" > /tmp/foo.$instance.log        
    echo "Instance $instance ended"
}

# Launch the process in background...
#
echo "Launching instance 1"
foo 1 &

# ... and wait for it to complete.
#
echo "Waiting..."
wait
echo "Waiting... done.  (wait exited with: $?)"

# This ls command ALWAYS fails in the real
# program in the 1st while-iteration, complaining about 
# missing files, but works in the 2nd iteration!
#
# It always works in the very 1st while-iteration of the
# abstracted version.
#
while ! ls -l /tmp/foo.*; do
    :
done

In my original program (and NOT in the above abstracted version), I do see Waiting... done. (wait exited with: 0) on stdout, just as I see in the above version. Yet, the ls -l always fails in the original, but always works in the above abstracted version in the very first while loop iteration.

Also, the ls command fails despite seeing the Instance 1 ended message on stdout. The output is:

$ ./myProgram
Launching instance 1
Waiting...
Waiting... done. (wait exited with: 0)
Instance 1 ended
ls: cannot access '/tmp/foo.*': No such file or directory
/tmp/foo.1
$

I noticed that the while loop can be safely done away with if I put a sleep 1 right before ls in my original program, like so:

# This too works in the original program:
sleep 1
ls -l /tmp/foo.*

Question: Why isn't wait working as expected in my original program? Any suggestions to at least help troubleshoot the problem?

I'm using bash 4.4.19 on Ubuntu 18.04.

EDIT: I just also verified that the call to wait in the original, failing program is exiting with a status code of 0.

EDIT 2: Shouldn't the Instance 1 ended message appear BEFORE Waiting... done. (wait exited with: 0)? Could this be a 'flushing problem' with OS' disk-buffer/cache when dealing with background processes in bash?

EDIT 3: If instead of the while loop or sleep 1 hacks, I issue a sync command, then, voila, it works! But why should I have to do a sync in one program but the other?

4

1 回答 1

0

我注意到以下三个黑客都有效,但不太清楚为什么:

黑客 1

while ! ls -l /tmp/foo.*; do
    :
done

黑客 2

sleep 1
ls -l /tmp/foo.*

破解 3

sync
ls -l /tmp/foo.*

这可能是操作系统磁盘缓冲区/缓存的“刷新问题”,尤其是在处理后台进程时,尤其是在bash?换句话说,调用wait似乎在刷新磁盘缓存之前返回(或者,在操作系统自己意识到并完成刷新磁盘缓存之前)。

编辑感谢@Jon,他的猜测非常接近,让我朝着正确的方向思考,以及来自@chepner 的古老的、按位调整的建议。

真正的问题:我开始foo,不是直接/明确地显示在我原来的问题中我不准确的抽象版本中,而是通过另一个launchThread函数,在做了一些簿记之后,它也会foo 1 &在它的正文中说出来。并且调用launchThread本身带有一个后缀&!所以,我wait真的在等待launchThread而不是等待foo!、sleep和只是帮助争取更多的时间来完成,这就是为什么介绍他们工作sync。以下是该问题的更准确演示,即使您可能能够也可能无法在自己的系统上复制它(由于跨系统的调度/时间差异):whilefoo

#!/bin/bash -u

function now() {
    date +'%Y-%m-%d %H:%M:%S'
}

function log() {
    echo "$(now) - $@" >> $logDir/log # Line 1
}

function foo() {
    local msg=$1
    log "$msg"
    echo "  foo ended"
}

function launchThread() {
    local f=$1
    shift
    "$f" "$@" &  # Line 2
}

logDir=/tmp/log

/bin/rm -rf "$logDir"
mkdir -p "$logDir"

echo "Launching foo..."
launchThread foo 'message abc' &  # Line 3

echo "Waiting for foo to finish..."
wait
echo "Waiting for foo to finish... done. (wait exited with: $?)"

ls "$logDir"/log*

上述错误程序的输出:

Launching foo...
Waiting for foo to finish...
Waiting for foo to finish... done. (wait exited with: 0)
  foo ended
ls: cannot access '/tmp/log/log*': No such file or directory

如果我&从 EITHER Line 2OR from中删除Line 3,则程序可以正常工作,输出如下:

Launching foo...
Waiting for foo to finish...
  foo ended
Waiting for foo to finish... done. (wait exited with: 0)
/tmp/log/log

如果我$(now)Line 1.

于 2018-07-12T04:25:18.010 回答