4

我目前正在阅读 Advanced Bash-Scripting Guide 并发现以下内容:

   # Generate binary choice, that is, "true" or "false" value.
   BINARY=2
   T=1
   number=$RANDOM

   let "number %= $BINARY"
   #  Note that    let "number >>= 14"    gives a better random distribution
   #+ (right shifts out everything except last binary digit).
   if [ "$number" -eq $T ]
   then
       echo "TRUE"
   else
       echo "FALSE"
   fi  

   echo

为什么建议取第 15 位而不是第 1 位?使用二元决策的几次运行表明两者之间没有显着差异。

// 更新 因为有人问我是如何计算分布的,所以我们开始吧。我生成了几个 $RANDOM 数字,取了每个数字的第 15 位和第 1 位,并创建了两个二进制序列。之后,我遍历了这些序列,检查了 1 和 0 的链(运行),计算了最大长度序列将生成多少次运行(供参考)并将所有内容打印到一个混乱的表中。这是所有荣耀的代码(对不起,脏代码......):

#! /bin/bash
COUNT=10000
RUN=1

# generate 2 sequences based on the same $RANDOM numbers
# seq1 = modulo 2, seq2 = bitshift 14
while [ $RUN -le $COUNT ]
do
    number=$RANDOM 
    let 'var1=number%2'
    var2=$number 
    let 'var2 >>= 14'
    seq1="${seq1}${var1}"
    seq2="${seq2}${var2}"
    (( RUN+=1 ))
done

# loop through sequences and check for chains of 1 and 0 (runs)
length=${#seq1}
prevSym=${seq1:0:1}
currRun="${prevSym}"
for (( i=1; i<length; i++ )); do
    currSym=${seq1:$i:1}
    if (( currSym==prevSym )); then
        currRun="${currRun}${currSym}"
        (( i!=length-1 )) && continue
        (( runStat1[${#currRun}]++ ))               #case: ends with run length > 1
        break
    fi
    (( runStat1[${#currRun}]++ ))
    (( prevSym=currSym ))
    (( i==length-1 )) && (( runStat1[1]++ ))             #case: ends with run length = 1
    currRun="${currSym}"
done

length=${#seq2}
prevSym=${seq2:0:1}
currRun="${prevSym}"
for (( i=1; i<length; i++ )); do
    currSym=${seq2:$i:1}
    if (( currSym==prevSym )); then
        currRun="${currRun}${currSym}"
        (( i!=length-1 )) && continue
        (( runStat2[${#currRun}]++ ))               #case: ends with run length > 1
        break
    fi
    (( runStat2[${#currRun}]++ ))
    (( prevSym=currSym ))
    (( i==length-1 )) && (( runStat2[1]++ ))             #case: ends with run length = 1
    currRun="${currSym}"
done

# print results and expected frequency
# number of expected runs with runlength k:
# 1/2**k if k<n, 1/2**(k-1) if k=n  
# $RANDOM generates random numbers in the range 0 to 32768 thus n=15
n=15
echo -e "Length L of run | # of runs with %2 | # of runs with >>14 | # of runs with MLS (calculated)\n "
echo -e "L\t|%2\t|>>14\t|MLS"
echo -e "-----------------------------------\n"
sorted="${!runStat1[*]} ${!runStat2[*]}" 
sorted=$(echo $sorted | tr ' ' '\n' | sort -n | uniq)
for a in $sorted; do
    k=${a}
    (( ${a}==${n} )) && (( k=a-1 ))
    prob=$(awk -v k=${a} -v c=${COUNT} 'BEGIN { print (((1/2)**k)*c)/k}')
    echo -e "${a} \t| ${runStat1[$a]} \t| ${runStat2[$a]} \t| ${prob} "
done

运行它会打印出以下内容:

Length L of run | # of runs with %2 | # of runs with >>14 | # of runs with MLS (calculated)
L   |%2 |>>14   |MLS
-----------------------------------

1   | 2495  | 2450  | 5000 
2   | 1219  | 1212  | 1250 
3   | 638   | 621   | 416.667 
4   | 300   | 329   | 156.25 
5   | 162   | 166   | 62.5 
6   | 75    | 81    | 26.0417 
7   | 46    | 34    | 11.1607 
8   | 23    | 26    | 4.88281 
9   | 13    | 7     | 2.17014 
10  | 2     | 6     | 0.976562 
11  | 1     | 1     | 0.443892 
13  | 3     |   | 0.0939002 
15  |   | 2     | 0.0203451 
21  |   | 1     | 0.000227065 

这使我得出的结论是,不出所料,并且在所有 bash 参考资料中也提到,$RANDOM 是一个可怕的随机性来源......但“number >>= 14”也没有比“number %=”更好的随机分布2" 用于二元选择。

......或者我在这个巨大的愚蠢计算中犯了一个巨大的错误。你告诉我。

4

1 回答 1

2

使用高位的建议是因为许多随机数生成器是作为线性同余生成器实现的,这会在低位中产生较差的随机性。

例如,以下 RNG 实现曾经很常见。(我相信它是在 C89 标准中作为示例给出的。)

unsigned old_rand() {
  next = next * 1103515245 + 12345;
  return next;
}

现在看看这会产生什么样的数字。

2140733074   // even
3902869603   // odd
4012135520   // even
2255314201   // odd
3913576926   // even
2626310079   // odd
4159329932   // even
1903014357   // odd

位 1 根本不是随机的。

正如这个漂亮的图形演示所示,即使是更高质量的 LCG,如 Java 中使用的 LCG,也会受到这种影响。所以不要相信未知 RNG 的低位。

于 2013-04-24T00:11:09.440 回答