bash - 如何将一个列表（例如 2 和 3）上的数字与另一个列表（例如 5）上的近似总和进行匹配？

Question

我正在尝试将一些音频文件与一些书面文本段落相匹配。

我从一个阅读打字文章的人的单个音频文件开始。然后，我在每个静音时间段拆分音频文件，使用sox，并类似地拆分类型文本，使每个唯一的句子都在唯一的行上。

然而，分裂并不是在每个时期都完美地发生，而是在演讲者停顿时发生。我需要创建一个列表，其中包含哪些音频文件对应于哪些类型的句子，例如：

0001.wav This is a sentence.
0002.wav This is another sentence.

请注意，有时 2 个或更多音频文件对应一个句子，例如：

0001.wav（“这是一个”）+ 0002.wav（“句子”）=“这是一个句子。”

为了帮助匹配文本，我使用软件来计算音频中的音节并计算输入文本中的音节。

我有两个包含这些数据的文件。第一个“sentences.txt”是文本中所有句子的列表，每行显示一个，带有它们的音节数，例如：

5 这是一个句子。
7 这是另一个句子。
8 这又是一句话。
9 这又是一句话。

我可以删除句子数据awk -f" " { print $1 } sentences.txt来拥有这个syllables_in_text.txt：

第二个文件syllables_in_audio.txt有一个音频文件列表，顺序相同，音节数大致相同。有时会比文中的实际数字略低一些，因为音节计数软件并不完善：

0001.wav 3
0002.wav 2
0003.wav 4
0004.wav 5
0005.wav 7
0006.wav 3
0007.wav 2
0008.wav 3

如何打印音频文件列表（“output.txt”）以使音频文件文件名与“sentences.txt”中的文本句子出现在同一行，例如：

0001.wav 0002.wav
0003.wav 0004.wav
0005.wav
0006.wav 0007.wav 0009.wav

下面是两个文件的表格，以演示如果两个文件并排放置，它们是如何排列的。文件“0001.wav”和“0002.wav”都需要使句子“This is a sentence”。这些文件名列在“output.txt”的第 1 行，而相应的句子以文本形式写在“sentences.txt”的行：

Contents of "output.txt":    | Contents of "sentences.txt":
0001.wav 0002.wav            | 5 This is a sentence.
0003.wav 0004.wav            | 7 This is another sentence.
0005.wav                     | 8 This is yet another sentence.
0006.wav 0007.wav 0009.wav   | 9 This is still yet another sentence.

score 1 · Accepted Answer

awk您可以按如下方式创建脚本。伪代码：

BEGIN { 
        init counter=1
        read your first file (syllables_in_text.txt) with getline till the end (while...)
            store its value in firstfile[counter]
            counter++
        # when you had finished reading your first file
        init another_counter=1
        read your second file (syllables_in_audio.txt) with getline till the end (while...)
            if $2 (second col from your file) <= firstfile[another_counter]
                 store $1 like o[another_counter]=" " $1 
               else
                 another_counter++  
                 store $1 like o[another_counter]=" " $1
        finally loop over the o array after sorint it
            print its contents after removing the leading space
}

但是还有另一种解决方案......

score 1 · Accepted Answer

你能解释如何在另一个列表（5）上匹配（2和3）吗？

我制作样品开始，请纠正我。

$ cat sentences.txt
5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.

$ cat syllables_in_audio.txt
0001.wav 5
0002.wav 5
0003.wav 7
0004.wav 7
0005.wav 8
0006.wav 9
0007.wav 9
0008.wav 9

所以你应该可以运行 awk 命令来获取输出：

awk 'NR==FNR{a[$1]=$0;next}{b[$2]=b[$2]==""?$1:b[$2] FS $1}END{for (i in a) printf "%-40s|%s\n", b[i], a[i]}' sentences.txt syllables_in_audio.txt

结果

0001.wav 0002.wav                       |5 This is a sentence.
0003.wav 0004.wav                       |7 This is another sentence.
0005.wav                                |8 This is yet another sentence.
0006.wav 0007.wav 0008.wav              |9 This is still yet another sentence.

bash - 如何将一个列表（例如 2 和 3）上的数字与另一个列表（例如 5）上的近似总和进行匹配？

2 回答 2

Related

Reference