bash - 使用 awk 进行条件查找/替换

Question

我想解决一个常见但非常具体的问题：由于 OCR 错误，许多字幕文件包含字符“I”（大写 i）而不是“l”（小写 L）。

我的进攻计划是：

逐字处理文件
将每个单词传递给 hunspell 拼写检查器（“echo the-word | hunspell -l”如果有效则根本不产生响应，如果不正确则产生响应）
如果它是一个坏词，并且其中包含大写字母 Is，则将其替换为小写字母 l 并重试。如果它现在是一个有效的词，则替换原来的词。

我当然可以在脚本中标记和重建整个文件，但是在我走这条路之前，我想知道是否可以在单词级别使用 awk 和/或 sed 进行这些类型的条件操作？

任何其他建议的方法也将非常受欢迎！

score 2 · Accepted Answer

你真的不需要比 bash 更多的东西：

while read line; do
  words=( $line )
  for ((i=0; i<${#words[@]}; i++)); do
    word=${words[$i]}
    if [[ $(hunspell -l <<< $word) ]]; then
      # hunspell had some output
      tmp=${word//I/l}
      if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
        # no output for new word, therefore it's a dictionary word
        words[$i]=$tmp
      fi
    fi
  done
  # print the new line
  echo "${words[@]}"
done < filename > filename.new

将整个文件传递给 hunspell 并解析其输出似乎更有意义。

score 1 · Accepted Answer

两个建议：

将问题修复到更接近问题的起源位置，即 OCR 软件附近。能不能查字典，连非“我”的词都查不出来？如果没有，请尝试其他可以的 OCR 程序。
通过 hunspell 运行每个单词都会为每个单词创建一个进程，这是对 CPU 周期的巨大浪费。尝试使用多遍，其中第一遍查找所有“我”单词，然后过滤掉正确的单词，然后替换每个可纠正的单词。

bash - 使用 awk 进行条件查找/替换

2 回答 2

Related

Reference