python - 在翻译文本中自动同步降价（ProGit 书籍，可用资源）

Question

摘要：将已翻译内容中列出的子字符串包装为反引号的有效方法是什么？

动机：我正在将翻译文本中的降价标记与原文同步。我确实将 Scott Chacon 的 ProGit 书很好地翻译成捷克语。不幸的是，它是使用与原始工具链完全不同的工具链进行排版的，并且原始标记已丢失。到目前为止，我已经成功地将大部分内容转换回markdown，并将文档结构与原始结构同步。下一步是修复code在翻译中使用反引号的问题。

情况

说，我从原文中有以下段落。实际上，如果重要的话，这是一条没有换行符的长线：

    On Windows systems, Git looks for the `.gitconfig` file in the 
    `$HOME` directory (`C:\Documents and Settings\$USER` for most 
    people). It also still looks for /etc/gitconfig, although it’s 
    relative to the MSys root, which is wherever you decide to 
    install Git on your Windows system when you run the installer.

我也有翻译的段落：

    Ve Windows používá Git soubor .gitconfig, který je umístěný v 
    domovském adresáři (u většiny uživatelů C:\Documents and 
    Settings\$USER). Dále se pokusí vyhledat ještě soubor 
    /etc/gitconfig, který je relativní vůči kořenovému adresáři. 
    Ten je umístěn tam, kam jste se rozhodli nainstalovat Git po 
    spuštění instalačního programu.

使用正则表达式，我确实从原始列表中提取了以下列表（这里是repr()-- 因此是双反斜杠）：

    ['.gitconfig', '$HOME', 'C:\\Documents and Settings\\$USER']

将翻译内容中列出的子字符串包装为反引号的有效方法是什么？问题还在于某些段落可能重复多次相同的子字符串。我也不能告诉你还会发生什么其他并发症。（“我的脑袋也很痛！”）

附注：对于那些对这个问题更感兴趣的人，一切都可以在https://github.com/pepr/progitCZ上找到（刚刚提交 04d1354656276bf1e6ba7305d06c12faca267a19；警告，评论是捷克语）。问题与util/cz.py脚本有关。这是第四遍——在pass4.py. 目前，我将列表转换为集合，然后调用str.replace()每个子字符串。

该info_aux_cs\pass4backticks.txt文件显示了自动化过程的比较。显示info_aux_cs\pass4.txt“固定”结果，txtCorrected\RucneUpravovanyVysledekPass2.txt显示最后手动修改的阶段。

另一个问题是……文档的结构已经同步了。另一方面，还没有检查段落的内容（翻译）是否有新的原文。

更新——观察到新问题

自动替换可能不明确。我确实观察到了这种情况['git clone', 'clone', ...]。由于set是首先创建的，因此clone实际上可以更早地包装。这边走

some text git `clone` other text

出现在

some text `git clone` other text

应该是正确的替换。

我知道这种方法非常具有启发性，实际上不需要非常精确地完成。一旦自动替换的文本将成为手动编辑的来源。这样，部分解决方案可以可视化应该由人眼检查并由人手修复的可疑差异:)

您对如何找到解决该问题的最可靠方法有任何想法吗？以下是我想到的一些启发式方法——即何时可视化潜在问题：

原文中的所有子串都应在翻译中找到。否则，翻译在某种程度上是特定的或不是最新的，或者只是被削弱了。翻译可能会更改她的子字符串，但这应该被识别并且以后应该明确禁止检查。
子字符串的顺序可能不会保留在目标语言中。无论如何，相同顺序的相同数量的子字符串是替换成功的好兆头。
是否应该首先替换最长的子字符串？
...但是较短的替换将在下一步中被替换？
是否可以从子字符串构造正则表达式模式，并且使用正则表达式的贪婪性来一次替换所有模式的反引号？

任何好主意都非常受欢迎；）

感谢您的时间和经验，

彼得

score 1 · Accepted Answer

到目前为止，我发现使用正则表达式的解决方案是最有希望的。如果您找到更好的解决方案，我将很高兴接受您的解决方案:)

首先，这是查找反引号子字符串的正则表达式：

rexBackticked = re.compile(r'`(\S.*?\S?)`')

有了原始enpara和翻译的cspara段落，我可以像这样提取反引号子字符串的列表：

enlst = rexBackticked.findall(enpara)
cslst = rexBackticked.findall(cspara)

然后我测试一下捷克段是否应该修改：

if set(enlst) != set(cslst) or len(enlst) != len(cslst):

如果是，那么我创建一个子字符串的差异列表，这些子字符串不是但应该在cspara（可能写得更好）中反引号：

    dlst = enlst[:]   # copy
    for s in cslst:
        if s in dlst:
            dlst.remove(s)

现在我需要构建一个正则表达式对象来识别dlst子字符串。我已经定义了以下功能：

def buildRex(self, lst):
    '''Build a regular expression mathing substrings from the lst.'''

    # Build a list of escaped unique substrings from the input list.
    # The order is not important now as it must be corrected later.
    lst2 = [re.escape(s) for s in set(lst)]

    # Join the escaped substrings to form the regular expression
    # pattern, build the regular expression, and return it. There could
    # be longer paterns that contain shorter patterns. The longer patterns
    # should be matched first. This way, the lst2 must be reverse sorted
    # by the length of the patterns.
    pat = '|'.join(sorted(lst2, key=len, reverse=True))
    rex = re.compile(pat)
    return rex

现在我可以用它来替换所有不重叠的子字符串cspara：

    rex = self.buildRex(dlst)
    cspara, n = rex.subn(r'`\g<0>`', cspara)

n对未来检查可能很重要的替换数量在哪里。

欢迎任何意见！

python - 在翻译文本中自动同步降价（ProGit 书籍，可用资源）

1 回答 1

Related

Reference