python - 如何在带有 POS 标记的语料库文件中更改短语动词的词序

Question

我有一个带有 POS 标记的并行语料库文本文件，我想在其中进行单词重新排序，以便“可分离的短语动词粒子”将出现在短语动词的“动词”旁边（“制定计划”而不是'制定计划'）。这用于统计机器翻译系统中的预处理。以下是 POS 标记文本文件中的一些示例行：

you_PRP mean_VBP we_PRP should_MD kick_VB them_PRP out_RP ._。
don_VB 't_NNP take_VB it_PRP off_RP until_IN I_PRP say_VBP so_RB ._.
请_VB help_VB the_DT man_NN out_RP ._。
关闭_VBZ it_PRP down_RP ！_。

我想将所有粒子（在示例中：out_RP、off_RP、out_RP、down_RP）移动到最接近的前面动词（即与粒子组合构成短语动词的动词）旁边。以下是更改词序后线条的外观：

you_PRP mean_VBP we_PRP should_MD kick_VB out_RP them_PRP ._。
don_VB 't_NNP take_VB off_RP it_PRP until_IN I_PRP say_VBP so_RB ._.
请_VB help_VB out_RP the_DT man_NN ._。
关闭_VBZ down_RP it_PRP ！_。

到目前为止，我已经尝试使用 python 和正则表达式通过 re.findall 对问题进行排序：

import re 

file=open('first100k.txt').read()
matchline3='\w*_VB.?\s\w*_DT\s\w*_NN\s\w*_RP'
wordorder1=re.findall(matchline3,file)
print wordorder1

这将在词序 1 中找到所有短语动词（见下文），但这是我所能得到的，因为我无法弄清楚如何将粒子移动到动词旁边。任何想法如何正确解决这个问题（不一定使用python和正则表达式）？我希望能够搜索所有短语动词并按以下词序移动粒子：

（使用的标签取自 Penn Treebank 标签集（http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html）（x表示可选字符，以包含所有动词形式，并且* 表示通配符））

*_VBx+*_DT+*_NN+*_RP
*_VBx+*_DT+*_NNS+*_RP
*_ VBx+ * _DT+*_.JJ+*_NN+*_RP
*_ VBx+ * _DT+*_.JJ+*_NNS+*_RP
*_VBx+*_PRP$+*_NN+*_RP
*_VBx+*_PRP$+*_NNS+*_RP
*_ VBx +* _PRP$+*_.JJ+*_NN+*_RP
*_ VBx +* _PRP$+*_.JJ+*_NNS+*_RP
*_VBx+*_NNP+*_RP
*_VBx+*_JJ+*_NNP+*_RP
*_VBx+*_NNPS+*_RP
*_VBx+*_PRP+*_RP

在此先感谢您的帮助！

score 3 · Accepted Answer

我不建议在这里使用正则表达式。它绝对不像在空格上分割后迭代每一行，可能重新排列列表，最后加入那样直观。你可以试试这样的

reordered_corpus = open('reordered_corpus.txt', 'w')
with open('corpus.txt', 'r') as corpus:
    for phrase in corpus:
        phrase = phrase.split()                 # split on whitespace
        vb_index = rp_index = -1                # variables for the indices
        for i, word_pos in enumerate(phrase):
            pos = word_pos.split('_')[1]        # POS at index 1 splitting on _
            if pos == 'VB' or pos == 'VBZ':     # can add more verb POS tags
                vb_index = i
            elif vb_index >= 0 and pos == 'RP': # or more particle POS tags
                rp_index = i
                break                           # found both so can stop
        if vb_index >= 0 and rp_index >= 0:     # do any rearranging
            phrase = phrase[:vb_index+1] + [phrase[rp_index]] + \
                     phrase[vb_index+1:rp_index] + phrase[rp_index+1:]
        reordered_corpus.write(' '.join(word_pos for word_pos in phrase)+'\n')
reordered_corpus.close()

使用此代码，如果corpus.txt读取，

you_PRP mean_VBP we_PRP should_MD kick_VB them_PRP out_RP ._.
don_VB 't_NNP take_VB it_PRP off_RP until_IN I_PRP say_VBP so_RB ._.
please_VB help_VB the_DT man_NN out_RP ._.
shut_VBZ it_PRP down_RP !_.

运行后，reordered_corpus.txt会，

you_PRP mean_VBP we_PRP should_MD kick_VB out_RP them_PRP ._.
don_VB 't_NNP take_VB off_RP it_PRP until_IN I_PRP say_VBP so_RB ._.
please_VB help_VB out_RP the_DT man_NN ._.
shut_VBZ down_RP it_PRP !_.

python - 如何在带有 POS 标记的语料库文件中更改短语动词的词序

1 回答 1

Related