python - 如何提取与文本文件中的正则表达式匹配的行号

Question

我正在做一个关于统计机器翻译的项目，其中我需要从带有正则表达式的 POS 标记文本文件中提取行号（任何带有粒子“out”的非分隔短语动词），并写下行号到一个文件（在python中）。

我有这个正则表达式：'\w*_VB.?\sout_RP' 和我的 POS 标记文本文件：'Corpus.txt'。我想得到一个行号与上述正则表达式匹配的输出文件，并且输出文件每行应该只有一个行号（没有空行），例如：

2

5

44

到目前为止，我的脚本中只有以下内容：

OutputLineNumbers = open('OutputLineNumbers', 'w')
with open('Corpus.txt', 'r') as textfile:
    phrase='\w*_VB.?\sout_RP'
    for phrase in textfile: 

OutputLineNumbers.close()

知道如何解决这个问题吗？

在此先感谢您的帮助！

score 6 · Accepted Answer

这应该可以解决您的问题，假设您在变量“短语”中有正确的正则表达式

import re

# compile regex
regex = re.compile('[0-9]+')

# open the files
with open('Corpus.txt','r') as inputFile:
    with open('OutputLineNumbers', 'w') as outputLineNumbers:
        # loop through each line in corpus
        for line_i, line in enumerate(inputFile, 1):
            # check if we have a regex match
            if regex.search( line ):
                # if so, write it the output file
                outputLineNumbers.write( "%d\n" % line_i )

score 2 · Accepted Answer

如果您的正则表达式对 grep 友好，则可以直接使用 bash 来完成。使用“-n”显示行号

例如：

grep -n  "[1-9][0-9]" tags.txt

将首先输出包含行号的匹配行

2569:vote2012
2570:30
2574:118
2576:7248
2578:2293
2580:9594
2582:577

python - 如何提取与文本文件中的正则表达式匹配的行号

2 回答 2

Related

Reference