python - Find Pattern in Textfile From Several Elements In Several Lists?

Question

I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:

import re

noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']

noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'

with open('test_sentence.txt', 'r') as input_f:
    read_input = input_f.read()
    word = re.findall(noun_list_pattern1, read_input)
    for w in word:
        print w
else:
    pass

So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.

The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.

Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:

'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list

Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.

Hopefully I have made myself clear!

Here is the content of the test_sentence.txt file that I am using:

I need to buy are bacon, cheese and eggs. 
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.

score 2 · Accepted Answer

实际上，您不一定需要正则表达式，因为仅使用原始列表有多种方法可以做到这一点。

noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']

#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
    matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
    if len(matches) == 4:
        for match in matches:
            print match

匹配数为 4 的原因是 4 是正确的匹配数。（注意，这也可能是重复名词或连词的情况）。

编辑：

此版本打印匹配的行和匹配的单词。还修复了可能的多词匹配问题：

words_matched = []
matching_lines = []

for l in lst:
    matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
    invalid = True
    valid_count = 0
    for match in matches:
        if matches.count(match) == 1:
            valid_count += 1
    if valid_count == len(matches):
        invalid = False

    if not invalid:
        words_matched.append(matches)
        matching_lines.append(l)

for line, matches in zip(matching_lines, words_matched):
    print line, matches

但是，如果这不适合您，您始终可以按如下方式构建正则表达式（使用itertools模块）：

#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
    matches = [noun for noun in nouns]
    matches.append(conj)
    #matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
    regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
    print regex_string
    #... do regex related matching here

这种方法的警告是它是纯粹的蛮力，因为它会生成两个列表的所有可能组合（读取排列），然后可以对其进行测试以查看每行是否匹配。因此，它非常慢，但在这个例子中，匹配给定的那些（连词之前的非逗号），这将产生完全匹配的完美。

根据需要进行调整。

score 2 · Accepted Answer

把你的问题分解一下。首先，您需要一个模式来匹配列表中的单词，但不能匹配其他单词。您可以使用交替运算符|和文字来完成此操作。red|green|blue例如，将匹配"red","green"或"blue"，但不匹配"purple"。加入带有该字符的名词列表，并添加单词边界元字符和括号以对交替进行分组：

noun_patt = r'\b(' + '|'.join(nouns) + r')\b'

对你的连词列表做同样的事情：

conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'

您要进行的整体匹配是“一个或多个noun_patt匹配，每个匹配可选地后跟一个逗号，然后是一个匹配，conj_patt然后再noun_patt匹配一个”。对于正则表达式来说足够简单：

patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)

您真的不想使用re.findall(), 但是re.search()，因为您只希望每行有一个匹配项：

for line in lines:
...     print re.search(patt, line).group(0)
... 
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs

请注意，就解析英语而言，您已经接近（如果不是碰到）正则表达式的限制。比这更复杂的，你会想看看实际的解析，也许用 NLTK。

python - Find Pattern in Textfile From Several Elements In Several Lists?

2 回答 2

Related

Reference