2

I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:

import re

noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']

noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'

with open('test_sentence.txt', 'r') as input_f:
    read_input = input_f.read()
    word = re.findall(noun_list_pattern1, read_input)
    for w in word:
        print w
else:
    pass

So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.

The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.

Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:

'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list

Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.

Hopefully I have made myself clear!

Here is the content of the test_sentence.txt file that I am using:

I need to buy are bacon, cheese and eggs. 
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
4

2 回答 2

2

实际上,您不一定需要正则表达式,因为仅使用原始列表有多种方法可以做到这一点。

noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']

#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
    matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
    if len(matches) == 4:
        for match in matches:
            print match

匹配数为 4 的原因是 4 是正确的匹配数。(注意,这也可能是重复名词或连词的情况)。

编辑:

此版本打印匹配的行和匹配的单词。还修复了可能的多词匹配问题:

words_matched = []
matching_lines = []

for l in lst:
    matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
    invalid = True
    valid_count = 0
    for match in matches:
        if matches.count(match) == 1:
            valid_count += 1
    if valid_count == len(matches):
        invalid = False

    if not invalid:
        words_matched.append(matches)
        matching_lines.append(l)

for line, matches in zip(matching_lines, words_matched):
    print line, matches

但是,如果这不适合您,您始终可以按如下方式构建正则表达式(使用itertools模块):

#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
    matches = [noun for noun in nouns]
    matches.append(conj)
    #matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
    regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
    print regex_string
    #... do regex related matching here

这种方法的警告是它是纯粹的蛮力,因为它会生成两个列表的所有可能组合(读取排列),然后可以对其进行测试以查看每行是否匹配。因此,它非常慢,但在这个例子中,匹配给定的那些(连词之前的非逗号),这将产生完全匹配的完美

根据需要进行调整。

于 2013-09-22T05:19:13.277 回答
2

把你的问题分解一下。首先,您需要一个模式来匹配列表中的单词,但不能匹配其他单词。您可以使用交替运算符|和文字来完成此操作。red|green|blue例如,将匹配"red","green""blue",但不匹配"purple"。加入带有该字符的名词列表,并添加单词边界元字符和括号以对交替进行分组:

noun_patt = r'\b(' + '|'.join(nouns) + r')\b'

对你的连词列表做同样的事情:

conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'

您要进行的整体匹配是“一个或多个noun_patt匹配,每个匹配可选地后跟一个逗号,然后是一个匹配,conj_patt然后再noun_patt匹配一个”。对于正则表达式来说足够简单:

patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)

您真的不想使用re.findall(), 但是re.search(),因为您只希望每行有一个匹配项:

for line in lines:
...     print re.search(patt, line).group(0)
... 
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs

请注意,就解析英语而言,您已经接近(如果不是碰到)正则表达式的限制。比这更复杂的,你会想看看实际的解析,也许用 NLTK。

于 2013-09-22T04:08:29.293 回答