3

我有一本包含 2 个和 3 个单词短语的字典,我想在 rss 提要中搜索以进行匹配。我抓取 rss 提要,对其进行处理,它们最终以字符串形式出现在名为“文档”的列表中。我想检查下面的字典,如果字典中的任何短语与文本字符串的一部分匹配,我想返回键的值。我不确定解决这个问题的最佳方法。任何建议将不胜感激。

ngramList = {"cash outflows":-1, "pull out":-1,"winding down":-1,"most traded":-1,"steep gains":-1,"military strike":-1,
          "resumed operations":+1,"state aid":+1,"bail out":-1,"cut costs":-1,"alleged violations":-1,"under perform":-1,"more than expected":+1,
         "pay more taxes":-1,"not for sale":+1,"struck a deal":+1,"cash flow problems":-2}
4

2 回答 2

2

我会将所有字符串合并到一个正则表达式中并迭代它在文本中找到的匹配项。我不是 100% 确定,但我认为 Python 中的正则表达式实现足够聪明,可以将所有单词放在一个 trie 中,这将为您提供良好的性能。

strings = [re.escape(s) for s in ngramList.iterkeys()]
regex = re.compile(r'\b(' + '|'.join(strings) + r')\b', re.IGNORECASE)
for text in documents:
    scores = []
    for m in regex.finditer(text):
        scores.append(ngramList[m.group(1)])
    # process the scores here, e.g. add their sum to some a global variable:
    score += sum(scores)
于 2013-10-06T19:33:27.043 回答
2

我假设该字典中的数字(-2、-1、+1)是权重,因此您需要对每个文档中的每个短语进行计数以使其有用。

所以这样做的伪代码是:

  1. 将文档拆分为行列表,然后将每一行拆分为单词列表。
  2. 然后循环遍历一行中的每个单词,在该行中向前和向后循环以生成各种短语。
  3. 在生成每个短语时,请保留一个全局字典,其中包含短语和出现次数。

以下是查找文档中每个短语的计数的简单案例的一些代码,这似乎是您正在尝试做的事情:

text = """
I have a dictionary of 2 and 3 word phrases that I want to search in rss feeds for a match. 

I grab   the rss feeds, process them and they end up as a string IN a list entitled "documents". 
I want to check the dictionary below and if any of the phrases in the dictionary match part of a string of text I want to return the values for the key. 
I am not sure about the best way to approach this problem. Any suggestions would be greatly appreciated.
"""

ngrams = ["grab the rss", "approach this", "in"]

import re

counts = {}
for ngram in ngrams:
    words = ngram.rsplit()
    pattern = re.compile(r'%s' % "\s+".join(words),
        re.IGNORECASE)
    counts[ngram] = len(pattern.findall(text))

print counts

输出 :

{'grab the rss': 1, 'approach this': 1, 'in': 5}
于 2013-10-06T20:08:51.890 回答