3

这与以下问题有关 - Searching for Unicode characters in Python

我有这样的字符串 -

sentence = 'AASFG BBBSDC FEKGG SDFGF'

我将其拆分并获得如下单词列表-

sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF']

我使用以下代码搜索单词的一部分并得到整个单词 -

[word for word in sentence.split() if word.endswith("GG")]

它返回['FEKGG']

现在我需要找出那个词的前面和后面是什么。

例如,当我搜索“GG”时,它会返回['FEKGG']。它也应该能够得到

behind = 'BBBSDC'
infront = 'SDFGF'
4

5 回答 5

3

使用这个生成器:

如果您有以下字符串(从原始编辑):

sentence = 'AASFG BBBSDC FEKGG SDFGF KETGG'

def neighborhood(iterable):
    iterator = iter(iterable)
    prev = None
    item = iterator.next()  # throws StopIteration if empty.
    for next in iterator:
        yield (prev,item,next)
        prev = item
        item = next
    yield (prev,item,None)

matches = [word for word in sentence.split() if word.endswith("GG")]
results = []

for prev, item, next in neighborhood(sentence.split()):
    for match in matches:
        if match == item:
            results.append((prev, item, next))

这将返回:

[('BBBSDC', 'FEKGG', 'SDFGF'), ('SDFGF', 'KETGG', None)]
于 2013-08-11T11:15:29.953 回答
2

这是一种可能性:

words = sentence.split()
[pos] = [i for (i, word) in enumerate(words) if word.endswith("GG") ]
behind = words[pos - 1]
infront = words[pos + 1]

您可能需要注意边缘情况,例如"…GG"不出现、出现多次或成为第一个和/或最后一个词。就目前而言,其中任何一个都会引发异常,这很可能是正确的行为。

使用正则表达式的完全不同的解决方案首先避免将字符串拆分为数组:

match = re.search(r'\b(\w+)\s+(?:\w+GG)\s+(\w+)\b', sentence)
(behind, infront) = m.groups()
于 2013-08-11T11:02:40.083 回答
1

另一个基于 itertools 的选项,在大型数据集上可能对内存更友好

from itertools import tee, izip
def sentence_targets(sentence, endstring):
   before, target, after = tee(sentence.split(), 3)
   # offset the iterators....
   target.next()
   after.next()
   after.next()
   for trigram in izip(before, target, after):
       if trigram[1].endswith(endstring): yield trigram

编辑:修正错字

于 2013-08-11T21:53:48.250 回答
1
sentence = 'AASFG BBBSDC FEKGG SDFGF AAABGG FOOO EEEGG'

def make_trigrams(l):
    l = [None] + l + [None]

    for i in range(len(l)-2):
        yield (l[i], l[i+1], l[i+2])


for result in [t for t in make_trigrams(sentence.split()) if t[1].endswith('GG')]:
    behind,match,infront = result

    print 'Behind:', behind
    print 'Match:', match
    print 'Infront:', infront, '\n'

输出:

Behind: BBBSDC
Match: FEKGG
Infront: SDFGF

Behind: SDFGF
Match: AAABGG
Infront: FOOO

Behind: FOOO
Match: EEEGG
Infront: None
于 2013-08-11T21:32:13.617 回答
1

这是一种方式。None如果“GG”一词在句子的开头或结尾,则前面和后面的元素将是。

words = sentence.split()
[(infront, word, behind) for (infront, word, behind) in 
 zip([None] + words[:-1], words, words[1:] + [None])
 if word.endswith("GG")]
于 2013-08-11T20:37:38.227 回答