31

我需要为包含如下文本的文本文件计算 Unigrams、BiGrams 和 Trigrams:

“仅在美国,囊性纤维化就影响了 30,000 名儿童和年轻人 吸入盐水雾可以减少充满囊性纤维化患者气道的脓液和感染,尽管副作用包括令人讨厌的咳嗽和刺鼻的味道。这就是结论本周发表在《新英格兰医学杂志》上的两项研究中的一项。”

我从 Python 开始并使用了以下代码:

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

但它适用于一个单词中的所有 n-gram,当我想要它来自单词之间的 CYSTIC 和 FIBROSIS 或 CYSTIC FIBROSIS 时。有人可以帮我解决这个问题吗?

4

8 回答 8

48

博客中的简短 Pythonesque 解决方案:

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

用法:

>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
>>> find_ngrams(input_list, 1)
[('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]
>>> find_ngrams(input_list, 2)
[('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]
>>> find_ngrams(input_list, 3))
[('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]
于 2015-06-03T00:53:49.097 回答
40

假设输入是一个包含空格分隔单词的字符串,就像x = "a b c d"您可以使用以下函数(编辑:请参阅最后一个函数以获得可能更完整的解决方案):

def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

如果您希望将它们重新连接到字符串中,您可以调用如下内容:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

最后,这并没有将事情总结为总数,因此如果您的输入是'a a a a',则需要将它们计数到一个字典中:

for g in (' '.join(x) for x in ngrams(input, 2)):
    grams.setdefault(g, 0)
    grams[g] += 1

将所有这些放在一个最终函数中给出:

def ngrams(input, n):
   input = input.split(' ')
   output = {}
   for i in range(len(input)-n+1):
       g = ' '.join(input[i:i+n])
       output.setdefault(g, 0)
       output[g] += 1
    return output

ngrams('a a a a', 2) # {'a a': 3}
于 2012-11-16T20:33:15.547 回答
27

使用 NLTK(自然语言工具包)并使用函数将文本标记(拆分)为列表,然后找到二元组和三元组。

import nltk
words = nltk.word_tokenize(my_text)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)
于 2012-11-17T15:26:00.070 回答
11

python 中还有一个更有趣的模块,称为 Scikit。这是代码。这将帮助您获得特定范围内的所有克数。这是代码

from sklearn.feature_extraction.text import CountVectorizer 
text = "this is a foo bar sentences and i want to ngramize it"
vectorizer = CountVectorizer(ngram_range=(1,6))
analyzer = vectorizer.build_analyzer()
print analyzer(text)

输出是

[u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']

在这里,它给出了 1 到 6 范围内的所有克数。它使用称为 countVectorizer 的方法。这是链接

于 2014-10-30T14:14:55.450 回答
3

使用collections.deque

from collections import deque
from itertools import islice

def ngrams(message, n=1):
    it = iter(message.split())
    window = deque(islice(it, n), maxlen=n)
    yield tuple(window)
    for item in it:
        window.append(item)
        yield tuple(window)

...或者也许您可以在一行中将其作为列表理解:

n = 2
message = "Hello, how are you?".split()
myNgrams = [message[i:i+n] for i in range(len(message) - n)]
于 2012-11-16T20:39:38.203 回答
2

nltk 原生支持 ngram

'n' 是 ngram 的大小 例如:n=3 表示三元组

from nltk import ngrams

def ngramize(texts, n):
    output=[]
    for text in texts:
        output += ngrams(text,n)
    return output
于 2016-09-28T22:35:12.163 回答
2

如果效率是一个问题,并且您必须构建多个不同的 n-gram,我会考虑使用以下代码(基于 Franck 的出色答案):

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

用法 :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~ 与 NLTK 相同的速度:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
于 2018-01-18T07:34:53.407 回答
1

虽然帖子很旧,但我想在这里提一下我的答案,以便大多数 ngram 创建逻辑可以在一篇帖子中。

Python 中有一个名为 TextBlob 的东西。它非常容易创建类似于 NLTK 的 ngram。

下面是代码片段及其输出,以便于理解。

sent = """This is to show the usage of Text Blob in Python"""
blob = TextBlob(sent)
unigrams = blob.ngrams(n=1)
bigrams = blob.ngrams(n=2)
trigrams = blob.ngrams(n=3)

输出是:

unigrams
[WordList(['This']),
 WordList(['is']),
 WordList(['to']),
 WordList(['show']),
 WordList(['the']),
 WordList(['usage']),
 WordList(['of']),
 WordList(['Text']),
 WordList(['Blob']),
 WordList(['in']),
 WordList(['Python'])]

bigrams
[WordList(['This', 'is']),
 WordList(['is', 'to']),
 WordList(['to', 'show']),
 WordList(['show', 'the']),
 WordList(['the', 'usage']),
 WordList(['usage', 'of']),
 WordList(['of', 'Text']),
 WordList(['Text', 'Blob']),
 WordList(['Blob', 'in']),
 WordList(['in', 'Python'])]

trigrams
[WordList(['This', 'is', 'to']),
 WordList(['is', 'to', 'show']),
 WordList(['to', 'show', 'the']),
 WordList(['show', 'the', 'usage']),
 WordList(['the', 'usage', 'of']),
 WordList(['usage', 'of', 'Text']),
 WordList(['of', 'Text', 'Blob']),
 WordList(['Text', 'Blob', 'in']),
 WordList(['Blob', 'in', 'Python'])]

就如此容易。

TextBlob 正在做更多的事情。请查看此文档以获取更多详细信息 - https://textblob.readthedocs.io/en/dev/

于 2017-09-20T13:11:46.290 回答