python - 使用 python 的 unigram

Question

我正在尝试从文本文件生成 unigram。但只显示给定文件第一行的二元组。我想为文件中的所有句子显示 unigram。

import string;
import sys;
import tokenize;

f = open("data.txt", 'r');
line=f.readline();
while line:
    line = line.rstrip();
    list = line.split();
    for word in list:
         print word
    line = f.readline();

为什么它没有显示句子的一元组，我怎样才能把它变成一个二元组？

提前致谢。

data.txt 是包含句子的文本文件。它有两句话——

        Hello world this is a test code
        today is 29th november 2011

我得到输出：

    Hello
    world
    this
    is
    a
    test

代码

score 3 · Accepted Answer

首先，如果您使用的是最新版本的 python，您可以简单地执行以下操作：for line in f这比这些readline东西要简单得多。此外，您不必;在每一行都使用。仅当您想在一行中创建多个语句时才使用它。

以下几行对我来说很好：

f = open("data.txt", 'r')
for line in f:
    for word in line.split():
        print word

使一条线的二元组像这样就足够了（未经测试！）

items = line.split()
bigrams = []
for i in xrange(len(items) - 1):
    bigrams.append((items[i], items[i + 1]))

score 3 · Accepted Answer

该代码片段存在一些明显的问题。

;不需要
没有使用任何导入的模块（即tokenize）。这是有效的，但毫无意义。
文件行上的循环使用 while，它有效但很奇怪。

您没有显示文本文件的结构，但我假设每个句子都在单独的行上（即包含两个句子的文本文件将包含两行）。

我不确定在这种情况下到底什么是二元组，因此您可能需要替换该bigram函数。

from itertools import tee, izip

def bigrams(iterable):
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

with open("data.txt", 'r') as f:
    for line in f:
        words = line.strip().split()
        uni = words
        bi = bigrams(words)
        print uni
        print list(bi)

python - 使用 python 的 unigram

2 回答 2

Related

Reference