4

我想tokenize input file in python请建议我是 python 的新用户。

我阅读了一些关于正则表达式的内容,但仍然有些困惑,所以请建议任何链接或代码概述。

4

2 回答 2

12

尝试这样的事情:

import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens

NLTK 教程也充满了易于理解的示例:https ://www.nltk.org/book/ch03.html

于 2012-10-03T07:37:57.357 回答
2

Using NLTK

If your file is small:

  • Open the file with the context manager with open(...) as x,
  • then do a .read() and tokenize it with word_tokenize()

[code]:

from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
    tokens = word_tokenize(fin.read())

If your file is larger:

  • Open the file with the context manager with open(...) as x,
  • read the file line by line with a for-loop
  • tokenize the line with word_tokenize()
  • output to your desired format (with the write flag set)

[code]:

from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
    for line in fin:
        tokens = word_tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

Using SpaCy

from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

with open ('myfile.txt') as fin, open('tokens.txt') as fout:
    for line in fin:
        tokens = tokenizer.tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)
于 2018-01-29T23:48:09.087 回答