python - 无法在 Python 中读取完整的文本文件

Question

我在从 Python 中读取文件时遇到问题。

我在 Python 中读取的文件大小为 90 Mb。用word打开，总字数在1400万左右。但是当我用 Python 读取文件时，它给我的文件长度约为 900 万字（8,915,710 字）。

当我通过 python 命令查看文件中的最后 100 个单词时

print "The length of the Corpus is ", len(tokens), tokens[-100:]

我只从原始文件的中间得到单词。

我使用的是 64 位 Windows 操作系统和 32 位版本的 Python。

PC 规格：i7、1.8Gz、6GB RAM

我想了解为什么 Python 拒绝阅读超过 8,915,710 个单词。

谢谢

代码：

f = open('testtext.txt')
raw = f.read()
corp = lowercase(raw)
tokens = nltk.word_tokenize(corp)
print "The number of words is ", len(tokens), tokens[-100:]
print "corp ", len(corp)
print "raw ", len(raw)

我得到以下答案：

>> The number of words is  8915710
>> corp  53322476
>> raw  53322476

score 1 · Accepted Answer

1

替换这一行：

f = open('testtext.txt')

用这条线：

f = open('testtext.txt', 'rb')

于 2013-03-07T23:16:54.643 回答

score 0 · Accepted Answer

尝试将文件处理为二进制文件：

f = open('file.txt', "rb")
chunkSize = 1024
dataChunk = f.read(chunkSize)
while len(dataChunk):
    processData(dataChunk)
    dataChunk = f.read(chunkSize)

python - 无法在 Python 中读取完整的文本文件

2 回答 2

Related

Reference