我有一个超过 500 万行和 20 个字段的文件。我想在 Pandas 中打开它,但出现内存不足错误:
pandas.parser.CParserError: Error tokenizing data. C error: out of memory
然后,我阅读了一些关于类似问题的帖子并发现了 Blaze,但遵循三种方法(.Data、.CSV、.Table),显然没有一个有效。
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import re
import numpy as np
import sys
import blaze as bz
reload(sys)
sys.setdefaultencoding('utf-8')
# Gave an out of memory error
'''data = pd.read_csv('file.csv', header=0, encoding='utf-8', low_memory=False)
df = DataFrame(data)
print df.shape
print df.head'''
data = bz.Data('file.csv')
# Tried the followings too, but no luck
'''data = bz.CSV('file.csv')
data = bz.Table('file.csv')'''
print data
print data.head(5)
输出:
_1
_1.head(5)
[Finished in 1.0s]