7

我在这里尝试 kaggle 挑战,不幸的是我被困在一个非常基本的步骤。这应该归咎于我有限的python知识。我正在尝试通过执行以下命令将数据集读入熊猫数据框:

test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")

问题是你会发现这个文件有超过 300,000 条记录,但我只读取 7945、21。

print (test.shape)
(7945, 21)

现在我已经仔细检查了文件,我找不到关于第 7945 行的任何特别之处。任何可能发生这种情况的指针。看起来很普通的情况,希望遇到这个错误的人能帮帮我。

4

1 回答 1

13

我认为更好的是使用带有参数的函数read_csvquoting=csv.QUOTE_NONEerror_bad_lines=False. 关联

import pandas as pd
import csv

test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)

print (test.shape)
#(381422, 22)

但是会跳过一些数据(有问题的)。

如果您想跳过电子邮件正文数据,您可以使用:

import pandas as pd
import csv

test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE,  sep=',', error_bad_lines=False, header=None,
    names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"])

print (test.shape)

#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']
于 2015-10-16T05:57:14.370 回答