1

我正在尝试用 Pandas 解析一个大的 .txt 文件。该文件大小为 1.6 GB。您可以在此处下载该文件(它是所有国家和定居点的 GeoNames 数据库转储)。

关于在 Pandas 中加载和解析文件,我在这里这里查阅了答案,这就是我在代码中的内容:

import pandas as pd

for chunk in pd.read_csv(
    "allCountries.txt",
    header=None,
    engine="python",
    sep=r"\s{1,}",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    chunksize=1000,
):
    print(chunk[0])  # just printing out the first row

如果我运行上面的代码,我会收到以下错误:

ParserError:预计第 1 行中有 20 个字段,看到 25。错误可能是由于使用多字符分隔符时忽略引号引起的。

我不知道这里出了什么问题。有人可以告诉我出了什么问题,我该如何解决?

4

2 回答 2

0

您的分隔符错误,因为您在一列(名称)中有空格:

2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 0 2860 欧洲/安道尔 2014-11-05

它被解析错误。

这段代码对我有用:

for chunk in pd.read_csv(
    "allCountries.txt",
    header=None,
    engine="python",
    sep=r"\t+",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    chunksize=1000,
):
    print(chunk)
于 2020-09-25T09:16:35.267 回答
0

使用 LibreOffice 打开文件的前 10 行并使用制表符作为分隔符工作正常

import csv
import pandas as pd

for chunk in pd.read_csv(
    'allCountries.txt',
    header=None,
    engine="python",
    sep="\t",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    quoting=csv.QUOTE_NONE,
    chunksize=1000
):
    print(chunk.iloc[0])  # just printing out the first row

该文件还包含字符 ' 和 ",默认情况下,pandas 假定它们用于引用并导致错误,但将引用设置为 QUOTE_NONE 修复了它。

于 2020-09-25T09:23:02.387 回答