python - 奇怪的 python csv 模块行为 - 不要拆分记录

Question

我正在尝试使用 python 的 csv 模块cities5000.txt从 geonames.org ( http://download.geonames.org/export/dump/cities5000.zip ) 解析并得到非常奇怪的行为：cvs不要拆分文件中的所有行。

例如：

>>> len(open('cities5000.txt').read().splitlines())
46955
>>> len(list(csv.reader(open('cities5000.txt'))))
46955
# but here comes some fun
>>>len(list(csv.reader(open('cities5000.txt'), delimiter='\t')))
46048

-'\t'是此文件中使用的实际分隔符。因此，大约有 900 条记录被识别为其他一些记录字段的一部分。但是在解析的数据中其他一切都很好。

问题是：这是什么原因，如果不手动拆分所有这些记录，我怎么能逃脱它？

score 1 · Accepted Answer

默认方言还指定一个引号字符，可用于转义换行符。您可以使用覆盖它quotechar=None。

>>> len(open('cities5000.txt').read().splitlines())
46957
>>> len(list(csv.reader(open('cities5000.txt'), delimiter='\t')))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: field larger than field limit (131072)
>>> len(list(csv.reader(open('cities5000.txt'), delimiter='\t', quotechar=None)))
46957

score 0 · Accepted Answer

我认为默认分隔符由默认方言“excel”（https://docs.python.org/2/library/csv.html#csv-fmt-params）定义

我不知道是哪种分隔符，但我认为自己定义分隔符可以让您更好地控制如何拆分数据。

我还可以想象城市名称和 UTF8 编码的一些问题（不确定，作为进一步研究的提示）。

编辑：简短的谷歌搜索，你会发现：https ://github.com/oamasood/GeonamesPy 也许这也有帮助。

python - 奇怪的 python csv 模块行为 - 不要拆分记录

2 回答 2

Related

Reference