python - 使用 numpy.genfromtxt 读取包含逗号的字符串的 csv 文件

Question

我正在尝试读取 csv 文件，numpy.genfromtxt但其中一些字段是包含逗号的字符串。字符串用引号引起来，但 numpy 没有将引号识别为定义单个字符串。例如，使用 't.csv' 中的数据：

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

编码

np.genfromtxt('t.csv', delimiter=',')

产生错误：

ValueError：检测到一些错误！第 2 行（得到 4 列而不是 3 列）

我正在寻找的数据结构是：

array([['2012', 'Louisville KY', '3.5'],
       ['2011', 'Lexington, KY', '4.0']], 
      dtype='|S13')

查看文档，我看不到任何处理此问题的选项。有没有办法用 numpy 来做，还是我只需要用csv模块读入数据然后将其转换为 numpy 数组？

score 27 · Accepted Answer

You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:

quotechar : string

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

The default value is ". An example:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s="""year, city, value
   ...: 2012, "Louisville KY", 3.5
   ...: 2011, "Lexington, KY", 4.0"""

In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
   year           city  value
0  2012  Louisville KY    3.5
1  2011  Lexington, KY    4.0

The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.

Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

score 14 · Accepted Answer

附加逗号的问题np.genfromtxt并没有解决这个问题。

csv.reader()一个简单的解决方案是使用python 的csv模块将文件读取到一个列表中，然后根据需要将其转储到一个 numpy 数组中。

如果您真的想使用np.genfromtxt，请注意它可以使用迭代器而不是文件，例如np.genfromtxt(my_iterator, ...). 因此，您可以将 a 包装csv.reader在迭代器中并将其提供给np.genfromtxt.

那会是这样的：

import csv
import numpy as np

np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")

这实质上只用制表符即时替换了适当的逗号。

score 5 · Accepted Answer

如果您使用的是 numpy，您可能希望使用 numpy.ndarray。这会给你一个 numpy.ndarray：

import pandas
data = pandas.read_csv('file.csv').as_matrix()

Pandas 将正确处理“肯塔基州列克星敦”案

score 2 · Accepted Answer

csv结合标准模块和 Numpy 的强大功能做出更好的功能recfromcsv。例如，该csv模块对方言、引号、转义字符等有很好的控制和自定义，您可以将其添加到下面的示例中。

下面的示例genfromcsv_mod函数读取一个类似于 Microsoft Excel 所见的复杂 CSV 文件，该文件可能在引用字段中包含逗号。在内部，该函数有一个生成器函数，它用制表符分隔符重写每一行。

import csv
import numpy as np

def recfromcsv_mod(fname, **kwargs):
    def rewrite_csv_as_tab(fname):
        with open(fname, newline='') as fp:
            dialect = csv.Sniffer().sniff(fp.read(1024))
            fp.seek(0)
            for row in csv.reader(fp, dialect):
                yield "\t".join(row)
    return np.recfromcsv(
        rewrite_csv_as_tab(fname), delimiter="\t", encoding=None, **kwargs)

# Use it to read a CSV file into a record array
x = recfromcsv_mod("t.csv", case_sensitive=True)

score 0 · Accepted Answer

你可以试试这段代码。我们正在从 np.genfromtext() 方法 代码中读取 .csv 文件：

myfile = np.genfromtxt('MyData.csv', delimiter = ',')
myfile = myfile.astype('int64')
print(myfile)

输出：

[[ 1  1  1  1  1  1  1  1  1  1  1]
 [ 3  3  3  3  3  3  3  3  3  3  3]
 [ 3  3  3  3  3  3  3  3  3  3  3]
 [ 4  4  4  4  4  4  4  4  4  4  4]
 [ 5  5  5  5  5  5  5  5  5  5  5]
 [ 6  6  6  6  6  6  6  6  6  6  6]
 [ 7  7  7  7  7  7  7  7  7  7  7]
 [ 8  8  8  8  8  8  8  8  8  8  8]
 [ 9  9  9  9  9  9  9  9  9  9  9]
 [10 10 10 10 10 10 10 10 10 10 10]
 [11 11 11 11 11 11 11 11 11 11 11]
 [12 12 12 12 12 12 12 12 12 12 12]
 [13 13 13 13 13 13 13 13 13 13 13]
 [14 14 14 14 14 14 14 14 14 14 14]
 [15 15 15 15 15 15 15 15 15 15 15]
 [16 17 18 19 20 21 22 23 24 25 26]]

输入文件“MyData.csv”

希望对你有帮助

python - 使用 numpy.genfromtxt 读取包含逗号的字符串的 csv 文件

5 回答 5

Related

Reference