python - 当第一列是字符串而其余列是数字时，如何使用 numpy.genfromtxt？

Question

基本上，我有一堆数据，其中第一列是字符串（标签），其余列是数值。我运行以下命令：

data = numpy.genfromtxt('data.txt', delimiter = ',')

这可以很好地读取大部分数据，但标签列只是得到“nan”。我该如何处理？

score 74 · Accepted Answer

默认情况下，np.genfromtxt使用dtype=float: 这就是将字符串列转换为 NaN 的原因，因为毕竟它们不是数字......

您可以要求np.genfromtxt尝试使用以下方法猜测列的实际类型dtype=None：

>>> from StringIO import StringIO
>>> test = "a,1,2\nb,3,4"
>>> a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)
>>> print a
array([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])

您可以使用它们的名称访问列，例如a['f0']...

dtype=None如果你不知道你的列应该是什么，那么使用是一个很好的技巧。如果你已经知道它们应该有什么类型，你可以给出一个明确的dtype. 例如，在我们的测试中，我们知道第一列是字符串，第二列是 int，我们希望第三列是浮点数。然后我们将使用

>>> np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))
array([('a', 1, 2.0), ('b', 3, 4.0)], 
      dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])

使用显式dtype比使用更有效，dtype=None并且是推荐的方式。

在这两种情况下（dtype=None或显式的、非同质的dtype），您最终都会得到一个结构化数组。

[注意：使用dtype=None，输入被第二次解析，每列的类型被更新以匹配可能的更大类型：首先我们尝试一个 bool，然后是一个 int，然后是一个浮点数，然后是一个复数，然后我们保留一个字符串如果一切都失败了。实际上，实现相当笨拙。有一些尝试使类型猜测更有效（使用正则表达式），但到目前为止没有任何问题]

score 41 · Accepted Answer

如果您的数据文件是这样的结构

col1, col2, col3
   1,    2,    3
  10,   20,   30
 100,  200,  300

然后可以使用该选项numpy.genfromtxt将第一行解释为列标题。names=True有了这个，您可以通过提供列标题非常方便地访问数据：

data = np.genfromtxt('data.txt', delimiter=',', names=True)
print data['col1']    # array([   1.,   10.,  100.])
print data['col2']    # array([   2.,   20.,  200.])
print data['col3']    # array([   3.,   30.,  300.])

因为在你的情况下，数据是这样形成的

row1,   1,  10, 100
row2,   2,  20, 200
row3,   3,  30, 300

您可以使用以下代码片段实现类似的功能：

labels = np.genfromtxt('data.txt', delimiter=',', usecols=0, dtype=str)
raw_data = np.genfromtxt('data.txt', delimiter=',')[:,1:]
data = {label: row for label, row in zip(labels, raw_data)}

第一行将第一列（标签）读入字符串数组。第二行从文件中读取所有数据，但丢弃第一列。第三行使用字典理解来创建一个字典，该字典的使用非常类似于使用选项numpy.genfromtxt创建的结构化数组：names=True

print data['row1']    # array([   1.,   10.,  100.])
print data['row2']    # array([   2.,   20.,  200.])
print data['row3']    # array([   3.,   30.,  300.])

score 10 · Accepted Answer

10

data=np.genfromtxt(csv_file, delimiter=',', dtype='unicode')

这对我来说可以。

于 2017-06-02T06:40:00.507 回答

score 2 · Accepted Answer

对于这种格式的数据集：

CONFIG000   1080.65 1080.87 1068.76 1083.52 1084.96 1080.31 1081.75 1079.98
CONFIG001   414.6   421.76  418.93  415.53  415.23  416.12  420.54  415.42
CONFIG010   1091.43 1079.2  1086.61 1086.58 1091.14 1080.58 1076.64 1083.67
CONFIG011   391.31  392.96  391.24  392.21  391.94  392.18  391.96  391.66
CONFIG100   1067.08 1062.1  1061.02 1068.24 1066.74 1052.38 1062.31 1064.28
CONFIG101   371.63  378.36  370.36  371.74  370.67  376.24  378.15  371.56
CONFIG110   1060.88 1072.13 1076.01 1069.52 1069.04 1068.72 1064.79 1066.66
CONFIG111   350.08  350.69  352.1   350.19  352.28  353.46  351.83  350.94

此代码适用于我的应用程序：

def ShowData(data, names):
    i = 0
    while i < data.shape[0]:
        print(names[i] + ": ")
        j = 0
        while j < data.shape[1]:
            print(data[i][j])
            j += 1
        print("")
        i += 1

def Main():
    print("The sample data is: ")
    fname = 'ANOVA.csv'
    csv = numpy.genfromtxt(fname, dtype=str, delimiter=",")
    num_rows = csv.shape[0]
    num_cols = csv.shape[1]
    names = csv[:,0]
    data = numpy.genfromtxt(fname, usecols = range(1,num_cols), delimiter=",")
    print(names)
    print(str(num_rows) + "x" + str(num_cols))
    print(data)
    ShowData(data, names)

Python-2 输出：

The sample data is:
['CONFIG000' 'CONFIG001' 'CONFIG010' 'CONFIG011' 'CONFIG100' 'CONFIG101'
 'CONFIG110' 'CONFIG111']
8x9
[[ 1080.65  1080.87  1068.76  1083.52  1084.96  1080.31  1081.75  1079.98]
 [  414.6    421.76   418.93   415.53   415.23   416.12   420.54   415.42]
 [ 1091.43  1079.2   1086.61  1086.58  1091.14  1080.58  1076.64  1083.67]
 [  391.31   392.96   391.24   392.21   391.94   392.18   391.96   391.66]
 [ 1067.08  1062.1   1061.02  1068.24  1066.74  1052.38  1062.31  1064.28]
 [  371.63   378.36   370.36   371.74   370.67   376.24   378.15   371.56]
 [ 1060.88  1072.13  1076.01  1069.52  1069.04  1068.72  1064.79  1066.66]
 [  350.08   350.69   352.1    350.19   352.28   353.46   351.83   350.94]]
CONFIG000:
1080.65
1080.87
1068.76
1083.52
1084.96
1080.31
1081.75
1079.98

CONFIG001:
414.6
421.76
418.93
415.53
415.23
416.12
420.54
415.42

CONFIG010:
1091.43
1079.2
1086.61
1086.58
1091.14
1080.58
1076.64
1083.67

CONFIG011:
391.31
392.96
391.24
392.21
391.94
392.18
391.96
391.66

CONFIG100:
1067.08
1062.1
1061.02
1068.24
1066.74
1052.38
1062.31
1064.28

CONFIG101:
371.63
378.36
370.36
371.74
370.67
376.24
378.15
371.56

CONFIG110:
1060.88
1072.13
1076.01
1069.52
1069.04
1068.72
1064.79
1066.66

CONFIG111:
350.08
350.69
352.1
350.19
352.28
353.46
351.83
350.94

score 0 · Accepted Answer

您可以使用numpy.recfromcsv(filename)：将自动确定每列的类型（就像您使用np.genfromtxt()with一样dtype=None），并且默认为delimiter=",". 这基本上是np.genfromtxt(filename, delimiter=",", dtype=None)Pierre GM 在他的回答中指出的捷径。

score 0 · Accepted Answer

这是一个从头到尾的工作示例：

如果我想从没有第一行的文件中导入数字：

 I like trains #this is the first line, a string

1 \t 2 \t 3   #\t is to signify that the delimeter (separation) is tab and not komma  

4 \t 5 \t 6

然后运行以下代码：

import numpy as np              #contains genfromtxt
import matplotlib.pyplot as plt #enables plots 
from pathlib import Path        # easier using path instead of writing it again and again when you have many files in the same folder
path = r'some_path'             #location of your file in your computer like r'C:my comp\folder\folder2' r is there to make the win 10 path readable in python, it means "just text"
fileNames = [r'\I like trains.txt',
             r'\die potato.txt']

data=np.genfromtxt(path + fileNames[0], delimiter='\t', skip_header=1)

产生这个结果：

data = [1 2 3
        4 5 6]

每个号码都有自己的单元格，可以单独到达

python - 当第一列是字符串而其余列是数字时，如何使用 numpy.genfromtxt？

6 回答 6

Related

Reference