python - the dtype parameter in numpy genfromtxt

Question

I am trying to create MX2 numpy matrix or array from the following file contents:

shell: head WORLD#America.csv
"2013-04-17 12","3","WORLD","#America"
"2013-04-17 13","9","WORLD","#America"
"2013-04-17 14","4","WORLD","#America"
"2013-04-17 15","3","WORLD","#America"
"2013-04-17 16","7","WORLD","#America"
"2013-04-17 17","8","WORLD","#America"
"2013-04-17 18","6","WORLD","#America"
"2013-04-17 19","6","WORLD","#America"
"2013-04-17 20","6","WORLD","#America"
"2013-04-17 21","2","WORLD","#America"

I have come across the genfromtxt() function but have been unsuccessful in extracting my data. With a file called f I tried the following: ts = genfromtxt(f, delimiter=",") and got an array filled all with nan. This was only a first attempt, so I read the documentation about the dtype parameter which specifies the data-type of the array. It appears that to get an MX2 matrix with entries of the form (datetime, int) I would have the following: dtype=[('f1', datetime64), ('f2', uint)]. When I did this, I got the following assigned to variable ts:

(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L)],
dtype=[('f1', ('<M8[us]', {})), ('f2', '<u8')])

Every value I got for the matrix is some constant... Why did it not read from my file? Obviously this is not the output that I should get.

How do I get the desired MX2 matrix or array with the first column being the datetime and the second column being an integer as shown with the head command?

score 0 · Accepted Answer

正如评论中所指出的，阅读此文件的一个困难genfromtxt是引号字符的存在。也许最好只（以编程方式）删除引号，但也可以绕过这个问题：将引号字符指定为分隔符：

np.genfromtxt(filename, delimiter='"', dtype=str, comments=None)[0]
# array(['', '2013-04-17 12', ',', '3', ',', 'WORLD', ',', '#America', ''], 
#       dtype='|S13')

现在该文件被解释为有 9 列，其中第二列和第四列包含感兴趣的数据。

另一个问题是为日期时间列指定 dtype。在 Numpy 的更新（？）版本中，您必须指定时间/日期单位或genfromtxt引发错误。在这种情况下，显然您需要M8[h]用作 dtype，以指定小时单位。

总而言之，我能够加载文件：

ts = np.genfromtxt(filename, 
                   delimiter='"', 
                   dtype='M8[h], uint', 
                   usecols=[1,3])

或者，您可以考虑使用转换器或尝试Pandas 的 CSV 阅读器。

python - the dtype parameter in numpy genfromtxt

1 回答 1

Related

Reference