I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.

Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1

I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.

Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.


3 回答 3



pandas 是一个开源的、BSD 许可的库,为 Python 编程语言提供高性能、易于使用的数据结构和数据分析工具。

于 2013-08-14T11:21:14.403 回答


import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
               skiprows=1, dtype=float)


#array([[ 1.,  1.,  1.,  0.],
#       [ 0.,  1.,  0.,  1.],
#       [ 1.,  0.,  0.,  0.],
#       [ 1.,  1.,  1.,  0.],
#       [ 0.,  0.,  0.,  0.],
#       [ 1.,  1.,  1.,  1.]])

或者,使用structured arrays(`np.recarray'):

a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
        skiprows=1, dtype=[('Attribute 1', float),
                           ('Attribute 2', float),
                           ('Attribute 3', float),
                           ('Attribute 4', float)])


a['Attribute 1']
#array([ 1.,  0.,  1.,  1.,  0.,  1.])
于 2013-08-14T11:24:31.427 回答


data = np.genfromtxt('file.txt', dtype=None)


于 2013-08-14T13:44:03.420 回答