1

I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.

Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1

I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.

Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.

4

3 回答 3

1

看看熊猫

pandas 是一个开源的、BSD 许可的库,为 Python 编程语言提供高性能、易于使用的数据结构和数据分析工具。

于 2013-08-14T11:21:14.403 回答
1

您可以使用numpy.loadtxt,例如:

import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
               skiprows=1, dtype=float)

这将导致类似:

#array([[ 1.,  1.,  1.,  0.],
#       [ 0.,  1.,  0.,  1.],
#       [ 1.,  0.,  0.,  0.],
#       [ 1.,  1.,  1.,  0.],
#       [ 0.,  0.,  0.,  0.],
#       [ 1.,  1.,  1.,  1.]])

或者,使用structured arrays(`np.recarray'):

a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
        skiprows=1, dtype=[('Attribute 1', float),
                           ('Attribute 2', float),
                           ('Attribute 3', float),
                           ('Attribute 4', float)])

从那里您可以获取每个字段,例如:

a['Attribute 1']
#array([ 1.,  0.,  1.,  1.,  0.,  1.])
于 2013-08-14T11:24:31.427 回答
0

您可以genfromtxt改用:

data = np.genfromtxt('file.txt', dtype=None)

这将为您的表创建一个结构化数组(又名记录数组)

于 2013-08-14T13:44:03.420 回答