python - 根据相关结构选择python结构中的记录

Question

在我真正的问题中，我将有两个信息表（x，y）。x 将有约 260 万条记录，y 将有约 1 万条记录；这两个表具有多对一 (x->y) 关系。我想根据 y 对 x 进行子集化。

我认为最匹配的帖子是this和that以及this。我选择了 numpy 数组。我愿意使用其他数据结构；我只是想选择可以扩展的东西。我是否使用了适当的方法？是否有其他帖子涵盖此内容？我不想使用数据库，因为我只这样做了一次。

以下代码试图说明我正在尝试做的事情。

import numpy, copy
x=numpy.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)]  )
y=numpy.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )
for id, category in x:
    if y[y['category']==category]['value'][0] > 3:
        y[y['category']==category]['output']=numpy.array(copy.deepcopy(id))

score 3 · Accepted Answer

y['category']==category当您尝试使用布尔数组 ( ) 进行索引以修改原始数组 ( )时，您必须小心，y因为“花式索引”返回一个副本（不是视图），因此修改副本不会更改您的原始数组y。如果您只是在普通数组上执行此操作，它可以正常工作（过去这让我很困惑）。但是对于您正在使用的结构化数组，即使用作分配，它也不会是视图，如果您使用掩码，然后使用字段名再次索引。这听起来令人困惑，但它不会像你写的那样工作，注意y前后没有变化：

for i, category in x:
    c = y['category']==category   #generate the mask once
    if y[c]['value'][0] > 3:
        print 'before:', y[c]['output']
        y[c]['output'] = i
        print 'after:', y[c]['output']

#output:
#before: [0]
#after: [0]
#before: [0]
#after: [0]
#before: [0]
#after: [0]

如果您使用字段访问获得视图，然后在该视图上获得精美的索引，您将获得一个有效的 setitem 调用：

for i, category in x:
    c = y['category']==category   #generate the mask once
    if y[c]['value'][0] > 3:
        print 'before:', y[c]['output']
        y['output'][c] = i
        print 'after:', y[c]['output']

#output:
#before: [0]
#after: [1]
#before: [1]
#after: [3]
#before: [0]
#after: [4]

如您所见，我也删除了您的副本。 i（或者id，我没有使用它，因为id它是一个函数）只是一个整数，不需要复制。如果您确实需要复制某些内容，最好使用numpy副本而不是标准库copy，如

y[...]['output'] = np.array(id, copy=True)

或者

y[...]['output'] = np.copy(id)

事实上，copy=True应该是默认的，这样... = np.array(id)可能就足够了，但我不是复制的权威。

score 2 · Accepted Answer

您有 260 万条记录，每条记录（可能）覆盖 1 万条记录中的一条。所以可能会有很多覆盖。每次你写到同一个地方，之前在那个地方所做的所有工作都是徒劳的。

因此，您可以通过循环y（10K 唯一？类别）而不是循环x（260 万条记录）来提高效率。

import numpy as np
x = np.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)]  )
y = np.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )

for idx in np.where(y['value'] > 3)[0]:
    row = y[idx]
    category = row['category']
    # Only the last record in `x` of the right category affects `y`.
    # So find the id value for that last record in `x`
    idval = x[x['category'] == category]['id'][-1]
    y[idx]['output'] = idval

print(y)

产量

[('a', 3.2, 3) ('b', -1.0, 0) ('c', 0.0, 0) ('d', 100.0, 4)]

python - 根据相关结构选择python结构中的记录

2 回答 2

Related

Reference