使用 200 万多个阵列,我立即注意到 Warren Weckesser 的解决方案和 Tonsic的解决方案之间存在很大差异(非常感谢你们俩)
和
first_array
[out]
array([(1633046400299000, 1.34707, 1.34748),
(1633046400309000, 1.347 , 1.34748),
(1633046400923000, 1.347 , 1.34749), ...,
(1635551693846000, 1.36931, 1.36958),
(1635551693954000, 1.36925, 1.36952),
(1635551697902000, 1.3692 , 1.36947)],
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8')])
和
second_array
[out]
array([('2021-10-01T00:00:00.299000',), ('2021-10-01T00:00:00.309000',),
('2021-10-01T00:00:00.923000',), ...,
('2021-10-29T23:54:53.846000',), ('2021-10-29T23:54:53.954000',),
('2021-10-29T23:54:57.902000',)], dtype=[('date_time', '<M8[us]')])
我明白了
%timeit rfn.merge_arrays((first_array, second_array), flatten=True)
[out]
13.8 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
和
%timeit rfn.append_fields(first_array, 'date_time', second_array, dtypes='M8[us]').data
[out]
2.12 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
好多了(.data
最后注意避免得到mask
and fill_value
)
而使用类似的东西
def building_new(first_array, other_array):
new_array = np.zeros(
first_array.size,
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
new_array[['timestamp', 'bid', 'ask']] = first_array[['timestamp', 'bid', 'ask']]
new_array['date_time'] = other_array
return new_array
(请注意,在结构化数组中,每一行都是一个元组,因此 size 效果很好)
我明白了
%timeit building_new(first_array, second_array)
[out]
67.2 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
三者的输出是一样的
[out]
array([(1633046400299000, 1.34707, 1.34748, '2021-10-01T00:00:00.299000'),
(1633046400309000, 1.347 , 1.34748, '2021-10-01T00:00:00.309000'),
(1633046400923000, 1.347 , 1.34749, '2021-10-01T00:00:00.923000'),
...,
(1635551693846000, 1.36931, 1.36958, '2021-10-29T23:54:53.846000'),
(1635551693954000, 1.36925, 1.36952, '2021-10-29T23:54:53.954000'),
(1635551697902000, 1.3692 , 1.36947, '2021-10-29T23:54:57.902000')],
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
最后的想法:
创建新数组而不是recfunctions,第二个数组甚至不需要是结构化的
third_array
[out]
array(['2021-10-01T00:00:00.299000', '2021-10-01T00:00:00.309000',
'2021-10-01T00:00:00.923000', ..., '2021-10-29T23:54:53.846000',
'2021-10-29T23:54:53.954000', '2021-10-29T23:54:57.902000'],
dtype='datetime64[us]')
%timeit building_new(first_array, third_array)
[out]
67 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)