python - 在 Python/Pandas 中读取和处理 10k Excell 单元格的最快方法？

Question

我想从交易平台读取和处理实时 DDE 数据，使用 Excel 作为交易平台（发送数据）和处理它的 Python 之间的“桥梁”，并将其作为前端“gui”打印回 Excel。速度至关重要。我需要：

尽可能快地读取 Excel 中的 6/10 千个单元格
同时通过的总刻度（相同的 h:m:sec）
检查 DataFrame 是否包含静态数组中的任何值（例如大量）
在同一个excel文件（不同的工作表）上写输出，用作前端输出'gui'。

我导入了“ xlwings ”库并使用它从一张纸中读取数据，在 python 中计算所需的值，然后在同一文件的另一张纸中打印结果。我想让 Excel 打开并可见，以便充当“输出仪表板”。该函数在无限循环中运行，读取实时股票价格。

import xlwings as xw
import numpy as np
import pandas as pd

...
...

tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns

try:
   global ttt #this is used as temporary global pandas.df
   global tttout #this is used as output global pandas.df copy
   #they are global as they can be zeroed with another function

   ttt= ttt.append(tickdf, ignore_index=False) 
   #at each loop, newly read ticks are added as rows to the end of ttt global.df.

   ttt.drop_duplicates(inplace=True)

   tttout = ttt.copy()
   #to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step

   tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
   tttout = tttout.set_index('time')
   #sort it by time/name and set time as index

   tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)] 
   #find matching values comparing an array of a dozen values

   tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
   xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout

我在 i5@4.2ghz 上运行这个函数，这个函数连同其他一些小的其他代码，每个循环运行 500-600 毫秒，这相当不错（但不是很棒！） - 我想知道是否有更好的方法以及哪些步骤可能是瓶颈。

代码读取 1500 行，每个上市股票按字母顺序排列，每一个都是市场上为该特定股票传递的“最后一个报价”，它看起来像这样：

'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....

是时间、股票代码、价格、数量。

我想调查市场上是否有一些特定数量的交易，例如 1.000.000（因为它代表了一个巨大的订单），或者可能只是 '1' 经常被用作市场'心跳'，一种假订单。

我的方法是使用 Pandas/Xlwings/ 和 'isin' 方法。有没有更有效的方法可以提高我的脚本性能？

score 1 · Accepted Answer

使用用 PyXLL 编写的 UDF 会更快，因为这样可以避免通过 COM 和外部进程。您将在 Excel 中有一个公式，其中输入设置为您的数据范围，并且每次更新输入数据时都会调用该公式。这将避免在无限循环中不断轮询数据的需要，并且应该比在 Excel 之外运行 Python 快得多。

如果您还不熟悉 PyXLL，请参阅https://www.pyxll.com/docs/introduction.html 。

PyXLL 可以为您将输入范围转换为 pandas DataFrame（请参阅https://www.pyxll.com/docs/userguide/pandas.html），但这可能不是最快的方法。

将数据从 Excel 传输到 Python 的最快方法是使用 PyXLL 中的“numpy_array”类型通过浮点 numpy 数组（参见https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array -类型）。

由于速度是一个问题，也许您可以将数据拆分并拥有一些主要采用静态数据（例如行和列标题）的函数，以及将可变数据作为 numpy_arrays 的其他函数（如果可能）或其他类型，然后是最终功能将它们全部结合起来。

PyXLL 可以将 Python 对象作为对象句柄返回给 Excel。如果您需要返回中间结果，那么执行此操作通常比将整个数据集扩展到 Excel 范围更快。

score 0 · Accepted Answer

@托尼罗伯茨，谢谢

我有一个疑问和一个观察。

怀疑：数据更新得非常快，每 50-100 毫秒。经常调用UDF函数是否可行？会瘦吗？我在这方面经验很少。

观察：PyXLL 确实非常强大，做得很好，维护得很好，但恕我直言，每月 25 美元，它超出了免费 Python 语言的纯粹性质。我虽然明白质量是有代价的。

python - 在 Python/Pandas 中读取和处理 10k Excell 单元格的最快方法？

2 回答 2

Related

Reference