python - 如何有效地将 Google BigTable 中的行读入 pandas DataFrame

Question

用例：

我正在使用 Google BigTable 来存储这样的计数：

|  rowkey  |    columnfamily    |
|          | col1 | col2 | col3 |
|----------|------|------|------|
| row1     | 1    | 2    | 3    |
| row2     | 2    | 4    | 8    |
| row3     | 3    | 3    | 3    |

我想读取给定范围的行键的所有行（让我们假设在这种情况下全部）并聚合每列的值。

一个简单的实现会在聚合计数时查询行并迭代行，如下所示：

from google.cloud.bigtable import Client

instance = Client(project='project').instance('my-instance')
table = instance.table('mytable')

col1_sum = 0
col2_sum = 0
col3_max = 0

table.read_rows()
row_data.consume_all()

for row in row_data.rows:
    col1_sum += int.from_bytes(row['columnfamily']['col1'.encode('utf-8')][0].value(), byteorder='big')
    col2_sum += int.from_bytes(row['columnfamily']['col2'.encode('utf-8')][0].value(), byteorder='big')
    col3_value = int.from_bytes(row['columnfamily']['col3'.encode('utf-8')][0].value(), byteorder='big')
    col3_max = col3_value if col3_value > col3_max else col3_max

问题：

有没有办法在 pandas DataFrame 中有效地加载结果行并利用 pandas 性能进行聚合？

我想避免用于计算聚合的 for 循环，因为众所周知它效率很低。

我知道Apache Arrow 项目及其python 绑定，虽然 HBase 被称为支持项目（并且 Google BigTable 被宣传为与 HBase 非常相似），但我似乎找不到将它用于用例的方法我在这里描述过。

score 2 · Accepted Answer

我不相信 Cloud Bigtable 存在现有的 pandas 接口，但这将是一个不错的构建项目，类似于https://github.com/pydata/pandas-gbq中的 BigQuery 接口。

score 2 · Accepted Answer

在深入了解 BigTable 机制之后，python 客户端似乎ReadRows在您调用table.read_rows(). 该 gRPC 调用通过 HTTP/2以键顺序返回行的流式响应（请参阅文档）。

如果 API 每行返回数据，在我看来，使用该响应的唯一有用方法是基于行的。尝试以列格式加载该数据以避免不得不遍历行似乎没有什么用处。

score 1 · Accepted Answer

您也许可以将pdhbase与google-cloud-happybase 一起使用。如果这不能开箱即用，您也许可以获得有关如何执行集成的灵感。

还有一个Cloud Bigtable / BigQuery 集成，您可以将其与https://github.com/pydata/pandas-gbq集成（感谢 Wes McKinney 的提示）。

python - 如何有效地将 Google BigTable 中的行读入 pandas DataFrame

3 回答 3

Related

Reference