python - happybase table.scan() 和 hbase thriftscannerGetList() 之间的区别

Question

我有两个版本的 python 脚本，可以在 while 循环中按 1000 行扫描 hbase 中的表。第一个使用happybase，如https://happybase.readthedocs.org/en/latest/user.html#retrieving-rows

while variable:
    for key, data in hbase.table(tablename).scan(row_start=new_key, batch_size=1000, limit=1000):
        print key
    new_key = key

第二个使用 hbase 节俭接口，如http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/

scanner_id = hbase.scannerOpenWithStop(tablename, '', '', [])
data = hbase.scannerGetList(scanner_id, 1000) 
while len(data):
    for dbpost in data:
        print row_of_dbpost
    data = hbase.scannerGetList(scanner_id, 1000)

数据库中的行是数字。所以我的问题是在某些行发生了一些奇怪的事情：

happybase 打印（行）：

... 100161632382107648 
10016177552 
10016186396 
10016200693 
10016211838 
100162138374537217 (point of interest) 
193622937692155904 
193623435597983745...

和 thrift_scanner 打印（行）：

... 100161632382107648 
10016177552 
10016186396 
10016200693 
10016211838 
100162138374537217 (point of interest)
100162267416506368 
10016241167 
10016296927 ...

这不是发生在接下来的 1000 行（row_start=new_scan 或 next data=scannerGetList），而是在批处理的中间。而且每次都会发生。

我会说带有scannerGetList 的第二个脚本做得对。

为什么happybase 做的不一样？是否考虑时间戳或happybase / hbase逻辑中的其他一些？它最终会以不同的顺序扫描整个表吗？

附言。我知道happybase 版本将扫描并打印第1000 行两次，scannerGetList 将忽略下一个数据中的第一行。这不是重点，神奇的事情发生在 1000 行批次的中间。

score 3 · Accepted Answer

我不确定您的数据，但这些循环并不相同。您的 Thrift 版本仅使用一个扫描仪，而您的 Happybase 示例重复创建一个新扫描仪。此外，您的 Happybase 版本施加了扫描仪限制，而您的 Thrift 版本没有。

使用 Thrift，您需要记账，并且scannerGetList()循环需要重复的代码（调用），所以这可能会导致您的困惑。

使用 Happybase 的正确方法就是：

table = connection.table(tablename)
for key, data in table.scan(row_start=new_key, batch_size=1000):
    print key
    if some_condition:
        break  # this will cleanly close the scanner

注意：这里没有嵌套循环。另一个好处是，Happybase 会在您完成扫描仪后正确关闭它，而您的 Thrift 版本则不会。

python - happybase table.scan() 和 hbase thriftscannerGetList() 之间的区别

1 回答 1

Related

Reference