python - Python Hive 查询限制为 100

Question

我正在使用 Python Apache Hive 客户端 ( https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python ) 在 Shark 服务器上运行查询。

问题是当我在 Shark CLI 中正常运行查询时，我会得到一整套结果，但是当我使用 Hive Python 客户端时，它只返回 100 行。我的选择查询没有限制。

鲨鱼命令行：

[localhost:10000] shark> SELECT COUNT(*) FROM table;
46831

Python：

import sys
from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
    transport = TSocket.TSocket('localhost', 10000)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)

    client = ThriftHive.Client(protocol)
    transport.open()

    client.execute("SELECT * from table")
    hdata = client.fetchAll()
    transport.close()
    ....

In [97]: len(hdata)
Out[97]: 100

奇怪的是，当我在 Python 代码中运行 COUNT(*) 时，我得到：

In [104]: hdata
Out[104]: ['46831']

是否有我可以访问的设置文件或变量来解锁此限制？

score 1 · Accepted Answer

100 行的限制是在底层 Driver中设置的，请查找private int maxRows = 100;.

如果您使用fetchN() 方法，则在驱动程序上将 maxRows 设置为所需的值：

public List<String> fetchN(int numRows)

一种可能的解决方法可能是首先获取总行数，然后调用 fetchN()。但是，如果返回的数据可能涉及大量行，您可能会遇到麻烦。出于这个原因，以块的形式获取和处理数据似乎是一个更好的主意。为了比较，以下是 CLI 的作用：

do {
  results = client.fetchN(LINES_TO_FETCH);
  for (String line : results) {
    out.println(line);
  }
} while (results.size() == LINES_TO_FETCH);

哪里LINES_TO_FETCH = 40。但这或多或少是一个任意值，您可以根据您的特定需求在代码中进行调整。

python - Python Hive 查询限制为 100

1 回答 1

Related

Reference