我正在使用带有 hive 连接的 pandas.read_sql 函数来提取非常大的数据。我有一个这样的脚本:
df = pd.read_sql(query_big, hive_connection)
df2 = pd.read_sql(query_simple, hive_connection)
大查询耗时较长,执行后python在尝试执行第二行时返回以下错误:
raise NotSupportedError("Hive does not have transactions") # pragma: no cover
似乎连接有问题。
此外,如果我用 multirpocessing.Manager().Queue() 替换第二行,它会返回以下错误:
File "/usr/lib64/python3.6/multiprocessing/managers.py", line 662, in temp
token, exp = self._create(typeid, *args, **kwds)
File "/usr/lib64/python3.6/multiprocessing/managers.py", line 554, in _create
conn = self._Client(self._address, authkey=self._authkey)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
似乎这种错误与在connection.py中被搞砸的退出函数有关。此外,当我更改第一个命令中的查询以提取不需要太长时间的较小数据时,一切正常。所以我认为可能是因为执行第一个查询花费的时间太长,某些东西被不正确地终止了。这导致了这两个错误,两者在性质上是如此不同,但都与断开的连接问题有关。