python - Python - unable to read a large file

Question

How do I read a large table from hdfs in jupyter-notebook as a pandas DataFrame? The script is launched through the docker image.

libraries:

sasl==0.2.1
thrift==0.11.0
thrift-sasl==0.4a1
Impyla==0.16.2

from impala.dbapi import connect 
from impala.util import as_pandas

impala_conn = connect(host='hostname', port=21050,
auth_mechanism='GSSAPI', 
                      timeout=100000, use_ssl=True, ca_cert=None, 
                      ldap_user=None, ldap_password=None, 
                      kerberos_service_name='impala')

This works.


import pandas as pd
df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 100", impala_conn)
print(df)

This does not work. The operation hangs, does not give errors.


import pandas as pd
df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 1000", impala_conn)
print(df)

score 0 · Accepted Answer

这似乎是您可以使用 pandas read_sql 函数从 impala 移动的行数的问题。我有同样的问题，但限制比你的低。您可能需要联系数据库管理员来检查大小。以下是其他选项：https ://docs.cloudera.com/machine-learning/cloud/import-data/topics/ml-running-queries-on-impala-tables.html

python - Python - unable to read a large file

1 回答 1

Related

Reference