How do I read a large table from hdfs in jupyter-notebook as a pandas DataFrame? The script is launched through the docker image.
libraries:
- sasl==0.2.1
- thrift==0.11.0
- thrift-sasl==0.4a1
- Impyla==0.16.2
from impala.dbapi import connect
from impala.util import as_pandas
impala_conn = connect(host='hostname', port=21050,
auth_mechanism='GSSAPI',
timeout=100000, use_ssl=True, ca_cert=None,
ldap_user=None, ldap_password=None,
kerberos_service_name='impala')
This works.
import pandas as pd
df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 100", impala_conn)
print(df)
This does not work. The operation hangs, does not give errors.
import pandas as pd
df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 1000", impala_conn)
print(df)