0

我正在尝试从 Teradata big 读取表格,这需要很长时间。我的表有 500 万行和 60 列,需要 30 分钟才能加载到内存中。我正在使用 teradatasql 包,但同一张表需要 5 分钟才能使用 RJDBC 包加载到 R 中。 

Python 代码(这需要 30 分钟)

import teradatasql
import pandas as pd

conn = teradatasql.connect(host=host, user=user_name, password=password, database=database)
df = pd.read_sql("SELECT * FROM big_table", conn)

R 代码(只需 3 分钟)

library(RJDBC)

# teradata conecction
con_tera <- dbConnect(drv_tera, "jdbc:teradata://{ip_host}/DATABASE=DBI_MIN,DBS_PORT=1025",Sys.getenv("TERA_DB_USER"), Sys.getenv("TERA_DB_PASS"))

# create query
final_query <- 'select * from big_table'

# get data
dataset_caribu <- dbGetQuery(con_tera,final_query)

我试图在 python 中增加光标的数组大小,但它并没有大大提高执行时间。

4

1 回答 1

0

pandas.read_sql 比直接使用 teradatasql 驱动要慢。

这是一个简单的 Python 脚本,用于测试 500 万行和 60 列,其中 80% 非 NULL 和 20% NULL 列值:

import pandas
import teradatasql
import time
with teradatasql.connect (host="whomooz", user="guest", password="please") as con:
 with con.cursor () as cur:
  cur.execute ("create volatile table voltab (" + ",".join ([ "c{} integer".format (n) for n in range (1, 61) ]) + ") on commit preserve rows")
  cur.execute ("insert into voltab(c1) select row_number() over (order by calendar_date) as c1 from sys_calendar.calendar qualify c1 <= 62500")
  cur.execute ("insert into voltab(c1) select c1 + 62500 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 125000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 250000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 500000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 1000000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 2000000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 4000000 from voltab where c1 <= 1000000")
  cur.execute ("update voltab set " + ",".join ([ "c{} = c1".format (n) for n in range (2, 49) ]))
  cur.execute ("select * from voltab")
  print ('beginning fetchall')
  dStartTime = time.time ()
  rows = cur.fetchall ()
  dElapsed = time.time () - dStartTime
  print ("fetchall took {} seconds, or {} minutes, and returned {} rows".format (dElapsed, dElapsed / 60, len (rows)))
  dStartTime = time.time ()
  df = pandas.read_sql ("select * from voltab", con)
  dElapsed = time.time () - dStartTime
  print ("read_sql took {} seconds, or {} minutes, and returned {} rows".format (dElapsed, dElapsed / 60, len (df)))

我的结果是:

fetchall took 638.6090559959412 seconds, or 10.64348426659902 minutes, and returned 5000000 rows
read_sql took 2293.84486413002 seconds, or 38.23074773550034 minutes, and returned 5000000 rows
于 2020-09-02T23:42:22.630 回答