python - 在 Sqlite 中对多索引大型数据库表进行排序

Question

我正在尝试从（带有WHERE子句）中选择并通过 python 对 sqlite3 中的大型数据库表进行排序。目前对大约 36 MB 的数据进行排序需要 30 多分钟。我觉得它可以比索引更快地工作，但我认为我的代码顺序可能不正确。

代码按此处列出的顺序执行。

我的CREATE TABLE陈述如下所示：

c.execute('''CREATE table gtfs_stop_times (
  trip_id text , --REFERENCES gtfs_trips(trip_id),
  arrival_time text, -- CHECK (arrival_time LIKE '__:__:__'),
  departure_time text, -- CHECK (departure_time LIKE '__:__:__'),
  stop_id text , --REFERENCES gtfs_stops(stop_id),
  stop_sequence int NOT NULL --NOT NULL
)''')

然后在下一步中插入这些行：

stop_times = csv.reader(open("tmp\\avl_stop_times.txt"))
c.executemany('INSERT INTO gtfs_stop_times VALUES (?,?,?,?,?)', stop_times)

接下来，我从两列 (trip_id和stop_sequence) 创建一个索引：

c.execute('CREATE INDEX trip_seq ON gtfs_stop_times (trip_id, stop_sequence)')

最后，我运行一个SELECT带有WHERE子句的语句，该子句按索引中使用的两列对这些数据进行排序，然后将其写入 csv 文件：

c.execute('''SELECT gtfs_stop_times.trip_id, gtfs_stop_times.arrival_time, gtfs_stop_times.departure_time, gtfs_stops.stop_id, gtfs_stop_times.stop_sequence
FROM gtfs_stop_times, gtfs_stops
WHERE gtfs_stop_times.stop_id=gtfs_stops.stop_code
ORDER BY gtfs_stop_times.trip_id, gtfs_stop_times.stop_sequence;
)''')

f = open("gtfs_update\\stop_times.txt", "w")
writer = csv.writer(f, dialect = 'excel')
writer.writerow([i[0] for i in c.description]) # write headers
writer.writerows(c)
del writer

有什么方法可以加快第 4 步（可能会改变我添加和/或使用索引的方式）还是应该在运行时去吃午饭？

我添加了 PRAGMA 语句以尝试提高性能但无济于事：

c.execute('PRAGMA main.page_size = 4096')
c.execute('PRAGMA main.cache_size=10000')
c.execute('PRAGMA main.locking_mode=EXCLUSIVE')
c.execute('PRAGMA main.synchronous=NORMAL')
c.execute('PRAGMA main.journal_mode=WAL')
c.execute('PRAGMA main.cache_size=5000')

score 2 · Accepted Answer

执行速度非常快，SELECT因为没有gtfs_stops表，您只得到一条错误消息。

如果我们假设有一个gtfs_stops表，那么您的trip_seq索引已经非常适合查询。但是，您还需要一个索引来查找列stop_code中的值gtfs_stops。

python - 在 Sqlite 中对多索引大型数据库表进行排序

1 回答 1

Related

Reference