postgresql - 如何有效地对允许重复键的键值数据执行相等查询？

Question

我有以下情况：

数据 = 大约 4 亿个 (string1, string2, score) 元组
数据大小 ~ 20gb，不适合内存。
数据以 csv 格式存储在文件中，不按任何字段排序。
我需要有效地检索具有特定字符串的所有元组，例如所有元组 st string1 = 'google'。

我如何设计一个系统，以便我可以有效地做到这一点？

我已经尝试过使用 B-tree 索引和 GIN 索引的 postgresql，但是每个查询的速度不够快（> 20-30 秒）。

理想情况下，我需要一个解决方案，它按 string1 对元组进行排序，以排序方式存储它们，然后运行二进制搜索，然后进行顺序扫描以进行检索。但是，我不知道哪个数据库或系统实现了这样的功能。

更新：这是 postgres 的详细信息：

我使用 COPY 命令将数据批量加载到 postgres 中。然后我在 string1 上创建了两个索引，一个 b-tree 和一个 GIN。但是，postgres 没有使用它们中的任何一个。

创建表：

  CREATE TABLE mytable(
 string1 varchar primary key, string2 varchar, source_id integer REFERENCES sources(id), score real);
  CREATE EXTENSION IF NOT EXISTS pg_trgm;
  CREATE INDEX string1_gin_index ON mytable USING gin (string1 gin_trgm_ops);
  CREATE INDEX string1_index ON mytable(lower(string1));

查询计划：

     isa=# EXPLAIN ANALYZE VERBOSE select * from mytable where string1 ilike 'google';
                                                             QUERY PLAN                                                                 
 --------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.mytable  (cost=235.88..41872.81 rows=11340 width=29) (actual time=20234.765..25566.128 rows=30971 loops=1)
   Output: hyponym, string2, source_id, score
   Recheck Cond: ((mytable.string1)::text ~~* 'google'::text)
   Rows Removed by Index Recheck: 34573
    ->  Bitmap Index Scan on string1_gin_index  (cost=0.00..233.05 rows=11340 width=0) (actual time=20218.263..20218.263 rows=65544 loops=1)
     Index Cond: ((mytable.string1)::text ~~* 'google'::text)
   Total runtime: 25568.209 ms
   (7 rows)

 isa=# EXPLAIN ANALYZE VERBOSE select * from isa where string1 = 'google';
                                                    QUERY PLAN                                                         
 ---------------------------------------------------------------------------------------------------------------------------
  Seq Scan on public.mytable  (cost=0.00..2546373.30 rows=3425 width=29) (actual time=11692.606..139401.099 rows=30511 loops=1)
    Output: string1, string2, source_id, score
    Filter: ((mytable.string1)::text = 'google'::text)
    Rows Removed by Filter: 124417194
    Total runtime: 139403.950 ms
    (5 rows)

postgresql - 如何有效地对允许重复键的键值数据执行相等查询？

0 回答 0

Related

Reference