我有 2 张桌子。它们的结构大致如下;不过我改了名字。
CREATE TABLE overlay_polygon
(
overlay_polygon_id SERIAL PRIMARY KEY,
some_other_polygon_id INTEGER REFERENCES some_other_polygon (some_other_polygon_id)
dollar_value NUMERIC,
geom GEOMETRY(Polygon,26915)
)
CREATE TABLE point
(
point_id SERIAL PRIMARY KEY,
some_other_polygon_id INTEGER REFERENCES some_other_polygon (some_other_polygon_id)
-- A bunch of other fields that this query won't touch
geom GEOMETRY(Point,26915)
)
point
在其列上具有空间索引geom
,名为spix_point
,并且在其some_other_polygon_id
列上也具有索引。
中大约有 500,000 行point
,并且几乎所有行都point
与 中的某些行相交overlay_polygon
。最初,我的overlay_polygon
表包含几行,它们的面积非常小(大部分小于 1 平方米),并且在空间上不与point
. 删除不与 中的任何行相交的小行后point
,共有 38 行。
顾名思义,是一个多边形表,它是由其他 3 个表(包括)overlay_polygon
的多边形叠加而生成的。some_other_polygon
特别是,我需要使用dollar_value
和一些列进行一些计算point
。当我开始删除不相交的行以加快将来的处理时,我最终查询了 COUNT 行。最明显的查询似乎如下。
SELECT op.*, COUNT(point_id) AS num_points
FROM overlay_polygon op
LEFT JOIN point ON op.some_other_polygon_id = point.some_other_polygon_id AND ST_Intersects(op.geom, point.geom)
GROUP BY op.overlay_polygon_id
ORDER BY op.overlay_polygon_id
;
这是它的EXPLAIN (ANALYZE, BUFFERS)
.
GroupAggregate (cost=544.45..545.12 rows=38 width=8049) (actual time=284962.944..540959.914 rows=38 loops=1)
Buffers: shared hit=58694 read=17119, temp read=189483 written=189483
I/O Timings: read=39171.525
-> Sort (cost=544.45..544.55 rows=38 width=8049) (actual time=271754.952..534154.573 rows=415224 loops=1)
Sort Key: op.overlay_polygon_id
Sort Method: external merge Disk: 897016kB
Buffers: shared hit=58694 read=17119, temp read=189483 written=189483
I/O Timings: read=39171.525
-> Nested Loop Left Join (cost=0.00..543.46 rows=38 width=8049) (actual time=0.110..46755.284 rows=415224 loops=1)
Buffers: shared hit=58694 read=17119
I/O Timings: read=39171.525
-> Seq Scan on overlay_polygon op (cost=0.00..11.38 rows=38 width=8045) (actual time=0.043..153.255 rows=38 loops=1)
Buffers: shared hit=1 read=10
I/O Timings: read=152.866
-> Index Scan using spix_point on point (cost=0.00..13.99 rows=1 width=200) (actual time=50.229..1139.868 rows=10927 loops=38)
Index Cond: (op.geom && geom)
Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
Rows Removed by Filter: 13353
Buffers: shared hit=58693 read=17109
I/O Timings: read=39018.660
Total runtime: 542172.156 ms
但是,我发现此查询的运行速度要快得多:
SELECT *
FROM overlay_polygon
JOIN (SELECT op.overlay_polygon_id, COUNT(point_id) AS num_points
FROM overlay_polygon op
LEFT JOIN point ON op.some_other_polygon_id = point.some_other_polygon_id AND ST_Intersects(op.geom, point.geom)
GROUP BY op.overlay_polygon_id
) x ON x.overlay_polygon_id = overlay_polygon.overlay_polygon_id
ORDER BY overlay_polygon.overlay_polygon_id
;
它EXPLAIN (ANALYZE, BUFFERS)
在下面。
Sort (cost=557.78..557.88 rows=38 width=8057) (actual time=18904.661..18904.748 rows=38 loops=1)
Sort Key: overlay_polygon.overlay_polygon_id
Sort Method: quicksort Memory: 126kB
Buffers: shared hit=58690 read=17134
I/O Timings: read=9924.328
-> Hash Join (cost=544.88..556.78 rows=38 width=8057) (actual time=18903.697..18904.210 rows=38 loops=1)
Hash Cond: (overlay_polygon.overlay_polygon_id = op.overlay_polygon_id)
Buffers: shared hit=58690 read=17134
I/O Timings: read=9924.328
-> Seq Scan on overlay_polygon (cost=0.00..11.38 rows=38 width=8045) (actual time=0.127..0.411 rows=38 loops=1)
Buffers: shared hit=2 read=9
I/O Timings: read=0.173
-> Hash (cost=544.41..544.41 rows=38 width=12) (actual time=18903.500..18903.500 rows=38 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
Buffers: shared hit=58688 read=17125
I/O Timings: read=9924.154
-> HashAggregate (cost=543.65..544.03 rows=38 width=8) (actual time=18903.276..18903.379 rows=38 loops=1)
Buffers: shared hit=58688 read=17125
I/O Timings: read=9924.154
-> Nested Loop Left Join (cost=0.00..543.46 rows=38 width=8) (actual time=0.052..17169.606 rows=415224 loops=1)
Buffers: shared hit=58688 read=17125
I/O Timings: read=9924.154
-> Seq Scan on overlay_polygon op (cost=0.00..11.38 rows=38 width=8038) (actual time=0.004..0.537 rows=38 loops=1)
Buffers: shared hit=1 read=10
I/O Timings: read=0.279
-> Index Scan using spix_point on point (cost=0.00..13.99 rows=1 width=200) (actual time=4.422..381.991 rows=10927 loops=38)
Index Cond: (op.gopm && gopm)
Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
Rows Removed by Filter: 13353
Buffers: shared hit=58687 read=17115
I/O Timings: read=9923.875
Total runtime: 18905.293 ms
正如您所看到的,它们具有可比较的成本估算,尽管我不确定这些成本估算的准确性如何。我对涉及 PostGIS 功能的成本估算持怀疑态度。自上次修改和运行查询之前,这两个表都已VACUUM ANALYZE FULL
在它们上运行。
也许我根本无法阅读我EXPLAIN ANALYZE
的 s,但我不明白为什么这些查询的运行时间如此不同。任何人都可以识别任何东西吗?我能想到的唯一可能性与LEFT JOIN
.
编辑 1
根据@ChrisTravers 的建议,我增加work_mem
并重新运行了第一个查询。我不认为这代表了重大改进。
执行
SET work_mem='4MB';
(它是 1 MB。)
然后执行第一个查询给出了这些结果。
GroupAggregate (cost=544.45..545.12 rows=38 width=8049) (actual time=339910.046..495775.478 rows=38 loops=1)
Buffers: shared hit=58552 read=17261, temp read=112133 written=112133
-> Sort (cost=544.45..544.55 rows=38 width=8049) (actual time=325391.923..491329.208 rows=415224 loops=1)
Sort Key: op.overlay_polygon_id
Sort Method: external merge Disk: 896904kB
Buffers: shared hit=58552 read=17261, temp read=112133 written=112133
-> Nested Loop Left Join (cost=0.00..543.46 rows=38 width=8049) (actual time=14.698..234266.573 rows=415224 loops=1)
Buffers: shared hit=58552 read=17261
-> Seq Scan on overlay_polygon op (cost=0.00..11.38 rows=38 width=8045) (actual time=14.612..15.384 rows=38 loops=1)
Buffers: shared read=11
-> Index Scan using spix_point on point (cost=0.00..13.99 rows=1 width=200) (actual time=95.262..5451.636 rows=10927 loops=38)
Index Cond: (op.geom && geom)
Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
Rows Removed by Filter: 13353
Buffers: shared hit=58552 read=17250
Total runtime: 496936.775 ms
编辑 2
嗯,这是一种我以前没有注意到的好闻的大气味(主要是因为我在阅读ANALYZE
输出时遇到了麻烦)。抱歉我没有早点注意到。
Sort (cost=544.45..544.55 rows=38 width=8049) (actual time=271754.952..534154.573 rows=415224 loops=1)
估计行数:38。实际行数:超过 400K。想法,有人吗?