json - Are postgres JSON indexes efficient enough compared with classic normalized tables?

Question

Current Postgresql versions have introduced various features for JSON content, but I'm concerned if I really should use them - I mean, there is not yet "best practice" estabilished on what works and what doesn't, or at least I can't find it.

I have a specific example - I have a table about objects which, among other things, contains a list of alternate names for that object. All that data will also be included in a JSON column for retrieval purposes. For example (skipping all the other irrelevant fields).

create table stuff (id serial primary key, data json);
insert into stuff(data) values('{"AltNames":["Name1","Name2","Name3"]}')

I will need some queries in the form "list all objects where one of altnames is 'foobar'." The expected table size is on the order of a few million records. Postgres JSON queries can be used for that, and it can also be indexed (Index for finding an element in a JSON array, for example). However, SHOULD it be done that way or is it a perverse workaround that's not recommended?

The classic alternative, of course, is to add an additional table for that one-to-many relation, containing the name and a foreign key to the main table; the performance of that is well understood. However, that has it's own disadvantages as then it means either data duplication between that table and JSON (with possible integrity risk); or creating that JSON return data dynamically at every request, which has it's own performance penalty.

score 27 · Accepted Answer

我将需要一些查询，格式为“列出替代名称之一为 'foobar' 的所有对象”。预期的表大小约为几百万条记录。可以使用 Postgres JSON 查询，也可以对其进行索引（例如，在 JSON 数组中查找元素的索引）。但是，应该这样做还是不建议这样做的不正当的解决方法？

可以这样做，但这并不意味着你应该这样做。从某种意义上说，最佳实践已经得到了很好的记录（例如，参见使用 hstore 与使用 XML 与使用 EAV 与使用单独的表）和一个新的数据类型，对于所有意图和实际目的（除了验证和语法），没有什么不同来自先前的非结构化或半结构化选项。

换一种说法，就是一头换了新妆的老猪。

JSON 提供了使用反向搜索树索引的能力，就像 hstore、数组类型和 tsvector 一样。它们工作正常，但请记住，它们主要设计用于提取按距离排序的邻域中的点（想想几何类型），而不是按字典顺序提取值列表。

为了说明，以罗曼的回答概述的两个计划为例：

执行索引扫描的方法直接遍历磁盘页面，按照索引指示的顺序检索行。
执行位图索引扫描的方法首先识别可能包含一行的每个磁盘页面，并在它们出现在磁盘上时读取它们，就好像（实际上，完全像）执行跳过无用区域的序列扫描一样。

回到您的问题：如果您将 Postgres 表用作巨型 JSON 存储，杂乱且过大的倒排树索引确实会提高您的应用程序的性能。但它们也不是灵丹妙药，在处理瓶颈时它们不会让您达到适当的关系设计。

最后，底线与您在决定使用 hstore 或 EAV 时得到的结果没有什么不同：

如果它需要索引（即，它经常出现在 where 子句中，或者更重要的是，出现在 join 子句中），您可能希望数据位于单独的字段中。
如果它主要是装饰性的，那么 JSON/hstore/EAV/XML/whatever-makes-you-sleep-at-night 可以正常工作。

score 20 · Accepted Answer

我会说值得一试。我创建了一些测试（100000 条记录，JSON 数组中的约 10 个元素）并检查了它是如何工作的：

create table test1 (id serial primary key, data json);
create table test1_altnames (id int, name text);

create or replace function array_from_json(_j json)
returns text[] as
$func$
    select array_agg(x.elem::text)
    from json_array_elements(_j) as x(elem)
$func$
language sql immutable;

with cte as (
    select
        (random() * 100000)::int as grp, (random() * 1000000)::int as name
    from generate_series(1, 1000000)
), cte2 as (
    select
        array_agg(Name) as "AltNames"
    from cte
    group by grp
)
insert into test1 (data)
select row_to_json(t)
from cte2 as t

insert into test1_altnames (id, name)
select id, json_array_elements(data->'AltNames')::text
from test1

create index ix_test1 on test1 using gin(array_from_json(data->'AltNames'));
create index ix_test1_altnames on test1_altnames (name);

查询 JSON（在我的机器上为30 毫秒）：

select * from test1 where '{489147}' <@ array_from_json(data->'AltNames');

"Bitmap Heap Scan on test1  (cost=224.13..1551.41 rows=500 width=36)"
"  Recheck Cond: ('{489147}'::text[] <@ array_from_json((data -> 'AltNames'::text)))"
"  ->  Bitmap Index Scan on ix_test1  (cost=0.00..224.00 rows=500 width=0)"
"        Index Cond: ('{489147}'::text[] <@ array_from_json((data -> 'AltNames'::text)))"

查询带有名称的表（在我的机器上为15ms ）：

select * from test1 as t where t.id in (select tt.id from test1_altnames as tt where tt.name = '489147');

"Nested Loop  (cost=12.76..20.80 rows=2 width=36)"
"  ->  HashAggregate  (cost=12.46..12.47 rows=1 width=4)"
"        ->  Index Scan using ix_test1_altnames on test1_altnames tt  (cost=0.42..12.46 rows=2 width=4)"
"              Index Cond: (name = '489147'::text)"
"  ->  Index Scan using test1_pkey on test1 t  (cost=0.29..8.31 rows=1 width=36)"
"        Index Cond: (id = tt.id)"

另外我必须注意，将行插入/删除带有名称（test1_altnames）的表需要一些成本，因此它比仅选择行要复杂一些。我个人喜欢 JSON 的解决方案。

json - Are postgres JSON indexes efficient enough compared with classic normalized tables?

2 回答 2

Related

Reference