0

I was playing around with PostgreSQL9.3 and decided to try out the JSON type. So I wrote a Query which counts certain JSON Data and groups them by a user_id, but the process took very long to finish. Could somebody give me a hint to improve the performance?

But let me show you the test data and the relations first:

I downloaded some test data which contained one JSON Document per line. The JSON Document was composed out of one array named "hashtags", one array named "urls" and one field with a field named "user_id" (non-unique).

Test data:

{"hashtags": ["a", "b", "c"],"user_id": 210824047, "urls": ["http://www.google.com"]}
{"hashtags": ["b"], "user_id": 78149438, "urls": ["http://www.facebook.com"]}
{"hashtags": ["a", "c"], "user_id": 123791448, "urls": ["http://www.twitter.com","http://www.facebook.com"]}
...

Then I created a simple relation with an auto-incrementing primary key and a column which contains an JSON type like this:

CREATE TABLE tweets(id BIGSERIAL PRIMARY KEY, tweet JSON);

I loaded the data into postgresql where one JSON Document now equals one row:

id | tweet
---+---------------------------------------------------------------------------------
1  |{"hashtags": ["a", "b", "c"],"user_id": 210824047, "urls": ["http://www.google.com"]}
2  |{"hashtags": ["b"], "user_id": 78149438, "urls": ["http://www.facebook.com"]}
3  |{"hashtags": ["a", "c"], "user_id": 123791448, "urls": ["http://www.twitter.com","http://www.facebook.com"]}

Now I wanted to know how often a hastag was used by a certain user_id. To do that I had to use a sub-query which used the json_array_elements() function. The function takes a specified JSON array and turns it into a column.

The Query looks like this:

    select foo.uid, foo.tag, count(foo.*) from 
        (select (tweet::json->>'user_id')::int as uid, 
        json_array_elements(tweet::json->'hashtags')::text as tag from tweets) 
    as foo group by foo.uid, foo.tag;

I got the result I wanted, but the query took very long to finish. For example I loaded 200.000 JSON Documents into postgresql, which is slightly less than 25MB, and it took around 10mins until I got a result. I also tried various indexes, but the performance only improved by half a minute at best.

Now being a postgresql beginner I ask myself, if it is really that slow or did I do something wrong? if it is the latter, what can I do against it?

Your help will be much appreciated.

4

1 回答 1

1

JSON 对 PostgreSQL 来说还是很新的东西,还没有得到很好的优化,但正在迅速改进。

当今年晚些时候发布 9.4 时,看起来这方面会有很大的改进(我得到大约 2 秒的时间来查询您的 196,608 条记录)。

实际上这种改进也在 9.3.3 中,应该会在一周内发布。

于 2014-02-13T21:17:33.003 回答