I was playing around with PostgreSQL9.3 and decided to try out the JSON type. So I wrote a Query which counts certain JSON Data and groups them by a user_id, but the process took very long to finish. Could somebody give me a hint to improve the performance?
But let me show you the test data and the relations first:
I downloaded some test data which contained one JSON Document per line. The JSON Document was composed out of one array named "hashtags", one array named "urls" and one field with a field named "user_id" (non-unique).
Test data:
{"hashtags": ["a", "b", "c"],"user_id": 210824047, "urls": ["http://www.google.com"]}
{"hashtags": ["b"], "user_id": 78149438, "urls": ["http://www.facebook.com"]}
{"hashtags": ["a", "c"], "user_id": 123791448, "urls": ["http://www.twitter.com","http://www.facebook.com"]}
...
Then I created a simple relation with an auto-incrementing primary key and a column which contains an JSON type like this:
CREATE TABLE tweets(id BIGSERIAL PRIMARY KEY, tweet JSON);
I loaded the data into postgresql where one JSON Document now equals one row:
id | tweet
---+---------------------------------------------------------------------------------
1 |{"hashtags": ["a", "b", "c"],"user_id": 210824047, "urls": ["http://www.google.com"]}
2 |{"hashtags": ["b"], "user_id": 78149438, "urls": ["http://www.facebook.com"]}
3 |{"hashtags": ["a", "c"], "user_id": 123791448, "urls": ["http://www.twitter.com","http://www.facebook.com"]}
Now I wanted to know how often a hastag was used by a certain user_id. To do that I had to use a sub-query which used the json_array_elements() function. The function takes a specified JSON array and turns it into a column.
The Query looks like this:
select foo.uid, foo.tag, count(foo.*) from
(select (tweet::json->>'user_id')::int as uid,
json_array_elements(tweet::json->'hashtags')::text as tag from tweets)
as foo group by foo.uid, foo.tag;
I got the result I wanted, but the query took very long to finish. For example I loaded 200.000 JSON Documents into postgresql, which is slightly less than 25MB, and it took around 10mins until I got a result. I also tried various indexes, but the performance only improved by half a minute at best.
Now being a postgresql beginner I ask myself, if it is really that slow or did I do something wrong? if it is the latter, what can I do against it?
Your help will be much appreciated.