我的任务是为我们的网站提取所有谷歌分析数据并对其进行分析并放入数据库,然后生成一些关于该数据的报告。问题是我最终在一个表中有近 2m 条记录。报告由一些 SUM() 和 AVG 查询组成,您可以想象在某些情况下会花费太长时间(最坏的情况是没有设置日期过滤器和广泛的标准(取决于报告的类型)需要 8-10 分钟)。鉴于这次用户将有权访问这些报告是不可接受的......目前该项目正在使用 Postgres。我非常清楚没有 RDMS 将在 10 秒内处理此类数据,尤其是在单台机器上运行。
问题是什么是软件和架构/技术的最佳选择才能获得良好的结果?
我尝试了 MongoDb,但考虑到它是单线程的(至少现在),map/reduce 在单台机器上的运行速度实际上并没有更快。
我知道我可能正在研究一个并行系统,但仍然不确定......我觉得使用 mongoDB 很舒服,我读到它们在集群和使用多个节点方面有了很大的改进,但我希望有人分享其他意见和经验不仅在计算能力方面,而且在价格方面也是最有效的方式。
谢谢
PS关于postgres应该处理这么多数据的一些答案,我发布了一些更多细节:
表结构:
-- Table: ga_content_tracking
-- DROP TABLE ga_content_tracking;
CREATE TABLE ga_content_tracking
(
id serial NOT NULL,
country character varying(255),
city character varying(255),
page_title character varying(255),
page_path character varying(255),
source character varying(255),
referral_path character varying(255),
date date NOT NULL,
visits integer,
pageviews integer,
avgtime_on_site double precision,
yacht_id integer,
charter_listing boolean DEFAULT false,
sales_listing boolean DEFAULT false,
directory_listing boolean DEFAULT false,
news_related boolean DEFAULT false,
visitor_type character varying(30),
organisation_id integer
)
WITH OIDS;
ALTER TABLE ga_content_tracking OWNER TO postgres;
-- Index: ga_content_tracking_charter_listing
-- DROP INDEX ga_content_tracking_charter_listing;
CREATE INDEX ga_content_tracking_charter_listing
ON ga_content_tracking
USING btree
(charter_listing);
-- Index: ga_content_tracking_country
-- DROP INDEX ga_content_tracking_country;
CREATE INDEX ga_content_tracking_country
ON ga_content_tracking
USING btree
(country);
-- Index: ga_content_tracking_dates
-- DROP INDEX ga_content_tracking_dates;
CREATE INDEX ga_content_tracking_dates
ON ga_content_tracking
USING btree
(date);
-- Index: ga_content_tracking_directory_listing
-- DROP INDEX ga_content_tracking_directory_listing;
CREATE INDEX ga_content_tracking_directory_listing
ON ga_content_tracking
USING btree
(directory_listing);
-- Index: ga_content_tracking_news_related
-- DROP INDEX ga_content_tracking_news_related;
CREATE INDEX ga_content_tracking_news_related
ON ga_content_tracking
USING btree
(news_related);
-- Index: ga_content_tracking_organisation_id
-- DROP INDEX ga_content_tracking_organisation_id;
CREATE INDEX ga_content_tracking_organisation_id
ON ga_content_tracking
USING btree
(organisation_id);
-- Index: ga_content_tracking_sales_listing
-- DROP INDEX ga_content_tracking_sales_listing;
CREATE INDEX ga_content_tracking_sales_listing
ON ga_content_tracking
USING btree
(sales_listing);
-- Index: ga_content_tracking_visitor_type
-- DROP INDEX ga_content_tracking_visitor_type;
CREATE INDEX ga_content_tracking_visitor_type
ON ga_content_tracking
USING btree
(visitor_type);
-- Index: ga_content_tracking_yacht_id
-- DROP INDEX ga_content_tracking_yacht_id;
CREATE INDEX ga_content_tracking_yacht_id
ON ga_content_tracking
USING btree
(yacht_id);
示例查询:
superyachts=# SELECT SUM( pageviews ) as cnt, SUM( visits ) as cnt1, AVG( avgtime_on_site ) as avg1 FROM ga_content_tracking gact WHERE TRUE AND ( yacht_id IN ( 7727, 7600, 2352, 7735, 7735, 3739, 7620, 7631, 7633, 7775, 3688, 7779, 3930, 2884, 2724, 2547, 3465, 2324, 4981, 2369, 7820, 4772, 7802, 7249, 4364, 7623, 7803, 7804, 7805, 7957, 7728, 7728, 7252, 8044, 8067, 8016, 8017, 8019, 2726, 2726, 2233, 4549, 6021, 8286, 4773, 8326, 8312, 4881, 8349, 2223, 4387, 2697, 6224, 5947, 4967, 3031, 7494, 7497, 3833, 6594, 6608, 3587, 6599, 3160, 4934, 3122, 4895, 3505, 4980, 8360, 2311, 4885, 2660, 5260, 2362, 2783, 2992, 3286, 3434, 4849, 4117, 2179, 5002, 2187, 5006, 2266, 4900, 4069, 6219, 2951, 3317, 3474, 6218, 3847, 4745, 6480, 4498, 6094, 6312, 6092, 7221, 7220, 2257, 4883, 6518, 2597, 4902, 2638, 2690, 4872, 2790, 6221, 2881, 2887, 3082, 3131, 3141, 3166, 3166, 4887, 4979, 3295, 4886, 4875, 6516, 5005, 3400, 3401, 4990, 3489, 5007, 4882, 4899, 5116, 4082, 4970, 3760, 2202, 2411, 2605, 6291, 6513, 7176, 3029, 8412, 2064, 7826, 4876, 4949, 3626, 5587, 8554, 2837, 5086, 5118, 2058, 4484, 4041, 2953, 8136, 2490, 3291, 4991, 2611, 3590 ) OR organisation_id = 1 ) ;
cnt | cnt1 | avg1
--------+-------+-----------------
640826 | 46418 | 34.408638690454
(1 row)
Time: 114652.334 ms