1

情况如下:

我有一个 SAAS 应用程序,它是一个简单的 RSS Feed 阅读器。我想大多数人都知道这是什么——用户订阅 RSS 提要,然后从中读取项目。没什么新鲜的。一个提要可以有多个订阅者。

我已经为用户实现了一些统计数据,但我认为我没有选择正确的方法,因为随着用户和提要数量的增长,事情变得越来越慢。

这就是我现在正在做的事情:

  1. 每小时获取每个提要的文章总数:

    SELECT COUNT(*) FROM articles WHERE feed_id=?
    
  2. 获取先前的值来计算增量(这有点慢):

    SELECT value FROM feeds_stats WHERE feed_id=? AND name='total_articles' ORDER BY date DESC LIMIT 1
    
  3. 插入新值和增量:

    INSERT INTO feeds_stats (date,feed_id,name,value,delta) VALUES ('".date("Y-m-d H:i:s",$global_timestamp)."','".$feed_id','total_articles','".$value."','".($value-$old_value)."')
    
  4. 为每个用户获取他的提要,并为每个提要获取他已阅读的文章数量:

    SELECT COUNT(*) FROM users_articles ua JOIN articles a ON a.id=ua.article_id WHERE a.feed_id='%s' AND ua.user_id='%s' AND ua.read=1
    

users_articles 是一个表,其中包含每个用户每篇文章的阅读状态

  1. 然后再次获取增量:

    SELECT value FROM users_feeds_stats WHERE user_id='?' AND feed_id='?' AND name='total_reads' ORDER BY date DESC LIMIT 1
    
  2. 并插入新值 + delta:

    INSERT INTO users_feeds_stats (date,user_id,feed_id,name,value,delta) VALUES ('".date("Y-m-d H:i:s",$global_timestamp)."','".$user_id."','".$feed_id."','total_reads','".$value."','".($value-$old_value)."')
    

处理完用户的所有提要后,将进入聚合部分

这有点棘手,我认为这里应该有很大的优化空间。这是 PHP 中的实际聚合函数:

<?php

function aggregate_user_stats($user_id=false,$feed_id=false){
    global $global_timestamp;
    // defined dimensions
    $feed_types[0] = array("days_back" => 31, "group_by" => "DATE_FORMAT(date, '%Y-%m-%d')");
    $feed_types[1] = array("days_back" => 31, "group_by" => "WEEKDAY(date)+1");
    $feed_types[2] = array("days_back" => 31, "group_by" => "HOUR(date)");

    if($user_id){
        $where = " WHERE id=".$user_id;
    }

    $feed_where = "";
    $getusers = mysql_query("SELECT id FROM users".$where)or die(__LINE__." ".mysql_error());
    while($user = mysql_fetch_assoc($getusers)){
        if($feed_id){
            $feed_where = " AND feed_id=".$feed_id;
        }

        $user_feeds = array();
        $getfeeds = mysql_query("SELECT feed_id FROM subscriptions WHERE user_id='".$user["id"]."' AND active=1".$feed_where)or die(__LINE__." ".mysql_error());
        while($row = mysql_fetch_assoc($getfeeds)){
            foreach($feed_types as $tab => $type){
                $getdata = mysql_query("
                SELECT ".$type["group_by"]." AS date, name, SUM(delta) AS delta FROM feeds_stats WHERE feed_id = '".$row["feed_id"]."' AND name='total_articles' AND date > DATE_SUB(NOW(), INTERVAL ".$type["days_back"]." DAY) GROUP BY name, ".$type["group_by"]." 
                UNION 
                SELECT ".$type["group_by"]." AS date, name, SUM(delta) AS delta FROM users_feeds_stats WHERE user_id = '".$user["id"]."' AND feed_id = '".$row["feed_id"]."' AND name='total_reads' AND date > DATE_SUB(NOW(), INTERVAL ".$type["days_back"]." DAY) GROUP BY name, ".$type["group_by"]."
                ")or die(__LINE__." ".mysql_error());
                $data = array();
                while($row = mysql_fetch_assoc($getdata)){
                    $data[$row["date"]][$row["name"]] = $row["delta"];
                }
                if(count($data)){
                    db_start_trx();
                    mysql_query("DELETE FROM stats_feeds_over_time WHERE feed_id='".$row["feed_id"]."' AND user_id='".$user["id"]."' AND tab='".$tab."'")or die(__LINE__." ".mysql_error());
                    foreach($data as $time => $keys){
                        mysql_query("REPLACE INTO stats_feeds_over_time (feed_id,user_id,tab,date,total_articles,total_reads,total_favs) VALUES ('".$row["feed_id"]."','".$user["id"]."','".$tab."','".$time."','".$keys["total_articles"]."','".$keys["total_reads"]."','".$keys["total_favs"]."')")or die(__LINE__." ".mysql_error());
                    }
                    db_commit_trx();
                }
            }
        }
    }
}

一些注意事项:

编辑:以下是所涉及表的 DDL:

CREATE TABLE `articles` (
  `id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `feed_id` INTEGER(11) UNSIGNED NOT NULL,
  `date` INTEGER(10) UNSIGNED NOT NULL,
  `date_updated` INTEGER(11) UNSIGNED NOT NULL,
  `title` VARCHAR(1000) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `url` VARCHAR(2000) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `author` VARCHAR(200) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `hash` CHAR(32) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  UNIQUE KEY `feed_id_hash` (`feed_id`, `hash`),
  KEY `date` (`date`),
  KEY `url` (`url`(255))
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';


CREATE TABLE `users_articles` (
  `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
  `user_id` INTEGER(11) UNSIGNED NOT NULL,
  `article_id` INTEGER(11) UNSIGNED NOT NULL,
  `subscription_id` INTEGER(11) UNSIGNED NOT NULL,
  `read` TINYINT(4) UNSIGNED NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  UNIQUE KEY `user_id` (`user_id`, `article_id`),
  KEY `article_id` (`article_id`),
  KEY `subscription_id` (`subscription_id`)
)ENGINE=InnoDB
CHECKSUM=1 AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';


CREATE TABLE `feeds_stats` (
  `id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `feed_id` INTEGER(11) UNSIGNED NOT NULL,
  `date` DATETIME NOT NULL,
  `name` VARCHAR(50) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `value` INTEGER(11) NOT NULL,
  `delta` INTEGER(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `name` (`name`),
  KEY `feed_id` (`feed_id`),
  KEY `date` (`date`)
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';


CREATE TABLE `users_feeds_stats` (
  `id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `user_id` INTEGER(11) UNSIGNED NOT NULL DEFAULT '0',
  `feed_id` INTEGER(11) UNSIGNED NOT NULL,
  `date` DATETIME NOT NULL,
  `name` VARCHAR(50) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `value` INTEGER(11) NOT NULL,
  `delta` INTEGER(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `name` (`name`),
  KEY `feed_id` (`feed_id`),
  KEY `user_id` (`user_id`),
  KEY `date` (`date`)
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';

CREATE TABLE `stats_feeds_over_time` (
  `feed_id` INTEGER(11) UNSIGNED NOT NULL,
  `user_id` INTEGER(11) NOT NULL,
  `tab` INTEGER(11) NOT NULL,
  `date` VARCHAR(30) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `total_articles` DOUBLE(9,2) UNSIGNED NOT NULL,
  `total_reads` DOUBLE(9,2) UNSIGNED NOT NULL,
  `total_favs` DOUBLE(9,2) UNSIGNED NOT NULL,
  PRIMARY KEY (`feed_id`, `user_id`, `tab`, `date`)
)ENGINE=InnoDB
AUTO_INCREMENT=0 
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';

在聚合函数的末尾,表 stats_feeds_over_time 中有一个 REPLACE。该表仅包含将显示在图表上的记录,因此实际的绘图过程不涉及繁重的查询。

最后,这是由此产生的图表:

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

如果有人指出我在哪里以及如何优化这个解决方案的正确方向,我会很高兴,即使这意味着放弃 MySQL 进行统计。

我在RRDTool方面有很长的经验,但这里的情况有所不同,因为“一天中的时间”、“一周中的一天”聚合。

4

1 回答 1

1

我不知道您希望优化的查询相对于您可能在同一组表上运行的其他查询有多重要。我假设您首先希望优化这些查询。

看到所有查询都是feed_id作为WHERE谓词进行的,我会尝试articles在该列上对表进行分区:

CREATE TABLE `articles` (
  `id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `feed_id` INTEGER(11) UNSIGNED NOT NULL,
  -- etc.
)ENGINE=InnoDB
AUTO_INCREMENT=0
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT=''
PARTITION BY KEY(feed_id)
PARTITIONS 10;

分区数(10上)可以根据您的需要进行调整,但必须大于 1 才能产生影响。您可能希望使用更大的数字来使您的选择查询更快。但是,任何不依赖的查询feed_id都会被此设备减慢。

相同的过程可以应用于其他表,其中列经常在查询中用作判别式。

但是,由于对所有提要执行前两个查询,您可以将它们重写如下:

SELECT feed_id, COUNT(feed_id) 
FROM articles
GROUP BY feed_id

SELECT feed_id, value
FROM feeds_stats
WHERE name='total_articles' 
GROUP BY feed_id
ORDER BY date DESC

这两个都将检索所有提要的结果,这使您不必为每个提要运行查询。使用这些查询会使分区适得其反,因此您必须在两者之间进行选择。

分区的好处:任何区分一个特定值feed_id(或用于分区的任何其他列)的任何查询都将得到显着提升。不好的一点是常规查询会变慢。

第二种解决方案的好处是它不会对其他查询产生任何影响。

于 2013-04-19T22:27:34.607 回答