0

请考虑下表。

users拥有数千名 Twitter 用户;它们tweets的索引为,这是推文所谈论sp100_id的公司的 id(参见)。为每条推文保存指定的情绪类别(= 中性,= 正面, = 负面)。保存推文被转发的次数。最后给每个用户打分和打分,如下:sp100tweets.class123tweets.rtqualityfollow

users                       tweets
-------------------------   -----------------------------------------------
user_id quality follow      tweet_id sp100_id nyse_date   user_id class  rt
-------------------------   -----------------------------------------------
1       2.50    5.00        1        1        2011-03-12  1       1      0
2       0.75    1.00        2        1        2011-03-13  1       2      2
                            3        1        2011-03-13  1       2      1
daterange                   4        1        2011-03-13  2       2      0
----------------            5        1        2011-03-13  2       3      3
_date                       6        2        2011-03-12  2       2      3
----------------            7        2        2011-03-12  2       2      0
2011-03-11                  8        2        2011-03-12  1       3      5
2011-03-12                  9        2        2011-03-13  2       2      0
2011-03-13

sp100
----------------
sp100_id  _name
----------------
1         Alcoa
2         Apple

所需的输出是一个列表,sp100_id每个_date正面 ( class=2) 和负面 ( class=3) 推文的数量按rt、“质量”和加权follow

sp100_id  nyse_date  pos-rt pos-quality pos-follow neg-rt neg-quality neg-follow
--------------------------------------------------------------------------------
1         2011-03-11 0      0           0          0      0           0
1         2011-03-12 0      0           0          0      0           0
1         2011-03-13 5 (1)  5.75 (2)    11.00 (3)  3 (4)  0.75 (5)    1.00 (6)
2         2011-03-11 0      0           0          0      0           0
2         2011-03-12 3 (7)  5.00 (8)    10.00 (9)  5.00   2.50        2.50
2         2011-03-13 0      0.75        1.00       0      0           0
--------------------------------------------------------------------------------

(1) On 2011-03-13, 3 positive tweets for sp100_id 1. 1 tweet retweeted 2 times,
    1 tweets retweeted 1 time and 1 tweet retweeted 0 times = 2x2+1x1+1x0 = 5
(2) On 2011-03-13, 2 positive tweets made by user 1, who has quality 2.50 and
    1 positive tweet made by user 2, who has quality 0.75 = 2x2.50+1x0.75 = 5.75
(3) On 2011-03-13, 2 positive tweets made by user 1, who has follow 5.00 and
    1 positive tweet made by user 2, who has follow 1 = 2x5.00+1x1.00 = 11.00
(4) On 2011-03-13, 1 negative tweet made by user 2, retweeted 3 times = 1x3 = 3
(5) On 2011-03-13, 1 negative tweet made by user 2, who has quality 0.75, thus
    1x0.75 = 0.75
(6) On 2011-03-13, 1 negative tweets made by user 2, who has follow 1.00 so
    1x1.00 = 1.00
(7) 1 positive tweet which has been retweeted 3 times, 1 positive tweet without
    any retweets = 1x3+1x0 = 3
(8) 2 positive tweets from user 2 x quality 2.50 = 5.00
(9) 2 positive tweets x follow 5 = 10.00

我试图尽可能地解释自己。谁能帮我建立正确的查询?如您所见,没有推文的日期(所有值为零)也需要包含在结果集中。我现在有了这个,但是在完成其余部分时遇到了麻烦:

SELECT
    s.sp100_id,
    d._date,
    COALESCE(c.pos-rt,0)      AS pos-rt,
    COALESCE(c.pos-quality,0) AS pos-quality,
    COALESCE(c.pos-follow,0)  AS pos-follow,
    COALESCE(c.neg-rt,0)      AS neg-rt,
    COALESCE(c.neg-quality,0) AS neg-quality,
    COALESCE(c.neg-follow,0)  AS neg-follow
FROM sp100 s
CROSS JOIN daterange d
LEFT JOIN (
    SELECT 
        sp100_id,
        nyse_date, 
        COUNT(CASE class WHEN 2 THEN 1 END) * [rt]      AS pos-rt,
        COUNT(CASE class WHEN 2 THEN 1 END) * [quality] AS pos-quality,
        COUNT(CASE class WHEN 2 THEN 1 END) * [follow]  AS pos-follow,
        COUNT(CASE class WHEN 3 THEN 1 END) * [rt]      AS neg-rt,
        COUNT(CASE class WHEN 3 THEN 1 END) * [quality] AS neg-quality,
        COUNT(CASE class WHEN 3 THEN 1 END) * [follow]  AS neg-follow
    FROM tweets 
    GROUP BY sp100_id, nyse_date
) c ON s.sp100_id = c.sp100_id AND d._date = c.nyse_date
ORDER BY s.sp100_id, d._date ASC

显然 ,[rt][quality]需要[follow]用正确的语法替换,我不确定COUNT(...),因为它现在首先计算推文的数量,但它应该将每条推文分开并将其乘以它自己的转发数量('rt ')。

有人可以帮帮我吗?

4

1 回答 1

2

假设我已经正确理解了这个问题(见我上面的评论),那么你只需要对连接的表和SUM()相关字段进行分组,其中推文属于所需的类,可以使用以下方法确定IF()

SELECT      sp100.sp100_id                            AS `sp100_id`,
            daterange._date                           AS `nyse_date`,
            SUM(IF(tweets.class=2, tweets.rt,     0)) AS `pos-rt`,
            SUM(IF(tweets.class=2, users.quality, 0)) AS `pos-quality`,
            SUM(IF(tweets.class=2, users.follow,  0)) AS `pos-follow`,
            SUM(IF(tweets.class=3, tweets.rt,     0)) AS `neg-rt`,
            SUM(IF(tweets.class=3, users.quality, 0)) AS `neg-quality`,
            SUM(IF(tweets.class=3, users.follow,  0)) AS `neg-follow`       
FROM        sp100
       JOIN daterange
  LEFT JOIN tweets ON tweets.nyse_date = daterange._date
                  AND tweets.sp100_id  = sp100.sp100_id
  LEFT JOIN users  ON tweets.user_id   = users.user_id
GROUP BY    sp100.sp100_id, daterange._date

sqlfiddle上查看。

[编辑] 这是EXPLAIN

id select_type table     type   possible_keys             key        key_len  ref                        rows  extra
-----------------------------------------------------------------------------------------------------------------------------------------------------------
1  SIMPLE      sp100     index  NULL                      PRIMARY    4        NULL                        101  Using index; Using temporary; Using filesort
1  SIMPLE      daterange index  NULL                      _date      3        NULL                        147  Using index; Using join buffer
1  SIMPLE      tweets    ref    query,nyse_date,sp100_id  nyse_date  3        sentimeter.daterange._date 3815    
1  SIMPLE      users     eq_ref PRIMARY                   PRIMARY    4        sentimeter.tweets.user_id     1    
于 2012-07-31T18:11:53.927 回答