1

我有一个看起来像这样的表:

+-----+-----------+------------+
| id  |     value |       date |
+-----+-----------+------------+
| id1 |      1499 | 2012-05-10 |
| id1 |      1509 | 2012-05-11 |
| id1 |      1511 | 2012-05-12 |
| id1 |      1515 | 2012-05-13 |
| id1 |      1522 | 2012-05-14 |
| id1 |      1525 | 2012-05-15 |
| id2 |      2222 | 2012-05-10 |
| id2 |      2223 | 2012-05-11 |
| id2 |      2238 | 2012-05-13 |
| id2 |      2330 | 2012-05-14 |
| id2 |      2340 | 2012-05-15 |
| id3 |      1001 | 2012-05-10 |
| id3 |      1020 | 2012-05-11 |
| id3 |      1089 | 2012-05-12 |
| id3 |      1107 | 2012-05-13 |
| id3 |      1234 | 2012-05-14 |
| id3 |      1556 | 2012-05-15 |
| ... |       ... |        ... |
| ... |       ... |        ... |
| ... |       ... |        ... |
+-----+-----------+------------+

value我想要做的是为每个日期生成此表中所有数据的列的总和。id每天有一个条目。问题是某些 ids 没有所有天的值,例如id2 没有日期的值:2012-05-11

我想要做的是:当给定日期没有特定 id 的值时,则在总和中计算前一个日期的值(更接近给定日期)。

例如,假设我们只有上面显示的数据。我们可以从此查询中获取特定日期的所有值的总和:

SELECT SUM(value) FROM mytable WHERE date='2012-05-12';

结果将是:1511 + 1089 = 2600

但我想要的是做一个查询来做这个计算:1511 + 2223 + 1089 = 4823

以便添加日期 2012-05-11 的2223id2不是错过的值:

| id2 |  2223 | 2012-05-11 |

你知道如何通过 SQL 查询来做到这一点吗?或通过脚本?例如蟒蛇..


每个日期我有数千个 id,所以如果可能的话,我希望查询快一点。

4

3 回答 3

7

这并不漂亮,因为它必须将您的表的四个副本连接到自身,这可能会导致各种性能问题(我强烈建议您在idand上设置索引date)......但这会解决问题:

SELECT   y.report_date, SUM(x.value)
FROM     mytable AS x
  NATURAL JOIN (
    SELECT   a.id, b.date AS report_date, MAX(c.date) AS date
    FROM     (SELECT DISTINCT id   FROM mytable) a JOIN
             (SELECT DISTINCT date FROM mytable) b JOIN
             mytable AS c ON (c.id = a.id AND c.date <= b.date)
    GROUP BY a.id, b.date
 ) AS y
GROUP BY y.report_date

sqlfiddle上查看。

于 2012-05-15T18:20:24.123 回答
2

我能想到的 SQL 解决方案不是很漂亮(在值列上的 case 语句中的子选择与日期序列表的右连接......这很丑陋。)所以我会去蟒蛇版本:

import pyodbc
#connect to localhost
conn = pyodbc.connect('Driver={MySQL ODBC 5.1 Driver};Server=127.0.0.1;Port=3306;Database=information_schema;User=root; Password=root;Option=3;')
cursor = conn.cursor()

sums = {}  ## { id : { 'dates': [], 'values': [], 'sum': 0 } }      # sum is optional, you can always just sum() on the values list.

query = """SELECT
    id, value, date
FROM mytable
ORDER BY date ASC, id ASC;"""

## note that I use "fetchall()" here because in my experience the memory
## required to hold the result set is available. If this is not the case
## for you, see below for a row-by-row streaming

for row in cursor.execute(query).fetchall():
    id = sums.get(row['id'], {'dates' : [], 'values': [], 'sum': 0})
    if len(id['date']) > 0: # previous records exist for id
        # days diff is greater than 1
        days = row['date'] - id['dates'][-1]).days  
        ## days == 0, range(0) == [], in which case the loop steps won't be run
        for d in range(1, days):   
            id['dates'].append(id['dates'][-1] + datetime.timedelta(days = 1))  # add date at 1 day increments from last date point
            id['values'].append(id['values'][-1])  # add value of last date point again
            id['sum'] = id['sum'] + id['values'][-1]    # add to sum
        ## finally add the actual time point
        id['dates'].append(row['date'])
        id['values'].append(row['value'])
        id['sum'] = id['sum'] + row['value']

    else: # this is the first record for the id
        sums[row['id']] = {'dates': [row['date']], 'values': [row['value']], 'sum': row['value'] }

替代的逐行流式循环:

cursor.execute(query)
while 1:
    row = cursor.fetchone()
    if not row:
        break
    id = sums.get(row['id'], {'dates' : [], 'values': [], 'sum': 0})
    if len(id['date']) > 0: # previous records exist for id
        # days diff is greater than 1
        days = row['date'] - id['dates'][-1]).days  
        ## days == 0, range(0) == [], in which case the loop steps won't be run
        for d in range(1, days):   
            id['dates'].append(id['dates'][-1] + datetime.timedelta(days = 1))  # add date at 1 day increments from last date point
            id['values'].append(id['values'][-1])  # add value of last date point again
            id['sum'] = id['sum'] + id['values'][-1]    # add to sum
        ## finally add the actual time point
        id['dates'].append(row['date'])
        id['values'].append(row['value'])
        id['sum'] = id['sum'] + row['value']

    else: # this is the first record for the id
        sums[row['id']] = {'dates': [row['date']], 'values': [row['value']], 'sum': row['value'] }

完成后不要忘记关闭连接!

conn.close()
于 2012-05-15T18:18:00.453 回答
0

您可能需要多考虑一下列的语义date

也许您应该添加一列并创建date一个范围。

您所做的任何不涉及记录数据的操作都可能很慢。对您的请求的字面解释可能需要date遍历每个值的总和。

于 2012-05-15T18:09:30.910 回答