0

我在 MySQL 中有这些数据(样本):

表格1:

ID     ITEM    CNT
--------------------
0001    AAB     5
0001    BBA     3
0001    BBB     8
0001    AAC     10
0002    BBA     2
0002    BBC     7
0003    FFG     2
0003    JPO     4
0003    PUI     22
..........

我想找到一种方法以计数向量的形式在 Python 中导入这些数据,例如:

0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]

其中元素以这种形式表示每个 id 的所有项目的计数:[AAB,AAC,BBA,BBB,BBC,FFG,JPO,PUI]

所以我想问一下,实现这一点的最佳和最有效的方法是什么?从 python 或 mysql 做它更好吗?如何做?

谢谢

4

2 回答 2

1

It is usually more efficient -- when possible -- to manipulate the data in SQL rather than in Python.

With this setup:

import config
import MySQLdb
conn = MySQLdb.connect(
    host=config.HOST, user=config.USER,
    passwd=config.PASS, db='test')
cursor = conn.cursor()

sql = '''\
DROP TABLE IF EXISTS foo 
'''
cursor.execute(sql)

sql = '''\
CREATE TABLE foo (
    ID varchar(4),
    ITEM varchar(3),
    CNT int)
'''

cursor.execute(sql)

sql = '''\
INSERT INTO foo VALUES (%s,%s,%s)
'''

cursor.executemany(sql, [['0001', 'AAB', 5],
                         ['0001', 'BBA', 3],
                         ['0001', 'BBB', 8],
                         ['0002', 'BBA', 2]])

You could form the desired SQL with:

items = 'AAB AAC BBA BBB BBC FFG JPO PUI'.split()
fields = ', '.join('COALESCE({}.CNT, 0)'.format(item) for item in items)
joins = '\n'.join('''\
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = '{i}') as {i}
    ON T.ID = {i}.ID'''.format(i=item) for item in items)
sql = '''\
SELECT T.ID, {f}
FROM (SELECT DISTINCT ID from foo) as T
{j}
'''.format(f=fields, j=joins)

print(sql)

And use it like this:

result = dict()
cursor.execute(sql)
for row in cursor:
    result[row[0]] = row[1:]
print(result)    

The SQL query used is:

SELECT T.ID, COALESCE(AAB.CNT, 0), COALESCE(AAC.CNT, 0), COALESCE(BBA.CNT, 0), COALESCE(BBB.CNT, 0), COALESCE(BBC.CNT, 0), COALESCE(FFG.CNT, 0), COALESCE(JPO.CNT, 0), COALESCE(PUI.CNT, 0)
FROM (SELECT DISTINCT ID from foo) as T
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAB') as AAB
    ON T.ID = AAB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAC') as AAC
    ON T.ID = AAC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBA') as BBA
    ON T.ID = BBA.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBB') as BBB
    ON T.ID = BBB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBC') as BBC
    ON T.ID = BBC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'FFG') as FFG
    ON T.ID = FFG.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'JPO') as JPO
    ON T.ID = JPO.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'PUI') as PUI
    ON T.ID = PUI.ID

and the resultant dict looks like:

{'0001': (5L, 0L, 3L, 8L, 0L, 0L, 0L, 0L), '0002': (0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L)}

I know you asked for

0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]

but there are at least two problems with this. First, 0001 is not a valid Python variable name. Variable names can not start with a digit. Second, you do not want dynamically defined variable names because it is hard to program with a bare variable name which is not known until run-time.

Instead, use the would-be variable names as keys in a dict, result. Then you can refer to the "variable" 0001 with result['0001'].

于 2013-08-04T13:18:35.687 回答
0

You can do it on access via a crosstab query where the row header will be the id and the column header will be the item and cnt the value to be aggregated. You can then loop through each column of each row to get your vectors. See here for help on crosstab queries: http://allenbrowne.com/ser-67.html

于 2013-08-04T13:16:24.967 回答