0

我创建了一个字典myDict,其中包含以下形式的 1000 万个条目。字典中的每个条目代表{(id, age): code}

>>> myDict = {('1039', '68.0864'): '42731,42781,V4501', 
              ('1039', '68.1704'): '4770,4778,V071', 
              ('0845', '60.4476'): '2724,27800,4019', 
              ('0983', '63.3936'): '41401,4168,4240,V1582,V7281'
             }

常量ageOffset定义为 value =0.1

给定一个(id,age)元组,我如何从中获取所有myDict具有键的值(id, X)

age <= X <= age+ageOffset 

我需要执行这个 fetch 操作 200 亿次。

Examples: 
1. 
myTup = ('1039', '68.0')
the answer is: '42731,42781,V4501'

2. 
myTup = ('0845', '60.0')
Ans : No value returned 

编辑:我可以在键的第一个元素的部分匹配的基础上创建一个子字典吗?我的意思是,如果元组 Key 的第一个元素匹配,则创建一个子字典。根据我的数据,这不会超过几百个。然后执行线性范围搜索,比较元组键中的第二个元素并找到相应的值。

4

3 回答 3

3

要执行此操作 200 亿(!)次,您必须对数据进行一些预处理。

首先,我会按 id 分组:

def preprocess(data):
    from collections import defaultdict # Python 2.5+ only
    preprocessed = defaultdict(list)
    # group by id
    for (id, age), value in data.iteritems():
        preprocessed[id].append((float(age), value))
    # sort lists for binary search, see edit
    for key, value in preprocessed.iteritems():
        value.sort()
    return preprocessed

结果应如下所示:

>>> preprocess(myDict)
defaultdict(<type 'list'>, {
    '0845': [(60.4476, '2724,27800,4019')],
    '0983': [(63.3936, '41401,4168,4240,V1582,V7281')],
    '1039': [(68.0864, '42731,42781,V4501'), (68.1704, '4770,4778,V071')]} 

如果相对较少的项目共享相同的 id,从而导致列表较短,您可能会通过过滤列表而侥幸成功。

def lookup(data, id, age, age_offset=0.1):
    if id in data:
        return [value for x, value in data[id] if age <= x <= age+age_offset]
    else:
        return None     

lookup(preprocessed, '1039', 68.0) # Note that I use floats for age
['42731,42781,V4501']

但是,如果许多项目共享相同的 id,您将不得不遍历长列表,从而使查找相对较慢。在这种情况下,您将不得不应用进一步的优化。

编辑:正如@Andrey Petrov 所建议的那样

from bisect import bisect_left
from itertools import islice, takewhile
def optimized_lookup(data, id, age, age_offset=0.1):
    if id in data:
        l = data[id]
        idx = bisect_left(l, age)
        return [a for a,v in takewhile(lambda (x, value): x <= age+age_offset, islice(l, idx, None))]
    else:
        return None 
于 2013-02-20T15:40:31.590 回答
1

Here's a way to do it in numpy, and though I haven't tested it I'm pretty confident it will be vastly faster than looping over the dictionary. I replaced the dictionary structure with a Numpy record array, and used np.where to locate the rows where they match the parameters you gave.

import numpy as np

myDict = {('1039', '68.0864'): '42731,42781,V4501', 
              ('1039', '68.1704'): '4770,4778,V071', 
              ('0845', '60.4476'): '2724,27800,4019', 
              ('0983', '63.3936'): '41401,4168,4240,V1582,V7281'
             }

records=[]
for k,v in myDict.iteritems():
    records.append([k[0], float(k[1]), v])

myArr = np.rec.fromrecords(records, formats='S10, f4, S100', 
                             names="ID, Age, Code")

def findInMyArray(arr, requestedID, requestedAge, tolerance=0.1):
    idx = np.where(((arr["Age"] - requestedAge) < tolerance) & (arr["ID"] == requestedID))
    return idx

idx = findInMyArray(myArr, "1039", 68.0, tolerance=0.1)
print "The index found is: ", idx
print "The values are: ", myArr["Code"][idx[0]]
于 2013-02-20T16:03:18.107 回答
0
def getr(t):
  id = float(t[0])
  age = float(t[1])
  os = 0.1
  rs = []
  correct_id=fixed[id]
  for k in correct_id.keys():
      if (k > age and k <= age + os):
          rs.append(correct_id.get(k))
  return rs

ct = {('1039', '68.0864'): '42731,42781,V4501',
      ('1039', '68.1704'): '4770,4778,V071',
      ('0845', '60.4476'): '2724,27800,4019',
      ('0983', '63.3936'): '41401,4168,4240,V1582,V7281' }

fixed={}

for k in ct:
    if not(float(k[0]) in fixed):
        fixed[float(k[0])]={}
    fixed[float(k[0])][float(k[1])] = ct[k]

print "1"
myTup = ('1039', '68.0')
assert(getr(myTup) == ['42731,42781,V4501'])

#the answer is: '42731,42781,V4501'

print "2"
myTup = ('0845', '60.0')
assert(getr(myTup) == [])
#Ans : No value returned 
于 2013-02-20T15:44:00.680 回答