这里有几个很好的答案。我已经编写并测试了一些代码。
首先,这里是对需求的简单实现:
import pprint
t = [
[ 23, None, 'Delhi', None, 0.33 ],
[ None, 'M', 'Mumbai', None, 0.78 ],
[ None, None, None, 'Hindu', 0.23 ],
[ 34, 'F', 'Chennai', None, 0.33 ],
]
rlen = len(t[0])
# None may require special handling
m = [23, 'M', 'Delhi', 'Hindu', None]
a = [[] for i in range(rlen+1)]
for r in t:
s = sum([1 for i in range(rlen) if r[i] == m[i]])
if 0 < s:
a[s].append(r)
# Print rows from least matching to most matching (easy to reverse)
rtable = [row for n in a for row in n]
pprint.pprint(rtable)
问题是我们扫描每一行并检查每个元素值。为了避免在最后进行排序,我们为每个可能的匹配计数保留单独的列表,然后我们将列表列表展平以获得最终结果。我希望这相对于表的大小执行大约 O(n),如果我们有大量匹配项(构建大型结果列表将比 O(n) 慢,接近 O(n^2) 作为最坏的情况下)。
如果我们索引表,我们可以加快速度。我们可以每列使用一个 dict 并使用集合组合列。
from collections import defaultdict
import pprint
# data table
TABLE = [
[ 23, None, 'Delhi', None, 0.33 ],
[ None, 'M', 'Mumbai', None, 0.78 ],
[ None, None, None, 'Hindu', 0.23 ],
[ 34, 'F', 'Chennai', None, 0.33 ],
]
# The index is a list of dicts, cdictlist.
# cdictlist is indexed by column number to get the column dict.
# The column dict's key is the data value of the column
def BuildIndex(table):
rlen = len(table[0])
rrange = range(rlen)
cdictlist = [defaultdict(set) for i in range(rlen+1)]
for ir in range(len(table)):
r = table[ir]
for ic in rrange:
f = r[ic]
cdictlist[ic][f].add(ir)
return cdictlist
def multisearch(table, match, cdictlist):
# rcounts is row counts, number of times columns have matched for a row
rccounts = defaultdict(int)
#rset is the result set, set of row indexes returned for this search
rset = set()
for ic in range(len(table[0])):
cset = cdictlist[ic].get(match[ic], set())
rset = rset.union(cset)
for i in cset:
rccounts[i] += 1
# sort the list by column match count, original row index
l = sorted((v,k) for (k,v) in rccounts.iteritems())
# return list of rows, for each row we return (count, index, raw data)
lr = [ [l[i][0], l[i][1]] + table[l[i][1]] for i in range(len(l)) ]
return lr
def main():
cdictlist = BuildIndex(TABLE)
# None may require special handling
match = [23, 'M', 'Delhi', 'Hindu', None]
lr = multisearch(TABLE, match, cdictlist)
pprint.pprint(lr)
if __name__ == '__main__':
main()
性能将取决于返回的记录数而不是表的大小。对于大量匹配,集合并集操作将很快成为问题。如果任何字段匹配并且示例字段之一是 Gender,则记录匹配,因此我们应该期望至少返回一半的行。
如果您必须匹配所有列,这种方法会更好。我们可以通过构建未返回的记录集(使用集合交集)来改进这一点,然后过滤掉这些记录。