1

嗨,所以很难在标题中正确解释这一点,但首先让我从解释我的数据开始。我在一个列表中存储了 40 个列表,其形式如下:

data[0] = [[value1 value2 value3,80],[value1,90],[value1 value3,60],[value2 value3,70]]
data[1] = [[value2,40],[value1 value2 value3,90]]
data[2] = [[value1 value2,80],[value1,50],[value1 value3,20]]
   .
   .
   .

现在我期待这样的输出:

data[0] = [[value1 value2 value3,80],[value1,90],[value1 value3,60],[value2 value3,70],[value2,0],[value1 value2,0]]
data[1] = [[value2,40],[value1 value2 value3,90],[value1,0],[value1 value3,0],[value2 value3,0],[value1 value2,0]]
data[2] = [[value1 value2,80],[value1,50],[value1 value3,20],[value1 value2 value3,0],[value2 value3,0],[value2,0]]    

我知道这读起来有点复杂,但我想确保有一个很好的数据演示。因此,基本上所有列表都需要具有所有列表中存在的值的所有可能组合,如果该组合不作为标准存在于该列表中,则其频率(第二个字段)为 0。

感谢您的帮助,请记住这是 40 个不同列表的交集,因此需要快速高效。我不确定如何最好地做到这一点......

编辑:我也不知道所有的“值”,为了简单起见,我刚刚在这里写了 3 个不同的值(value1、value2、value3)。在我的项目中,我不知道值是什么或有多少不同的值(我知道至少有几千个)

编辑2:这是一些真实的输入数据,我没有真实的输出数据,但我会尝试解决:

data[0] = [['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', '39.7769']]


data[1] = [['syslog_priority:Info', '100'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', '43.8362']]


data[2] = [['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', '43.9506']]
4

4 回答 4

4

听起来你可以使用集合:

>>> {1, 2, 3, 4, 5} & {2, 3, 4, 5, 6, 7} & {3, 4, 5}
{3, 4, 5}

&是集合的交集算子。获取一组列表(这将删除带有set(mylist).

编辑:根据您的评论,您似乎需要某种联合(联合运算符是|),而不是交集。这是一个功能,可以在您的评论中为 2 个列表列表执行您想要的操作:

def function(first, second):
    first_set = {tuple(i) for i in first}
    second_set = {tuple(i) for i in second}
    return (first_set | {(i[0], 0) for i in second_set},
            second_set | {(i[0], 0) for i in first_set})

>>> a = [(1,60),(3,90)]
>>> b = [(2,30),(4,50)]
>>> x, y = function(a, b)
>>> print(x)
{(2, 0), (3, 90), (1, 60), (4, 0)}
>>> print(y)
{(3, 0), (4, 50), (1, 0), (2, 30)}
于 2013-08-09T12:36:07.060 回答
1

好吧,鉴于您的意见,我会按照已经建议的那样使用集合

首先循环遍历您的列表以构建一组每个可能的字符串

possible_strings = set()
for row in mydata:
   for item in row:
       possible_string.add(item[0])

所以 possible_strings 在你的数据中有所有可能的字符串

现在您需要检查每一行是否有一个字符串,如果它不存在,您需要将其附加到频率为 0 的行中

my_new_data = []
for row in mydata:
    row_strings = set(item[0] for item in row)
    missing_strings = possible_strings - row_strings
    for item in list(missing_strings):
         new_item = []
         new_item.append(item)
         new_item.append(0)
         row.append(new_item)
     row.sort()
     my_new_data.append(row)

我使用集合的原因是您不必进行任何查找,并且项目是字符串,因此它们可以是集合的成员。有一些方法可以加快速度(压缩代码),但我喜欢把事情安排好,这样我就可以清楚地看到我在做什么。除非我打错字(并且我已经更正了 3),否则这段代码可以在我的电脑上运行

这是未排序的结果

newrow*************
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80', '39.7769']
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', 0]
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', 0]
['syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', 0]
newrow*************
['syslog_priority:Info', '100']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', '43.8362']
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', 0]
newrow*************
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', 0]
['syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', 0]
于 2013-08-09T12:44:38.420 回答
1

听起来你想要字典,然后你想要比较键,它们是你拥有的“值”列表,而不是字典值,它们是频率。当然,没有必要将数据重组为字典,但它可能更有意义。

现在,对于一个实际的答案:创建一个新的列表/字典,只是为了将所有键/“值列表”的完整列表放在一起。然后,再次检查并将缺少的元素添加到缺少它们的列表中。外部循环经过 40 次。第一个外部循环是 O(n* 2),其中 n 是唯一键的总数,尽管我想平均情况会小于 n *2。第二个外循环也是 O(n**2)。

我希望这不是太暴力。至少它比比较 data[n] 和 data[n+m] 的 n 0-40 更好......对于外部循环来说,这将是 40**2 ......这仍然是一个常数,但是,显然大于80。

于 2013-08-09T13:05:44.533 回答
1

如果我错了,请纠正我,但我认为最好的解决方案涉及每个所需输出的字典和一组主键。一个集合基本上会存储每个值而不允许重复。用你上面的例子,我会这样做:

master_set = set()
for current_list in list_of_lists:
    master_set |= [entry[0] for entry in current_list] 

|=集合的联合运算符在哪里有效。

一旦你有了那个集合,你就会为每个包含相关值或零的条目构建一个字典。首先我会构建一个字典,然后我会为缺席的项目添加结果。

full_dictionary = {}
for entry in master_set:
    full_dictionary[entry] = [thing[1] for thing in current_list if thing[0] == entry]

然后为您拥有的每个列表生成完整的字典。

或者,如果您可以选择数据的输入方式,或者只是想合理地对其进行重组,我建议您使用字典理解,这会使整个事情变得更简单:

new_dict = {value[0]: value[1] for value in current_list}

我在解释这个问题时也遇到了一些麻烦,但如果这不准确,请告诉我,我可以修改它。

于 2013-08-09T13:18:52.283 回答