python - 从 Python 中的 csv 列中检索每个唯一值的第一次出现的更有效方法

Question

给我的一个大 csv 有一个大的飞行数据表。我为帮助解析它而编写的一个函数迭代了 Flight ID 的列，然后返回一个字典，其中包含每个唯一 Flight ID 的索引和值（按首次出现的顺序）。

字典 = { 索引：FID, ... }

这是对旧功能的快速调整，不需要担心列中的 FID 重复（几十万行之后......）。

现在，我让它迭代并按顺序比较每个值。如果一个值等于它后面的值，它会跳过它。如果下一个值不同，它将值存储在字典中。我将其更改为现在还检查该值之前是否已经出现，如果是，则跳过它。
这是我的代码：

def DiscoverEarliestIndex(self, number):                                             
        finaldata = {}                                                        
        columnvalues = self.column(number)                                             
        columnenum = {}                                                         
        for a, b in enumerate(columnvalues):                                           
            columnenum[a] = b                                                   
        i = 0                                                                                                                    
        while i < (len(columnvalues) - 1):                                             
            next = columnenum[i+1]                                              
            if columnvalues[i] == next:                                                
                i += 1                                                          
            else:                                                               
                if next in finaldata.values():                                
                    i += 1                                                      
                    continue                                                    
                else:                                                           
                    finaldata[i+1]= next                                      
                    i += 1                                                      
        else:                                                                   
            return finaldata

它的效率非常低，并且随着字典的增长而减慢。该列有 520 万行，所以用 Python 处理这么多显然不是一个好主意，但我现在坚持使用它。

有没有更有效的方法来编写这个函数？

score 1 · Accepted Answer

要直接回答您的问题，您应该能够使用 dict comprehensions 和 itertools 模块来做到这一点。

>>> import itertools as it
>>> data = {1: 'a', 2: 'a', 3: 'c', 4: 'c', 5:'d' }
>>> grouped_shit = {k: list(v) for (k,v) in it.groupby(data.iteritems(), lambda (_,v): v)}
>>> good_shit = {v[0][0]: k for (k, v) in grouped_shit.iteritems()}
>>> good_shit
{1: 'a', 3: 'c', 5: 'd'}

我认为这可以稍微调整一下——我对两次复习字典不太高兴。但无论如何，我认为 dict 理解非常有效。此外，groupby假设您的键是有序的——也就是说，它假设所有 'a 的索引都组合在一起，这在您的情况下似乎是正确的。

score 1 · Accepted Answer

if next in thegoodshit.values():

可能是你的问题你在这里做什么

创建列表
搜索列表

也许您可以使用 aset来保存值并进行搜索-如下所示：

    while i < (len(columnvalues) - 1):                                             
        next = columnenum[i+1]                                              
        if columnvalues[i] == next:                                                
            i += 1                                                          
        else:                                                               
            if next in searchable_data:                                
                i += 1                                                      
                continue                                                    
            else:                                                           
                finaldata[i+1]= next
                searchable_data.add(next)                 
                i += 1                                                      
    else:                                                                   
        return finaldata

score 1 · Accepted Answer

You are essentially looking for a database. Databases are made exactly for such operations on large datasets. It will be much faster to parse the entire CSV at once using the CSV module and sending them in a database than storing them in a dict and running checks against the entire dict.

*large* python dictionary with persistence storage for quick look-ups

python - 从 Python 中的 csv 列中检索每个唯一值的第一次出现的更有效方法

3 回答 3

Related

Reference