0

我正在尝试从包含一行信息的 Schools 对象列表中创建一个数据框熊猫对象。问题是它需要几个小时才能完成。我在 Jupyter 笔记本上运行它,运行一个小时后它崩溃了。我有一个有序的学校对象列表。对象如下:

class School:
    def __init__(self, distance, row):
        self.distance_to_origin = distance
        self.row = row
        self.name = row['name']
        self.lat = row['lat']
        self.lon = row['lon']
    def get_distance(self):
        return self.distance_to_origin
    def get_lat_lon(self):
        return [self.lat, self.lat]
    def get_name(self):
        return self.name
    def get_row(self):
        return self.row
    def __str__(self):
        return str(self.distance_to_origin)
    def __repr__(self):
        return str(self.distance_to_origin)

然后我试图从这个列表中创建一个熊猫数据框。总体目标是删除重复的学校。重复学校是在 1600 范围内且名称相似的学校。

删除学校的代码如下:

def get_duplicates(ordered_list): 
    total_dups = 0;
    newDataFrame = pd.DataFrame()
    for i in trange(len(ordered_list)-1):
        newDataFrame = newDataFrame.append(ordered_list[i].get_row())
        ite = i+1    
        while( not (ite>(len(ordered_list)-1)) and abs(ordered_list[i].get_distance()-ordered_list[ite].get_distance())<1600):
            if(vincenty(ordered_list[i].get_lat_lon(),ordered_list[ite].get_lat_lon()).meters<1600):
                if(fuzzy_match(ordered_list[i].get_name(), ordered_list[ite].get_name())): #it's a match, dont add
                    total_dups +=1

                else: # is within distane, name doesnt match
                    newDataFrame = newDataFrame.append(ordered_list[ite].get_row())
            else: # it is not within distance
                newDataFrame = newDataFrame.append(ordered_list[ite].get_row())
                #print(newlist[ite].get_name())
                    #print( newlist[ite].get_lat_lon())
            ite+=1
    print(total_dups)
    return newDataFrame

vincenty 来自 geopy.distance

模糊匹配是:

stemmer = stem.PorterStemmer()
def normalize(s):
    words = tokenize.wordpunct_tokenize(s.lower().strip())
    return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
    return edit_distance(normalize(s1), normalize(s2)) <= max_dist

edit_distance 来自 nltk.metrics

我做错了什么导致这需要几个小时?有没有办法优化这个?谢谢!

4

0 回答 0