0

我创建了一个执行 Dedupe 的函数,它运行良好。但是,我正在传递一个包含 91k 行的 DataFrame 并收到错误

# Filter the values that did not get any clusters (90605 rows)
data_ap3 = dedupe_ap1.loc[~dedupe_ap1['Cluster ID'].duplicated(keep=False), :]
data_ap3 = data_ap3.loc[data_ap3['total'] < 2]

data_ap3 = data_ap3[['clean_name']].merge(data[['lat_long', 'clean_name']], on='clean_name', suffixes=('', '_y')) # 90618 rows
data_ap3.drop_duplicates(subset='clean_name', inplace=True)

# Execute ML algorithm:
dedupe_setup_ap3 = {'settings_file': 'learned_settings_ap3',
                    'training_file': 'training_ap3.json',
                    'fields': [{'field': 'clean_name', 'type': 'String'},
                               {'field': 'lat_long', 'type': 'LatLong', 'has missing': True}]
                   }

dedupe_ap3 = execute_dedupe(data_ap3, dedupe_setup_ap3)

错误是这样的:

KeyError                                  Traceback (most recent call last)
<ipython-input-181-c24e80c6c8b1> in <module>
     13                    }
     14 
---> 15 dedupe_ap3 = execute_dedupe(data_ap3, dedupe_setup_ap3)

<ipython-input-11-5589b2a3407d> in execute_dedupe(data_input, setup)
     36                 deduper.prepare_training(data_d, f)
     37         else:
---> 38             deduper.prepare_training(data_d)
     39 
     40         # ## Active learning

~\Anaconda3\envs\hubster\lib\site-packages\dedupe\api.py in prepare_training(self, data, training_file, sample_size, blocked_proportion, original_length)
   1241         if training_file:
   1242             self._read_training(training_file)
-> 1243         self._sample(data, sample_size, blocked_proportion, original_length)
   1244 
   1245     def _sample(self,

~\Anaconda3\envs\hubster\lib\site-packages\dedupe\api.py in _sample(self, data, sample_size, blocked_proportion, original_length)
   1274         examples, y = flatten_training(self.training_pairs)
   1275 
-> 1276         self.active_learner = self.ActiveLearner(self.data_model,
   1277                                                  data,
   1278                                                  blocked_proportion,

~\Anaconda3\envs\hubster\lib\site-packages\dedupe\labeler.py in __init__(self, data_model, data, blocked_proportion, sample_size, original_length, index_include)
    423         data = core.index(data)
    424 
--> 425         self.candidates = self._sample(data, blocked_proportion, sample_size)
    426 
    427         random_pair = random.choice(self.candidates)

~\Anaconda3\envs\hubster\lib\site-packages\dedupe\labeler.py in _sample(self, data, blocked_proportion, sample_size)
     57         data = dict(data)
     58 
---> 59         return [(data[k1], data[k2])
     60                 for k1, k2
     61                 in blocked_sample_keys | random_sample_keys]

~\Anaconda3\envs\hubster\lib\site-packages\dedupe\labeler.py in <listcomp>(.0)
     57         data = dict(data)
     58 
---> 59         return [(data[k1], data[k2])
     60                 for k1, k2
     61                 in blocked_sample_keys | random_sample_keys]

KeyError: 2147550046

更让我困惑的是,如果我删除第二个条件,data_ap3 = data_ap3.loc[data_ap3['total'] < 2],它工作正常。如果我将最后一行作为dedupe_ap3 = execute_dedupe(data_ap3.loc[:60000, dedupe_setup_ap3)and传递,它也可以正常工作dedupe_ap3 = execute_dedupe(data_ap3.loc[50000:], dedupe_setup_ap3)。我真的不知道为什么它不起作用。特别是因为它适用于数据集的重叠子集([50000:90000]、[0:60000])。

我尝试过重置索引,检查缺失值,但这些都不起作用。

4

0 回答 0