我创建了一个执行 Dedupe 的函数,它运行良好。但是,我正在传递一个包含 91k 行的 DataFrame 并收到错误
# Filter the values that did not get any clusters (90605 rows)
data_ap3 = dedupe_ap1.loc[~dedupe_ap1['Cluster ID'].duplicated(keep=False), :]
data_ap3 = data_ap3.loc[data_ap3['total'] < 2]
data_ap3 = data_ap3[['clean_name']].merge(data[['lat_long', 'clean_name']], on='clean_name', suffixes=('', '_y')) # 90618 rows
data_ap3.drop_duplicates(subset='clean_name', inplace=True)
# Execute ML algorithm:
dedupe_setup_ap3 = {'settings_file': 'learned_settings_ap3',
'training_file': 'training_ap3.json',
'fields': [{'field': 'clean_name', 'type': 'String'},
{'field': 'lat_long', 'type': 'LatLong', 'has missing': True}]
}
dedupe_ap3 = execute_dedupe(data_ap3, dedupe_setup_ap3)
错误是这样的:
KeyError Traceback (most recent call last)
<ipython-input-181-c24e80c6c8b1> in <module>
13 }
14
---> 15 dedupe_ap3 = execute_dedupe(data_ap3, dedupe_setup_ap3)
<ipython-input-11-5589b2a3407d> in execute_dedupe(data_input, setup)
36 deduper.prepare_training(data_d, f)
37 else:
---> 38 deduper.prepare_training(data_d)
39
40 # ## Active learning
~\Anaconda3\envs\hubster\lib\site-packages\dedupe\api.py in prepare_training(self, data, training_file, sample_size, blocked_proportion, original_length)
1241 if training_file:
1242 self._read_training(training_file)
-> 1243 self._sample(data, sample_size, blocked_proportion, original_length)
1244
1245 def _sample(self,
~\Anaconda3\envs\hubster\lib\site-packages\dedupe\api.py in _sample(self, data, sample_size, blocked_proportion, original_length)
1274 examples, y = flatten_training(self.training_pairs)
1275
-> 1276 self.active_learner = self.ActiveLearner(self.data_model,
1277 data,
1278 blocked_proportion,
~\Anaconda3\envs\hubster\lib\site-packages\dedupe\labeler.py in __init__(self, data_model, data, blocked_proportion, sample_size, original_length, index_include)
423 data = core.index(data)
424
--> 425 self.candidates = self._sample(data, blocked_proportion, sample_size)
426
427 random_pair = random.choice(self.candidates)
~\Anaconda3\envs\hubster\lib\site-packages\dedupe\labeler.py in _sample(self, data, blocked_proportion, sample_size)
57 data = dict(data)
58
---> 59 return [(data[k1], data[k2])
60 for k1, k2
61 in blocked_sample_keys | random_sample_keys]
~\Anaconda3\envs\hubster\lib\site-packages\dedupe\labeler.py in <listcomp>(.0)
57 data = dict(data)
58
---> 59 return [(data[k1], data[k2])
60 for k1, k2
61 in blocked_sample_keys | random_sample_keys]
KeyError: 2147550046
更让我困惑的是,如果我删除第二个条件,data_ap3 = data_ap3.loc[data_ap3['total'] < 2]
,它工作正常。如果我将最后一行作为dedupe_ap3 = execute_dedupe(data_ap3.loc[:60000, dedupe_setup_ap3)
and传递,它也可以正常工作dedupe_ap3 = execute_dedupe(data_ap3.loc[50000:], dedupe_setup_ap3)
。我真的不知道为什么它不起作用。特别是因为它适用于数据集的重叠子集([50000:90000]、[0:60000])。
我尝试过重置索引,检查缺失值,但这些都不起作用。