我有一组房地产广告数据。有几行是关于相同的房地产,所以它充满了不完全相同的重复。它看起来像这样:
ID URL CRAWL_SOURCE PROPERTY_TYPE NEW_BUILD DESCRIPTION IMAGES SURFACE LAND_SURFACE BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY ZIP_CODE DEPT_CODE PUBLICATION_START_DATE PUBLICATION_END_DATE LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0 22c05930-0eb5-11e7-b53d-bbead8ba43fe http://www.avendrealouer.fr/location/levallois... A_VENDRE_A_LOUER APARTMENT False Au rez de chaussée d'un bel immeuble récent,... ["https://cf-medias.avendrealouer.fr/image/_87... 72.0 NaN NaN ... Lamirand Et Associes AGENCY 54178039 Levallois-Perret 92300.0 92 2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1 8d092fa0-bb99-11e8-a7c9-852783b5a69d https://www.bienici.com/annonce/ag440414-16547... BIEN_ICI APARTMENT False Je vous propose un appartement dans la rue Col... ["http://photos.ubiflow.net/440414/165474561/p... 48.0 NaN NaN ... Proprietes Privees MANDATARY 54178039 Levallois-Perret 92300.0 92 2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89 2018-09-25
...
我想在数据集中查找属于具有记录链接的同一实体的记录。所以我阅读了文档并模仿了相同的内容:
indexer = recordlinkage.Index()
indexer.full()
candidate_links = indexer.index(df)
print (len(df), len(candidate_links))
2164 2340366
每个记录对都是候选匹配,为了将候选记录对分类为匹配和不匹配,我想比较两个记录共有的所有属性的记录。recordlinkage 模块有一个名为 Compare 的类。该类用于比较记录。以下代码显示了我如何比较属性:
compare_cl = 记录链接。比较()
compare_cl = recordlinkage.Compare()
compare_cl.exact('SURFACE', 'SURFACE', label='SURFACE')
features = compare_cl.compute(pairs, df)
但是它给了我回报:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-51-1e55ea540dbd> in <module>
9 #compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
10
---> 11 features = compare_cl.compute(pairs, df)
NameError: name 'pairs' is not defined
而且我在文档中找不到什么对...