python - 如何对数据集中的相似行进行分组？

Question

我正在尝试一种组合多个数据集的好方法，其中每个数据集在同一组项目上的信息集略有不同，因此包含略有不同的数据。例如：

+----+------+--------+-------+------------------+----------+---------+---------+
| h  |  db  |  name  |  age  |     location     |  colour  |  fruit  |  height |
+----+------+--------+-------+------------------+----------+---------+---------+
| 1  |  b   |  joe   |   22  |  redbush ave     |  blue    |  pear   |  _      |
| 2  |  b   |  joe   |   22  |  redbush avenue  |  blue    |  paer   |  _      |
| 3  |  c   |  macy  |   38  |  high street     |  green   |  apple  |  1.65   |
| 4  |  c   |  j. h  |   22  |  redbush         |  blue    |  pear   |  1.59   |
+----+------+--------+-------+------------------+----------+---------+---------+

从那组行（即从 DBb和结合c，我想得到：

+----+------+-----------+-------+-------------------------------+----------+--------------+---------+
| h  |  db  |  name     |  age  |        location               |  colour  |    fruit     |  height |
+----+-------+--- ------+-------+-------------------------------+----------+--------------+---------+
| 1  |  X   |  joe, j.h |  22   |  redbush ave, redbush avenue  |  blue    |  pear, paer  |    1.59 |
| 2  |  X   |  macy     |  38   |  high street                  |  green   |  apple       |    1.65 |
+----+------+-----------+-------+-------------------------------+----------+--------------+---------+

即 3 个非常相似的行已被合并，并且它们的数据不同，所有版本都被添加。

我正在尝试学习 python，所以找到了 pandas 的东西，使用 groupby，连接所有列并比较，以及模糊模糊，但似乎没有什么完全匹配的。我猜答案将涉及编辑/Levenshtein 距离，但我正在努力寻找一种方法。

谢谢你的帮助，

马特

python - 如何对数据集中的相似行进行分组？

0 回答 0

Related

Reference