我有几列,如文件名、文件大小和日期,我想通过考虑所有参数来查找附近的重复项。
Id Name Size Date
1 lib_mysqludf_sys.html 8934 2020-11-10 06:25:57
2 lib_mysqludf_sys.c 8715 2020-11-10 12:12:41
3 lib_mysqludf_sys.so 8480 2020-11-10 08:51:33
4 install.sh 1544 2020-11-10 12:17:16
5 lib_mysqludf_sys.sql 7900 2020-11-10 06:25:59
6 Makefile 124 2020-11-10 06:36:43
7 lib_mysqludf_sys-master 4096 2020-11-10 12:12:41
8 cmake-3.17.0.tar.gz 9466484 2020-11-09 08:23:31
9 fileclassification.cpython-36.pyc 522 2020-11-03 12:00:43
10 fileclassification.cpython-38.pyc 518 2020-11-04 05:49:24
11 __pycache__ 4096 2020-11-04 05:49:24
12 fileclassification.py 272 2020-11-03 12:00:41
13 asset_classifier 4096 2020-11-03 12:00:42
14 pyvenv.cfg 69 2020-11-04 04:56:36
与上面的数据框一样,我们有 4 个文件,它们具有附近的文件名、大小和日期。
预期产出
Id Name Near Duplicates
1 lib_mysqludf_sys.html ['lib_mysqludf_sys.c','lib_mysqludf_sys.so',
'lib_mysqludf_sys.html','lib_mysqludf_sys.sql']