“sequencematcher”的相关标签问题

0 投票

1 回答

306 浏览

python - 使用 Difflib get_matching_blocks 的模糊字符串匹配未检测到所有子字符串

我正在尝试查找段落中所有出现的单词，并且我希望它也能解决拼写错误。代码：

Difflib get_matching_blocks 仅检测 search_here 字符串中“caterpillar”的第一个实例。我希望它给我所有紧密匹配的块的输出，即它应该识别“卡特彼勒”、“卡特彼勒”和“卡特彼勒”

我怎么解决这个问题？

2020-05-31T06:12:41.343

0 投票

2 回答

921 浏览

python - 如何遍历2列并一一匹配

假设我有 2 个 excel 文件，每个文件都包含一列名称和日期

Excel 1：

Excel 2：

我想将第 1 列中的每个单元格与第 2 列中的每个单元格匹配，然后找到最大的相似性。

以下函数将给出两个输入相互匹配的百分比值。

SequenceMatcher 代码示例：

输出：0.92

python pandas loops dataframe sequencematcher

2020-06-02T10:08:28.713

0 投票

1 回答

111 浏览

python - Python如何通过特定列和额外的行循环序列匹配数据帧

过去两周我一直在尝试解决这个问题，我几乎达到了目标。

案例： 我正在尝试的整体描述

对于这个例子，我从 2 个不同的 excel 表中提取了 2 个数据框让我们说 3x3（DF1 和 DF2）
我想将 DF1 中的 Column2 中的单元格与 DF2 中的 Column2 匹配
我需要一一匹配单元格

示例：假设我有单元格 X1，我匹配 Y(1,2,3) X1 中的每个单元格与 Y3 最匹配。

我想提取 X1 行所在的行和 Y3 行所在的行，并将它们彼此对齐保存在一行中，可能在 3. excel 表中

更新我有什么：

此代码能够与 sequencematcher 匹配并打印匹配，但是我只得到一个输出匹配而不是最大匹配列表：

输出：(1.0, ('Fruit', 'Fruit'))

我该如何解决，以便它会给我所有的最大匹配项，以及如何提取匹配项所在的相应行？

python excel pandas dataframe sequencematcher

2020-06-03T09:01:33.227

0 投票

1 回答

180 浏览

python - 使用 SequenceMatcher Python 在 pandas 中查找所有相似值

我正在尝试过滤列中 pandas 中的特定值，但也允许输入错误。我认为使用 SequenceMatcher 是一个很好的解决方案，但我不知道在 DataFrame 中应用它的最佳方法是什么。假设标题是“数字”和“位置”。

df1 = [[1, Amsterdam], [2, amsterdam], [3, rotterdam], [4, amstrdam], [5, Berlin]]

如果我想以一定的比率过滤“阿姆斯特丹”，比如说 0.6。输出可能是这样的。

df1 = [[1, Amsterdam], [2, amsterdam], [4, amstrdam]]

完成这项工作的最佳方法是什么？我正在考虑使用过滤器选项，但没有奏效。我是否需要先运行应用函数来添加具有比率的列，然后才能对其进行过滤？还是有更聪明的方法？

df2 = df1[SequenceMatcher(None, location, df1.location).ratio() > 0.6]

python pandas apply sequencematcher

2020-06-30T08:17:07.897

0 投票

1 回答

100 浏览

python - 使用 Python 在列表中查找相似元素

我需要使用 python 在列表中查找类似的项目。（例如“限制”类似于“限制”或“下载 ICD 文件”类似于“下载 ICD zip 文件”）我真的希望我的结果与字符相似，而不是数字（例如“角度 1”类似于'角度 2'）。我列表中的所有这些字符串都以 '\0' 结尾

我正在尝试做的是将每个项目拆分为空白，并查看是否有任何部分由数字组成。但不知何故，它并没有像我想要的那样工作。

这是我的代码示例：

python list difflib sequencematcher

2020-08-10T09:06:35.183

0 投票

0 回答

37 浏览

python - 用自定义词典标记句子

我正在尝试使用自定义词典来标记句子。例如，如果我有两个文本文件（1. 我的句子，2. 我的字典）

语句文件：

我有腹痛和呼吸困难

字典文件：

腹痛，呼吸困难

我希望输出是这样的：

新文件：

我有腹痛 (AE) 和呼吸困难 (AE)

如何才能做到这一点？请看以下代码：

上面的代码只是创建了一个空的“tagged_sentences”文件谢谢

python nltk tagging sequencematcher

2020-08-10T14:50:45.523

0 投票

1 回答

469 浏览

python - 使用 SequenceMatcher 比较 pandas 中两列中的字符串

我正在尝试确定 pandas 数据框中两列的相似性：

我想比较'Performance results ... 'with'The six...'和 ' Accuracy is one...'with 'Where am I?'。第一行应该是两列之间的相似度较高，因为它包含一些单词；第二个应该等于 0，因为两列之间没有共同的单词。

为了比较我使用的两列SequenceMatcher，如下所示：

但使用 . 似乎是错误的df.Text1, df.All。

你能告诉我为什么吗？

python pandas nlp sequencematcher

user12809368

2020-08-12T18:30:48.267

0 投票

3 回答

238 浏览

python - 如何检测交错日志文件中的序列

我想匹配给定模式库中的模式，返回检测到的最长模式。

但是，我在日志文件中只有多个并行任务的交错结果，例如来自处理器的多个内核。

这是数据挖掘中的已知应用程序吗？

我想到了一种与Regex subsequence matching类似的正则表达式解决方案。然而，有一种距离度量来允许一些模糊性会很好，例如，如果序列中的一个活动会丢失。

python algorithm pattern-matching data-mining sequencematcher

2020-08-13T23:04:08.460

0 投票

0 回答

73 浏览

python - Comparing strings in python with tools as SequenceMatcher and textdistance and the difference in their algorithms

I am working with a dataframe which has 2 columns of city names which should be equal. But they are not due to administrative errors, spelling mistakes or name changes. I am trying to see when those city names are 'equal enough' to be assumed equal. Using SequenceMatcher I can divide the list in roughly 3 parts: Everything is wrong, some are wrong, some are right, everything is right.

In a perfect world I would want the list to be divided in: Everything is wrong, Everything is right. Where the split can be made around a certain ratio/matching value.

Therefore, SequenceMatcher does not do the trick for me. I found textdistance but I get overwhelmed by the possibilities and probably there are more possibilities. An example where it goes wrong:

'Zeddam' and 'Didam' are classified with a ratio of 0.72 (Which are not equal). 'Nes' and 'Nes gem dongeradeel' with a ratio of 0.45 (Which are equal, the second just specifies it's province)

Just checking if one of the strings is a subset of the other string does not do the trick since it will give problems in other cases.

Do you guys have a suggestion on what string comparing algorithm is appropriate and why? I am comparing multiple columns in my dataframe, which is around 1000 rows.

python pandas string-comparison sequencematcher

2020-09-17T14:24:01.897

0 投票

1 回答

131 浏览

python - 排除字符串列表中的相似度

在比较 2 个字符串的相似性时，我想排除一个字符串列表，例如，忽略 'Texas' 和 'US'。

我尝试在 Difflib 的 SequenceMatcher 中使用参数“isjunk”：

相似度高达 0.72，因此显然它没有排除不需要的字符串。

这样做的正确方法是什么？

2020-10-07T04:40:33.877

问题标签 [sequencematcher]

Reference