string - 使用标准 unix 工具进行模糊搜索/近似字符串匹配

Question

我正在使用 prokka 注释文件，这些文件给了我在 uniprot 数据库中发现的基因的蛋白质产物。不幸的是，许多基因与多个非常相似的产品名称相关联，例如

1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2 phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl CoA epoxidase%2C subunit A
1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A

而这些变体实际上是不同的产品

1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl-CoA epoxidase%2C subunit B
1%2C2-phenylacetyl-CoA epoxidase%2C subunit C
1%2C2-phenylacetyl-CoA epoxidase%2C subunit E

为了避免在将我的基因映射到它们各自的产品时遇到麻烦，我决定用“@”替换所有可能的歧义和有问题的字符，例如“-”“”“/”，并将所有字符串小写。

但是有没有办法搜索例如

1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A

包括与标准 unix 工具（如 grep）密切相关的条目？到目前为止我找不到答案。

score 1 · Accepted Answer

如果您想要由字符串距离度量定义的真正模糊搜索，请查看tre-agrep。对于您的应用程序，我会将 grep 与不区分大小写的匹配和句点特殊字符一起使用。

grep -i "1.2C2.phenylacetyl.CoA.epoxidase.2C subunit A" drugNames.txt

将匹配句点位置的任何字符，并且不注意大小写，这是您想要的。

string - 使用标准 unix 工具进行模糊搜索/近似字符串匹配

1 回答 1

Related

Reference