python - 正则表达式字典【谷歌类型搜索和正则表达式匹配】

Question

编辑：下面代码的主要问题之一是由于将正则表达式对象存储在字典中，以及如何访问它们以查看它们是否可以匹配另一个字符串。但我仍然会留下我之前的问题，因为我认为可能有一种简单的方法可以完成所有这些工作。

我想在 python 中找到一个方法，它知道如何返回两个字符串是否引用同一事物的布尔值。我知道这很困难，如果在编程中不是完全荒谬的话，但我正在研究使用引用同一事物的替代字符串字典来处理这个问题。

这里有一些例子，因为我知道没有它们就没有多大意义。

如果我给出字符串：

'breakingBad.Season+01 Episode..02'

然后我希望它匹配字符串：

'Breaking Bad S01E02'

或者'three.BuCkets+of H2O'可以匹配'3 buckets of water'

我知道对于等同义词而言这几乎是不可能的'3'，'water'但如果需要，我愿意将这些作为相关正则表达式同义词的字典提供给该函数。

我有一种感觉，在 python 中有一种更简单的方法可以做到这一点，就像往常一样，但这是我到目前为止所拥有的：

import re

def check_if_match(given_string, string_to_match, alternative_dictionary):
    print 'matching: ', given_string, '  against: ', string_to_match
    # split the string into it's parts with pretty much any special character 
    list_of_given_strings = re.split(' |\+|\.|;|,|\*|\n', given_string)
    print 'List of words retrieved from given string: '
    print list_of_given_strings
    check = False
    counter = 0
    for i in range(len(list_of_given_strings)):
        m = re.search(list_of_given_strings[i], string_to_match, re.IGNORECASE)
        m_alt = None
        try:
            m_alt = re.search(alternative_dictionary[list_of_given_strings[i]], string_to_match, re.IGNORECASE)
        except KeyError:
            pass
        if m or m_alt:
            if counter == len(list_of_given_strings)-1: check = True
            else: counter += 1
            print list_of_given_strings[i], ' found to match'
        else:
            print list_of_given_strings[i], ' did not match'
            break
    return check

string1 = 'breaking Bad.Season+01 Episode..02'
other_string_to_check = 'Breaking.Bad.S01+E01'
# make a dictionary of synonyms -  here we should be saying that "S01" is equivalent to "Season 01"
alternative_dict = {re.compile(r'S[0-9]',flags=re.IGNORECASE):re.compile(r'Season [0-9]',flags=re.IGNORECASE),\
                    re.compile(r'E[0-9]',flags=re.IGNORECASE):re.compile(r'Episode [0-9]',flags=re.IGNORECASE)}
print check_if_match(string1, other_string_to_check, alternative_dict)
print 
# another try
string2 = 'three.BuCkets+of H2O'
other_string_to_check2 = '3 buckets of water'
alternative_dict2 = {'H2O':'water', 'three':'3'}
print check_if_match(string2, other_string_to_check2, alternative_dict2)

这将返回：

matching:  breaking Bad.Season+01 Episode..02   against:  Breaking.Bad.S01+E01
List of words retrieved from given string: 
['breaking', 'Bad', 'Season', '01', 'Episode', '', '02']
breaking  found to match
Bad  found to match
Season  did not match
False

matching:  three.BuCkets+of H2O   against:  3 buckets of water
List of words retrieved from given string: 
['three', 'BuCkets', 'of', 'H2O']
three  found to match
BuCkets  found to match
of  found to match
H2O  found to match
True

我意识到这可能意味着我的字典键和值有问题，但我觉得我离一个可能已经创建的简单 pythonic 解决方案越来越远了。

有人有什么想法吗？

score 1 · Accepted Answer

我正在修补它并发现了一些有趣的东西：

这可能与您将初始单词分解为列表的方式有关

matching:  breaking Bad.Season 1.Episode.1   against:  Breaking.Bad.S1+E1
List of words retrieved from given string:
['breaking', 'Bad', 'Season', '1', 'Episode', '1']

我认为您希望它..., 'Season 1', ...不是列表中'Season'的1单独条目。
您指定S[0-9]，但这不会匹配两位数。
您的常规表达式存储在字典中是正确的；映射仅适用于一个方向。r'Season [0-9]'我通过映射来摆弄代码（不幸的是不记得它是什么）r'S[0-9]'而不是反之亦然，它能够匹配Season.

建议

不是映射，而是为每个字符串类型（例如标题、季节、剧集）设置一个等价类，并为此设置一些匹配器代码。
分离解析和比较步骤。将每个字符串单独解析为通用格式或对象，然后进行比较
您可能需要实现某种状态机才能知道您正在处理一个季节，并希望在它之后立即看到一个特定格式的数字。
您可能想改用第三方工具；我听说过关于Renamer的好消息

python - 正则表达式字典【谷歌类型搜索和正则表达式匹配】

1 回答 1

建议

Related

Reference