python - 删除空格时的正则表达式匹配，如何从带有空格的原始字符串中删除匹配的字符？

Question

（免责声明：这是我的第一个 stackoverflow 问题，如果我不太清楚，请提前原谅我）

预期成绩：

我的任务是在代表公司名称的字符串中查找公司合法标识符，然后将它们从中分离出来，并将它们保存在单独的字符串中。公司名称已被清理，因此它们仅包含字母数字小写字符。

例子：

company_1 = 'uber wien abcd gmbh'
company_2 = 'uber wien abcd g m b h'
company_3 = 'uber wien abcd ges mbh'

应该导致

company_1_name = 'uber wien abcd'
company_1_legal = 'gmbh'
company_2_name = 'uber wien abcd'
company_2_legal = 'gmbh'
company_3_name = 'uber wien abcd'
company_3_legal = 'gesmbh'

我现在在哪里：

我从 csv 文件加载所有公司 ID 的列表。奥地利提供了一个很好的例子。两个合法身份证是：

gmbh
gesmbh

我使用一个正则表达式来告诉我公司名称是否包含合法标识符。但是，此正则表达式会从字符串中删除所有空格以识别合法 id。

company_1_nospace = 'uberwienabcdgmbh'
company_2_nospace = 'uberwienabcdgmbh'
company_3_nospace = 'uberwienabcdgesmbh'

因为我在字符串中查找不带空格的正则表达式，所以我可以看到所有三个公司的名称中都有合法的 ID。

我被困在哪里：

我可以说,中是否有合法 id company_1，但我只能从. 事实上，我不能删除，因为它不匹配，但我可以说它是一个合法的 id。我可以删除它的唯一方法是同时删除公司名称其余部分中的空格，我不想这样做（这只是最后的选择）company_2company_3company_1g m b h

即使我要插入空格gmbh以匹配它g m b h，我也不会拿起ges mbhor ges m b h。（请注意，其他国家也会发生同样的事情）

我的代码：

import re
re_code = re.compile('^gmbh|gmbh$|^gesmbh|gesmbh$')
comp_id_re = re_code.search(re.sub('\s+', '', company_name))
if comp_id_re:
    company_id = comp_id_re.group()
    company_name = re.sub(re_code, '', company_name).strip()
else:
    company_id = ''

python有没有办法理解从原始字符串中删除哪些字符？或者如果我以某种方式（这是另一个问题）找到合法身份证间距的所有可能替代方案，它会更容易吗？即从gmbh我创建g mbh, gm bh, gmb h,g m bh等...并将其用于匹配/提取？

我希望我的解释已经足够清楚了。想这个标题是相当困难的。

更新 1：公司 ID 通常位于公司名称字符串的末尾。在某些国家/地区，它们有时会出现在开头。

更新 2：我认为这会处理公司名称中的公司 ID。它适用于公司名称末尾的法律 ID，但不适用于开头的公司 ID

legal_regex = '^ltd|ltd$|^gmbh|gmbh$|^gesmbh|gesmbh$'
def foo(name, legal_regex):
    #compile regex that matches company ids at beginning/end of string
    re_code = re.compile(legal_regex)
    #remove spaces
    name_stream = name.replace(' ','')
    #find regex matches for legal ids
    comp_id_re = re_code.search(name_stream)
    #save company_id, remove it from string
    if comp_id_re:
        company_id = comp_id_re.group()
        name_stream = re.sub(re_code, '', name_stream).strip()
    else:
        company_id = ''
    #restore spaced string (only works if id is at the end)
    name_stream_it = iter(name_stream)
    company_name = ''.join(next(name_stream_it) if e != ' ' else ' ' for e in name)
       return (company_name, company_id)

score 1 · Accepted Answer

非正则表达式解决方案在这里会更容易，我会这样做

legal_ids = """gmbh
gesmbh"""
def foo(name, legal_ids):
    #Remove all spaces from the string
    name_stream = name.replace(' ','')
    #Now iterate through the legal_ids
    for id in legal_ids:
            #Remove the legal ID's from the string
        name_stream = name_stream.replace(id, '')
    #Now Create an iterator of the modified string
    name_stream_it = iter(name_stream)
    #Fill in the missing/removed spaces
    return ''.join(next(name_stream_it) if e != ' ' else ' ' for e in name)

foo(company_1, legal_ids.splitlines())
'uber wien abcd '
foo(company_2, legal_ids.splitlines())
'uber wien abcd '
foo(company_3, legal_ids.splitlines())
'uber wien abcd '

score 0 · Accepted Answer

这是我想出的代码：

company_1 = 'uber wien abcd gmbh'
company_2 = 'uber wien abcd g m b h'
company_3 = 'uber wien abcd ges mbh'
legalids = ["gmbh", "gesmbh"]

def info(company, legalids):
    for legalid in legalids:
        found = []

        last_pos = len(company)-1
        pos = len(legalid)-1
        while True:
            if len(legalid) == len(found):
                newfound = found
                newfound.reverse()
                if legalid == ''.join(newfound):
                    return [company[:last_pos+1].strip(' '), legalid]
                else:
                    break

            if company[last_pos] == ' ':
                last_pos -= 1
                continue
            elif company[last_pos] == legalid[pos]:
                found.append(company[last_pos])
                pos -= 1
            else:
                break
            last_pos -= 1
    return

print(info(company_1, legalids))
print(info(company_2, legalids))
print(info(company_3, legalids))

输出：

['uber wien abcd', 'gmbh']
['uber wien abcd', 'gmbh']
['uber wien abcd', 'gesmbh']

score 0 · Accepted Answer

我想我得到了一个可以接受的解决方案。我使用了部分原始代码、@Abhijit 的部分代码以及 @wei2912 代码背后的主要思想。谢谢你们

这是我要使用的代码：

legal_ids = '^ltd|ltd$|^gmbh|gmbh$|^gesmbh|gesmbh$'

def foo(name, legal_ids):
    #initialize re (company id at beginning or end of string)
    re_code = re.compile(legal_ids)
    #remove spaces from name
    name_stream = name.replace(' ','')
    #search for matches
    comp_id_re = re_code.search(name_stream)
    if comp_id_re:
        #match was found, extract the matching company id
        company_id = comp_id_re.group()
        #remove the id from the string without spaces
        name_stream = re.sub(re_code, '', name_stream).strip()
        if comp_id_re.start()>0:
            #the legal id was NOT at the beginning of the string, proceed normally
            name_stream_it = iter(name_stream)
            final_name = ''.join(next(name_stream_it) if e != ' ' else ' ' for e in name)
        else:
            #the legal id was at the beginning of the string, so do the same as above, but with the reversed strings
            name_stream_it = iter(name_stream[::-1])
            final_name = ''.join(next(name_stream_it) if e != ' ' else ' ' for e in name[::-1])
            #reverse the string to get it back to normal
            final_name = final_name[::-1]
    else:
        company_id = ''
        final_name = name
    return (final_name.strip(), company_id)

python - 删除空格时的正则表达式匹配，如何从带有空格的原始字符串中删除匹配的字符？

3 回答 3

Related

Reference