2

我想根据几个关键字拆分一个句子:

p = r'(?:^|\s)(standard|of|total|sum)(?:\s|$)'
re.split(p,'10-methyl-Hexadecanoic acid of total fatty acids')

这输出:

['10-methyl-Hexadecanoic acid', 'of', 'total fatty acids']

预期产量:['10-甲基-十六烷酸', 'of', 'total', '脂肪酸']

我不确定为什么要注册。表达式不会根据令牌“总计”进行拆分。

4

2 回答 2

3

您可以使用

import re
p = r'(?<!\S)(standard|of|total|sum)(?!\S)'
s = '10-methyl-Hexadecanoic acid of total fatty acids'
print([x.strip() for x in re.split(p,s) if x.strip()])
# => ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']

查看Python 演示

细节

  • (?<!\S)(standard|of|total|sum)(?!\S)当用空格括起来或在字符串开始/结束处时,将匹配并捕获到组中的第 1 组单词。
  • 理解将有助于摆脱空白项 ( if x.strip()) 并x.strip()从每个非空白项中修剪空白。
于 2019-08-02T08:20:40.960 回答
0

通过字符串切片:

def search(string, search_terms):
    # Init
    ret = []
    # Find all terms
    # Does not find duplicates, employ count() for that
    for term in search_terms:
        found = string.find(term)
        # Not found
        if found < 0:
            continue
        # Add index of found and length of term
        ret.append((found, len(term),))

    # Not found
    if ret == []:
        return [string]

    # Sort by index
    ret.sort(key=lambda x: x[0])

    # Init results list
    end = []
    # Do first found as it is special
    generator = iter(ret)
    ind, length = next(generator)
    # End index of match
    end_index = ind + length
    # Add both to results list
    end.append(string[:ind])
    end.append(string[ind:end_index])

    # Do for all other results
    for ind, length in generator:
        end.append(string[end_index:ind])
        end_index = ind + length
        end.append(string[ind:end_index])
    # Add rest of the string to results
    end.append(string[end_index:])
    return end

# Initiaze
search_terms = ("standard", "of", "total", "sum")
string = '10-methyl-Hexadecanoic acid of total fatty acids' 

print(search(string, search_terms))
# ['10-methyl-Hexadecanoic acid ', 'of', ' ', 'total', ' fatty acids']

如有必要,可以轻松删除空格。

于 2019-08-02T09:01:45.280 回答