0

在 python 中,确定字符串中的位置是否在一对特定字符序列内的最有效方法是什么?

       0--------------16-------------------37---------48--------57
       |               |                    |          |        |
cost=r"a) This costs \$1 but price goes as $x^2$ for \(x\) item(s)."

在字符串cost中,我想弄清楚某个位置是否被一对$or 包围在\(and\)中。

对于字符串,cost函数is_maths(cost,x)将返回 in并评估其他任何地方。Truex[37,38,39,48]False

动机是找出有效的乳胶数学位置,也欢迎使用 python 的任何替代有效方法。

4

1 回答 1

2

您需要将字符串解析到请求的位置,并且如果在一对有效的 LaTeX 环境分隔符内,直到结束分隔符,才能使用Trueor回答False。那是因为您必须处理每个相关的元字符(反斜杠、美元和括号)来确定它们的效果。

我已经了解Latex$...$\(...\)环境分隔符是不能嵌套的,所以这里不用担心嵌套语句;您只需要找到最近的完整$...$\(...\)配对。

但是,您不能只匹配文字$\(\)字符,因为每个字符前面都可以有任意数量的\反斜杠。相反,在反斜杠、美元或括号上标记输入字符串,并按顺序迭代标记并跟踪最后匹配的内容以确定它们的效果(转义下一个字符,以及打开和关闭数学环境)。

如果您超出了请求的位置并且超出了数学环境部分,则无需继续解析;那时你已经有了答案,可以False早点回来。

这是我对这种解析器的实现:

import re

_maths_pairs = {
    # keys are opening characters, values matching closing characters
    # each is a tuple of char (string), escaped (boolean)
    ('$', False): ('$', False),
    ('(', True): (')', True),
}
_tokens = re.compile(r'[\\$()]')

def _tokenize(s):
    """Generator that produces token, pos, prev_pos tuples for s

    * token is a single character: a backslash, dollar or parethesis
    * pos is the index into s for that token
    * prev_pos is te position of the preceding token, or -1 if there
      was no preceding token

    """
    prev_pos = -1
    for match in _tokens.finditer(s):
        token, pos = match[0], match.start()
        yield token, pos, prev_pos
        prev_pos = pos

def is_maths(s, pos):
    """Determines if pos in s is within a LaTeX maths environment"""
    expected_closer = None  # (char, escaped) if within $...$ or \(...\)
    opener_pos = None  # position of last opener character
    escaped = False  # True if the most recent token was an escaping backslash

    for token, token_pos, prev_pos in _tokenize(s):
        if expected_closer is None and token_pos > pos:
            # we are past the desired position, it'll never be within a
            # maths environment.
            return False

        # if there was more text between the current token and the last
        # backslash, then that backslash applied to something else.
        if escaped and token_pos > prev_pos + 1:
            escaped = False

        if token == '\\':
            # toggle the escaped flag; doubled escapes negate
            escaped = not escaped
        elif (token, escaped) == expected_closer:
            if opener_pos < pos < token_pos:
                # position is after the opener, before the closer
                # so within a maths environment.
                return True
            expected_closer = None
        elif expected_closer is None and (token, escaped) in _maths_pairs:
            expected_closer = _maths_pairs[(token, escaped)]
            opener_pos = token_pos

        prev_pos = token_pos

    return False

演示:

>>> cost = r'a) This costs \$1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0)  # should be False
False
>>> is_maths(cost, 16)  # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37)  # should be True, within $...$
True
>>> is_maths(cost, 48)  # should be True, within \(...\)
True
>>> is_maths(cost, 57)  # should be False, within unescaped (...)
False

和其他测试以表明转义得到正确处理:

>>> is_maths(r'Doubled escapes negate: \\$x^2$', 27)  # should be true
True
>>> is_maths(r'Doubled escapes negate: \\(x\\)', 27)  # no longer escaped, so false
False

我的实现刻意忽略了格式错误的 LaTeX 问题;内部的未转义$字符\(...\)或转义字符\(以及其中的\)字符$...$将被忽略,序列中的其他\(开启符或前面没有匹配开启符的关闭符也是如此。这确保即使在给定 LaTeX 本身不会呈现的输入时,该功能也能继续工作。但是,可以更改解析器以在这些情况下抛出异常或返回。在这种情况下,您需要添加一个创建的全局集,并在为 false时针对该集进行测试(检测嵌套环境分隔符)并测试以检测没有开启问题的关闭器。\(...\)\)\(False_math_pairs.keys() | _math_pairs.values()(char, escaped)expected_closer is not None and (token, escaped) != expected_closerchar == ')' and escaped and expected_closer is None\)

于 2018-10-03T16:19:05.867 回答