您需要将字符串解析到请求的位置,并且如果在一对有效的 LaTeX 环境分隔符内,直到结束分隔符,才能使用True
or回答False
。那是因为您必须处理每个相关的元字符(反斜杠、美元和括号)来确定它们的效果。
我已经了解Latex$...$
和\(...\)
环境分隔符是不能嵌套的,所以这里不用担心嵌套语句;您只需要找到最近的完整$...$
或\(...\)
配对。
但是,您不能只匹配文字$
或\(
或\)
字符,因为每个字符前面都可以有任意数量的\
反斜杠。相反,在反斜杠、美元或括号上标记输入字符串,并按顺序迭代标记并跟踪最后匹配的内容以确定它们的效果(转义下一个字符,以及打开和关闭数学环境)。
如果您超出了请求的位置并且超出了数学环境部分,则无需继续解析;那时你已经有了答案,可以False
早点回来。
这是我对这种解析器的实现:
import re
_maths_pairs = {
# keys are opening characters, values matching closing characters
# each is a tuple of char (string), escaped (boolean)
('$', False): ('$', False),
('(', True): (')', True),
}
_tokens = re.compile(r'[\\$()]')
def _tokenize(s):
"""Generator that produces token, pos, prev_pos tuples for s
* token is a single character: a backslash, dollar or parethesis
* pos is the index into s for that token
* prev_pos is te position of the preceding token, or -1 if there
was no preceding token
"""
prev_pos = -1
for match in _tokens.finditer(s):
token, pos = match[0], match.start()
yield token, pos, prev_pos
prev_pos = pos
def is_maths(s, pos):
"""Determines if pos in s is within a LaTeX maths environment"""
expected_closer = None # (char, escaped) if within $...$ or \(...\)
opener_pos = None # position of last opener character
escaped = False # True if the most recent token was an escaping backslash
for token, token_pos, prev_pos in _tokenize(s):
if expected_closer is None and token_pos > pos:
# we are past the desired position, it'll never be within a
# maths environment.
return False
# if there was more text between the current token and the last
# backslash, then that backslash applied to something else.
if escaped and token_pos > prev_pos + 1:
escaped = False
if token == '\\':
# toggle the escaped flag; doubled escapes negate
escaped = not escaped
elif (token, escaped) == expected_closer:
if opener_pos < pos < token_pos:
# position is after the opener, before the closer
# so within a maths environment.
return True
expected_closer = None
elif expected_closer is None and (token, escaped) in _maths_pairs:
expected_closer = _maths_pairs[(token, escaped)]
opener_pos = token_pos
prev_pos = token_pos
return False
演示:
>>> cost = r'a) This costs \$1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0) # should be False
False
>>> is_maths(cost, 16) # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37) # should be True, within $...$
True
>>> is_maths(cost, 48) # should be True, within \(...\)
True
>>> is_maths(cost, 57) # should be False, within unescaped (...)
False
和其他测试以表明转义得到正确处理:
>>> is_maths(r'Doubled escapes negate: \\$x^2$', 27) # should be true
True
>>> is_maths(r'Doubled escapes negate: \\(x\\)', 27) # no longer escaped, so false
False
我的实现刻意忽略了格式错误的 LaTeX 问题;内部的未转义$
字符\(...\)
或转义字符\(
以及其中的\)
字符$...$
将被忽略,序列中的其他\(
开启符或前面没有匹配开启符的关闭符也是如此。这确保即使在给定 LaTeX 本身不会呈现的输入时,该功能也能继续工作。但是,可以更改解析器以在这些情况下抛出异常或返回。在这种情况下,您需要添加一个创建的全局集,并在为 false时针对该集进行测试(检测嵌套环境分隔符)并测试以检测没有开启问题的关闭器。\(...\)
\)
\(
False
_math_pairs.keys() | _math_pairs.values()
(char, escaped)
expected_closer is not None and (token, escaped) != expected_closer
char == ')' and escaped and expected_closer is None
\)