python - 制作降价解析器时的正则表达式问题

Question

我正在尝试在python中制作一个markdown解析器，不是因为它有用，而是因为它很有趣并且因为我正在尝试学习正则表达式。

#! /usr/bin/env python
#-*- coding: utf-8 -*-

import re

class Converter:

    def markdown2html(self, string):

        string = re.sub('\*{3}(.+)\*{3}', '<strong>\\1</strong>', string)
        string = re.sub('\*{2}(.+)\*{2}', '<i>\\1</i>', string)
        string = re.sub('^#{1}(.+)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)
        string = re.sub('^#{2}(.+)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)

        return string

markdown_sting = """
##h2 heading
#H1 heading
This should be a ***bold*** char
#anohter h1
anohter ***bold***
this is a **italic** string
"""

converter = Converter()
print converter.markdown2html(markdown_sting)

它打印

<h1>#h2 heading</h1>
<h1>H1 heading</h1>
This should be a <strong>bold</strong> char
<h1>anohter h1</h1>
anohter <strong>bold</strong>
this is a <i>italic</i> string

如您所见，它不解析 h2 标签。我哪里出错了？

score 4 · Accepted Answer

当您解析器看到时#，它会替换h1. 然后它尝试对进行替换h2，但没有字符串，因为在解析该部分时已经替换##了哈希 ( ) 之一。'#'h1

一个简单的解决方法是交换订单：

string = re.sub('^#{2}(.+)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)
string = re.sub('^#{1}(.+)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)

通常，当您对数据应用转换时，应将其从最严格到最不严格的顺序排列，以避免这些问题。

score 4 · Accepted Answer

您可以通过确保标题文本的第一个字符不是井号来确保仅匹配所需数量的井号。这可以通过使用[^#]这样的来完成：

string = re.sub('^#{1}([^#].*)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)
string = re.sub('^#{2}([^#].*)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)

这样规则的顺序就无关紧要了，使规则更加健壮。

score 1 · Accepted Answer

这些正则表达式按顺序进行评估。h1 正则表达式将抓取任何以 # 开头的行并将其转换为<h1>. 因此，当它到达 h2 正则表达式时，该行不再以 ## 开头。交换这两个表达式。

score 1 · Accepted Answer

更合适更高效的方法可能是比较字符串的第一个字符，然后进行简单的字符串替换

def markdown2html(self, string):

    if string[0:2] == "##":
        string = string.replace( "##", "<h2>" ) + "</h2>"
    if string[0] == "#":
        string = string.replace( "##", "<h1>" ) + "</h1>"
    return string

这样您就可以进行简单的列表操作而不是 RegEx。但在所有情况下，顺序都很重要

python - 制作降价解析器时的正则表达式问题

4 回答 4

Related

Reference