python-3.x - 逐行处理 Markdown 文件时跳过处理防护代码块

Question

我是一个非常缺乏经验的 Python 编码器，所以我很有可能以完全错误的方式解决这个特定问题，但我很感激任何建议/帮助。

我有一个 Python 脚本，它逐行遍历 Markdown 文件并重写[[wikilinks]]为标准 Markdown[wikilink](wikilink) 样式链接。我在一个函数中使用了两个正则表达式，如下所示：

def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.

:param file_obj: Path to file
:return: List object containing modified text. Newlines will be returned as '\n' strings.
"""

file = file_obj
linelist = []
logging.debug("Going to open file %s for processing now.", file)
try:
    with open(file, encoding="utf8") as infile:
        for line in infile:
            linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
            # Finds  references that are in style [[foo]] only by excluding links in style [[foo]](bar).
            # Capture group $2 returns just foo
            linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
                                     r"[\2](\2 \5.md)", line) for line in linelist]
            # Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
            # returns bar
except EnvironmentError:
    logging.exception("Unable to open file %s for reading", file)
logging.debug("Finished processing file %s", file)
return linelist_final

这适用于大多数 Markdown 文件。但是，我偶尔会得到一个包含[[wikilinks]]在受保护代码块中的 Markdown 文件，例如：

# Reference

Here is a reference to “the Reactome Project” using smart quotes.

Here is an image: ![](./images/Screenshot.png)


[[201802150808]](Product discovery)

```
[[201802150808 Product Prioritization]]

def foo():
    print("bar")

```

在上述情况下，我应该跳过处理[[201802150808 Product Prioritization]]围栏代码块内部。我有一个正确识别围栏代码块的正则表达式，即：

(?<=```)(.*?)(?=```)

但是，由于现有函数是逐行运行的，因此我无法找到一种方法来跳过 for 循环中的整个部分。我该怎么做呢？

score 0 · Accepted Answer

通过对我的原始函数进行一些更改，我能够为这个问题创建一个相当完整的解决方案，即：

用PyPi 上可用的模块替换re内置的 python。regex
更改函数以将整个文件读入单个变量，而不是逐行读取。

修改后的功能如下：

import regex 

def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.

:param file_obj: Path to file
:return: String containing modified text. Newlines will be returned as '\\n' in the string.
"""

file = file_obj
try:
    with open(file, encoding="utf8") as infile:
        line = infile.read()
        # Read the entire file as a single string
        linelist = regex.sub(r"(?V1)"
                             r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
        #                    Ignore fenced & inline code blocks. V1 engine allows in-line flags so 
        #                    we enable newline matching only here.
                             r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
        #                    Ignore code blocks beginning with 4 spaces/1 tab
                             r"|(\[\[(.*)\]\](?!\s\(|\())", r"[\3](\3.md)", line)
        # Finds  references that are in style [[foo]] only by excluding links in style [[foo]](bar) or
        # [[foo]] (bar). Capture group $3 returns just foo
        linelist_final = regex.sub(r"(?V1)"
                                   r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
                                   r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
        #                          Refer comments above for this portion.
                                   r"|(\[\[(\d+)\]\](\s\(|\()(.*)(?=\))\))", r"[\3](\3 \5.md)", linelist)
        # Finds only references in style [[123]](bar) or [[123]] (bar). Capture group $3 returns 123 and capture
        # group $5 returns bar
except EnvironmentError:
    logging.exception("Unable to open file %s for reading", file)
return linelist_final

上述函数处理[[wikilinks]]内联代码块、围栏代码块和缩进 4 个空格的代码块。目前有一种误报场景，它忽略了一个有效[[wiklink]]的链接，即链接出现在 Markdown 列表的第 3 级或更深的位置，即：

* Level 1
  * Level 2
    * [[wikilink]] #Not recognized
      * [[wikilink]] #Not recognized.

但是，我的文档在列表中没有嵌套在该级别的维基链接，所以这对我来说不是问题。

score 0 · Accepted Answer

您需要使用完整的 Markdown 解析器才能覆盖所有边缘情况。当然，大多数 Markdown 解析器直接将 Markdown 转换为 HTML。但是，有些人会使用两步过程，第一步将原始文本转换为抽象语法树 (AST)，第二步将 AST 呈现为输出格式。找到可以替换默认 HTML 渲染器的 Markdown 渲染器（输出 Markdown）并不罕见。

您只需修改解析器步骤（使用插件添加对 wikilink 语法的支持）或直接修改 AST。然后将 AST 传递给 Markdown 渲染器，这将为您提供格式良好且标准化的 Markdown 文档。如果您正在寻找 Python 解决方案，失谐Pandoc 过滤器可能是一个不错的起点。

但是，当可以在源文本上运行一些精心设计的正则表达式时，为什么还要经历这一切呢？因为 Markdown 解析很复杂。我知道，一开始似乎很容易。毕竟 Markdown 对人类来说很容易阅读（这是它的定义设计目标之一）。但是，解析器实际上非常复杂，部分解析器依赖于前面的步骤。

例如，除了有围栏的代码块，还有缩进的代码块呢？但是您不能只检查行首的缩进，因为嵌套列表的单行可能看起来与缩进的代码块相同。您想跳过代码块，而不是嵌套在列表中的段落。如果您的 wikilink 被分成两行怎么办？通常在解析内联标记时，Markdown 解析器会将单个换行符视为与空格没有什么不同。所有这一切的重点是，在您开始解析内联元素之前，首先需要将整个文档解析为其各种块级元素。只有这样，您才能逐步浏览这些内容并解析链接等内联元素。

我敢肯定还有其他我没有想到的边缘情况。覆盖它们的唯一方法是使用成熟的 Markdown 解析器。

python-3.x - 逐行处理 Markdown 文件时跳过处理防护代码块

2 回答 2

Related

Reference