python - 使用 BeautifulSoup 解析文档而不解析文档的内容 `tags`

Question

score 1 · Accepted Answer

The problem is that <code> is treated according to the normal rules for HTML markup, and content inside <code> tags is still HTML (The tags exists mainly to drive CSS formatting, not to change the parsing rules).

What you are trying to do is create a different markup language that is very similar, but not identical, to HTML. The simple solution would be to assume certain rules, such as, "<code> and </code> must appear on a line by themselves," and do some pre-processing yourself.

A very simple — though not 100% reliable — technique is to replace ^<code>$ with <code><![CDATA[ and ^</code>$ with ]]></code>. It isn't completely reliable, because if the code block contains ]]>, things will go horribly wrong.
A safer option is to replace dangerous characters inside code blocks (<, > and & probably suffice) with their equivalent character entity references (<, > and &). You can do this by passing each block of code you identify to cgi.escape(code_block).

Once you've completed preprocessing, submit the result to BeautifulSoup as usual.

score 1 · Accepted Answer

来自Python 维基

>>>import cgi
>>>cgi.escape("<string.h>")
>>>'&lt;string.h&gt;'

>>>BeautifulSoup('&lt;string.h&gt;', 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)

score 0 · Accepted Answer

编辑：

使用python-markdown2处理输入，并让用户缩进代码区域。

>>> print html
I like this article, but the third code example <em>could have been simpler</em>:

    #include <stdbool.h>
    #include <stdio.h>

    int main()
    {
        printf("Hello World\n");
    }

>>> import markdown2
>>> marked = markdown2.markdown(html)
>>> marked
u'<p>I like this article, but the third code example <em>could have been simpler</em>:</p>\n\n<pre><code>#include &lt;stdbool.h&gt;\n#include &lt;stdio.h&gt;\n\nint main()\n{\n    printf("Hello World\\n");\n}\n</code></pre>\n'
>>> print marked
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>

<pre><code>#include &lt;stdbool.h&gt;
#include &lt;stdio.h&gt;

int main()
{
    printf("Hello World\n");
}
</code></pre>

如果您仍需要使用 BeautifulSoup 导航和编辑它，请执行以下操作。如果您需要重新插入“<”和“>”（而不是“<”和“>”），请包括实体转换。

soup = BeautifulSoup(marked, 
                     convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> soup
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>
<pre><code>#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code></pre>


def thickened(soup):
    """
    <code>
    blah blah <entity> blah
        blah
    </code>
    """
    codez = soup.findAll('code') # get the code tags
    for code in codez:
        # take all the contents inside of the code tags and convert
        # them into a single string
        escape_me = ''.join([k.__str__() for k in code.contents])
        escaped = cgi.escape(escape_me) # escape them with cgi
        code.replaceWith('<code>%s</code>' % escaped) # replace Tag objects with escaped string
    return soup

score 0 · Accepted Answer

不幸的是，无法阻止 BeautifulSoup 解析代码块。

您想要实现的一种解决方案也是

1）删除代码块

soup = BeautifulSoup(unicode(content))
code_blocks = soup.findAll(u'code')
for block in code_blocks:
    block.replaceWith(u'<code class="removed"></code>')

2) 进行通常的解析以去除不允许的标签。

3）重新插入代码块，重新生成html。

stripped_code = stripped_soup.findAll(u"code", u"removed")
# re-insert pygment formatted code

我会用一些代码来回答，但我最近阅读了一个优雅地做到这一点的博客。

http://iboris.com/page/add-source-code-syntax-highlighting-your-django-content-pygments.html

score 0 · Accepted Answer

如果<code>元素在代码中包含未转义<的 , &,>字符，则它不是有效的 html。BeautifulSoup将尝试将其转换为有效的 html。这可能不是你想要的。

要将文本转换为有效的 html，您可以调整从 html 中剥离标签的正则表达式，以从块中提取文本<code>并将其替换为cgi.escape()版本。如果没有嵌套<code>标签，它应该可以正常工作。之后，您可以将经过清理的 html 提供给BeautifulSoup.

python - 使用 BeautifulSoup 解析文档而不解析文档的内容 tags

5 回答 5

编辑：

Related

Reference

python - 使用 BeautifulSoup 解析文档而不解析文档的内容 `tags`