python - 在python中处理html片段中的数据片段

Question

我敢肯定这已经被问过，但我无法在任何地方找到答案......

我有一个字符串，它基本上是 HTML 页面的一部分。它看起来很像这样：

body = u'<div class="admonition warning">\n<p class="first admonition-title">Warning</p>\n<p class="last">Read all of this! ALL OF IT!</p>\n</div>\n<div class="section" id="pitfalls-and-common-mistakes">\n<h1>Pitfalls and Common Mistakes<a class="headerlink" href="#pitfalls-and-common-mistakes" title="Permalink to this headline">\xb6</a></h1>\n<p>New and old users alike can run into a pitfall. Below we outline issues that we\nsee frequently as well as explain how to resolve those issues. In the #nginx IRC\nchannel on Freenode, we see these issues frequently.</p>\n<div class="section" id="this-guide-says">\n<h2>This Guide Says<a class="headerlink" href="#this-guide-says" title="Permalink to this headline">\xb6</a></h2>\n<p>The most frequent issue we see happens when someone attempts to just copy and\npaste a configuration snippet from some other guide. Not all guides out there\nare wrong, but a scary number of them are. Even the Linode library has poor\nquality information that some Nginx community members have futily attempted to\ncorrect.</p>\n<p>The Ngx CC Docs were created and reviewed by community members that work\ndirectly with all types of Nginx users. This specific document exists only\nbecause of the volume of common and recurring issues that community members see.</p>\n</div>\n<div class="section" id="my-issue-isn-t-listed">\n<h2>My Issue Isn\'t Listed<a class="headerlink" href="#my-issue-isn-t-listed" title="Permalink to this headline">\xb6</a></h2>\n<p>You don\'t see something in here related to your specific issue. Maybe we didn\'t\npoint you here because of the exact issue you\'re experiencing. Don\'t skim this\npage and assume you were sent here for no reason. You were sent here because\nsomething you did wrong is listed here.</p>\n<p>When it comes to supporting many users on many issues, community members don\'t\nwant to support broken configurations. Fix your configuration before asking for\nhelp. Fix your configuration by reading through this. Don\'t just skim it.</p>\n</div>\n<div class="section" id="root-inside-location-block">\n<h2>Root inside Location Block<a class="headerlink" href="#root-inside-location-block" title="Permalink to this headline">\xb6</a></h2>\n<p>BAD</p>\n<div class="highlight-nginx"><pre>server {\n    server_name www.domain.com;\n      location / {\n          root /var/www/nginx-default/;\n          [...]\n      }\n      location /foo {\n          root /var/www/nginx-default/;\n          [...]\n      }\n      location /bar {\n          root /var/www/nginx-default/;\n          [...]\n      }\n}</pre>\n</div>\n<div class="highlight-nginx"><div class="highlight"><pre><span class="k">def</span> <span class="s">greet(name):</span>\n    <span class="s">print</span> <span class="s">&#39;Hello&#39;,</span> <span class="s">name</span>\n\n<span class="s">greet(&#39;Jack&#39;)</span>\n<span class="s">greet(&#39;Jill&#39;)</span>\n<span class="s">greet(&#39;Bob&#39;)</span>\n</pre></div>\n</div>\n'

无论如何，那是缩短的版本。

在那个块里面是 "<div class="highlight-nginx"><pre>" 和 "</pre></div>" 这将在同一个页面中出现很多次。每次它出现时，我都想操纵它里面的文字。我已经准备好要通过它的功能。但是，我不知道如何从中获取文本，通过函数运行它，然后将其粘贴回字符串中并保持其他所有内容相同。

任何帮助将不胜感激。

score 5 · Accepted Answer

您可以使用像Beautiful Soup这样的 html 解析器。

from bs4 import BeautifulSoup
soup = BeautifulSoup(body)
for div in soup.find_all(class_='highlight-nginx'):
    div.pre.string = my_function(div.pre.string)

score 0 · Accepted Answer

你想要的是re.findall()与一个不贪婪的正则表达式相结合。

试试这个（注意：这是未经测试的）：

import re

your_new_text = your_text = '<div class="highlight-nginx"><pre>whatever is inbetween here</pre></div><div class="highlight-nginx"><pre>some more text to change</pre></div><div class="highlight-nginx"><pre>whatever is inbetween here</pre></div>'

pre_text = '<div class="highlight-nginx"><pre>'
post_text = '</pre></div>'
regex = re.compile(r'{pre_text}(.*?){post_text}'.format(pre_text=pre_text,
    post_text=post_text)
# Find all the matches of our regular expression above
list_of_matches = re.findall(your_text)

for text in list_of_matches:
    # We look for an exact match, including the pre and post tags so we're don't perform
    # the wrong sub later on.
    old_text = '{pre_text}{old_string}{post_text}'.format(
        pre_text=pre_text,
        old_string=text,
        post_text=post_text)

    new_text = '{pre_text}{manipulated_text}{post_text}'.format(
        pre_text=pre_text,
        manipulated_text=manipulate_text(text),
        post_text=post_text)

    # We have the old strings and we now replace them with the new strings.
    your_new_text = your_new_text.replace(old_text, new_text)

print(your_new_text)

python - 在python中处理html片段中的数据片段

2 回答 2

Related

Reference