python - 用正则表达式替换字符串内容

Question

我正在尝试删除我从网页中寻找的数据周围的所有 html，这样剩下的就是我可以输入数据库的原始数据。所以如果我有类似的东西：

<p class="location"> Atlanta, GA </p>

以下代码将返回

Atlanta, GA </p>

但我所期望的不是返回的。这是我在这里找到的基本问题的更具体的解决方案。任何帮助将不胜感激，谢谢！代码如下。

def delHTML(self, html):
    """
    html is a list made up of items with data surrounded by html
    this function should get rid of the html and return the data as a list
    """

    for n,i in enumerate(html):
        if i==re.match('<p class="location">',str(html[n])):
            html[n]=re.sub('<p class="location">', '', str(html[n]))

    return html

score 2 · Accepted Answer

正如评论中正确指出的那样，您应该使用特定的库来解析 HTML 和提取文本，以下是一些示例：

html2text：功能有限，但正是您所需要的。
BeautifulSoup：更复杂，更强大。

score 0 · Accepted Answer

假设您只想提取<p class="location">标签中包含的数据，您可以使用 PythonHTMLParser模块（一个简单的 HTML SAX 解析器）的快速且肮脏（但正确）的方法，如下所示：

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    PLocationID=0
    PCount=0
    buf=""
    out=[]

    def handle_starttag(self, tag, attrs):
        if tag=="p":
            self.PCount+=1
            if ("class", "location") in attrs and self.PLocationID==0:
                self.PLocationID=self.PCount

    def handle_endtag(self, tag):
        if tag=="p":
            if self.PLocationID==self.PCount:
                self.out.append(self.buf)
                self.buf=""
                self.PLocationID=0
            self.PCount-=1

    def handle_data(self, data):
        if self.PLocationID:
            self.buf+=data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("""
<html>
<body>
<p>This won't appear!</p>
<p class="location">This <b>will</b></p>
<div>
<p class="location">This <span class="someclass">too</span></p>
<p>Even if <p class="location">nested Ps <p class="location"><b>shouldn't</b> <p>be allowed</p></p> <p>this will work</p></p> (this last text is out!)</p>
</div>
</body>
</html>
""")
print parser.out

输出：

['This will', 'This too', "nested Ps shouldn't be allowed this will work"]

这将提取任何<p class="location">标签中包含的所有文本，剥离其中的所有标签。单独的标签（如果不是嵌套的 - 无论如何都不应该允许用于段落）将在out列表中具有单独的条目。

请注意，对于更复杂的需求，这很容易失控；在这些情况下，DOM 解析器更合适。

python - 用正则表达式替换字符串内容

2 回答 2

Related

Reference