3
>>> BeautifulSoup('<span>this is a</span>cat').text
u'this is acat'
>>> BeautifulSoup('Spelled f<b>o</b>etus in British English with extra "o"').text
u'Spelled foetus in British English with extra "o"'

标记标签之间的一些解析应该在它们之间留出空格(就像那样acat)。什么是确保解析器将空格放在有意义的地方的好方法?我正在尝试将电子邮件转换为文本。

4

2 回答 2

4

根据评论编辑:

BeautifulSoup 支持第一个示例。你所要做的就是

BeautifulSoup('<span>this is a</span>cat').get_text(" ")

它将使用空格连接两个元素之间的文本。它记录在这里

于 2018-11-16T12:09:41.587 回答
2

没关系,我错了:

def grab(soup):
    return ' '.join(unicode(i.string) for i in soup.body.contents)
           # soup.body.contents contains a list of all the tags
           # [<span>this is a</span>, u'cat']
           # [<p>Spelled f<b>o</b>etus in British English with extra "o"</p>]

           # i.string gets the text of a tag, similar to .text, but if there are tags in the tag you want to get the .string of, it will return None.

           # unicode() is used to convert it from a bs4 type to a string type. Used to call ' '.join()
           # It's good to use unicode() instead of str():
           ## If you want to use a NavigableString outside of Beautiful Soup, 
           ## you should call unicode() on it to turn it into a normal 
           ## Python Unicode string. If you don’t, your string will carry around 
           ## a reference to the entire Beautiful Soup parse tree, even when 
           ## you’re done using Beautiful Soup. This is a big waste of memory.

           # Lastly, as .contents returns a list, we join it together.

soup1 = BeautifulSoup('<span>this is a</span>cat')
soup2 = BeautifulSoup('Spelled f<b>o</b>etus in British English with extra "o"')
soups = [soup1, soup2] # here we have a list of the soups
for i in soups:
    result = grab(i) # It will be either u'None', or the correct string with a space
    if result == 'None': # If the result had a tag in between (i.e, like your second example)
        print i.text
    else:
        print result # The result with a space.

印刷:

this is a cat
Spelled foetus in British English with extra "o"
于 2013-05-27T06:41:17.750 回答