python - python：简单的子字符串/解析

Question

我有这样的字符串

 <img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/>
 Begado is the newest online casino in our listings. As the newest
 member of the Affactive group, Begado features NuWorks slots and games
 for both US and international players.
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>

我需要src从第一个img标签中获取

我可以轻松做到吗？

score 4 · Accepted Answer

对于 Python 中的 HTML 屏幕抓取，我推荐Beautiful Soup库。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
images = list(soup.findAll('img'))
print images[0]['src']

score 2 · Accepted Answer

强制性“不要使用正则表达式解析 HTML”警告：https ://stackoverflow.com/a/1732454/505154

邪恶的正则表达式解决方案：

import re
re.findall(r'<img\s*src="([^"]*)"\s*/>', text)

这将返回一个列表，其中包含src每个仅包含一个属性的<img>标签的属性（因为您说您只想匹配第一个）。src

score 0 · Accepted Answer

一种方法是使用regex。

另一种方法是用引号分割字符串，然后获取返回的第二个元素。

splits = your_string.split('"')
print splits[1]

score 0 · Accepted Answer

这是一种快速而丑陋的方法，无需任何库：

"""
    >>> get_src(data)
    ['http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg', 'http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo']
"""

data = """<img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/>
 Begado is the newest online casino in our listings. As the newest
 member of the Affactive group, Begado features NuWorks slots and games
 for both US and international players.
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>"""

def get_src(lines):
    srcs = []
    for line in data.splitlines():
        i = line.find('src=') + 5
        f = line.find('"', i)
        if i > 0 and f > 0:
            srcs.append(line[i:f])
    return srcs

但是我会推荐使用Beatiful Soup，它是一个非常好的库，旨在处理真实的网络（损坏的 HTML 和所有），或者如果您的数据是有效的 XML，您可以使用Python 标准库中的Element Tree。

python - python：简单的子字符串/解析

4 回答 4

Related

Reference