python - python match image tags from large content string using regular expressions

Question

am really a noob with regular expressions, I tried to do this on my own but I couldn't understand from the manuals how to approach it. Am trying to find all img tags of a given content, I wrote the below but its returning None

            content = i.content[0].value
            prog = re.compile(r'^<img')
            result = prog.match(content)
            print result

any suggestions?

score 1 · Accepted Answer

多用途解决方案：

image_re = re.compile(r"""
    (?P<img_tag><img)\s+    #tag starts
    [^>]*?                  #other attributes
    src=                    #start of src attribute
    (?P<quote>["''])?       #optional open quote
    (?P<image>[^"'>]+)      #image file name
    (?(quote)(?P=quote))    #close quote
    [^>]*?                  #other attributes
    >                       #end of tag
    """, re.IGNORECASE|re.VERBOSE) #re.VERBOSE allows to define regex in readable format with comments

image_tags = []
for match in image_re.finditer(content):
    image_tags.append(match.group("img_tag"))

#print found image_tags
for image_tag in image_tags:
    print image_tag

正如您在正则表达式定义中看到的那样，它包含

(?P<group_name>regex)

group_name它允许您按而不是按数字访问找到的组。这是为了可读性。因此，如果要显示标签的所有src属性img，只需编写：

for match in image_re.finditer(content):
    image_tags.append(match.group("image"))

此 image_tags 列表之后将包含图像标签的 src。

此外，如果您需要解析 html，那么有些工具就是专门为此目的而设计的。例如，它是lxml，它使用xpath表达式。

score 0 · Accepted Answer

我不知道 Python 但假设它使用普通的 Perl 兼容的正则表达式......

您可能想要查找“<img[^>]+>”，即：“<img”，后跟任何不是“>”的内容，然后是“>”。每场比赛都应该给你一个完整的图像标签。

python - python match image tags from large content string using regular expressions

2 回答 2

Related

Reference