3

I've got to find the images in a html source code. I'm using regex instead of html.parser because I know it better, but if you can explain to me how to use html parsing like you would a child, I'll be happy to go down that road too.

Can't use beautifulsoup, wish I could, but I got to learn to do this the hard way.

I've read through a lot of questions and answers on here on regex and html (example) so I'm aware of the feelings on this topic.

But hear me out!

Here's my coding attempt (Python 3):

import urllib.request
import re

website = urllib.request.urlopen('http://google.com')
html = website.read()
pat = re.compile (r'<img [^>]*src="([^"]+)')
img = pat.findall(html)

I double checked my regex on regex101.com and it works at finding the img link, but when I run it on IDLE, I get a syntax error and keeps highlighting the caret. Why?

I'm headed in the right direction... yes?

update: Hi, I was thinking may be I get short quick answer, but it seems I may touched a nerve in the community.

I am definitely new and terrible at programming, no way around that. I've been reading all the comments and I really appreciate all the help and patience users have shown me.

4

3 回答 3

2

正则表达式没有任何问题,您缺少两件事:

  1. Python 没有正则表达式类型,因此您必须将其包装在字符串中。使用raw字符串,以便将字符串按原样传递给正则表达式编译器,而无需任何转义解释
  2. 调用的结果.read()是字节序列,而不是字符串。所以你需要一个字节序列正则表达式。

第二个是 Python3 特定的(我看到你正在使用 Py3)

综上所述,只需像这样修复上述行:

pat = re.compile (rb'<img [^>]*src="([^"]+)')

r代表原始和b字节序列。

此外,在一个实际将图像嵌入<img>标签的网站上进行测试,例如http://stackoverflow.com处理http://google.com时您将找不到任何东西

开始了:

Python 3.3.2+
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> import re
>>> website = urllib.request.urlopen('http://stackoverflow.com/')
>>> html = website.read()
>>> pat = re.compile (rb'<img [^>]*src="([^"]+)')
>>> img = pat.findall(html)
>>> img
[b'http://i.stack.imgur.com/tKsDb.png', b'http://i.stack.imgur.com/dmHl0.png', b'http://i.stack.imgur.com/dmHl0.png', b'http://i.stack.imgur.com/tKsDb.png', b'http://i.stack.imgur.com/6QN0y.png', b'http://i.stack.imgur.com/tKsDb.png', b'http://i.stack.imgur.com/L8rHf.png', b'http://i.stack.imgur.com/tKsDb.png', b'http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif']
于 2013-10-20T13:09:18.513 回答
1

而不是使用urllib,我使用requests,你可以从这里下载。他们做同样的事情,我只是喜欢requests更好,因为它有更好的 API。正则表达式字符串仅略有更改。只是在标签\s前有几个空格的情况下添加。img朝着正确的方向前进。您可以在此处找到有关该re模块的更多信息。

这是代码

import requests
import re

website = requests.get('http://stackoverflow.com//')
html = website.text
pat = re.compile(r'<\s*img [^>]*src="([^"]+)')
img = pat.findall(html)

print img

和输出:

[u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/L8rHf.png', u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/Ryr18.png', u'http://i.stack.imgur.com/ASf0H.png', u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/Ryr18.png', u'http://i.stack.imgur.com/VgvXl.png', u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/tKsDb.png', u'http://i.stack.imgur.com/6QN0y.png', u'http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif']
于 2013-10-20T12:56:40.287 回答
0

re.compile (r'<img [^>]*src="([^"]+)')

您缺少模式周围的引号(单引号或双引号)

于 2013-10-20T12:40:34.900 回答