3

我编写了简单的 Python 脚本,该脚本连接到特定网站并获取那里的所有链接。

import urllib2
import re


request = urllib2.urlopen('http://www.securitytube.net/')
content = request.read()
match = re.findall(r'<a href=".\w+.\d+">.+</a>', content)
if match:
    for i in match:
        print i + "\n"

else:
    print 'Not Found!'

结果:

<a href="/video/3878"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3878.jpg"  alt=
"avatar" /></a>

<a href="/video/3878">NodeZero Linux Review</a>

<a href="/video/3877"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3877.jpg"  alt=
"avatar" /></a>

<a href="/video/3877">Post Attack Uploading Shell in Real Time</a>

<a href="/video/3867"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3867.jpg"  alt=
"avatar" /></a>

<a href="/video/3867">Using SQLMAP in Real Time (SQLinjection WEB)</a>

<a href="/video/3866"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3866.jpg"  alt=
"avatar" /></a>
....
...
...

我正在尝试使用可理解的标题获取这些链接,例如<a href="/video/3867">Using SQLMAP in Real Time (SQLinjection WEB)</a>.

我的模式是:<a href=".\w+.\d+">.+</a>

4

2 回答 2

2

如果您真的想使用正则表达式而不是适当的解析器,您可以groups稍后匹配和访问它们。

请参阅http://docs.python.org/library/re.html

(...)

匹配括号内的任何正则表达式,并指示组的开始和结束;执行匹配后可以检索组的内容

尝试:

request = urllib2.urlopen('http://www.securitytube.net/')
content = request.read()
match = re.findall(r'<a href="(.*?)".*>(.*)</a>', content)
if match:
    for link, title in match:
        print "link %s -> %s" % (link, title)

这输出:

link /video/3822 -> SecurityTube SpeakUp: Cloud Computing
link /video/3587 -> 
link /video/3587 -> Securitytube Speak Up: Antivirus Evasion attacks
link /video/3489 -> 
link /video/3489 -> SecurityTube SpeakUp: ThePirateBay LOSS
link /video/3375 -> 
link /video/3375 -> SecurityTube SpeakUp: .COM and .NET Domain Seizures
link /video/3130 -> 
link /video/3130 -> SecurityTube Speak Up: The MS12-020 Fiasco!
...etc

您当然可以过滤链接,以便只考虑具有匹配标题的链接。你也会想丢弃以 开头的链接#......你看,一个合适的解析器会给你更好的结果。

于 2012-04-20T15:35:06.233 回答
0

永远不要用正则表达式解析 html。;-)

但是为了帮助您改进您的 regex-fu 可以针对未来的非 HTML工作进行改进,您的 regex 有两个地方失败:

  • .\w+.\d+ (这与/in不匹配/video/3877。试试 `"[^"]+"
  • .+,这将匹配尽可能多的任何字符......尽可能少地尝试:.+?
于 2012-04-20T15:26:38.590 回答