我对 python 比较陌生,所以这样的事情对我来说并不容易。
我只想循环浏览网页内容,然后暂时将每个出现的内容打印到控制台窗口,但显然我的循环错误。
import sys
import re
import urllib2
import urlparse
crawling = tocrawl.pop()
response = urllib2.urlopen(crawling)
msg = response.read()
endDiv = msg.find('</div>')
while endDiv != -1:
endDiv = msg.find('</div>')
startPos = msg.find('class="facultyname">', endDiv)
if startPos != -1:
nextPos = msg.find('.php">', startPos)
endPos = msg.find('</a>', nextPos)
if endPos != -1:
name = msg[nextPos+6:endPos]
print name, " ",
startPos = msg.find('function escramble()')
if startPos != -1:
nextPos = msg.find('b=', startPos)
endPos = msg.find('c', nextPos)
if endPos != -1:
email = msg[nextPos+3:endPos-1]
email = email[:-13] + '@email.com'
print email
endDiv = msg.find('</div>', endPos)
我已经抓住了第一次出现,我只想循环到页面末尾并收集其余部分。
示例 HTML:
<div id="main-text">
<p class="title">Research Scientists</p>
<div class="space"> </div>
<img src="photos/icons/bastolaicon.jpg" class="faculty" width="53" height="71" alt="Bastola Photo" />
<div class="facultyname">
<strong><a href="people/bastola.php">person1</a>
<br /><em>Post-Doctoral Scientist</em></strong>
<br />
</div>
<div class="facultybody">
Rm. 218A
<br /><em><script type="text/javascript">
<!--
function escramble(){
var a,b,c,d,e,f,g,h,i
a='<a href=\"mai'
b='person1'
c='\">'
a+='lto:'
b+='@'
e='</a>'
f=''
b+='email.com'
g='<img src=\"'
h=''
i='\" alt="Email us." border="0">'
if (f) d=f
else if (h) d=g+h+i
else d=b
document.write(a+b+c+d+e)
}
escramble()
//-->
</script></em>
</div>
<div class="space"> </div>
<img src="photos/icons/person2icon.jpg" class="faculty" width="53" height="71" alt="person2 Photo" />
<div class="facultyname">
<strong><a href="people/person2.shtml">person2</a>
<br /><em>Assistant Research Scientist</em></strong>
<br />
</div>
<div class="facultybody">
Rm. 227
<br />(850) 645-1253
<br /><em><script type="text/javascript">
<!--
function escramble(){
var a,b,c,d,e,f,g,h,i
a='<a href=\"mai'
b='person2'
c='\">'
a+='lto:'
b+='@'
e='</a>'
f=''
b+='email.com'
g='<img src=\"'
h=''
i='\" alt="Email us." border="0">'
if (f) d=f
else if (h) d=g+h+i
else d=b
document.write(a+b+c+d+e)
}
escramble()
//-->
</script></em>
</div>
<div class="spacer"> </div>