0

返回的 HTML 不断告诉我重新启动浏览器,我有点迷茫:

require 'rubygems'
require 'mechanize'

def getHtml(the_url)
  agent = Mechanize.new
  agent.keep_alive = false
  agent.user_agent = "gibsonSim"
  agent.user_agent_alias = "Mechanize"
  agent.redirect_ok = true
  agent.add_auth('www.http://corpus2.byu.edu/','omitted', 'omitted')
  resp = agent.get(the_url)
  puts resp.body
  return resp   
end

url = "http://corpus2.byu.edu/glowbe/x2.asp?     chooser=seq&p=%5Bsolid%5D&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&showsec=y&sec1=0&sec2=0&sortBy=freq&sortByDo2=freq&minfreq1=freq&freq1=20&freq2=20&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&changed=&corpus=glowbe&word=&sbs=&sbs1=&sbsreg1=&sbsr=&sbsgroup=&redidID=&ownsearch=y&compared=&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=0&s2=0&s3=0&perc=mi"
puts getHtml(url)

我真的不确定为什么每次在 Mechanize 中都会发生这种情况,但有时在 Chrome 中会发生这种情况。

返回的 HTML 为:

<style>

<!--



option { font-family: Verdana; font-size: 9px }
input { font-family: Verdana; font-size: 9px }
body { font-family: Verdana; font-size: 11px }
div { font-family: Verdana; font-size: 11px }
p { font-family: Verdana; font-size: 11px }
td { font-family: Verdana; font-size: 11px }



-->
</style>

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>New Page 1</title>
<script language=Javascript>

function x(x1)
{
top.lefto.document.zabba.reset();
top.lefto.document.zabba.p.value = x1;
top.lefto.document.zabba.wl.options[0].selected = true;
top.lefto.document.zabba.whatsee[0].checked='true';
top.lefto.document.zabba.submit();
}

function x()
{
top.lefto.document.zabba.submit();
}

</script>

</head>


<body>


<div align="center">
<table align="center" border="0" cellpadding="10" cellspacing="0" style="border-    collapse: collapse" bordercolor="#111111" width="70%" id="AutoNumber1">
<tr><td style="background-color: #FFFFFF">&nbsp;</td></tr>
  <tr>
    <td align="center" width="100%">


Please close your browser <b>completely</b>, and then open your browser and start a new session.


</td>
  </tr>
</table>
<p>&nbsp;</p>
</body>
</html>

谢谢你提供的所有帮助!

4

1 回答 1

0

我不确定这是否是整个问题,但它表明 Mechanize/Nokogiri 不高兴:

require 'nokogiri'

html = <<EOT
<style>

<!--
option { font-family: Verdana; font-size: 9px }
input { font-family: Verdana; font-size: 9px }
body { font-family: Verdana; font-size: 11px }
div { font-family: Verdana; font-size: 11px }
p { font-family: Verdana; font-size: 11px }
td { font-family: Verdana; font-size: 11px }
-->
</style>

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>New Page 1</title>
<script language=Javascript>

function x(x1)
{
top.lefto.document.zabba.reset();
top.lefto.document.zabba.p.value = x1;
top.lefto.document.zabba.wl.options[0].selected = true;
top.lefto.document.zabba.whatsee[0].checked='true';
top.lefto.document.zabba.submit();
}

function x()
{
top.lefto.document.zabba.submit();
}

</script>
</head>
<body>
<div align="center">
<table align="center" border="0" cellpadding="10" cellspacing="0" style="border-    collapse: collapse" bordercolor="#111111" width="70%" id="AutoNumber1">
<tr><td style="background-color: #FFFFFF">&nbsp;</td></tr>
  <tr>
    <td align="center" width="100%">


Please close your browser <b>completely</b>, and then open your browser and start a new session.
</td>
  </tr>
</table>
<p>&nbsp;</p>
</body>
</html>
EOT

doc = Nokogiri::HTML(html)
puts doc.errors

运行显示 HTML 存在错误:

>> htmlParseStartTag: misplaced <html> tag
>> htmlParseStartTag: misplaced <head> tag

而且,这是 Nokogiri 认为文件在完成修复后的内容:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<style>

<!--
option { font-family: Verdana; font-size: 9px }
input { font-family: Verdana; font-size: 9px }
body { font-family: Verdana; font-size: 11px }
div { font-family: Verdana; font-size: 11px }
p { font-family: Verdana; font-size: 11px }
td { font-family: Verdana; font-size: 11px }
-->
</style>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>New Page 1</title>
<script language="Javascript">

function x(x1)
{
top.lefto.document.zabba.reset();
top.lefto.document.zabba.p.value = x1;
top.lefto.document.zabba.wl.options[0].selected = true;
top.lefto.document.zabba.whatsee[0].checked='true';
top.lefto.document.zabba.submit();
}

function x()
{
top.lefto.document.zabba.submit();
}

</script>
</head>
<body>
<div align="center">
<table align="center" border="0" cellpadding="10" cellspacing="0" style="border-    collapse: collapse" bordercolor="#111111" width="70%" id="AutoNumber1">
<tr><td style="background-color: #FFFFFF"> </td></tr>
<tr>
<td align="center" width="100%">


Please close your browser <b>completely</b>, and then open your browser and start a new session.
</td>
  </tr>
</table>
<p> </p>

</div>
</body>
</html>
于 2014-02-11T04:14:30.780 回答