-1

:-)

我正在尝试下载页面,在页面上填写表格并提交。我喜欢python,遇到了mechanize。我可以成功下载网页,验证页面中有 2 个表单,但是,即使我可以验证 mechanize 下载的网页数据清楚地包含第 2 个表单,mechanize 也不会识别第二个表单(方法 POST)。因此,我什至无法继续修改​​值并提交我感兴趣的表单。我在 OS X 10.6.8 上使用 Python 2.6.1。非常感谢任何建议。

我的代码

import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)   # no robots
br.set_handle_refresh(False)  # can sometimes hang without this
br.addheaders = [('User-agent', 'Mozilla/6.0 (X11; U; i686; en-US; rv:1.9.0.1) Gecko/2008071615 OS X 10.2 Firefox/3.0.1')]
url = 'http://www.abcd.com/test.html'
response = br.open(url)

我可以使用 response.read() 或 get_data() 验证有两种形式,如下所示

<form id="lookupFormX" action="/lookup/" onSubmit="return submitLookupForm('lookupForm', 'download');" method="GET">

                <label style="font-weight:normal; font-size:85%; margin-right:5px;">View a Site Report </label>
                <input type="hidden" name="facet" style="margin-right:2px; font-weight:normal; font-size:85%;" value="sitereport" readonly/>

                <input style="margin-right:2px; font-weight:normal; font-size:85%;" name="q" type="text" id="railtext_v11pt" value="e.g. yahoo.com"
                        onfocus="clearDefaultNote(this,'e.g. yahoo.com');"
                        onblur="addDefaultNote(this,'e.g. yahoo.com');" />
                <a style="margin-right:10px;" href="#" onclick="submitLookupForm('lookupFormX');"><img src="/images/nav_right.gif" /></a>
            </form>

<br>
<FORM action="userfeedbackpost.html" id="friendForm" name="friendForm" method="post">
<TABLE id="userfeedbacktable" BORDER=0 style="padding:left:0px; margin-left:0px;">
    <TR>
        <TD style="width:200px;padding-left:10px">Your Name:</TD>
        <TD style="width:200px" ><input name="your_name" type="text" SIZE=35/></TD>

        <TD style="width:250px;text-align:right;padding-right:10px">Your E-mail:</TD>
        <TD style="width:140px" ><input name="your_email" type="text" SIZE=35/></TD>
    </TR>
    <!-- <TR></TR> -->
    <TR>
        <TD style="width:200px;padding-left:10px">Subject:</TD>
        <TD colspan="3" ><input name="subject" type="text" style="width:648px" SIZE=106/></TD>
    </TR>
    <!-- <TR></TR> -->
    <TR>
        <TD style="width:200px;padding-left:10px">URL this concerns:</TD>
        <TD colspan="3" ><input name="url" type="text" style="width:648px" SIZE=106/></TD>
    </TR>
    <!-- <TR></TR> -->
    <TR>
        <TD style="width:200px;padding-left:10px">User ID:</TD>
        <TD style="width:200px" ><input name="test_id" type="text" SIZE=35/></TD>

        <TD style="width:250px;text-align:right;padding-right:10px">Type of inquiry:</TD>
        <TD style="width:140px" >
            <SELECT name="type" id="type" style="width:262px" onchange="makeSelection()">
                <OPTION value="Choose">Choose One</OPTION>
                <OPTION value="Bug report">Report an error</OPTION>
                <OPTION value="Helpful Information">Send us a suggestion</OPTION>
                <OPTION value="Other">Other</OPTION>
            </SELECT>
        </TD>
    </TR>
    <!-- <TR></TR> -->
    <TR id="infoPanel" style="display:none">
        <TD style="width:200px;padding-left:10px">Facet in question:</TD>
        <TD style="width:200px" >
            <SELECT name="facet" style="width:263px" id="facet">
                <OPTION selected value="Choose">Choose One</OPTION>
                <OPTION value="Annoyances">Annoyances</OPTION>
                <OPTION value="Downloads">Downloads</OPTION>
                <OPTION value="Links">Links</OPTION>
           </SELECT>
       </TD>

       <TD style="width:250px;text-align:right;padding-right:10px">Are you the site owner?:</TD>
       <TD style="width:140px" >
           <input type="radio" id="siteowner_yes" name="siteowner" value="Yes">&nbsp;Yes&nbsp;&nbsp;&nbsp;
           <input type="radio" id="siteowner_no" name="siteowner" value="No" checked>&nbsp;No
       </TD>
    </TR>
    <!-- <TR></TR> -->
    <TR>
        <TD style="width:200px;padding-left:10px" >Your Message:</TD>
        <TD colspan=3><textarea class=userfeedbackTA NAME=message ROWS=12 COLS=80 style="width:646px;"></textarea></TD>
    </TR>
    <!-- <TR></TR> -->
</TABLE>

<br/><br/> <a href="javascript:document.getElementById('friendForm').submit();" class="btnOrangeLrg"><span>Send Your Feedback or Question.</span></a><br/>
<br/><br/> P.S. We will use the information above only to help provide you feedback. This information will not be used for any other purpose.

</FORM>

mechanize 仅显示以下内容:

Form name: None
<GET http://www.test.com/lookup/ application/x-www-form-urlencoded
  <HiddenControl(facet=sitereport) (readonly)>
  <TextControl(q=e.g. yahoo.com)>>

当我使用以下代码时

for form in br.forms():
    print "Form name:", form.name
    print form

我的问题: - 我怎样才能访问第二个表格?(使用 nr=1 给了我一个错误)

编辑:

我也试过这个版本,同样的结果,第二种形式不会出现:

request = mechanize.Request(url)
request.add_header("User-agent", "Mozilla/6.0 (X11; U; i686; en-US; rv:1.9.0.1) Gecko/2008071615 OS X 10.2 Firefox/3.0.1")
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
response.close()

for form in forms:
  print form

编辑 2

我还尝试将我的代码修改为如下所示:

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

br.addheaders = [
('Cookie','mbox=PC#1327356910232-537677#1410633293|check#true#1347561353|session#1347561287712-498080#1347563153; s_cc=true; s_sq=%5B%5BB%5D%5D; s_nr=1347561671754-Repeat'),\
('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.3'),\
('Accept-Encoding','gzip,deflate,sdch'),\
('Accept-Language','en-US,en'),\
('Cache-Control','max-age=0'),\
('Connection','keep-alive'),\
('Referer','http://www.siteadvisor.com'),\
('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1')
]

我从浏览器中获取了标头值并尝试将它们插入到 mechanize 浏览器实例中。然而我只能看到 1 形式。

4

2 回答 2

2

我遇到了类似的问题,我发现修改我的标题并包括RobustFactory()处理“坏”的 HTML 解决了这个问题。

`br = mechanize.Browser(factory=mechanize.RobustFactory())
br.set_handle_robots(False)
br.addheaders = [('User-agent','Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6')]`

这是在他们摆弄了很多之后。此解决方案适用于一般情况以及我使用的特定 URL,但添加:

br.addheaders.append(['Accept-Encoding', 'gzip'])

如果您尝试访问的 URL 是 GZipped,则可能需要。您可以在此处检查是否是这种情况:http: //checkgzipcompression.com/

于 2015-06-04T08:28:33.640 回答
0

您应该提供 url,因为如果我将您给定的带有表单的 html 放在text变量中,则会发生这种情况:

In [61]: forms = mech.ParseString(text, 'fake') # imported mechanize as mech

In [62]: for form in forms: print form; print '-'*5
   ....: 
<GET fake application/x-www-form-urlencoded>
-----
<GET /lookup/ application/x-www-form-urlencoded
  <HiddenControl(facet=sitereport) (readonly)>
  <TextControl(q=e.g. yahoo.com)>>
-----
<friendForm POST userfeedbackpost.html application/x-www-form-urlencoded
  <TextControl(your_name=)>
  <TextControl(your_email=)>
  <TextControl(subject=)>
  <TextControl(url=)>
  <TextControl(test_id=)>
  <SelectControl(type=[*Choose, Bug report, Helpful Information, Other])>
  <SelectControl(facet=[*Choose, Annoyances, Downloads, Links])>
  <RadioControl(siteowner=[Yes, *No])>
  <TextareaControl(message=)>>
-----

第一个是默认的(由解析添加)忽略它。好开心的两种形式。

于 2012-09-12T08:13:10.917 回答