python - 无法使用 python 下载受 cookie 保护的文件

Question

我整天都在寻找解决这个问题。有这个http://www.some.site/index.php正在请求用户和密码 + 发送 cookie。好吧，我是这样进入的：

import urllib, urllib2, cookielib, os
import re # not required here but tried it out though
import requests # not required here but tried it out though
username = 'somebody'
password = 'somepass'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
resp = opener.open('http://www.some.site/index.php', login_data)
print resp.read()

问题是屏幕中间有一个下载 .xls 文件的链接：http ://www.some.site/excel_file.php?/t=1303457489 。我可以在任何浏览器（Mozilla、Chrome、IE）中下载该文件，但不能使用 Python。在 .php 之后，帖子数据（即： ?t=1370919996 ）在我登录或刷新页面时一直在变化。

也许我错了，但我相信 Post Data 是从 cookie（或 session-cookie）生成的，但 cookie 仅包含以下内容：('set-cookie', 'PHPSESSID=9cde55534fcc8e136fcf6588c0d0f1df; path=/')

这是我尝试保存文件的一种方法：

print "downloading with urllib2"
f = urllib2.urlopen('http://www.some.site/excel_file.php')
data = f.read()
with open("exceldoc.xls", "wb") as code:
    code.write(data)

如果我保存它或打印它会给出相同的错误请求错误：

<b>Fatal error</b>:  Call to a member function FetchRow() on a non-object in <b>http://www.some.site/excel_file.php</b> on line <b>112</b><br

如何使用 Python 下载此文件？非常感谢您的帮助！

有很多类似的帖子，我已经检查过它们，我的例子是从这些帖子中得到启发的，但对我没有任何帮助。我对cookies、php、js不是很熟悉。

编辑：这是我打印出 index.php 的内容时得到的：

<html>
<head>
<title>SOMETITLE</title>
<meta http-equiv="Page-Enter" content="blendTrans(Duration=0.5)">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel='stylesheet' type='text/css' href='somesite.css'>
<SCRIPT LANGUAGE="JavaScript">
<!-- JavaScript hiding

function clearDefault(obj) {
    if (!obj._cleared) {
                obj.value='';
                obj._cleared=true;
    }
}

// -->
</SCRIPT>
</head>

<body bgcolor="#FFFFFF" text="#000000">

<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
  <tr>
    <td>
      <table width="1000" height="150" border="0" align="center" cellpadding="16" cellspacing="0" class="header" style="background: #989896 url('images/header.png') no-repeat;">
        <tr>
          <td valign="middle">
            <table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
              <tr>
                <td width="380">&nbsp;</td>
                <td>
                  <div id="login">
                       <form name="flogin" method="post" action="/index.php">
                      <h1>Login</h1>
                      <input name="uName" type="text" value="Username:" class="name" onfocus="clearDefault(this)">
                      <br>
                      <input type="password" name="uPw"  value="Password:" class="pass" onfocus="clearDefault(this)">
                      <input type="submit" name="Submit" value="OK" class="submit">
                    </form>
                  </div>                                                                
                                                                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
                </td>
  </tr>
</table>

</body>
</html>

score 1 · Accepted Answer

您可以尝试解析来自第一个代码部分的响应，并将提取的 url 与相同的opener. 在不知道链接的实际格式的情况下：

import urllib, urllib2, cookielib, os
import re # going to use this now!

username = 'somebody'
password = 'somepass'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
resp = opener.open('http://www.some.site/index.php', login_data)
content = resp.read()
print content

match = re.search(
    r"<a\s+href=\"(?P<file_link>http://www.some.site/excel_file.php?t=\d+)\">",
    content,
    re.IGNORECASE
)

assert match is not None, "Couldn't find the file link..."

file_link = match.group('file_link')
print "downloading {} with urllib2".format(file_link)
f = opener.open(file_link)
data = f.read()
with open("exceldoc.xls", "wb") as code:
    code.write(data)

python - 无法使用 python 下载受 cookie 保护的文件

1 回答 1

Related

Reference