3

这是我的第一个问题,所以如果我做错了,请善待。

我正在使用 python 3.3 中的请求模块来自动从几个站点下载文件,但是当我尝试获取 csv 文件时,这个模块尤其给我带来了麻烦。我在 python 方面有一个可行的能力水平,但就网站交互而言,我不熟悉 html 和 javascript。

这是相关的代码。

import requests
import datetime

now = datetime.datetime.now().strftime("%Y%m%d")

folder = 'some path'

url = 'https://gats.pjm-eis.com/gats2/PublicReports/RenewableGeneratorsRegisteredInGATS/'#ExportTo'
payload = {'exportType' : 'CSV',
           'tabNumber' : ''}
doc = requests.post(url, data=payload, stream=True)

output = open(folder+now+'_GATSRegistered.csv','wb')
output.write(doc.content)
output.close()

我没有收到任何错误,但我正在创建的文档是基于错误页面的。'http://www.place.com/path/file.xlsx我已经成功地为一个网址直接指向文件(但这只需要一个“获取”请求。

所以,我的问题:

  • 发布的正确请求是什么?
  • 发布甚至是正确的事情吗?
  • 这是一个特例还是我应该知道一般如何解决的问题?
  • 还有什么我应该做的不同的事情吗?
4

1 回答 1

1

我查看了 Chrome 中的页面并打开了开发人员控制台,同时打开了网络选项卡。在那里您可以看到单击“CSV”按钮会发送POST包含大量表单数据的请求。

exportType:CSV
tabNumber:
CSV_CH:1
PRN_CH:0
GridView$DXFREditorcol0:
GridView$DXFREditorcol1:
GridView$DXFREditorcol2:
GridView$DXFREditorcol3:
GridView$DXFREditorcol4:
GridView$DXFREditorcol5:
GridView$DXFREditorcol6:
GridView$DXFREditorcol7:
GridView$DXFREditorcol8:
GridView$DXFREditorcol9:
GridView$DXFREditorcol10:
GridView$DXFREditorcol11:
GridView$DXFREditorcol12:
GridView$DXFREditorcol13:
GridView$DXFREditorcol14:
GridView$DXFREditorcol15:
GridView$DXFREditorcol16:
GridView$DXFREditorcol17:
GridView$DXFREditorcol18:
GridView$DXFREditorcol19:
GridView$DXFREditorcol20:
GridView$DXFREditorcol21:
GridView$DXFREditorcol22:
GridView$DXFREditorcol23:
GridView$DXFREditorcol24:
GridView$DXFREditorcol25:
GridView$DXFREditorcol26:
GridView_custwindowWS:0:0:-1:-10000:-10000:0:1px:-10000:1:0:0:0
GridView_DXHFPWS:0:0:-1:-10000:-10000:0:180px:100px:1:0:0:0
GridView_DXPagerBottom_PSPSI:2
GridView$DXSelInput:
GridView$DXKVInput:[]
GridView$CallbackState:BwMHAQIFU3RhdGUGEAEHGwcAAgEHAQIBBwICAQcDAgEHBAIBBwUCAQcGAgEHBwIBBwgCAQcJAgEHCgIBBwsCAQcMAgEHDQIBBw4CAQcPAgEHEAIBBxECAQcSAgEHEwIBBxQCAQcVAgEHFgIBBxcCAQcYAgEHGQIBBxoCAQcABxsHAAcABwEHAAcCBwAHAwcABwQHAAcFBwAHBgcABwcHAAcIBwAHCQcABwoHAAcLBwAHDAcABw0HAAcOBwAHDwcABxAHAAcRBwAHEgcABxMHAAcUBwAHFQcABxYHAAcXBwAHGAcABxkHAAcaBwAHAAcAAgAFAAAAgAkCCUVudGl0eUtleQkCAAIAAwcEAgAHAAIBBTaVAAAHAAIBBwAHAAIQRmlsdGVyRXhwcmVzc2lvbgcCAAIIUGFnZVNpemUDBzI=
GridView$DXSyncInput:
GridView_DXFilterRowMenuCI:
DXScript:1_142,1_80,1_135,1_91,14_0,1_90,1_113,14_23,14_10,1_98,1_105,1_77,1_128,1_126,1_124,1_133,1_119,1_127,1_104,1_101,1_84,1_109,1_92,14_1,1_94,1_97,1_95,1_96,1_106,14_4,1_100,1_117,1_103,14_12,14_13,1_102,1_129,1_107,1_137,1_114,14_16,10_2,10_1,10_3,10_4,14_3
DXMVCEditorsValues:{"GridView_DXFREditorcol0":null,"GridView_DXFREditorcol1":null,"GridView_DXFREditorcol2":null,"GridView_DXFREditorcol3":null,"GridView_DXFREditorcol4":null,"GridView_DXFREditorcol5":null,"GridView_DXFREditorcol6":null,"GridView_DXFREditorcol7":null,"GridView_DXFREditorcol8":null,"GridView_DXFREditorcol9":null,"GridView_DXFREditorcol10":null,"GridView_DXFREditorcol11":null,"GridView_DXFREditorcol12":null,"GridView_DXFREditorcol13":null,"GridView_DXFREditorcol14":null,"GridView_DXFREditorcol15":null,"GridView_DXFREditorcol16":null,"GridView_DXFREditorcol17":null,"GridView_DXFREditorcol18":null,"GridView_DXFREditorcol19":null,"GridView_DXFREditorcol20":null,"GridView_DXFREditorcol21":null,"GridView_DXFREditorcol22":null,"GridView_DXFREditorcol23":null,"GridView_DXFREditorcol24":null,"GridView_DXFREditorcol25":null,"GridView_DXFREditorcol26":null}

您可以看到上面哪些是您绝对需要发送到服务器的。我怀疑所有这些都是必需的(但我错了很多:))。

也就是说,在使用时stream=True,您应该使用iter_content. 所以你的代码看起来像:

payload = {
# Form contents
}
r = requests.post(url, data=payload, stream=True)
with open(filename, 'wb') as output:
    for chunk in r.iter_content():
        output.write(chunk)

for 循环确保在可用时将其写入您的文件。当它停滞不前时,您不必担心它会挂在您身上。

于 2013-08-10T02:49:05.060 回答