-1

嗨,大家好 !

我仍在发现 Twisted,并且我制作了这个脚本来将 HTML 表格的内容解析为 excel。这个脚本运行良好!我的问题是我怎么能做同样的事情,只有一个网页(http://bandscore.ielts.org/)但是有很多 POST 请求能够获取所有结果,用 beautifulSoup 解析它然后把它们进入excel?

解析源代码并将其放入 excel 中是可以的,但我不知道如何使用 Twisted 进行 POST 请求以便在

这是我用于解析(使用 Twisted)许多不同页面的脚本(我希望能够编写相同的脚本,但在同一页面上使用许多不同的 POST 数据而不是很多页面):

from twisted.web import client
from twisted.internet import reactor, defer
from bs4 import BeautifulSoup as BeautifulSoup
import time
import xlwt

start = time.time()
wb = xlwt.Workbook(encoding='utf-8')
ws = wb.add_sheet("BULATS_IA_PARSED")
global x
x = 0
Countries_List = ['Afghanistan','Armenia','Brazil','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
urls = ["http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % Countries for Countries in Countries_List]


def finish(results):
    global x
    for result in results:
        print 'GOT PAGE', len(result), 'bytes'
        soup = BeautifulSoup(result)
        tableau = soup.findAll('table')
    try:
        rows = tableau[3].findAll('tr')
        print("Fetching")
        for tr in rows:
        cols = tr.findAll('td')
        y = 0
        x = x + 1
        for td in cols:
            texte_bu = td.text
            texte_bu = texte_bu.encode('utf-8')
            #print("Writing...")
                    #print texte_bu
            ws.write(x,y,td.text)
            y = y + 1
    except(IndexError):
        print("No IA for this country")
        pass

    reactor.stop()

waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)

reactor.run()
wb.save("IALOL.xls")
print "Elapsed Time: %s" % (time.time() - start)

非常感谢您的帮助!

4

1 回答 1

2

你有两个选择。继续使用getPage并告诉它使用POST而不是GET或使用Agent.

API 文档getPage将您引导至API 文档HTTPClientFactory以发现其他支持的选项。

后面的 API 文档明确涵盖method并暗示(但解释不好)postdata。因此,要使用以下方式进行POSTgetPage

d = getPage(url, method='POST', postdata="hello, world, or whatever.")

有一个howto 样式的文档Agent(链接自整个 web howto 文档索引。这给出了发送带有正文的请求的示例(即,参见FileBodyProducer示例)。

于 2012-04-24T16:57:19.460 回答