0

我想从网页获取所有 GET 和 POST 参数。假设有一些网页。我可以从此页面获取所有链接。但是,如果此页面采用输入参数(GET 和 POST),我该如何获取它们?我的算法是这样的:

find in web page this type of strings <form method="GET">...</form>;
then for each found result:
     get <input> fields and construct request
     then save it somewhere

我的目的是编写爬虫,从网站获取所有链接、GET 和 POST 参数,然后将其保存在某处以供进一步分析。我的算法很简单,所以我想知道有没有其他方法(在 python 中)?你能推荐任何python库吗?

4

1 回答 1

0

像这样的东西让你开始怎么样?它提取表单和输入属性:

from BeautifulSoup import BeautifulSoup

s = urllib2.urlopen('http://stackoverflow.com/questions/10614974/how-to-get-post-and-get-parameters-from-web-page-in-python').read()
soup = BeautifulSoup(s)

forms = soup.findall('form')
for form in forms:
  print 'form action: %s (%s)' % (form['action'], form['method'])
  inputs = form.findAll('input')
  for input in inputs:
    print "  -> %s" % (input.attrs) 

输出(本页):

form action: /search (get)
  -> [(u'autocomplete', u'off'), (u'name', u'q'), (u'class', u'textbox'), (u'placeholder', u'search'), (u'tabindex', u'1'), (u'type', u'text'), (u'maxlength', u'140'), (u'size', u'28'), (u'value', u'')]
form action: /questions/10614974/answer/submit (post)
  -> [(u'id', u'fkey'), (u'name', u'fkey'), (u'type', u'hidden'), (u'value', u'923d3d8b45bbca57cbf0b126b2eb9342')]
  -> [(u'id', u'author'), (u'name', u'author'), (u'type', u'text')]
  -> [(u'id', u'display-name'), (u'name', u'display-name'), (u'type', u'text'), (u'size', u'30'), (u'maxlength', u'30'), (u'value', u''), (u'tabindex', u'105')]
  -> [(u'id', u'm-address'), (u'name', u'm-address'), (u'type', u'text'), (u'size', u'40'), (u'maxlength', u'100'), (u'value', u''), (u'tabindex', u'106')]
  -> [(u'id', u'home-page'), (u'name', u'home-page'), (u'type', u'text'), (u'size', u'40'), (u'maxlength', u'200'), (u'value', u''), (u'tabindex', u'107')]
  -> [(u'id', u'submit-button'), (u'type', u'submit'), (u'value', u'Post Your Answer'), (u'tabindex', u'110')]
于 2012-05-16T09:15:15.650 回答