0

我仍在学习 python,这是我第一次访问网站并为自己抓取某些信息。我正在努力理解这种语言。因此,欢迎任何意见。

下面的数据是我从页面源中看到的。我必须访问某个页面才能输入我的登录信息。成功进入后。我被重定向到另一个页面以获取我的密码。我正在尝试通过 python 请求发帖。在我可以废弃第三页信息之前,我必须浏览两页。但是,我只能通过登录的第一页。

这是为 USERNAME 调用的 Header 和 POST 信息。

对于用户名页面:

(Request-Line)  
POST /client/factor2UserId.recip;jsessionid=15AD9CDEB48362372EFFC268C146BBFC HTTP/1.1
Host    www.card.com
User-Agent  Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0
Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-US,en;q=0.5
Accept-Encoding gzip, deflate
DNT 1
Referer https://www.card.com/client/
Cookie  JSESSIONID=15AD9CDEB48362372EFFC268C146BBFC
Connection  keep-alive
Content-Type    application/x-www-form-urlencoded
Content-Length  13

Post Data: 
login,  USERLOGIN

这是为密码调用的 Header 和 Post 信息:

For the PASSWORD PAGE:
(Request-Line)  
POST /client/siteLogonClient.recip HTTP/1.1
Host    www.card.com
User-Agent  Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0
Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-US,en;q=0.5
Accept-Encoding gzip, deflate
DNT 1
Referer https://www.card.com/client/factor2UserId.recip;jsessionid=15AD9CDEB48362372EFFC268C146BBFC
Cookie  JSESSIONID=15AD9CDEB48362372EFFC268C146BBFC
Connection  keep-alive
Content-Type    application/x-www-form-urlencoded
Content-Length  133

Post Data: 
org.apache.struts.taglib.html.TOKEN,     583ed0aefe4b04b
login,  USERLOGIN
password, PASSWORD

这是我想出的,但是我只能访问第一页。一旦我调用我的函数 second_pass(),我就会被重定向回第一页。

使用我的函数 first_pass(),我收到一个响应代码 200。但是,我在 second_pass() 上收到相同的代码,但是如果我打印出页面的文本,它会重定向到第一页。我从未成功进入第三页。

import requests
import re

response = None
r = None

payload = {'login' : 'USERLOGIN'}
# acesses the username screen and adds username
# give login name
def first_pass():
    global response
    global payload
    url = 'https://www.card.com/client/factor2UserId.recip'
    s = requests.Session()
    response = s.post(url, payload)
    return response.status_code


# merges payload with x that contains user password
def second_pass():
    global payload
    global r
    # global response
    x = {'password' : 'PASSWORD'} # I add the Password in this function cause I am not sure if it will fail the first_pass()
    payload.update(x)
    url = 'https://www.card.com/client/siteLogonClient.recip'
    r = requests.post(url, payload)
    return payload
    return r.status_code



# searches response for Token!
# if token found merges key:value into payload
def token_search():
    global response
    global payload
    f = response.text

    # uses regex to find the Token from the HTML
    patFinder2 = re.compile(r"name=\"(org.apache.struts.taglib.html.TOKEN)\"\s+value=\"(.+)\"",re.I)
    findPat2 = re.search(patFinder2, f)

    # if the Token in found it turns it into a dictionary. and prints the dictionary 
    # if no Token is found it prints "nothing found" 
    if(findPat2):
        newdict = dict(zip(findPat2.group(1).split(), findPat2.group(2).split()))
        payload.update(newdict)
        print payload
    else:
        print "No Token Found"

我现在从 shell 调用我的函数。我按这个顺序打电话给他们。first_pass()、token_search()、second_pass()。

当我调用 token_search() 时,它会使用 unicode 更新字典。我不确定这是否是导致我的错误的原因。

任何关于代码的建议都将受到欢迎。我喜欢学习。但在这一点上,我正在用头撞墙。

4

1 回答 1

2

如果您正在抓取数据,那么我建议您学习诸如lxmlBeautifulSoup之类的库,以更可靠地从页面收集数据(与使用正则表达式相比)。

如果令牌查找代码有效,那么我的建议是像这样重新排列代码。它避免了全局变量,将变量保持在它们所属的范围内。

login('USERLOGIN', 'PASSWORD')

def login(username, password):
    loginPayload = {'login' : username}
    passPayload = {'password' : password}
    s = requests.Session()

    # POST the username
    url = 'https://www.card.com/client/factor2UserId.recip'
    postData = loginPayload.copy()
    response = s.post(url, postData)
    if response.status_code != requests.codes.ok:
        raise ValueError("Bad response in first pass %s" % response.status_code)
    postData.update(passPayload)
    tokenParam = token_search(response.text)
    if tokenParam is not None:
        postData.update(tokenParam)
    else:
        raise ValueError("No token value found!")
    # POST with password and the token
    url = 'https://www.card.com/client/siteLogonClient.recip'
    r = s.post(url, postData)
    return r


def token_search(resp_text):
    # uses regex to find the Token from the HTML
    patFinder2 = re.compile(r"name=\"(org.apache.struts.taglib.html.TOKEN)\"\s+value=\"(.+)\"",re.I)
    findPat2 = re.search(patFinder2, resp_text)

    # if the Token in found it turns it into a dictionary. and prints the dictionary 
    # if no Token is found it prints "nothing found" 
    if findPat2:
        newdict = dict(zip(findPat2.group(1).split(), findPat2.group(2).split()))
        return newdict
    else:
        print "No Token Found"
        return None
于 2013-08-05T19:11:00.453 回答