python - 将字符串添加到 URL 的末尾

Question

为了练习更多的 Python 知识，我在 pythonchallenge.com 上尝试了挑战

简而言之，这个挑战作为第一步需要从一个带有数字的 url 加载一个 html 页面。该页面包含一行文本，其中包含一个数字。该数字用于替换 url 中的现有数字，因此将您带到序列中的下一页。显然这会持续一段时间......（这个挑战还有更多，但让这部分工作是第一步）。

我这样做的代码如下（暂时仅限于浏览序列中的前四页）。出于某种原因，它第一次工作 - 它获取序列中的第二页，读取数字，转到第三页，然后读取那里的数字。但后来它卡在了第三个。我不明白为什么，尽管我认为这可能与我在将数字放在 URL 末尾之前尝试将其转换为字符串有关。要回答这个明显的问题，是的，我知道 pythonchallenge 工作正常 - 只要您有耐心，您可以手动执行 url-numbers 的事情，如果您愿意，可以确认：p

import httplib2
import re

counter = 0
new = '12345' #the number for the initial page in the sequence, as a string

while True:
    counter = counter + 1
    if counter == 5:
        break

    original = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing='
    nextpage = original+new     #each page in the sequence is visited by adding 
                                #the number after 'nothing='
    print(nextpage)

    h = httplib2.Http('.cache')
    response, content = h.request(nextpage, "GET")  #get the content of the page, 
                                                    #which includes the number for the 
                                                    #*next* page in the sequence

    p = re.compile(r'\d{4,5}$')     #regex to find a 4 to 5 digit number at the end of
                                    #the content

    new = str((p.findall(content)))     #make the regex result a string - is this
                                            #where the problem lies?

    print('cached?', response.fromcache)    #I was worried my requests were somehow
                                            #being cached not actually sent afresh to
                                            #pythonchallenge. But it seems they aren't.

    print(content)
    print(new)

上面的输出如下。第一次运行似乎工作正常（将 92512 添加到 url 并成功获取下一页并找到下一个值）但之后它只是卡住了，并且似乎没有按顺序加载下一页. 通过在浏览器中手动更改 url 进行测试，确认数字正确并且 pythonchallenge 工作正常。

在我看来，将我的正则表达式搜索变成一个字符串以添加到 URL 的末尾似乎出了点问题 - 但为什么它应该第一次而不是第二次我不知道。我还担心我的请求可能只是缓存（我是 httplib2 的新手，对它如何缓存没有信心），但它们似乎不是。我还在请求中添加了一个 no-cache 参数，只是为了确定（未在此代码中显示），但它没有帮助。

http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345

（'缓存？'，假）

下一个是 92512

['92512']

http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=['92512 ']

（'缓存？'，假）

下一个是 72758

['72758']

http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=['72758 ']

（'缓存？'，假）

下一个是 72758

['72758']

http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=['72758 ']

（'缓存？'，假）

下一个是 72758

['72758']

我将感谢任何能指出我哪里出错的人，以及任何相关提示

提前致谢...

score 1 · Accepted Answer

http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=['72758']
                                                             ^^     ^^

我认为问题出在这里。findall()返回一个列表：

re.findall（模式，字符串 [，标志]）

返回字符串中模式的所有非重叠匹配，作为字符串列表。从左到右扫描字符串，并按找到的顺序返回匹配项。如果模式中存在一个或多个组，则返回组列表；如果模式有多个组，这将是一个元组列表。空匹配包含在结果中，除非它们触及另一个匹配的开始。

-- Python 文档

python - 将字符串添加到 URL 的末尾

1 回答 1

Related

Reference