python - 如何检查网站上的值是否已更改

Question

基本上，如果网站上的值发生变化，我会尝试运行一些代码（Python 3.2），否则请稍等片刻，稍后再检查。

首先，我认为我可以将值保存在变量中，并将其与下次运行脚本时获取的新值进行比较。但这很快就遇到了问题，因为当脚本再次运行并初始化该变量时，该值被覆盖。

因此，我尝试将网页的 html 保存为文件，然后将其与下次脚本运行时调用的 html 进行比较。那里也没有运气，因为即使没有变化，它也会不断出现 False。

接下来是对网页进行腌制，然后尝试将其与 html 进行比较。有趣的是，这在脚本中也不起作用。但是，如果我在脚本运行后键入 file = pickle.load( open( 'D:\Download\htmlString.p', 'rb')) 然后 file == html，它会在没有出现时显示 True任何变化。

我有点困惑为什么脚本运行时它不起作用，但如果我执行上述操作，它会显示正确的答案。

编辑：感谢到目前为止的回复。我遇到的问题并不是关于其他方法来解决这个问题（尽管学习更多完成任务的方法总是好的！）而是为什么下面的代码在作为脚本运行时不起作用，但是如果我脚本运行后在提示符处重新加载 pickle 对象，然后针对 html 对其进行测试，如果没有任何更改，它将返回 True 。

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'rb'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'rb')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )  
        print('Saving')
except: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )
    print('ERROR')

score 9 · Accepted Answer

编辑：我没有意识到您只是在寻找脚本的问题。这就是我认为的问题，然后是我的原始答案，它解决了您要解决的更大问题的另一种方法。

你的脚本是使用一揽子except陈述的危险的一个很好的例子：你抓住了一切。在这种情况下，包括您的sys.exit(0).

我假设你被try阻止是为了捕捉D:\Download\htmlString.p尚不存在的情况。那个错误被称为IOError，你可以专门用except IOError:

这是您的脚本和一些代码，可以解决您的except问题：

import sys
import pickle
import urllib2

request = urllib2.Request('http://www.iana.org/domains/example/')
response = urllib2.urlopen(request) # Make the request
htmlString = response.read()

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'rb'))
    if file == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )  
        print('Saving')
except IOError: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )
    print('Created new file.')

作为旁注，您可能会考虑使用os.path文件路径——它将帮助以后想要在另一个平台上使用您的脚本的任何人，并且它可以为您节省丑陋的双反斜杠。

编辑 2：适用于您的特定 URL。

该页面上的广告有一个动态生成的数字，该数字随着每次页面加载而变化。在所有内容之后，它就在末尾附近，因此我们可以在该点拆分 HTML 字符串并取前半部分，丢弃带有动态数字的部分。

import sys
import pickle
import urllib2

request = urllib2.Request('http://ecal.forexpros.com/e_cal.php?duration=weekly')
response = urllib2.urlopen(request) # Make the request
# Grab everything before the dynabic double-click link
htmlString = response.read().split('<iframe src="http://fls.doubleclick')[0]

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'r'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'r')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "w" ) )  
        print('Saving')
except IOError: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "w" ) )
    print('Created new file.')

如果这很重要，您的字符串不再是有效的 HTML 文档。如果是这样，您可能会删除该行或其他内容。可能有一种更优雅的方式来做到这一点——也许用正则表达式删除数字——但这至少可以满足你的问题。

原始答案-解决您的问题的另一种方法。

来自 Web 服务器的响应标头是什么样的？HTTP 指定了一个Last-Modified属性，您可以使用它来检查内容是否已更改（假设服务器说的是真话）。HEAD正如 Uku 在他的回答中显示的那样，将这个与请求一起使用。如果您想节省带宽并对正在轮询的服务器友好。

还有一个If-Modified-Since标题，听起来像您可能正在寻找的东西。

如果我们把它们结合起来，你可能会想出这样的东西：

import sys
import os.path
import urllib2

url = 'http://www.iana.org/domains/example/'
saved_time_file = 'last time check.txt'

request = urllib2.Request(url)
if os.path.exists(saved_time_file):
    """ If we've previously stored a time, get it and add it to the request"""
    last_time = open(saved_time_file, 'r').read()
    request.add_header("If-Modified-Since", last_time)

try:
    response = urllib2.urlopen(request) # Make the request
except urllib2.HTTPError, err:
    if err.code == 304:
        print "Nothing new."
        sys.exit(0)
    raise   # some other http error (like 404 not found etc); re-raise it.

last_modified = response.info().get('Last-Modified', False)
if last_modified:
    open(saved_time_file, 'w').write(last_modified)
else:
    print("Server did not provide a last-modified property. Continuing...")
    """
    Alternately, you could save the current time in HTTP-date format here:
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.3
    This might work for some servers that don't provide Last-Modified, but do
    respect If-Modified-Since.
    """

"""
You should get here if the server won't confirm the content is old.
Hopefully, that means it's new.
HTML should be in response.read().
"""

另请查看Stii 的这篇博文，它可能会提供一些灵感。我不太了解ETags将它们放在我的示例中，但他的代码也会检查它们。

score 4 · Accepted Answer

执行 HEAD 请求并检查文档的 Content-Length 会更有效。

import urllib2
"""
read old length from file into variable
"""
request = urllib2.Request('http://www.yahoo.com')
request.get_method = lambda : 'HEAD'

response = urllib2.urlopen(request)
new_length = response.info()["Content-Length"]
if old_length != new_length:
    print "something has changed"

请注意，虽然内容长度不太可能完全相同，但同时是最有效的方式。根据您期望的更改类型，此方法可能合适或不合适。

score 4 · Accepted Answer

您始终可以通过散列两者的内容来判断本地存储文件和远程文件之间数据中的任何更改。这通常用于验证下载数据的真实性。要进行连续检查，您将需要一个 while 循环。

import hashlib
import urllib
    
num_checks = 20
last_check = 1
while last_check != num_checks:
    remote_data = urllib.urlopen('http://remoteurl').read()
    remote_hash = hashlib.md5(remote_data).hexdigest()

    local_data = open('localfilepath').read()
    local_hash = hashlib.md5(local_data).hexdigest()
    if remote_hash == local_hash:
        print('right now, we match!')
    else:
        print('right now, we are different')

如果实际数据永远不需要保存在本地，我只会存储 md5 哈希并在检查时动态计算它。

score 1 · Accepted Answer

这个答案是@DeaconDesperado 的答案的延伸

为了简单和更快的代码执行，可以首先创建一个本地哈希（而不是存储页面的副本）并将其与新获得的哈希进行比较

要创建本地存储的哈希，最初可以使用此代码

import hashlib
import urllib

    remote_data = urllib.urlopen('http://remoteurl').read()
    remote_hash = hashlib.md5(remote_data).hexdigest()
  
    # Open a file with access mode 'a'
    file_object = open('localhash.txt', 'a')
    # Append  at the end of file
    file_object.write(remote_hash)
    # Close the file
    file_object.close()

并替换local_data = open('localfilepath').read()为local_data = open('local\file\path\localhash.txt').read()

那是

    import hashlib
    import urllib

    num_checks = 20
    last_check = 1
    while last_check != num_checks:
    
    remote_data = urllib.urlopen('http://remoteurl').read()
    remote_hash = hashlib.md5(remote_data).hexdigest()

    local_hash = open('local\file\path\localhash.txt').read()`
   
    if remote_hash == local_hash:
    
    print( 'right now, we match!' )
    
    else:
    
    print('right now, we are different' )

来源：-https://thispointer.com/how-to-append-text-or-lines-to-a-file-in-python/

执事亡命之徒的回答

score 0 · Accepted Answer

我不完全清楚您是否只想看看网站是否发生了变化，或者您是否打算对网站的数据做更多的事情。如果是前者，肯定是哈希，如前所述。这是一个工作（mac 上的python 2.6.1）示例，它将完整的旧 html 与新 html 进行比较；它应该很容易修改，因此它可以根据需要使用散列或仅使用网站的特定部分。希望评论和文档字符串能让一切变得清晰。

import urllib2

def getFilename(url):
    '''
    Input: url
    Return: a (string) filename to be used later for storing the urls contents
    '''
    return str(url).lstrip('http://').replace("/",":")+'.OLD'


def getOld(url):
    '''
    Input: url- a string containing a url
    Return: a string containing the old html, or None if there is no old file
    (checks if there already is a url.OLD file, and make an empty one if there isn't to handle the case that this is the first run)
    Note: the file created with the old html is the format url(with : for /).OLD
    '''
    oldFilename = getFilename(url)
    oldHTML = ""
    try:
        oldHTMLfile = open(oldFilename,'r')
    except:
        # file doesn't exit! so make it
        with open(oldFilename,'w') as oldHTMLfile:
            oldHTMLfile.write("")
        return None
    else:
        oldHTML = oldHTMLfile.read()
        oldHTMLfile.close()

    return oldHTML

class ConnectionError(Exception):
    def __init__(self, value):
        if type(value) != type(''):
            self.value = str(value)
        else:
            self.value = value
    def __str__(self):
        return 'ConnectionError: ' + self.value       


def htmlHasChanged(url):
    '''
    Input: url- a string containing a url
    Return: a boolean stating whether the website at url has changed
    '''

    try:
        fileRecvd = urllib2.urlopen(url).read()
    except:
        print 'Could not connect to %s, sorry!' % url
        #handle bad connection error...
        raise ConnectionError("urlopen() failed to open " + str(url))
    else:
        oldHTML = getOld(url)
        if oldHTML == fileRecvd:
            hasChanged = False
        else:
            hasChanged = True

        # rewrite file
        with open(getFilename(url),'w') as f:
            f.write(fileRecvd)

        return hasChanged

if __name__ == '__main__':
    # test it out with whatismyip.com
    try:
        print htmlHasChanged("http://automation.whatismyip.com/n09230945.asp")
    except ConnectionError,e:
        print e

python - 如何检查网站上的值是否已更改

5 回答 5

Related

Reference