python - 检查网站是否有更新（使用 Python + Selenium 实现 Web 自动化）

Question

我正在尝试编写一个执行以下操作的简单脚本：

每 6 小时自动运行一次
检查房地产网站是否有新房源
如果找到任何新列表详细信息，请通过电子邮件发送，否则终止脚本直到下次运行

我打算使用 crontab 来执行（1）。此外，这是我迄今为止为一个特定网站提出的脚本：

from selenium import webdriver
import smtplib
import sys

driver = webdriver.Firefox()

#Capital Pacific Website
#Commerical Real Estate

#open text file containing property titles we already know about
properties = open("properties.txt", "r+")
currentList = []
for line in properties:
    currentList.append(line)

#to search for new listings
driver.get("http://cp.capitalpacific.com/Properties")

assert "Capital" in driver.title

#holds any new listings
newProperties = []

#find all listings on page by Property Name
newList = driver.find_elements_by_class_name('overview')

#find elements in pageList not in oldList & add to newList
#add new elements to 
for x in currentList:
    for y in newList:
        if y != x:
            newProperties.append(y)
            properties.write(y)

properties.close()
driver.close()

#if no new properties found, terminate script
#else, email properties
if not newProperties:
    sys.exit()
else: 
    fromaddr = 'someone@gmail.com'
    toaddrs = ['someoneelse@yahoo.com']
    server = smtplib.SMTP('smtp.gmail.com:587')
    server.starttls()

    for item in newProperties:
        msg = item
        server.sendmail(fromaddr, toaddrs, msg)

    server.quit()

到目前为止我的问题：（请耐心等待，因为我是 python 新手..）

使用列表存储使用 selenium 的“按类查找”方法返回的 Web 元素：是否有更好的方法从文本文件中写入/写入以确保我只获取新添加的属性？

如果脚本确实找到了网站上存在但 newList 上没有的类属性，有没有办法我可以只通过该 div 来获取有关列表的详细信息？

请有任何建议/建议！谢谢你。

score 0 · Accepted Answer

如果您切换到使用JSON将列表存储为字典的格式怎么办：

[
    {
        "location": "REGON CITY, OR",
        "price": 33000000,
        "status": "active",
        "marketing_package_url": "http://www.capitalpacific.com/inquiry/TrailsEndMarketplaceExecSummary.pdf"
        ...
    },
    ...
]

为了识别新列表，您需要每个属性都具有独特性。例如，您可以为其使用营销包 url - 对我来说看起来很独特。

这是从页面获取列表列表的示例代码：

properties = []
for property in driver.find_elements_by_css_selector('table.property div.property'):
    title = property.find_element_by_css_selector('div.title h2')
    location = property.find_element_by_css_selector('div.title h4')
    marketing_package = property.find_element_by_partial_link_text('Marketing Package')

    properties.append({
        'title': title.text,
        'location': location.text,
        'marketing_package_url': marketing_package.getAttribute('href')
    })

python - 检查网站是否有更新（使用 Python + Selenium 实现 Web 自动化）

1 回答 1

Related

Reference