我在 python 中有两个脚本:
登录 >> 访问网站,使用登录表单登录并将 cookie 存储到 JSON 文件中以备后用
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(slow_mo=50)
context = browser.new_context(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36')
page = context.new_page()
page.goto('https://www.url.us/signin')
try:
page.wait_for_selector('#signInFormPage input[name="userName"]', state='visible')
page.type('#signInFormPage input[name="userName"]', "aaa")
page.type('#signInFormPage input[name="password"]', "aa")
page.click('#userNamePasswordSignInButton')
page.wait_for_timeout(3000)
cookies = context.cookies()
page.wait_for_timeout(10000)
f = open('./cookies.json', 'w')
f.write(json.dumps(cookies))
page.close()
context.close()
browser.close()
except Exception as e:
print("Error in playwright script.")
page.close()
context.close()
browser.close()
这个脚本运行良好。第二个脚本是从同一网站的其他页面的文件和打印页面源中获取存储的 cookie:
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False, slow_mo=50)
context = browser.new_context(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36')
page = context.new_page()
cookie_file = open('./cookies.json')
cookies = json.load(cookie_file)
context.add_cookies(cookies)
page.goto('https://www.url.us/Product/10aaa')
try:
page.wait_for_timeout(6000)
print(page.content())
page.close()
except Exception as e:
print("Error in playwright script.")
page.close()
这个脚本也很好用。
但问题是这个网站有一些我想要提取的信息的 API,并且信息不能通过前端用户可见的页面源获得。因此,当我将 API 链接放在第二个链接中时,我收到了空的 JSON 页面。这些 API 请求使用令牌值,但由于我使用 cookie 来获取页面源,因此我没有令牌。我使用这些脚本是因为这是通过该网站拥有的 Cloudflare 保护的唯一方法。有什么方法可以让我使用 requests 模块和 playwright 模块的组合?或者对这种情况有帮助的任何其他建议,如何使用 cookie 获取 JSON 页面?
使用持久上下文更新代码:
1脚本:
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch_persistent_context(r'C:\Users\test\Downloads\pyyy', headless=False)
page = browser.new_page()
page.goto('https://www.url.us/signin')
try:
page.wait_for_selector('#signInFormPage input[name="userName"]', state='visible')
page.type('#signInFormPage input[name="userName"]', "aaaaa")
page.type('#signInFormPage input[name="password"]', "aaaa")
page.click('#userNamePasswordSignInButton')
page.wait_for_timeout(3000)
page.close()
except Exception as e:
print("Error in playwright script.")
page.close()
2:
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch_persistent_context(r'C:\Users\test\Downloads\pyyy', headless=False)
page = browser.new_page()
page.goto('https://www.url.us/Product/aaa')
try:
page.wait_for_timeout(6000)
print(page.content())
page.close()
except Exception as e:
print("Error in playwright script.")
page.close()