0

尝试解析 Coinbase 博客网站https://blog.coinbase.com/的部分,即 9 篇以下的第一篇文章,首先<div class="streamItem streamItem--section js-streamItem" data-action-scope="_actionscope_6">是获取最新消息(不知道如何在 coinbase 博客托管的中型平台上以其他方式进行操作)它是主页上的随机日期和搜索上的随机日期)但由于某种原因不能,首先尝试使用请求并且它有效,但直到本节才有效,并尝试使用下一个代码的剧作家:

# !/usr/bin/env python    
# coding: utf-8  
import asyncio
from playwright.sync_api import sync_playwright  
from playwright.async_api import async_playwright   
import os   
import time    

async def parser():        
    page_path = "https://blog.coinbase.com/"        
    async with async_playwright() as p:          
        browser = await p.chromium.launch(headless=True)           
        page = await browser.new_page(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')         
        await page.goto(page_path)         
        page_content = await page.content()            
        await browser.close()        
        print(page_content)    
        
asyncio.get_event_loop().run_until_complete(parser())   

同样的事情 - 在本节之前它一直有效

我也尝试过像这里https://scrapingant.com/blog/scrape-dynamic-website-with-python这样的抓取工具,它有效,但我需要通过请求或剧作家以其他方式解决它,更好地使用请求

4

1 回答 1

0

我能够使用以下代码获取新闻文章的标题:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.chromium.launch(headless=False)
  page = browser.new_page()
  page.goto('https://blog.coinbase.com/', wait_until='domcontentloaded')
  elements = page.query_selector_all('*[data-post-id]')
  titles = []
  for element in elements:
    try:
      title = element.query_selector('h3 div')
      title = title.text_content()
      if not title in titles:
        titles.append(title)
    except Exception as e:
      continue
  print(titles)

它可能不是您正在寻找的东西,但希望它能让您朝着正确的方向前进!

于 2021-10-27T17:37:26.917 回答