11

我想从以下网站中提取标题和描述:

查看来源: http: //www.virginaustralia.com/au/en/bookings/flights/make-a-booking/

使用以下源代码片段:

<title>Book a Virgin Australia Flight | Virgin Australia
</title>
    <meta name="keywords" content="" />
        <meta name="description" content="Search for and book Virgin Australia and partner flights to Australian and international destinations." />

我想要标题和元内容。

我用过鹅,但提取效果不好。这是我的代码:

website_title = [g.extract(url).title for url in clean_url_data]

website_meta_description=[g.extract(urlw).meta_description for urlw in clean_url_data] 

结果为空

4

4 回答 4

19

请检查BeautifulSoup作为解决方案。

对于上述问题,您可以使用以下代码提取“描述”信息:

import requests
from bs4 import BeautifulSoup

url = 'http://www.virginaustralia.com/au/en/bookings/flights/make-a-booking/'
response = requests.get(url)
soup = BeautifulSoup(response.text)

metas = soup.find_all('meta')

print [ meta.attrs['content'] for meta in metas if 'name' in meta.attrs and meta.attrs['name'] == 'description' ]

输出:

['Search for and book Virgin Australia and partner flights to Australian and international destinations.']
于 2016-06-24T10:17:36.933 回答
1

你知道 html xpath 吗?使用带有 xpath 的 lxml lib 来提取 html 元素是一种快速方法。

import lxml

doc = lxml.html.document_fromstring(html_content)
title_element = doc.xpath("//title")
website_title = title_element[0].text_content().strip()
meta_description_element = doc.xpath("//meta[@property='description']")
website_meta_description = meta_description_element[0].text_content().strip()
于 2016-06-24T10:29:23.757 回答
0

您可以使用 BeautifulSoup 来实现这一点。

应该有帮助-

metas = soup.find_all('meta') #Get Meta Description
for m in metas:
    if m.get ('name') == 'description':
        desc = m.get('content')
        print(desc)
        
于 2021-02-18T10:45:04.030 回答
0

导入元数据解析器

page = metadata_parser.MetadataParser(url='www.xyz.com') metaDesc=page.metadata['og']['description'] print(metaDesc)

于 2021-01-06T09:23:37.263 回答