python - 从只有网站 url 的元标记和 img url 中提取值。用python为django应用程序编写

Question

我想知道如何在 python 中编写它并将其连接到 django 应用程序。我的意思是从只有网站 url 的元标记和 img url 中提取值。与用户粘贴链接时的 facebook 相同。

score 4 · Accepted Answer

就个人而言，我会选择使用非常好的Requests、BeautifulSoup和LXML库来解决这个问题。

假设我们在中有以下模型models.py，我们可以重写save()方法来填充title、description和keywords属性：

from bs4 import BeautifulSoup
import requests

from django.db import models

class Link(models.Model):
    url = models.URLField(blank=True)
    title = models.CharField(max_length=20, blank=True)
    description = models.TextField(blank=True)
    keywords = models.TextField(blank=True)

    def save(self, *args, **kwargs):
        if self.url and not (self.title or self.keywords or self.description):
            # optionally, use 'html' instead of 'lxml' if you don't have lxml installed
            soup = BeatifulSoup(requests.get(self.url).content, "lxml")
            self.title = soup.title.string
            meta = soup.find_all('meta')
            for tag in meta:
                if 'name' in tag.attrs and tag.attrs['name'].lower() in ['description', 'keywords']:
                    setattr(self, tag.attrs['name'].lower(), tag.attrs['content'])

        super(Link, self).save(*args, **kwargs)

被覆盖save()方法的逻辑可以很好地存在于视图或实用程序函数中，甚至可以存在于Link模型上可以有条件地调用的另一个方法中。

以上适用于 Django 1.4。不能保证，但它也应该适用于早期版本。

编辑：修复语法错误并提及替代解析器，感谢@jinesh和@stonefury。

score 2 · Accepted Answer

jnovinger 的答案在 Django 1.5 中基本上对我有用，但我必须进行一些调整。首先，代码本身似乎有一个错字。线

soup = BeatifulSoup(requests.get(self.url).contents, "lxml")

提出AttributeError: 'Response' object has no attribute 'contents'。根据 Requests 文档，我相信正确的属性是requests.get(self.url).content，虽然requests.get().text似乎也可以工作。

在将它放入我的 Django 项目之前，我首先尝试在一个简单的脚本中实现它，在这种情况下，我还收到以下错误：

requests.exceptions.MissingSchema: Invalid URL u'www.example.com/': No schema supplied

这是由于天真地不包括http://在 url 之前。在 Django 中，这是自动完成的，但我会提到这一点，以防其他初学者犯同样的错误并且不理解“缺少模式”的含义。

我遇到的最后一个问题是由于缺乏对内容长度的验证。当检查的链接产生的标题、关键字或详细信息字段超过模型中字段对象的限制（max_length 参数）时，这会在尝试保存新链接时导致 DatabaseErrors。

DatabaseError: value too long for type character varying(20)

这是我的粗略修复；可能有更好的方法，但这似乎有效。

from bs4 import BeautifulSoup
import requests

from django.db import models

class Link(models.Model):
    url = models.URLField(blank=True)
    title = models.CharField(max_length=100, blank=True)
    description = models.TextField(blank=True)
    keywords = models.TextField(blank=True)

    def save(self, *args, **kwargs):
        if self.url and not (self.title or self.keywords or self.description):
            soup = BeautifulSoup(requests.get(self.url).content, "lxml")
            limit = self._meta.get_field('title').max_length    # check field max_length
            self.title = soup.title.string[:limit]              # limit title to max_length
            meta = soup.find_all('meta')
            for tag in meta:
                if 'name' in tag.attrs and tag.attrs['name'].lower() in ['description', 'keywords']:
                    field = tag.attrs['name'].lower()                   # check whether description or keywords
                    limit = self._meta.get_field(field).max_length      # check field max_length
                    content = tag.attrs['content'][:limit]              # limit field to max_length
                    setattr(self, tag.attrs['name'].lower(), content)

        super(Link, self).save(*args, **kwargs)

请注意，这只会在单词中间切断标题、描述和关键字列表，而不是在最后一个完整的单词处停止，因此您最终可能会得到无意义的单词片段。

python - 从只有网站 url 的元标记和 img url 中提取值。用python为django应用程序编写

2 回答 2

Related

Reference