python - 在python中提取博客数据

Question

n我们必须通过从包含博客列表的文本文件中读取指定数量的博客（）来提取它们。

然后我提取博客数据并将其附加到文件中。

这只是应用于nlp数据的主要任务的一部分。

到目前为止，我已经这样做了：

import urllib2
from bs4 import BeautifulSoup
def create_data(n):
    blogs=open("blog.txt","r") #opening the file containing list of blogs

    f=file("data.txt","wt") #Create a file data.txt

    with open("blog.txt")as blogs:
        head = [blogs.next() for x in xrange(n)]
        page = urllib2.urlopen(head['href'])

        soup = BeautifulSoup(page)
        link = soup.find('link', type='application/rss+xml')
        print link['href']

        rss = urllib2.urlopen(link['href']).read()
        souprss = BeautifulSoup(rss)
        description_tag = souprss.find('description')

        f = open("data.txt","a") #data file created for applying nlp
        f.write(description_tag)

此代码不起作用。它直接提供链接。像：

page = urllib2.urlopen("http://www.frugalrules.com")

我从用户提供输入的不同脚本中调用此函数n。

我究竟做错了什么？

追溯：

    Traceback (most recent call last):
  File "C:/beautifulsoup4-4.3.2/main.py", line 4, in <module>
    create_data(2)#calls create_data(n) function from create_data
  File "C:/beautifulsoup4-4.3.2\create_data.py", line 14, in create_data
    page=urllib2.urlopen(head)
  File "C:\Python27\lib\urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 395, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'

score 3 · Accepted Answer

head是一个列表：

    head = [blogs.next() for x in xrange(n)]

列表由整数索引（或切片）索引。你不能使用head['href']whenhead是一个列表：

    page = urllib2.urlopen(head['href'])

如果不知道它的内容是什么样的，很难说如何解决这个问题blog.txt。如果每一行都blog.txt包含一个 URL，那么您可以使用：

with open("blog.txt") as blogs:
    for url in list(blogs)[:n]:
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        ...
        with open('data.txt', 'a') as f:
            f.write(...)

请注意，这file是不推荐使用的形式open（在 Python3 中已删除）。而不是 using f=file("data.txt","wt")，而是使用更现代的with-statement语法（如上所示）。

例如，

import urllib2
import bs4 as bs

def create_data(n):
    with open("data.txt", "wt") as f:
        pass
    with open("blog.txt") as blogs:
        for url in list(blogs)[:n]:
            page = urllib2.urlopen(url)
            soup = bs.BeautifulSoup(page.read())

            link = soup.find('link', type='application/rss+xml')
            print(link['href'])

            rss = urllib2.urlopen(link['href']).read()
            souprss = bs.BeautifulSoup(rss)
            description_tag = souprss.find('description')

            with open('data.txt', 'a') as f:
                f.write('{}\n'.format(description_tag))

create_data(2)

我假设您在data.txt每次通过循环时都在打开、写入和关闭，因为您想保存部分结果——也许以防程序被迫过早终止。

否则，一开始只打开一次文件会更容易：

with open("blog.txt") as blogs, open("data.txt", "wt") as f:

python - 在python中提取博客数据

1 回答 1

Related

Reference