python - Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

Question

Apologies in advance for the long block of code following. I'm new to BeautifulSoup, but found there were some useful tutorials using it to scrape RSS feeds for blogs. Full disclosure: this is code adapted from this video tutorial which has been immensely helpful in getting this off the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE.

Here's my problem: the video does a great job of showing how to print the relevant content to the console. I need to write out each article's text to a separate .txt file and save it to some directory (right now I'm just trying to save to my Desktop). I know the problem lies i the scope of the two for-loops near the end of the code (I've tried to comment this for people to see quickly--it's the last comment beginning # Here's where I'm lost...), but I can't seem to figure it out on my own.

Currently what the program does is takes the text from the last article read in by the program and writes that out to the number of .txt files that are indicated in the variable listIterator. So, in this case I believe there are 20 .txt files that get written out, but they all contain the text of the last article that's looped over. What I want the program to do is loop over each article and print the text of each article out to a separate .txt file. Sorry for the verbosity, but any insight would be really appreciated.

from urllib import urlopen
from bs4 import BeautifulSoup
import re

# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

# On RSS Feed site, find tags for title of articles and 
# tags for article links to be downloaded.

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)

# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))

for i in listIterator:
    # Print each title to console to ensure program is working. 
    print findPatTitle[i]

    # Read in the linked-to article.
    articlePage = urlopen(findPatLink[i]).read()

    # Find the beginning and end of articles using tags listed below.
    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    # Define article variable that will contain all the content between the 
    # beginning of the article to the end as indicated by variables above.
    article = articlePage[divBegin:divEnd]

    # Parse the page using BeautifulSoup
    soup = BeautifulSoup(article)

    # Compile list of all <p> tags for each article and store in paragList
    paragList = soup.findAll('p')

    # Create empty string to eventually convert items in paragList to string to 
    # be written to .txt files.
    para_string = ''

    # Here's where I'm lost and have some sort of scope issue with my for-loops.
    for i in paragList:
        para_string = para_string + str(i)
        newlist = range(len(findPatTitle))
        for i in newlist:
            ofile = open(str(listIterator[i])+'.txt', 'w')
            ofile.write(para_string)
            ofile.close()

score 3 · Accepted Answer

之所以看起来只写了最后一篇，是因为所有的文章都是一遍又一遍地写入20个单独的文件。让我们看看以下内容：

for i in paragList:
    para_string = para_string + str(i)
    newlist = range(len(findPatTitle))
    for i in newlist:
        ofile = open(str(listIterator[i])+'.txt', 'w')
        ofile.write(para_string)
        ofile.close()

您在每次迭代parag_string中一遍又一遍地写入相同的 20 个文件。您需要做的是，将所有s 附加到单独的列表中，例如，然后将其所有内容写入单独的文件，如下所示：parag_stringparaStringList

for i, var in enumerate(paraStringList):  # Enumerate creates a tuple
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

现在这需要在您的主循环之外，即for i in listIterator:(...). 这是该程序的工作版本：

from urllib import urlopen
from bs4 import BeautifulSoup
import re


webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]

listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []

for i in listIterator:

    print findPatTitle[i]

    articlePage = urlopen(findPatLink[i]).read()

    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    article = articlePage[divBegin:divEnd]

    soup = BeautifulSoup(article)

    paragList = soup.findAll('p')

    para_string = ''

    for i in paragList:
        para_string += str(i)

    paraStringList.append(para_string)

for i, var in enumerate(paraStringList):
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

python - Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

1 回答 1

Related

Reference