I have a series of HTML files that are parsed into a single text file using Beautiful Soup. The HTML files are formatted such that their output is always three lines within the text file, so the output will look something like:
Hello!
How are you?
Well, Bye!
But it could just as easily be
83957
And I ain't coming back!
hgu39hgd
In other words, the contents of the HTML files are not really standard across each of them, but they do always produce three lines.
So, I was wondering where I should start if I want to then take the text file that is produced from Beautiful Soup and parse that into a CSV file with columns such as (using the above examples):
Title Intro Tagline
Hello! How are you? Well, Bye!
83957 And I ain't coming back! hgu39hgd
The Python code for stripping the HTML from the text files is this:
import os
import glob
import codecs
import csv
from bs4 import BeautifulSoup
path = "c:\\users\\me\\downloads\\"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(codecs.open(markup, "r", "utf-8").read())
with open("extracted.txt", "a") as myfile:
myfile.write(soup.get_text())
And I gather I can use this to set up the columns in my CSV file:
csv.put_HasColumnNames(True)
csv.SetColumnName(0,"title")
csv.SetColumnName(1,"intro")
csv.SetColumnName(2,"tagline")
Where I'm drawing blank is how to iterate through the text file (extracted.txt) one line at a time and, as I get to a new line, set it to the correct cell in the CSV file. The first several lines of the file are blank, and there are many blank lines between each grouping of text. So, first I would need to open the file and read it:
file = open("extracted.txt")
for line in file.xreadlines():
pass # csv.SetCell(0,0 X) (obviously, I don't know what to put in X)
Also, I don't know how to tell Python to just keep reading the file, and adding to the CSV file until it's finished. In other words, there's no way to know exactly how many total lines will be in the HTML files, and so I can't just csv.SetCell(0,0) to cdv.SetCell(999,999)