I am using BeautifulSoup4 to scrape this web page, however I'm getting the weird unicode text that BeautifulSoup returns.
Here is my code:
site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
req.add_header('Accept-enconding', 'gzip') #Header to check for gzip
page = urllib2.urlopen(req)
if page.info().get('Content-Encoding') == 'gzip': #IF checks gzip
data = page.read()
data = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
soup = BeautifulSoup(html, fromEncoding='gbk')
else:
soup = BeautifulSoup(page)
section = soup.find('span', id='Events').parent
events = section.find_next('ul').find_all('li')
print soup.originalEncoding
for x in events:
print x
Bascially I want x to be in plain English. I get, instead, things that look like this:
<li><a href="/wiki/153_BC" title="153 BC">153 BC</a> – <a href="/wiki/Roman_consul" title="Roman consul">Roman consuls</a> begin their year in office.</li>
There's only one example in this particular string, but you get the idea.
Related: I go on to cut up this string with some regex and other string cutting methods, should I switch this to plain text before or after I cut it up? I'm assuming it doesn't matter but seeing as I'm defering to SO anyways, I thought I'd ask.
If anyone knows how to fix this, I'd appreciate it. Thanks
EDIT: Thanks J.F. for the tip, I now used this after my for loop:
for x in events:
x = x.encode('ascii')
x = str(x)
#Find Content
regex2 = re.compile(">[^>]*<")
textList = re.findall(regex2, x)
text = "".join(textList)
text = text.replace(">", "")
text = text.replace("<", "")
contents.append(text)
However, I still get things like this:
2013 – At least 60 people are killed and 200 injured in a stampede after celebrations at Félix Houphouët-Boigny Stadium in Abidjan, Ivory Coast.
EDIT: Here is how I make my excel spreadsheet (csv) and send in my list
rows = zip(days, contents)
with open("events.csv", "wb") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
So the csv file is created during the program and everything is imported after the lists are generated. I just need to it to be readable text at that point.