您无需“清理” HTML 即可使用BeautifulSoup
.
只需将日期和事件直接解析为 csv 文件:
import csv
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "http://www.thecomedystudio.com/schedule.html"
soup = BeautifulSoup(urlopen(url))
with open('output.csv', 'wb') as f:
writer = csv.writer(f)
for item in soup.select('td div[align=center] > b'):
date = ' '.join(el.strip() for el in item.find_all(text=True))
event = item.parent.parent.find_next_sibling('td').get_text(strip=True)
writer.writerow([date, event])
output.csv
运行脚本后的内容:
Fri 2.27.15,"Rick Canavan hosts with Christine An, Rachel Bloom, Dan Crohn, Wes Hazard, James Huessy, Kelly MacFarland, Peter Martin, Ted Pettingell."
Sat 2.28.15,"Rick Jenkins hosts Taylor Connelly, Lilian DeVane, Andrew Durso, Nate Johnson, Peter Martin, Andrew Mayer, Kofi Thomas, Tim Willis."
Sun 3.1.15,"Peter Martin hosts Sunday Funnies with Nonye Brown-West, Ryan Donahue, Joe Kozlowski, Casey Malone, Etrane Martinez, Kwasi Mensah, Anthony Zonfrelli, Christa Weiss and Sam Jay closing."
Tue 3.3.15,Mystery Lounge! The old-est and only-est magic show in New England! with guest comedian Ryan Donahue.
...
Thu 12.31.15,"New Year's Eve! with Rick Jenkins, Nathan Burke."
Fri 1.1.16,Rick Canavan hosts New Year's Day.