因此,我从 IMDb 的奖励页面中提取了一些字符串:
<table><tr><td><big>Academy Awards, USA</big> </td> </tr> <tr> <th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th> </tr> <tr> <td rowspan="11" align="center" valign="middle"> 1978 </td> <td rowspan="7" align="center" valign="middle"><b>Won</b></td> <td rowspan="6" align="center" valign="middle">Oscar</td> <td valign="top"> Best Art Direction-Set Decoration John Barry Norman Reynolds Leslie Dilley Roger Christian <small> </small> </td> </tr> <tr> <td valign="top"> Best Costume Design John Mollo <small> </small> </td> </tr> <tr> <td valign="top"> Best Effects, Visual Effects John Stears John Dykstra Richard Edlund Grant McCune Robert Blalack <small> </small> </td> </tr> <tr> <td valign="top"> Best Film Editing Paul Hirsch Marcia Lucas Richard Chew <small> </small> </td> </tr> <tr> <td valign="top"> Best Music, Original Score John Williams <small> </small> </td> </tr> <tr> <td valign="top"> Best Sound Don MacDougall Ray West Bob Minkler Derek Ball <small> Derek Ball was not present at the awards ceremony. </small> </td> </tr> <tr> <td rowspan="1" align="center" valign="middle">Special Achievement Award</td> <td valign="top"> Ben Burtt (as Benjamin Burtt Jr.) <small> For sound effects. (For the creation of the alien, creature and robot voices.) </small> </td> </tr> <tr> <td rowspan="4" align="center" valign="middle"><b>Nominated</b></td> <td rowspan="4" align="center" valign="middle">Oscar</td> <td valign="top"> Best Actor in a Supporting Role Alec Guinness <small> </small> </td> </tr> <tr> <td valign="top"> Best Director George Lucas <small> </small> </td> </tr> <tr> <td valign="top"> Best Picture Gary Kurtz <small> </small> </td> </tr> <tr> <td valign="top"> Best Writing, Screenplay Written Directly for the Screen George Lucas <small> </small> </td> </tr> <tr> </tr></table>
我想将标题(年份、结果、奖项和类别/收件人)拉到一个列表中,然后将每个列分别拉到它们自己的列表中。例如(使用奥斯卡奖表)(参考网站:http ://www.imdb.com/title/tt0076759/awards ):
Columns = {"Year", "Result", "Award", "Category/Recipient"}
Years = {"1978", "1978", "1978", "1978", "1978", "1978", "1978"}
Results = {"Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Special Achievement Award"}
Categories/Recipients = {"Best Art Direction-Set Decoration (John Barry, Norman Reynolds, Leslie Dilley, Roger Christian)", "Best Costume Design (John Mollo)", "Best Effects, Visual Effects (John Stears, John Dykstra, Richard Edlund, Grant McCune, Robert Blalack)", Best Film Editing (Paul Hirsch, Marcia Lucas, Richard Chew)", "Best Music, Original Score (John Williams)", "Best Sound (Don MacDougall, Ray West, Bob Minkler, Derek Ball)", "(Ben Burtt (as Benjamin Burtt Jr.))"}
如您所见,我从表格中删除了不必要的空格并将所有名称放在括号中。所有名称周围都有标签,但我删除了它们(如果有助于将它们放在括号中更容易,它们可以保留)。除了 Columns 列表之外,我在每个列表中也有相同数量的项目。
这是我当前的脚本,所以你知道我是如何操作它的:
import shutil
import urllib2
import re
from lxml import etree
award_usock = urllib2.urlopen('http://www.imdb.com/title/tt0076759' + '/awards')
award_html = award_usock.read()
award_usock.close()
if "<big>" in award_html:
for a_show in re.finditer('<big>',award_html):
award_show_full_end = award_html.find('<td colspan="4"> </td>',a_show.end())
award_show_full = award_html[a_show.start():award_show_full_end]
award_show_full = award_show_full.replace('\n','')
# award_show_full = award_show_full.replace(' ','')
award_show_full = award_show_full.replace('</a>','')
award_show_full = award_show_full.replace('<br />','')
award_show_full = re.sub('<a href="/name/[^>]*>', '', award_show_full)
award_show_full = re.sub('<a href="/title/[^>]*>', '', award_show_full)
for a_s_title in re.finditer('<a href="',award_show_full):
award_title_loc = award_show_full.find('<a href="')
award_title_end = award_show_full.find('">',award_title_loc+10)
award_title_del = award_show_full[award_title_loc:award_title_end+2]
award_show_full = award_show_full.replace(award_title_del,'')
award_show_full = '<table><tr><td>' + award_show_full.replace('<br>','') + '</tr></table>'
award_show_loc = award_html.find('>',a_show.end())
award_show_end = award_html.find('</a></big>',a_show.end())
award_show = award_html[award_show_loc+1:award_show_end]
award_show_table = etree.XML(award_show_full)
award_show_rows = iter(award_show_table)
award_show_headers = [award_show_col.text for award_show_col in next(award_show_rows)]
for award_show_row in award_show_rows:
award_show_values = [award_show_col.text for award_show_col in award_show_row]
print dict(zip(award_show_headers,award_show_values))
但这会产生一个结果:
{None: 'Year'}
{None: ' 1978 '}
{None: ' Best Costume Design John Mollo '}
{None: ' Best Effects, Visual Effects John Stears John Dykstra Richard Edlund Grant McCune Robert Blalack '}
{None: ' Best Film Editing Paul Hirsch Marcia Lucas Richard Chew '}
{None: ' Best Music, Original Score John Williams '}
{None: ' Best Sound Don MacDougall Ray West Bob Minkler Derek Ball '}
{None: 'Special Achievement Award'}
{None: None}
{None: ' Best Director George Lucas '}
{None: ' Best Picture Gary Kurtz '}
{None: ' Best Writing, Screenplay Written Directly for the Screen George Lucas '}
{}