python - 将HTML表格解析为python中的列表

Question

因此，我从 IMDb 的奖励页面中提取了一些字符串：

<table><tr><td><big>Academy Awards, USA</big>        </td>      </tr>      <tr>        <th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th>      </tr>            <tr>        <td rowspan="11" align="center" valign="middle">          1978         </td>                          <td rowspan="7" align="center" valign="middle"><b>Won</b></td>                                                      <td rowspan="6" align="center" valign="middle">Oscar</td>                                      <td valign="top">          Best Art Direction-Set Decoration                                John Barry                                              Norman Reynolds                                              Leslie Dilley                                              Roger Christian                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Costume Design                                John Mollo                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Effects, Visual Effects                                John Stears                                              John Dykstra                                              Richard Edlund                                              Grant McCune                                              Robert Blalack                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Film Editing                                Paul Hirsch                                              Marcia Lucas                                              Richard Chew                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Music, Original Score                                John Williams                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Sound                                Don MacDougall                                              Ray West                                              Bob Minkler                                              Derek Ball                                                      <small>                                                                      Derek Ball was not present at the awards ceremony.          </small>        </td>      </tr>                                                      <tr>                                            <td rowspan="1" align="center" valign="middle">Special Achievement Award</td>                                      <td valign="top">                                          Ben Burtt             (as Benjamin Burtt Jr.)                                          <small>                                                                      For sound effects. (For the creation of the alien, creature and robot voices.)          </small>        </td>      </tr>                                                            <tr>                  <td rowspan="4" align="center" valign="middle"><b>Nominated</b></td>                                                      <td rowspan="4" align="center" valign="middle">Oscar</td>                                      <td valign="top">          Best Actor in a Supporting Role                                Alec Guinness                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Director                                George Lucas                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Picture                                Gary Kurtz                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Writing, Screenplay Written Directly for the Screen                                George Lucas                                                      <small>                                                                                </small>        </td>      </tr>                                                  <tr>        </tr></table>

我想将标题（年份、结果、奖项和类别/收件人）拉到一个列表中，然后将每个列分别拉到它们自己的列表中。例如（使用奥斯卡奖表）（参考网站：http ://www.imdb.com/title/tt0076759/awards ）：

Columns = {"Year", "Result", "Award", "Category/Recipient"}
Years = {"1978", "1978", "1978", "1978", "1978", "1978", "1978"}
Results = {"Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Special Achievement Award"}
Categories/Recipients = {"Best Art Direction-Set Decoration (John Barry, Norman Reynolds, Leslie Dilley, Roger Christian)", "Best Costume Design (John Mollo)", "Best Effects, Visual Effects (John Stears, John Dykstra, Richard Edlund, Grant McCune, Robert Blalack)", Best Film Editing (Paul Hirsch, Marcia Lucas, Richard Chew)", "Best Music, Original Score (John Williams)", "Best Sound (Don MacDougall, Ray West, Bob Minkler, Derek Ball)", "(Ben Burtt (as Benjamin Burtt Jr.))"}

如您所见，我从表格中删除了不必要的空格并将所有名称放在括号中。所有名称周围都有标签，但我删除了它们（如果有助于将它们放在括号中更容易，它们可以保留）。除了 Columns 列表之外，我在每个列表中也有相同数量的项目。

这是我当前的脚本，所以你知道我是如何操作它的：

import shutil
import urllib2
import re
from lxml import etree

award_usock = urllib2.urlopen('http://www.imdb.com/title/tt0076759' + '/awards')
award_html = award_usock.read()
award_usock.close()
if "<big>" in award_html:
    for a_show in re.finditer('<big>',award_html):
        award_show_full_end = award_html.find('<td colspan="4">&nbsp;</td>',a_show.end())
        award_show_full = award_html[a_show.start():award_show_full_end]
        award_show_full = award_show_full.replace('\n','')
        # award_show_full = award_show_full.replace('  ','')
        award_show_full = award_show_full.replace('</a>','')
        award_show_full = award_show_full.replace('<br />','')
        award_show_full = re.sub('<a href="/name/[^>]*>',  '', award_show_full)
        award_show_full = re.sub('<a href="/title/[^>]*>',  '', award_show_full)
        for a_s_title in re.finditer('<a href="',award_show_full):
            award_title_loc = award_show_full.find('<a href="')
            award_title_end = award_show_full.find('">',award_title_loc+10)
            award_title_del = award_show_full[award_title_loc:award_title_end+2]
            award_show_full = award_show_full.replace(award_title_del,'')
        award_show_full = '<table><tr><td>' + award_show_full.replace('<br>','') + '</tr></table>'
        award_show_loc = award_html.find('>',a_show.end())
        award_show_end = award_html.find('</a></big>',a_show.end())
        award_show = award_html[award_show_loc+1:award_show_end]
        award_show_table = etree.XML(award_show_full)
        award_show_rows = iter(award_show_table)
        award_show_headers = [award_show_col.text for award_show_col in next(award_show_rows)]
        for award_show_row in award_show_rows:
            award_show_values = [award_show_col.text for award_show_col in award_show_row]
            print dict(zip(award_show_headers,award_show_values))

但这会产生一个结果：

{None: 'Year'}
{None: '          1978         '}
{None: '          Best Costume Design                                John Mollo                                                      '}
{None: '          Best Effects, Visual Effects                                John Stears                                              John Dykstra                                              Richard Edlund                                              Grant McCune                                              Robert Blalack                                                      '}
{None: '          Best Film Editing                                Paul Hirsch                                              Marcia Lucas                                              Richard Chew                                                      '}
{None: '          Best Music, Original Score                                John Williams                                                      '}
{None: '          Best Sound                                Don MacDougall                                              Ray West                                              Bob Minkler                                              Derek Ball                                                      '}
{None: 'Special Achievement Award'}
{None: None}
{None: '          Best Director                                George Lucas                                                      '}
{None: '          Best Picture                                Gary Kurtz                                                      '}
{None: '          Best Writing, Screenplay Written Directly for the Screen                                George Lucas                                                      '}
{}

score 3 · Accepted Answer

使用正则表达式解析 HTML不是一个好主意，最好尝试使用解析器，例如Beautiful Soup。

score 1 · Accepted Answer

You don't have to use an external library for parsing an HTML table if you are using python 3. You can use the built-in HTMLParser class from html.parser (it replaces SGMLParser from python 2). I've written code for a simple derived HTMLParser class. It is here in a github repo. It simply does remember the current scope of a <td>, <tr> or <table> tag. The advantages over using etree or BeautifulSoup are that it runs correctly on non-xml-compliant html and that it doesn't use external libraries.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It maybe looks like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

python - 将HTML表格解析为python中的列表

2 回答 2

Related

Reference