python - 带有 colspan=2 的 pandas read_html 函数

Question

我正在使用 pandas read_html 函数将 html 表加载到数据框中，但是它失败了，因为源数据有一个 colspan=2 合并标题，导致这个 AssertionError: 6 columns passed, pass data has 7 columns。

我已经尝试了使用 header kwarg (header=None, header=['Code'...]) 的各种选项，但似乎没有任何效果。

有谁知道使用 pandas read_html 解析和 html 表与合并列的任何方法？

score 5 · Accepted Answer

如果您不坚持使用 pandas 中的 read_html，则此代码可以完成工作：

import pandas as pd
from lxml.html import parse
from urllib2 import urlopen
from pandas.io.parsers import TextParser

def _unpack(row, kind='td'):
   elts = row.findall('.//%s' % kind)
   return [val.text_content() for val in elts]

def parse_options_data(table):
  rows = table.findall('.//tr')
  header = _unpack(rows[0], kind='th')
  data = [_unpack(r) for r in rows[1:]]
  return TextParser(data, names=header).get_chunk()

parsed = parse(urlopen('http://www.bmfbovespa.com.br/en-us/intros/Limits-and-Haircuts-for-accepting-stocks-as-collateral.aspx?idioma=en-us'))
doc = parsed.getroot()
tables = doc.findall('.//table')
table = parse_options_data(tables[0])

这摘自 Wes McKinney 的“Python for Data analysis”一书。

score 0 · Accepted Answer

pandas >= 0.24.0 理解colspan和rowspan属性。根据发行说明：

result = pd.read_html("""
    <table>
      <thead>
        <tr>
          <th>A</th><th>B</th><th>C</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td colspan="2">1</td><td>2</td>
        </tr>
      </tbody>
    </table>""")

result

出去：

[   A  B  C
 0  1  1  2

以前这将返回以下内容：

[   A  B   C
 0  1  2 NaN]

我无法使用您的链接进行测试，因为找不到该 URL。

python - 带有 colspan=2 的 pandas read_html 函数

2 回答 2

Related

Reference