python - 如何使用 pandas 解析已经从其他地方加载的 CSV？

Question

我下载并抓取网页以获取 TSV 格式的一些数据。TSV 数据周围是我不想要的 HTML。

我下载网页的 html，并使用 beautifulsoup 刮出我想要的数据。但是，我现在已经在内存中获得了 TSV 数据。

如何在 pandas 的内存中使用此 TSV 数据？我能找到的每个方法似乎都想从文件或 URI 中读取，而不是从我已经抓取的数据中读取。

我不想下载文本，将其写入文件，然后重新抓取它。

#!/usr/bin/env python2

from pandas import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2

def main():
    url = "URL"
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html)
    # pre is the tag that the data is within
    tab_sepd_vals = soup.pre.string

    data = p.LOAD_CSV(tab_sepd_vals)
    process(data)

score 3 · Accepted Answer

如果您将数据的文本/字符串版本输入 a StringIO.StringIO（或io.StringIO在 Python 3.X 中），您可以将该对象传递给 pandas 解析器。所以你的代码变成：

#!/usr/bin/env python2

import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2
import StringIO

def main():
    url = "URL"
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html)
    # pre is the tag that the data is within
    tab_sepd_vals = soup.pre.string

    # make the StringIO object
    tsv = StringIO.StringIO(tab_sepd_vals)

    # something like this
    data = p.read_csv(tsv, sep='\t') 

    # then what you had
    process(data)

score 1 · Accepted Answer

像read_csv做两件事的方法，他们解析CSV并构造一个DataFrame对象 - 所以在你的情况下你可能想要DataFrame直接构造：

>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1], ['b', 2], ['c', 3]])
>>> print(df)
   0  1
0  a  1
1  b  2
2  c  3

构造函数接受各种数据结构。

python - 如何使用 pandas 解析已经从其他地方加载的 CSV？

2 回答 2

Related

Reference