python - 如何使用 Pandas 分析来分析大型数据集？

Question

数据不是很干净，但可以毫无问题地与 pandas 一起使用。pandas 库为 EDA 提供了许多非常有用的功能。

但是，当我对大数据（即 10 列的 1 亿条记录）使用分析时，从数据库表中读取它时，它没有完成，我的笔记本电脑内存不足，csv 中的数据大小约为 6 GB，我的 RAM 为 14 GB 我的空闲使用量约为 3 - 4 GB。

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")

我也尝试过check_recoded = False选项。但它无助于完全分析。有没有办法分块和读取数据并最终生成一个整体的汇总报告？或任何其他将此函数用于大型数据集的方法。

score 9 · Accepted Answer

v2.4引入了禁用昂贵计算（例如相关性和动态分箱）的最小模式：

from pandas_profiling import ProfileReport


profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")

score 3 · Accepted Answer

禁用相关性计算（从而大大减少计算）的语法pandas-profiling=1.4在当前（beta-）版本之间发生了很大变化pandas-profiling=2.0，如下所示：

profile = df.profile_report(correlations={
    "pearson": False,
    "spearman": False,
    "kendall": False,
    "phi_k": False,
    "cramers": False,
    "recoded":False,}
)

此外，您可以通过禁用用于绘制直方图的 bin 计算来减少执行的计算。

profile = df.profile_report(plot={'histogram': {'bins': None}}

score 1 · Accepted Answer

Did you try with the below option as when doing correlation analysis on large free text fields using pandas profiling might cause this issue?

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df, , check_correlation = False)

Please refer the below github link for more details: https://github.com/pandas-profiling/pandas-profiling/issues/84

score 0 · Accepted Answer

另一种选择是减少数据。

可以通过以下方式实现一种选择sample：

df.sample(number)

有关pandas 文档的更多详细信息。

score -2 · Accepted Answer

问题 #43 的实现添加了禁用检查关联的功能，该问题不是 PyPI 中可用的最新版 pandas-profiling (1.4) 的一部分。它已在之后实施，并且我想在下一个版本中将可用。同时，如果你真的需要它，你可以从 github 下载当前版本并使用它，例如将它添加到你的 PYTHONPATH 中。

!/bin/sh

PROF_DIR="$HOME/Git/pandas-profiling/"

导出 PYTHONPATH="$PYTHONPATH:$PROF_DIR"

jupyter笔记本

python - 如何使用 Pandas 分析来分析大型数据集？

5 回答 5

!/bin/sh

Related

Reference