python - 将 Pandas 代码更改为 CUDF 以提高 GPU 利用率

Question

我通过混合正负对来制作成对的图像。这个过程计算量很大，需要大量的 RAM 和处理器。为了加快速度，我想使用 GPU 并将熊猫代码更改为 CUDF。现在，CUDF 的文档非常有限，我想将下面的代码更改为 CUDF。

positives = pd.DataFrame()
for value in tqdm(identities.values(), desc="Positives"):
    positives = positives.append(pd.DataFrame(itertools.combinations(value, 2), columns=["file_x", "file_y"]),
                                 ignore_index=True)
positives["decision"] = "Yes"
print(positives)
samples_list = list(identities.values())
negatives = pd.DataFrame()
######################====================Functions=============##############

def compute_cross_samples(x):
    return pd.DataFrame(itertools.product(*x), columns=["file_x", "file_y"])

####################################
if __name__ == "__main__":
    if Path("positives_negatives.csv").exists():
        df = pd.read_csv("positives_negatives.csv")
    else:
        with ProcessPoolExecutor() as pool:
            # take cpu_count combinations from identities.values
            for combos in tqdm(more_itertools.ichunked(itertools.combinations(identities.values(), 2), cpu_count())):
                # for each combination iterator that comes out, calculate the cross
                for cross_samples in pool.map(compute_cross_samples, combos):
                    # for each product iterator "cross_samples", iterate over its values and append them to negatives
                    negatives = negatives.append(cross_samples)

        negatives["decision"] = "No"

negatives = negatives.sample(positives.shape[0])
df = pd.concat([positives, negatives]).reset_index(drop=True)
df.to_csv("positives_negatives.csv", index=False)`

score 1 · Accepted Answer

对于您的代码，您需要考虑两件事：

由于 API 相似，首先要开始导入 cudf。然后，在您使用pd（您的 pandas 导入变量名称）的地方，将其替换为cudf. 虽然这是一个开始，但请查看本指南，这将帮助您了解过渡的基础知识。编码明智，从cudf 和 dask cuDF 教程笔记本开始，尤其是这个。
顺便说一句，除了删除 CPU 处理代码之外，您还想将函数重构为不需要for loops. cuDF 和其他 RAPIDS 库在后台为 GPU 并行化代码做了很多工作。添加 for 循环会使过程串行化并减慢您的速度。
最后，请在此处阅读我们的官方文档文档，这将有助于您的 CPU -> GPU 重构：https ://docs.rapids.ai/api/cudf/stable/api.html

python - 将 Pandas 代码更改为 CUDF 以提高 GPU 利用率

1 回答 1

Related

Reference