我有一个从 GPU 加速中受益匪浅的工作流程,但每个任务的内存要求相对较低(2-4 GB)。我正在使用dask.dataframe
、dask.distributed.Client
和的组合dask_cuda.LocalCUDACluster
。该过程将从更多的工人 CUDA 工人中受益匪浅,因此我想将物理 GPU(Nvidia RTX A600、V100、A100)拆分为多个虚拟/逻辑 GPU,以增加我的dask_cuda LocalCUDACluster
. 我最初的想法是尝试将创建的logical_gpus 传递TensorFlow
给LocalCUDACluster
,但我似乎无法将它们传递给集群。
我在 docker 环境中工作,我想将这些分裂保留在 python 中。此工作流程理想地从本地工作站扩展到多节点 MPI 作业,但我不确定这是否可行,我愿意接受任何建议。
添加一个类似的例子。
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask_cuda.initialize import initialize
import pandas as pd
import dask.dataframe as dd
import time
# fake function
def my_gpu_sim(x):
"""
GPU simulation which is independent of any others (calls a c++ program in real-world, which saves a
file.)
"""
...
return None
# fake data creation
dic = {'random':['apple' for i in range(40)], 'main':[i for i in range(40)]}
df = pd.DataFrame.from_dict(dic)
ddf = dd.from_pandas(df, npartitions=4)
# Configurations
protocol = "ucx"
enable_tcp_over_ucx = True
enable_nvlink = True
enable_infiniband = False
initialize(
create_cuda_context=True,
enable_tcp_over_ucx=enable_tcp_over_ucx,
enable_infiniband=enable_infiniband,
enable_nvlink=enable_nvlink,
)
cluster = LocalCUDACluster(local_directory="/tmp/USERNAME",
protocol=protocol,
enable_tcp_over_ucx=enable_tcp_over_ucx,
enable_infiniband=enable_infiniband,
enable_nvlink=enable_nvlink,
rmm_pool_size="35GB"
)
client = Client(cluster)
# Simulation
ddf.map_partitions(lambda df: df.apply(lambda x: my_gpu_sim(x.main), axis=1)).compute(scheduler=client)