我正在尝试从数百个大型 CSV 文件的单个列中创建 Keras Tokenizer 。Dask 似乎是一个很好的工具。我目前的方法最终会导致内存问题:
df = dd.read_csv('data/*.csv', usecol=['MyCol'])
# Process column and get underlying Numpy array.
# This greatly reduces memory consumption, but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)
我怎样才能做到这一点?类似于以下内容:
df = pd.read_csv('a-single-file.csv', chunksize=1000)
for chunk in df:
# Process a chunk at a time