nlp - TextLMDataBunch Memory issue Language Model Fastai

Question

I have a dataset with 45 million rows of data. I have three 6gb ram gpu. I am trying to train a language model on the data.

For that, I am trying to load the data as the fastai data bunch. But this part always fails because of the memory issue.

data_lm = TextLMDataBunch.from_df('./', train_df=df_trn, 
valid_df=df_val, bs=10)

How do I handle this issue?

score 0 · Accepted Answer

When you use this function, your Dataframe is loaded in memory. Since you have a very big dataframe, this causes your memory error. Fastai handles tokenization with a chunksize, so you should still be able to tokenize your text.

Here are two things you should try :

Add a chunksize argument (the default value is 10k) to your TextLMDataBunch.from_df, so that the tokenization process needs less memory.
If this is not enough, I would suggest not to load your whole dataframe into memory. Unfortunately, even if you use TextLMDataBunch.from_folder, it just loads the full DataFrame and pass it to TextLMDataBunch.from_df, you might have to create your own DataBunch constructor. Feel free to comment if you need help on that.

nlp - TextLMDataBunch Memory issue Language Model Fastai

1 回答 1

Related

Reference