我已经学习 PyTorch 几个星期了。当我使用PyTorch datasets 中的 CIFAR-10 数据集进行练习时,我也想到了使用 ImageFolder 类进行练习,因此我从 Kaggle 中找到了 Cifar-10的一个版本,其中图像被文件夹。(我你不记得 PyTorch 数据集是tar.gz 格式,而不是文件夹结构)
令我大吃一惊的是,尽管使用相同的损失函数、学习率和架构,Kaggle 数据集测试集的准确度从 0.18 开始,PyTorch 数据集的准确度从 0.56 在 epoch 1 开始。
最终在 20 个 epoch 之后,一个几乎在 0.45 附近饱和,而后一个几乎固定在 0.86 附近。
我一次又一次地检查,但没有发现这两个代码有什么大的区别。我真的很想知道,如果我做错了什么,或者这两个数据集有什么根本不同。
澄清一下,我正在使用这个Pytorch 数据集和这个Kaggle 数据集。代码太大,这里不提供,所以我提供了我的笔记本,欢迎你看我的整个代码,如果需要也可以运行[你只需要使用你的 Kaggle API 密钥来下载来自 kaggle 的数据集,我不能公开我的数据集...抱歉给您带来不便] Kaggle Dataset Notebook here和 Pytorch Dataset Notebook here
我还提供了我认为大部分不同的代码块。
Kaggle 数据集:
Epoch 1 得分 = 0.18 Epoch 20 得分 = 0.45
from torch.utils.data import DataLoader
def createVal(train_list, root_folder, classes, valid_split ):
try:
os.mkdir(os.path.join(root_folder, 'val'))
except FileExistsError:
pass
for cls in classes:
try:
os.mkdir(os.path.join(root_folder, 'val', cls))
except FileExistsError:
pass
np.random.shuffle(train_list)
valid_len = len(train_list) * valid_split
for i in tqdm(range(int(valid_len))):
shutil.move(train_list[i], train_list[i].replace('/train/', '/val/'))
valid_split = 0.2
batch_size = 32
num_workers = 4
root_folder = "/content/cifar10/cifar10"
train_folder = os.path.join(root_folder, "train")
test_folder = os.path.join(root_folder, "test")
if valid_split:
createVal(train_list, root_folder, classes, valid_split = valid_split)
val_folder = os.path.join(root_folder, "val")
val_data = datasets.ImageFolder(val_folder, transform = transform)
val_loader = DataLoader(val_data, batch_size = batch_size, num_workers = num_workers )
train_data = datasets.ImageFolder(train_folder, transform = transform)
train_loader = DataLoader(train_data, shuffle = True, batch_size = batch_size, num_workers = num_workers )
test_data = datasets.ImageFolder(test_folder, transform = transform)
test_loader = DataLoader(test_data, batch_size = batch_size, num_workers = num_workers )
Pytorch 数据集:
Epoch 1 得分 = 0.18 Epoch 20 得分 = 0.45
valid_split = 0.2
batch_size = 32
num_workers = 4
if valid_split:
num_train = len(train_data)
idx = list(range(num_train))
np.random.shuffle(idx)
train_idx = idx[int(valid_split*num_train):]
val_idx = idx[:int(valid_split*num_train)]
train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)
train_loader = DataLoader(train_data, sampler = train_sampler, batch_size = batch_size, num_workers = num_workers )
val_loader = DataLoader(train_data, sampler = val_sampler, batch_size = batch_size, num_workers = num_workers )
else:
train_loader = DataLoader(train_data, batch_size = batch_size, num_workers = num_workers )
test_loader = DataLoader(test_data, batch_size = batch_size, num_workers = num_workers )