python - 自定义数据集、数据加载器、采样器或其他？

Question

我正在开展一个项目，该项目需要在非常大的图像数据集上训练 PyTorch 框架 NN。其中一些图像与问题完全无关，但这些不相关的图像并没有被标记为这样。但是，如果它们不相关，我可以使用一些指标来计算它们（例如，将所有像素值相加可以让我很好地了解哪些是相关图像，哪些不是）。我最理想的做法是拥有一个可以接收 Dataset 类的 Dataloader，并仅使用相关图像创建批次。Dataset 类只知道图像列表及其标签，Dataloader 将解释它正在制作批处理的图像是否相关，然后只会使用相关图像制作批处理。

将此应用于示例，假设我有一个黑白图像数据集。白色图像是无关紧要的，但它们没有这样标记。我希望能够从文件位置加载批次，并且这些批次只包含黑色图像。我可以在某个时候通过对所有像素求和并找到它等于 0 来过滤。

我想知道的是自定义数据集、数据加载器或采样器是否能够为我解决此任务？我已经编写了一个自定义数据集，它存储所有保存图像的目录，以及该目录中所有图像的列表，并且可以在getitem函数中返回带有标签的图像。我还应该在那里添加一些东西来过滤掉某些图像吗？还是应该在自定义 Dataloader 或 Sampler 中应用该过滤器？

谢谢！

score 0 · Accepted Answer

I'm assuming that your image dataset belongs to two classes (0 or 1) but it's unlabeled. As @PranayModukuru mentioned that you can determine the similarity by using some measure (e.g aggregating all the pixels intensity values of a image, as you mentioned) in the getitem function in tour custom Dataset class.

However, determining the similarity in getitem function while training your model will make the training process very slow. So, i would recommend you to approximate the similarity before start training (not in the getitem function). Moreover if your image dataset is comprised of complex images (not black and white images) it's better to use a pretrained deep learning model (e.g. resnet or autoencoder) for dimentionality reduction followed by applying clustering approach (e.g. agglomerative clustering) to label your image.

In the second approach you only need to label your images for exactly one time and if you apply augmentation on images while training you don't need to re-determine the similarity (label) in the getitem funcion. On the other hand, in the first approach you need to determine the similarity (label) every time (after applying transformation on images) in the getitem function which is redundant, unnecessary and time consuming.

Hope this will help.

score 0 · Accepted Answer

听起来您的目标是从训练中完全删除不相关的图像。

处理这个问题的最好方法是预先找出所有相关图像的文件名，并将它们的文件名保存到 csv 或其他文件中。然后只将好的文件名传递给您的数据集。

原因是您将在训练期间多次运行数据集。这意味着您将一遍又一遍地加载、分析和丢弃不相关的图像，这是对计算的浪费。

最好预先进行这种预处理/过滤。

python - 自定义数据集、数据加载器、采样器或其他？

2 回答 2

Related

Reference