I'm assuming that your image dataset belongs to two classes (0 or 1) but it's unlabeled. As @PranayModukuru mentioned that you can determine the similarity by using some measure (e.g aggregating all the pixels intensity values of a image, as you mentioned) in the getitem function in tour custom Dataset class.
However, determining the similarity in getitem function while training your model will make the training process very slow. So, i would recommend you to approximate the similarity before start training (not in the getitem function). Moreover if your image dataset is comprised of complex images (not black and white images) it's better to use a pretrained deep learning model (e.g. resnet or autoencoder) for dimentionality reduction followed by applying clustering approach (e.g. agglomerative clustering) to label your image.
In the second approach you only need to label your images for exactly one time and if you apply augmentation on images while training you don't need to re-determine the similarity (label) in the getitem funcion. On the other hand, in the first approach you need to determine the similarity (label) every time (after applying transformation on images) in the getitem function which is redundant, unnecessary and time consuming.
Hope this will help.