I am working on implementing YOLO v2 and 3 for object detection on a custom dataset. While YOLO v2 and 3 use something like 5 or so anchor boxes, I generally have maybe 50-100 detections each image. My sense is that if there are only 5 anchor boxes, then there are at most 5 detections per image right? So I was trying to understand if I needed to adjust the number of anchor boxes to my dataset.
My questions is, does the number of anchor boxes need to be larger than the maximum count of bounding boxes in any training image? That way, I would never run into detections where there is no corresponding anchor box. Is that the right way of thinking about adapting YOLO?
If my intuition is correct then would I need to do k-means to cluster the bounding boxes in the ground truth images and set the anchor box coordinates. Then I would use the usual regression method as specified in this blog post.
Thanks for any help that anyone can provide.