Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual
information such as the appearance, type and number of objects, as well as the lack of labeled object classes
typically available in supervised settings. While recent approaches to unsupervised object localization have
demonstrated significant progress by leveraging self-supervised visual representations, they often require
computationally intensive training processes, resulting in high resource demands in terms of computation,
learnable parameters, and data. They also lack explicit modeling of visual context, potentially limiting
their accuracy in object localization. To tackle these challenges, we propose a single-stage learning framework,
dubbed PEEKABOO, for unsupervised object localization by learning context-based representations at both the
pixel- and shape-level of the localized objects through image masking. The key idea is to selectively hide parts
of an image and leverage the remaining image information to infer the location of objects without explicit supervision.
The experimental results, both quantitative and qualitative, across various benchmark datasets, demonstrate the simplicity,
effectiveness and competitive performance of our approach compared to state-of-the-art methods in both single object
discovery and unsupervised salient object detection tasks.
|