Localized Embeddings (LocEm): A multi-task model for localization, classification, and retrieval of multiple objects in images

Shravan Kale

Embeddings are generated across different Deep Learning models to represent objects or entire images. They are then used for tasks such as Object Retrieval by matching against a database of other object embeddings. We propose our model LocEm, a single passthrough model that can generate embeddings for multiple objects in images but at the object’s location. We also repurpose the ImageNet video dataset that includes natural augmentation containing pose and action movement variation of objects in images to create a triplet generator. Our method to create LocEm includes extending an existing object detection model with the ability to predict embeddings and include an extended triplet loss function that encodes embeddings generated for non-object (background) entities in an image. We evaluate our model against other models in the literature that at most perform two of the three tasks performed by our model, i.e., object detection, classification, and embedding generation. The models are evaluated for fine-grained categorization based on object retrieval. We keep the base architecture and dataset constant across the models to show that the strength of our model is derived through the final LocEm layers and our training loss function. We also discuss the performance of our model on the variety of object instances and methods for further improvement.