Vision Embedding Comparison for Image Similarity Search: EfficientNet vs. ViT vs. VINO vs. CLIP vs. BLIP2

Author(s): Yuki Shizuya Originally published on Towards AI. Photo by gilber franco on UnsplashRecently, I needed to research image similarity search. I wonder if there are any differences among embeddings based on the architecture training methods. However, few blogs compare embeddings among several models. So, in this blog, I will compare the vision embeddings of EfficientNet [1], ViT [2], DINO-v2 [3], CLIP [4], and BLIP-2 [5] for image similarity search using the Flickr dataset [6]. I will mainly use Huggingface and Faiss libraries for implementation. First, I will briefly introduce each deep learning model. Next, I will show you the code implementation and the comparison results.

Blog