BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding&Generation

#blip #review #ai Cross-modal pre-training has been all the rage lately in deep learning, especially training vision and language models together. However, there are a number of issues, such as low quality datasets that limit the performance of any model trained on it, and also the fact that pure contrastive pre-training cannot be easily fine-tuned for most downstream tasks. BLIP unifies different tasks and objectives in a single pre-training run and achieves a much more versatile model, which the paper immediately uses to create, filter, clean and thus bootstrap its own dataset to improve performance even more! Sponsor: Zeta Alpha Use code YANNIC for 20% off! OUTLINE: 0:00 - Intro 0:50 - Sponsor: Zeta Alpha 3:40 - Paper Overview 6:40 - Vision-Language Pre-Training 11:15 - Contributions of the paper 14:30 - Model architecture: many parts for many tasks 19:50 - How data flows in the model 26:50 - Parameter sharing between the modules 29:45 - Captioning & Filtering bootstrapping 41:10
Back to Top