Vision-Language Models in Practice

About

This hands-on course introduces participants to modern vision language models that combine image and text understanding. Building on the foundations established in the previous course on YOLO and object detection, this program extends participants’ skills into the multimodal domain by integrating natural language with visual perception. Over 14 structured sessions, participants explore models such as CLIP, BLIP, BLIP 2, OWL ViT, X CLIP, GLIP and LLaVA through practical examples and coding exercises. The course is designed for those with prior experience in deep learning and Python who are looking to apply state of the art vision language models in real world settings. Each session focuses on a specific task including zero shot classification, image captioning, open vocabulary detection, visual question answering and image based dialogue. Every lesson includes a short video or reading, followed by a guided hands on implementation using PyTorch and Hugging Face libraries. Participants work directly with pre trained models, explore prompt design, and build simple yet functional applications. By the end of the course, participants will understand how to apply vision language models effectively, evaluate their outputs, and recognize their limitations. The course concludes with a review of practical use cases across domains such as healthcare, robotics, education and digital media.

You can also join this program via the mobile app. Go to the app

Instructors

Ivan Lorencin

Price

Single Payment

£39.00

2 Plans Available

From £29.00/month

About

Instructors

Price

Share