Visual Question Answering System with YOLO, CLIP, ViT, BLIP, BLIP Caption, and LLM

Built end-to-end VQA platform for image upload, scene understanding, and LLM-based answers. This project demonstrates practical execution from architecture and implementation to measurable delivery outcomes.

Personal ProjectsYear 2026

Project Overview

Objective

Built end-to-end VQA platform for image upload, scene understanding, and LLM-based answers.

Stack

FastAPIReactYOLOCLIPViTBLIPGPT-4o-miniGPT-4.1GPT-5

Delivery highlights

  • Developed an end-to-end Visual Question Answering (VQA) system that allows users to upload an image and ask natural language questions about the scene. The system uses YOLO for object detection, CLIP for image–text similarity reasoning, Vision Transformer (ViT) for image classification, and BLIP for automatic image captioning, and integrates the extracted visual information with selectable Large Language Models (LLMs) to generate context-aware answers. The system supports multiple LLM options including GPT-4o-mini, GPT-4.1, and GPT-5, allowing users to compare responses from different models. Built the backend using FastAPI for model inference and API services, and developed an interactive React-based web interface where users can upload images, select the LLM model, visualize detected objects and bounding boxes, and receive AI-generated explanations about the image conten
Back to Topic ProjectsBack to All Projects

Related Projects

3 items

AI Document Question Answering System with RAG and LLM

Personal ProjectsYear: 2026

Built PDF upload and natural language QA system with retrieval-augmented generation.

Electric Vehicle Charger Socket Semantic Visual Search System with YOLO, CLIP, and FAISS

Personal ProjectsYear: 2026

Built two-stage semantic retrieval and socket-type refinement for EV charger images.

AI Document Search & Question Answering System (RAG)

Personal ProjectsYear: 2026

Built a multimodal AI system for document, audio, and image understanding with natural language question answering using retrieval-augmented generation (RAG).