Project Overview
Objective
Built a multimodal AI system for document, audio, and image understanding with natural language question answering using retrieval-augmented generation (RAG).
Stack
Delivery highlights
- Developed an advanced multimodal AI platform by extending and integrating multiple existing systems including AI Document Question Answering System with RAG and LLM, Multimodal Semantic Search, AI Meeting Transcription & Q&A, and Text-to-Image Semantic Search into a unified architecture that enables cross-modal retrieval and context-aware reasoning across documents, audio, and images, designing and implementing RESTful APIs using FastAPI for file upload, background processing, indexing, and question-answering workflows, using PyMuPDF for document parsing, Typhoon OCR API for extracting text from images and scanned PDFs, Whisper for speech-to-text transcription from audio, and BLIP for image captioning, while applying text chunking and generating semantic embeddings with SentenceTransformers (BAAI/bge-m3) stored in Qdrant for vector similarity search, combined with Elasticsearch for keyword-based retrieval to implement a hybrid search system that improves retrieval accuracy and reduces hallucination, and leveraging selectable Large Language Models (GPT-4o-mini, GPT-4.1, GPT-5) via LangChain to generate context-grounded answers with source attribution, supported by scalable backend services with background job processing and persistent storage, and a modern frontend built with Next.js for file upload, semantic search, and interactive knowledge exploration across multiple data sources.