Multimodal Semantic Search Chat System (FastAPI, Qdrant, CLIP, BLIP,Typhoon OCR):

Built a multimodal chat system leveraging vector embeddings and cross-modal retrieval for semantic search across text and images with multilingual support. This project demonstrates practical execution from architecture and implementation to measurable delivery outcomes.

Personal ProjectsYear 2026

Project Overview

Objective

Built a multimodal chat system leveraging vector embeddings and cross-modal retrieval for semantic search across text and images with multilingual support.

Stack

FastAPINext.jsPostgreSQLQdrantSentenceTransformerCLIPBLIPGoogleTranslatorTyphoon OCR

Delivery highlights

  • Developed a multimodal semantic search chat system enabling cross-modal retrieval across text and images using a dual-database architecture (PostgreSQL for structured chat data and Qdrant for vector similarity search). Leveraged SentenceTransformers for multilingual text embeddings and CLIP for unified image–text representation, enhancing image understanding through BLIP-based captioning and Typhoon OCR for text extraction with translation. Designed a hybrid search pipeline combining semantic similarity from both text and image modalities with ranking and filtering to improve retrieval relevance and accuracy. Built scalable backend services using FastAPI and integrated with a Next.js frontend to support real-time chat interaction and efficient semantic search.
Back to Topic ProjectsBack to All Projects

Related Projects

3 items

Multimodal Semantic Retrieval (Video and Image Search)

Personal ProjectsYear: 2026

Unified text-to-video and text-to-image search into one cross-modal retrieval platform.

Electric Vehicle Charger Socket Semantic Visual Search System with YOLO, CLIP, and FAISS

Personal ProjectsYear: 2026

Built two-stage semantic retrieval and socket-type refinement for EV charger images.

Visual Question Answering System with YOLO, CLIP, ViT, BLIP, BLIP Caption, and LLM

Personal ProjectsYear: 2026

Built end-to-end VQA platform for image upload, scene understanding, and LLM-based answers.