LiteViLA: A Lightweight Vision-Language Model Scene Understanding in Autonomous Driving

This paper describes our method for the ECCV 2024 Workshop W-CODA Track 1: Corner Case Scene Understanding. We propose LiteViLA: a Lightweight Vision-Language model for scene understanding in Autonomous driving, leveraging the TinyLLaVA backbone for efficient processing of large-scale multimodal data. Our approach extracts visual fea-tures through a Vision Encoder and Q-Former, with the integration of visual and language modalities handled by the Language Model (LM) through a Mixture of Adapters (MoA) mechanism. The MoA dynamically selects task-specific adapters for General Perception, Region Perception, and Driving Suggestions, optimizing performance across these critical tasks. Finally, a Reviewer component refines the generated answers, ensuring their accuracy and relevance.

LiteViLA: A Lightweight Vision-Language Model Scene Understanding in Autonomous Driving

Visual perception and comprehension data samples from CODA-LM.

Abstract

Method Overview

Qualitative Results