Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

Abu Tami, Mohammad$AAUP$Palestinian; Ashqar, Huthaifa$AAUP$Palestinian; Elhenawy, Mohammed$Other$Other; Glaser, Sebastien$Other$Other; Rakotonirainy, Andry$Other$Other

Please use this identifier to cite or link to this item: http://repository.aaup.edu/jspui/handle/123456789/2208

Title:	Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events
Authors:	Abu Tami, Mohammad$AAUP$Palestinian Ashqar, Huthaifa$AAUP$Palestinian Elhenawy, Mohammed$Other$Other Glaser, Sebastien$Other$Other Rakotonirainy, Andry$Other$Other
Issue Date:	2-Sep-2024
Publisher:	MDPI
Citation:	Abu Tami, M.; Ashqar, H.I.; Elhenawy, M.; Glaser, S.; Rakotonirainy, A. Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events. Vehicles 2024, 6, 1571-1590. https://doi.org/10.3390/vehicles6030074
Abstract:	Traditional approaches to safety event analysis in autonomous systems have relied on complex machine and deep learning models and extensive datasets for high accuracy and reliability. However, the emerge of multimodal large language models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities. Our framework leverages the logical and visual reasoning power of MLLMs, directing their output through object-level question–answer (QA) prompts to ensure accurate, reliable, and actionable insights for investigating safety-critical event detection and analysis. By incorporating models like Gemini-Pro-Vision 1.5, we aim to automate safety-critical event detection and analysis along with mitigating common issues such as hallucinations in MLLM outputs. The results demonstrate the framework’s potential in different in-context learning (ICT) settings such as zero-shot and few-shot learning methods. Furthermore, we investigate other settings such as self-ensemble learning and a varying number of frames. The results show that a few-shot learning model consistently outperformed other learning models, achieving the highest overall accuracy of about 79%. The comparative analysis with previous studies on visual reasoning revealed that previous models showed moderate performance in driving safety tasks, while our proposed model significantly outperformed them. To the best of our knowledge, our proposed MLLM model stands out as the first of its kind, capable of handling multiple tasks for each safety-critical event. It can identify risky scenarios, classify diverse scenes, determine car directions, categorize agents, and recommend the appropriate actions, setting a new standard in safety-critical event management. This study shows the significance of MLLMs in advancing the analysis of naturalistic driving videos to improve safety-critical event detection and understanding the interactions in complex environments.
URI:	http://repository.aaup.edu/jspui/handle/123456789/2208
ISSN:	https://doi.org/10.3390/vehicles6030074
Appears in Collections:	Faculty & Staff Scientific Research publications

Files in This Item:

File	Description	Size	Format
vehicles-06-00074.pdf		6.72 MB	Adobe PDF	View/Open

Show full item record

Admin Tools

ARAB AMERICAN UNIVERSITY Repository