Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems

Ashqar, Huthaifa$AAUP$Palestinian; Al-Hadidi, Taqwa$Other$Other; Elhenawy, Mohammed$Other$Other; Khanfar, Nour$AAUP$Palestinian

Please use this identifier to cite or link to this item: http://repository.aaup.edu/jspui/handle/123456789/2941

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ashqar, Huthaifa$AAUP$Palestinian	-
dc.contributor.author	Al-Hadidi, Taqwa$Other$Other	-
dc.contributor.author	Elhenawy, Mohammed$Other$Other	-
dc.contributor.author	Khanfar, Nour$AAUP$Palestinian	-
dc.date.accessioned	2024-11-05T12:35:21Z	-
dc.date.available	2024-11-05T12:35:21Z	-
dc.date.issued	2024-10-10	-
dc.identifier.citation	Ashqar, H.I.; Alhadidi, T.I.; Elhenawy, M.; Khanfar, N.O. Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems. Automation 2024, 5, 508–526. https:// doi.org/10.3390/automation5040029	en_US
dc.identifier.issn	https:// doi.org/10.3390/automation5040029	-
dc.identifier.uri	http://repository.aaup.edu/jspui/handle/123456789/2941	-
dc.description.abstract	The integration of thermal imaging data with multimodal large language models (MLLMs) offers promising advancements for enhancing the safety and functionality of autonomous driving systems (ADS) and intelligent transportation systems (ITS). This study investigates the potential of MLLMs, specifically GPT-4 Vision Preview and Gemini 1.0 Pro Vision, for interpreting thermal images for applications in ADS and ITS. Two primary research questions are addressed: the capacity of these models to detect and enumerate objects within thermal images, and to determine whether pairs of image sources represent the same scene. Furthermore, we propose a framework for object detection and classification by integrating infrared (IR) and RGB images of the same scene without requiring localization data. This framework is particularly valuable for enhancing the detection and classification accuracy in environments where both IR and RGB cameras are essential. By employing zero-shot in-context learning for object detection and the chain-of-thought technique for scene discernment, this study demonstrates that MLLMs can recognize objects such as vehicles and individuals with promising results, even in the challenging domain of thermal imaging. The results indicate a high true positive rate for larger objects and moderate success in scene discernment, with a recall of 0.91 and a precision of 0.79 for similar scenes. The integration of IR and RGB images further enhances detection capabilities, achieving an average precision of 0.93 and an average recall of 0.56. This approach leverages the complementary strengths of each modality to compensate for individual limitations. This study highlights the potential of combining advanced AI methodologies with thermal imaging to enhance the accuracy and reliability of ADS, while identifying areas for improvement in model performance.	en_US
dc.language.iso	en_US	en_US
dc.publisher	MDPI	en_US
dc.title	Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems	en_US
dc.type	Article	en_US
Appears in Collections:	Faculty & Staff Scientific Research publications

Files in This Item:

File	Description	Size	Format
automation-05-00029.pdf		8.95 MB	Adobe PDF	View/Open

Show simple item record

Admin Tools

ARAB AMERICAN UNIVERSITY Repository