We explore the cognitive capabilities of large language models (LLMs) to reason about a scene based solely on knowledge of relevant objects and their spatial positions. We designed a specialized module that leverages these capabilities to assist visually impaired individuals in navigating and interpreting their environments.
We formalize our problem statement as: “Given a stream of RGB-D video, how can we process this information using an LLM to generate coherent language output that informs a visually impaired person about their surroundings?”
Our methodology uses an object detector to first identify and localize objects in the scene. These detections are then paired with depth information and provided to the LLM for reasoning. This approach allows the LLM to incorporate temporal data, enabling better understanding of the scene. Also, as object detectors are computationally lighter than VLMs, our method is less resource intensive.
Below are the modules that we are using in our system and how they work:
Images captured at regular intervals by a LiDAR camera (Intel RealSense L515) are processed through an object detector to extract object labels and pixel locations, which, combined with depth information, calculate the 3D spatial positions of objects; this spatial data is processed using a custom proximity map to translate numerical data into textual descriptions that enhance the LLM's reasoning capabilities, with the LLM queried on the last 15 seconds of data and previous five responses to generate user-friendly descriptions, utilizing YOLOv11 for object detection and GPT-4o-mini as the LLM.
This module helps the user explore static scenes in more depth by utilizing an off-the-shelf Visual Question Answering package (BLIP) to respond to the user's multiple queries.
This module uses an off-the-shelf pretrained BLIP model to generate a caption for the image, providing a general description of the scene for use cases where the user seeks an overview.
To evaluate our model, we opted for a human evaluation as it appeared to be the most suitable approach, given the absence of ground truth data for our work. For comparison, we used vision-anguage models (VLMs) as baselines, as they are the closest to our intended methodology. To assess the effectiveness of different methods, a Vision-Language Model (VLM) served as the control, while the experimental variable was a navigation module integrated into the system. A comparative survey was conducted to gather user preferences and feedback.
We conducted a comparative survey to gather user feedback. Participants reviewed two sample videos, each displaying side-by-side responses from the baseline VLM and the our method. We gathered data on two aspects:
The results provided insights into user preferences and the perceived utility of each approach. Detailed findings are summarized below.
Preference | Ours | Baseline |
---|---|---|
Sample 1 | 21/22 | 1/22 |
Sample 2 | 18/21 | 3/21 |
Total | 39/43 | 4/43 |
Net Preference Score = (Preference for Ours - Preference for Baseline) / Total Responses
Based on the 43 responses, the net preference score for our method was: 81.4%We also collected ratings on a scale of 1–5 for the perceived usefulness of each method. The results from which are summarized below:
Metric | Ours | Baseline |
---|---|---|
Average | 3.65 | 2.02 |
Standard Deviation | 1.16 | 1.00 |
Median | 4 | 2 |
The score for both the methods are shown in the bar graph below.
In this work, we explored the use of large language models as cognitive filters and parsers of depth and visual information for the specific use case of assisting visually impaired individuals. The results demonstrate that the concept is feasible and warrants further exploration with enhanced hardware and resources. Due to time and resource constraints, we tested the method in a limited number of scenarios. However, with additional results and fine-tuning using reward-based methods, we anticipate significant improvements in the system.
1. Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., & Gan, C. (2023). 3D-LLM: Injecting the 3D World into Large Language Models. arXiv. https://arxiv.org/abs/2307.12981
2. Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv. https://arxiv.org/abs/2201.12086
3. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv. https://arxiv.org/abs/2103.00020
4. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. arXiv. https://arxiv.org/abs/1506.02640