Scene Descriptions for the Visually Impaired

Sentimentals
CSCI 5541: Natural Language Processing Final Project

Model output in navigation mode.

Model output in Question/Answering mode.

Motivation

We explore the cognitive capabilities of large language models (LLMs) to reason about a scene based solely on knowledge of relevant objects and their spatial positions. We designed a specialized module that leverages these capabilities to assist visually impaired individuals in navigating and interpreting their environments.

Problem Definition

We formalize our problem statement as: “Given a stream of RGB-D video, how can we process this information using an LLM to generate coherent language output that informs a visually impaired person about their surroundings?”

Proposed Idea

Our methodology uses an object detector to first identify and localize objects in the scene. These detections are then paired with depth information and provided to the LLM for reasoning. This approach allows the LLM to incorporate temporal data, enabling better understanding of the scene. Also, as object detectors are computationally lighter than VLMs, our method is less resource intensive.

MY ALT TEXT

Information flow for navigation module of our system.

Methods

Below are the modules that we are using in our system and how they work:

Navigation

Images captured at regular intervals by a LiDAR camera (Intel RealSense L515) are processed through an object detector to extract object labels and pixel locations, which, combined with depth information, calculate the 3D spatial positions of objects; this spatial data is processed using a custom proximity map to translate numerical data into textual descriptions that enhance the LLM's reasoning capabilities, with the LLM queried on the last 15 seconds of data and previous five responses to generate user-friendly descriptions, utilizing YOLOv11 for object detection and GPT-4o-mini as the LLM.

Visual Question Answering

This module helps the user explore static scenes in more depth by utilizing an off-the-shelf Visual Question Answering package (BLIP) to respond to the user's multiple queries.

Scene Description

This module uses an off-the-shelf pretrained BLIP model to generate a caption for the image, providing a general description of the scene for use cases where the user seeks an overview.

Evaluation and Results

To evaluate our model, we opted for a human evaluation as it appeared to be the most suitable approach, given the absence of ground truth data for our work. For comparison, we used vision-anguage models (VLMs) as baselines, as they are the closest to our intended methodology. To assess the effectiveness of different methods, a Vision-Language Model (VLM) served as the control, while the experimental variable was a navigation module integrated into the system. A comparative survey was conducted to gather user preferences and feedback.

We conducted a comparative survey to gather user feedback. Participants reviewed two sample videos, each displaying side-by-side responses from the baseline VLM and the our method. We gathered data on two aspects:

  1. a preference comparison between our method and the VLM baseline, and
  2. the perceived usefulness of the methods on a scale of 1–5.
The form used to collected data can be found on the link : Link to Google Form

The results provided insights into user preferences and the perceived utility of each approach. Detailed findings are summarized below.

Preference Comparison

Preference Ours Baseline
Sample 1 21/22 1/22
Sample 2 18/21 3/21
Total 39/43 4/43

Net Preference Score

Based on user preference data we calculated the net preference score as:

Net Preference Score = (Preference for Ours - Preference for Baseline) / Total Responses

Based on the 43 responses, the net preference score for our method was: 81.4%

Rating Metrics

We also collected ratings on a scale of 1–5 for the perceived usefulness of each method. The results from which are summarized below:

Metric Ours Baseline
Average 3.65 2.02
Standard Deviation 1.16 1.00
Median 4 2

The score for both the methods are shown in the bar graph below.

Bar graph showing results comparison
Bar graph summarizing the comparative results of Our method and the baseline(VLM).

Conclusion

In this work, we explored the use of large language models as cognitive filters and parsers of depth and visual information for the specific use case of assisting visually impaired individuals. The results demonstrate that the concept is feasible and warrants further exploration with enhanced hardware and resources. Due to time and resource constraints, we tested the method in a limited number of scenarios. However, with additional results and fine-tuning using reward-based methods, we anticipate significant improvements in the system.

Poster

References

1. Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., & Gan, C. (2023). 3D-LLM: Injecting the 3D World into Large Language Models. arXiv. https://arxiv.org/abs/2307.12981

2. Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv. https://arxiv.org/abs/2201.12086

3. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv. https://arxiv.org/abs/2103.00020

4. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. arXiv. https://arxiv.org/abs/1506.02640