Multimodal Situational Safety

1University of California, Santa Cruz, 2University of California, Berkley
image
Figure 1. Illustration of multimodal situational safety. The model must judge the safety of the user's query or instruction based on the visual context and adjust their answer accordingly. Given an unsafe visual context, the model should remind the user of the potential risk instead of directly answering the user's query. However, current MLLMs struggle to achieve this in most unsafe situations.

Abstract

Multimodal Large Language Models (MLLMs) are rapidly evolving, demonstrating impressive capabilities as multimodal assistants that interact with both humans and their environments. However, this increased sophistication introduces significant safety concerns. In this paper, we present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety, which explores how safety considerations vary based on the specific situation in which the user or agent is engaged. We argue that for an MLLM to respond safely—whether through language or action—it often needs to assess the safety implications of a language query within its corresponding visual context. To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs. The dataset comprises 1,820 language query-image pairs, half of which the image context is safe, and the other half is unsafe. We also develop an evaluation framework that analyzes key safety aspects, including explicit safety reasoning, visual understanding, and, crucially, situational safety reasoning. Our findings reveal that current MLLMs struggle with this nuanced safety problem in the instruction-following setting and struggle to tackle these situational safety challenges all at once, highlighting a key area for future research. Furthermore, we develop multi-agent pipelines to coordinately solve safety challenges, which shows consistent improvement in safety over the original MLLM response.

Dataset Overview

image
Figure 2. Presentation of MSSBench across four domains and ten secondary categories in Chat and Embodied tasks.
image
Figure 3. Data statistics for multimodal situational safety categories with percentages.

Data Collection

image
image
Figure 4. The overall structure of the chat data collection pipeline (left) and examples of two multimodal assistant scenarios (right). The pipeline includes four parts: (1) Generating Intented Activity and Unsafe Textual Situations. (2) Iterative Filtering with LLM. (3) Constructing a Multimodal Situational Safety Dataset via Image Retrieval. (4) Human Verification \& Query Generation.

Result and Diagnosis

We assess the performance of 8 leading multimodal large language models (MLLMs) on our MSS benchmark.

image
Table 1. Accuracy of MLLMs under instruction following setting. All of the MLLMs struggle to respond with safety awareness under unsafe situations and perform even worse in Embodied Task.

We identify three main reasons for MLLM's poor performance on the MSS benchmark: lack of explicit safety reasoning, visual understanding, and situational safety judgment. To validate these hypotheses, we design four distinct evaluation settings: (1) explicit safety reasoning for user queries, (2) explicit safety reasoning for user intents, (3) explicit safety reasoning for user intents with self-captioning, and (4) explicit safety reasoning for user intents using ground-truth situation information.

image
Chat task safe situations.
image
Chat task unsafe situations.
image
Chat task average.
image
Embodied task safe situations.
image
Embodied task unsafe situations.
image
Embodied task average.
Table 2. Diagnosis of different factors influencing the MLLM’s situational safety performance. Besides the instruction following (IF) setting, we design four extra settings: (1) query classification (QC): letting MLLMs explicitly reason the safety of user query, (2) intent classification (IC): explicitly reason the safety of user’s intent, (3) IC w/ Self Cap: explicitly reason the safety of user’s intent providing with self-caption, and (4) IC w/ GT Cap: explicitly reason the safety of user’s intent providing with ground-truth situation information. We report and compare the individual (a) and average (b) performance of open-source MLLMs and closed-source MLLMs.

Multi-Agent System For Better Safety Reasoning

image
Figure 5. Workflow of our Multi-Agent framework for enhancing situational safety in user queries, incorporating Intent Reasoning, Safety Judgment, QA and Visual Understanding agents.
image
image
Table 3. MLLM's performance on our benchmark with three reasoning settings. Base setting: without explicit safety reasoning. 1 step CoT: MLLMs reasoning the safety of user query and generating response at one step. Multi-agent: our designed multi-agent pipeline. The results show that the multi-agent pipeline improves performance in most cases.

BibTeX

@misc{zhou2024multimodalsituationalsafety,
      title={Multimodal Situational Safety}, 
      author={Kaiwen Zhou and Chengzhi Liu and Xuandong Zhao and Anderson Compalas and Dawn Song and Xin Eric Wang},
      year={2024},
      eprint={2410.06172},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.06172}, 
}