Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. iVISPAR is based on a variant of the sliding tile puzzle-a classic problem that demands logical planning, spatial awareness, and multi-step reasoning. The benchmark supports visual 2D, 3D, and text-based input modalities, enabling comprehensive assessments of VLMs' planning and reasoning skills. We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task's complexity and feasibility for humans. Results indicate that while some VLMs perform well on simple spatial tasks, they encounter difficulties with more complex configurations and problem properties. Notably, while VLMs generally perform better in 2D vision compared to 3D or text-based representations, they consistently fall short of human performance, illustrating the persistent challenge of visual alignment. This highlights critical gaps in current VLM capabilities, highlighting their limitations in achieving human-level cognition.
Publications
Semi-autonomous vehicles (AVs) enable drivers to engage in non-driving tasks but require them to be ready to take control during critical situations. This “out-of-the-loop” problem demands a quick transition to active information processing, raising safety concerns and anxiety. Multimodal signals in AVs aim to deliver take-over requests and facilitate driver–vehicle cooperation. However, the effectiveness of auditory, visual, or combined signals in improving situational awareness and reaction time for safe maneuvering remains unclear. This study investigates how signal modalities affect drivers’ behavior using virtual reality (VR). We measured drivers’ reaction times from signal onset to take-over response and gaze dwell time for situational awareness across twelve critical events. Furthermore, we assessed self-reported anxiety and trust levels using the Autonomous Vehicle Acceptance Model questionnaire. The results showed that visual signals significantly reduced reaction times, whereas auditory signals did not. Additionally, any warning signal, together with seeing driving hazards, increased successful maneuvering. The analysis of gaze dwell time on driving hazards revealed that audio and visual signals improved situational awareness. Lastly, warning signals reduced anxiety and increased trust. These results highlight the distinct effectiveness of signal modalities in improving driver reaction times, situational awareness, and perceived safety, mitigating the “out-of-the-loop” problem and fostering human–vehicle cooperation.
Natural eye movements have primarily been studied for over-learned activities such as tea-making, sandwich-making, and hand-washing, which have a fixed sequence of associated actions. These studies demonstrate a sequential activation of low-level cognitive schemas facilitating task completion. However, whether these action schemas are activated in the same pattern when a task is novel and a sequence of actions must be planned in the moment is unclear. Here, we recorded gaze and body movements in a naturalistic task to study action-oriented gaze behavior. In a virtual environment, subjects moved objects on a life-size shelf to achieve a given order. To compel cognitive planning, we added complexity to the sorting tasks. Fixations aligned with the action onset showed gaze as tightly coupled with the action sequence, and task complexity moderately affected the proportion of fixations on the task-relevant regions. Our analysis revealed that gaze fixations were allocated to action-relevant targets just in time. Planning behavior predominantly corresponded to a greater visual search for task-relevant objects before the action onset. The results support the idea that natural behavior relies on the frugal use of working memory, and humans refrain from encoding objects in the environment to plan long-term actions. Instead, they prefer just-in-time planning by searching for action-relevant items at the moment, directing their body and hand to it, monitoring the action until it is terminated, and moving on to the following action.
Abstract Visual attention is mainly goal directed and allocated based on the upcoming action. However, it is unclear how far this feature of gaze behaviour generalizes in more naturalistic settings. The present study investigates the influence of action affordances on active inference processes revealed by eye movements during interaction with familiar and novel tools. In a between‐subject design, a cohort of participants interacted with a virtual reality controller in a low‐realism environment; another performed the task with an interaction setup that allowed differentiated hand and finger movements in a high‐realism environment. We investigated the differences in odds of fixations and their eccentricity towards the tool parts before action initiation. The results show that participants fixate more on the tool's effector part before action initiation when asked to produce tool‐specific movements, especially with unfamiliar tools. These findings suggest that fixations are made in a task‐oriented way to plan the distal goals of producing the task‐ and tool‐specific actions well before action initiation. Moreover, with more realistic action affordance, fixations were biased towards the tool handle when it was oriented incongruent with the subjects' handedness. We hypothesize that these fixations are made towards the proximal goal of planning the grasp even though the perceived action on the tools is identical for both experimental setups. Taken together, proximal and distal goal‐oriented planning is contextualized to the realism of action/interaction afforded by an environment.
Multi-participant experiments in virtual reality (VR) could provide a new way to investigate real-time interactions in a controlled and ecologically valid environment. However, to create the impression of a shared world between participants, all non-static elements of the environment need to be networked. Currently available networking solutions contain complex automated functionalities and were developed for the demands of multiplayer video games. However, the required level of expertise to tweak such automated functionality is difficult to acquire during the short time frame of an experimental project's implementation phase. As the focus of multi-participant experiments lies on the control of experimental variables and data quality, we propose a new light-weight networking solution called LightNet, specifically designed for the needs of implementations of multi-participant experiments.LightNet is an open-source networking library. It provides a transparent software architecture, multiple customization options, and precise networking of variables to ensure control over experimental variables and data quality. In this article, we present a "how-to" section in which we explain the necessary steps to include networking functionalities using LightNet. We conclude this section with a list of additional recommendations for any LightNet implementation. Furthermore, we describe the networking logic and properties of a more complex example experiment, incorporating shared gaze.Overall, we believe that the provided networking solution facilitates the implementation of multi-participant experiments to enable a comprehensible access to investigating social interactions in VR, even without expert-level expertise in the field of networking.
With the further development of highly automated vehicles, drivers will engage in non-related tasks while being driven. Still, drivers have to take over control when requested by the car. Here, the question arises, how potentially distracted drivers get back into the control-loop quickly and safely when the car requests a takeover. To investigate effective human–machine interactions, a mobile, versatile, and cost-efficient setup is needed. Here, we describe a virtual reality toolkit for the Unity 3D game engine containing all the necessary code and assets to enable fast adaptations to various human–machine interaction experiments, including closely monitoring the subject. The presented project contains all the needed functionalities for realistic traffic behavior, cars, pedestrians, and a large, open-source, scriptable, and modular VR environment. It covers roughly 25 km2, a package of 125 animated pedestrians, and numerous vehicles, including motorbikes, trucks, and cars. It also contains all the needed nature assets to make it both highly dynamic and realistic. The presented repository contains a C++ library made for LoopAR that enables force feedback for gaming steering wheels as a fully supported component. It also includes all necessary scripts for eye-tracking in the used devices. All the main functions are integrated into the graphical user interface of the Unity® editor or are available as prefab variants to ease the use of the embedded functionalities. This project’s primary purpose is to serve as an open-access, cost-efficient toolkit that enables interested researchers to conduct realistic virtual reality research studies without costly and immobile simulators. To ensure the accessibility and usability of the mentioned toolkit, we performed a user experience report, also included in this paper.