← Back to Audits

Paper Audit

Grounding and Interaction Lab — Peer Audit

Team Members: Heyang Huang, Yuni Wu, Yi-Shiuan Tung


Overview

A prevailing assumption in Vision-Language-Action (VLA) research is that scaling data leads to better generalization. However, in robotics this creates a chicken-and-egg problem: robots require large amounts of embodied data to be deployed safely, yet without safe deployment they cannot collect such data.

Across our projects, we explore how robots can instead leverage human feedback, structured decision interfaces, and selective reinforcement learning (RL) to enable safe deployment and continual adaptation, without relying on large-scale offline robot datasets.


Heyang: Kinematic-Aware Grasp Selection via VLM Reasoning

Problem

Most grasping pipelines (e.g., GraspNet, AnyGrasp) assume the highest-scoring grasp is executable. In real-world settings, this assumption often breaks: some top-score grasps are kinematically unreachable (e.g., due to obstacles or joint limits), others fail under depth noise (e.g., RealSense artifacts), and human intent may favor grasps that are semantically valid but physically infeasible. Once grasp candidates are collapsed into a single “best grasp,” these distinct failure modes become indistinguishable.

Key Insight

The failure occurs before control, at the grasp selection step, where feasibility information is discarded. We propose a feasibility-aware grasp selection interface: generate a grasp pool from GraspNet (or AnyGrasp), then use a fine-tuned VLM to reason over why certain grasps are infeasible and select alternatives that preserve task intent.

The VLM operates only at this decision interface, not as a controller.

Why This Matters

Simulation is not used to mimic reality. Instead, it is used to systematically generate counterexamples at low cost, exposing feasibility boundaries—such as kinematic limits and collision constraints—that are not encoded in internet-scale data.

Optional Extension

RL can be used in simulation to make small, local adjustments to GraspNet outputs. In this setting, rewards are explicit and task-aligned (e.g., executability, smoothness), reducing human intervention without expanding the scope to end-to-end policy learning.

Discussion

Yuni:
How do you define and represent “feasibility” for the VLM during grasp selection? Is feasibility encoded through explicit kinematic checks, collision constraints, or only via visual reasoning from rendered counterexamples?

Heyang:
Feasibility is encoded through explicit kinematic and collision checks. The VLM reasons over these structured feasibility signals together with semantic intent, while simulation is used to expose counterexamples under sensing noise.

Yi-Shiuan:
If you fine-tune a VLM on successful grasps for a scene, how well would it generalize to other scenes? Would this require a large dataset?

Heyang:
We do not fine-tune the VLM on scene-specific successful grasps. Instead, the VLM reasons over robot-specific feasibility signals at the grasp selection interface. While some data is needed to learn common infeasibility patterns, the dataset can be small and efficiently collected in simulation using structured counterexamples.


Yuni: Click-to-Action VLA for Real Robots

Problem

A major challenge in Vision-Language-Action (VLA) systems is the chicken-and-egg problem between data and deployment: robots require large amounts of embodied data to be trained safely, yet they cannot be reliably deployed without already having such data. This limits the practicality of large-scale VLA policies and slows real-world learning. For example, a robot may understand high-level instructions such as “push the object” but still fail due to ambiguous goals, unsafe motions, or insufficient task-specific experience.

Key Insight

Instead of relying on large offline robot datasets, this project leverages human-in-the-loop feedback, specifically click-based demonstrations with optional natural language, to provide precise and low-cost supervision at deployment time. By inserting a click-grounded geometric bottleneck between vision-language perception (e.g., LLaVA) and action execution, the robot grounds human intent into explicit 3D targets and learns lightweight action policies with minimal embodied data.

Why This Matters

This approach shifts the burden of generalization to pretrained vision-language priors while keeping real-robot learning data-efficient, interpretable, and safe. Robots can be deployed earlier and improve through interaction, without requiring large-scale teleoperation datasets.

Optional Extensions

Future extensions include incorporating natural language feedback, online reinforcement learning for continual improvement, or adapting the click-grounded interface to new tasks, objects, or robot platforms without retraining the full system.

Discussion

Heyang:
How sensitive is the system to click precision? If the operator clicks slightly off the object boundary, does the downstream policy fail gracefully?

Yuni:
A click serves as an intent cue rather than a precise control signal. Moderate imprecision is tolerated and typically maps to a nearby valid 3D target. If a click is clearly invalid (e.g., background or unreachable region), the execution layer detects this via reachability or depth checks and aborts safely rather than producing unsafe behavior.


Yi-Shiuan: Learning from Human Feedback

Problem

Robots operating in human environments must adapt to ambiguous task requirements and varying preferences. When uncertainty is high, robots should proactively ask clarifying questions and present alternative outcomes. Additionally, when a task cannot be reliably performed, the robot should communicate uncertainty and learn from user feedback or demonstrations.

Key Insight

Robots should actively reduce uncertainty about human intent rather than passively executing commands. In the base implementation, a VLM maintains a belief over the user’s reward function and generates clarification queries that efficiently reduce uncertainty. An LLM-based teacher simulates a human with a hidden reward function, allowing systematic evaluation of how effectively the robot learns preferences through interaction.

Why This Matters

Enabling robots to recognize uncertainty and proactively seek clarification improves alignment with human goals while reducing task failures and supervision burden.

Optional Extensions

As a stretch goal, the VLM is fine-tuned using reinforcement learning (via GRPO) to autonomously decide when and how to ask clarification questions, integrating uncertainty awareness directly into the policy.

Discussion

Heyang:
How does the robot decide when uncertainty is high enough to justify asking a clarification question instead of attempting an action?

Yi-Shiuan:
Prior work uses value-of-information criteria to determine when queries are beneficial. Incorporating this into VLM fine-tuning remains an open challenge.

Yuni:
When fine-tuning the VLM with RL, how do you prevent degenerate behaviors such as overly generic or trivial questions?

Yi-Shiuan:
The reward is designed to favor information gain while penalizing redundant or low-impact queries, encouraging clarification only when it meaningfully reduces uncertainty.


Question Answering — Lab Report-Out

1. What was the most common “Load-Bearing Wall” identified?

The dominant assumption across projects was that large-scale embodied robot data is required for safe deployment. All projects identified failures where high-scoring or end-to-end outputs break down due to unmodeled constraints such as kinematics, sensing noise, ambiguity in human intent, or safety considerations.

2. Which “Initial Dissolve” was the most robust?

The most robust dissolve was the shift toward intermediate decision interfaces—such as feasibility-aware grasp selection, click-grounded geometric bottlenecks, and clarification queries—that incorporate human feedback or explicit constraints, enabling interpretability and deployability under uncertainty.

3. What engineering question remains unresolved?

Open questions include how to efficiently collect sufficient human feedback, how to evaluate sim-to-real gaps for grasping pipelines, and how to formalize uncertainty, preference modeling, and intervention costs to support continual learning without overburdening users or degrading foundation model performance.