Autonomous Surgical Robotics (2025) – OMF Surgeon's Notebook

Paper 1: SRT-H — A Hierarchical Framework for Autonomous Surgery via Language-Conditioned Imitation Learning

Authors: JiWoong (Brian) Kim, Juo-Tung Chen, Pascal Hansen, Lucy X. Shi, et al.
Affiliations: Johns Hopkins University, Stanford University, Optosurgical
Date: 2025

SRT-H¹ introduces a hierarchical robot learning framework for performing complex, long-horizon surgical steps autonomously. The system splits control into two levels: a high-level policy that plans in language space — generating task-level and corrective instructions — and a low-level policy that translates those instructions into robot trajectories. When the low-level policy makes errors, the high-level planner can issue corrective language instructions to guide recovery, a mechanism that proves critical in ablation studies.

The framework was validated on cholecystectomy (gallbladder removal), a real minimally invasive procedure, using eight unseen ex vivo gallbladders. SRT-H achieved a 100% success rate with no human intervention, substantially outperforming end-to-end and ablated variants. Key design choices — wrist cameras, corrective instructions, DAgger-based fine-tuning — each contributed measurably to performance. The end-to-end baseline scored only 33.3% by comparison.

Paper 2: VPPV — Surgical Embodied Intelligence for Generalized Task Autonomy in Laparoscopic Robot-Assisted Surgery

Authors: Yonghao Long, Anran Lin, Derek Hang Chun Kwok, Lin Zhang, et al.
Affiliations: CUHK, Johns Hopkins University, and collaborators
Published in: Science Robotics, 2025

VPPV (Visual Parsing → Perceptual Regressor → Policy Learning → Visual Servoing)² presents a vision-based learning paradigm for surgical robot autonomy. A central contribution is an open-source surgical embodied intelligence simulator (built on SurRoL) that enables reinforcement learning with zero-shot sim-to-real transfer — policies trained entirely in simulation can be deployed on real robots without additional real-world training.

The framework was tested across seven game-based skill tasks on the da Vinci Research Kit, five surgical assistive tasks on ex vivo animal tissue using the Sentire surgical system, and validated in live-animal in vivo trials for three tasks. The average sim-to-real performance drop was only 8%. An additional haptic-assisted training application demonstrated that AI-guided training significantly reduced task completion times for novice users. The open-source infrastructure is positioned as a scalable platform for the field.

Brief Review

Both papers represent significant advances in autonomous surgical robotics, but they approach the problem from different angles that are ultimately complementary.

SRT-H’s strength lies in its hierarchical language-conditioned architecture. The idea of using natural language as an intermediate representation for surgical planning is well-motivated: it provides interpretability, enables error correction at a semantic level, and decouples high-level reasoning from low-level motor control. The 100% success rate on unseen tissue is compelling, and the ablation studies are thorough — each component clearly justifies its inclusion. The main limitation is scope. Validation on a single procedure (cholecystectomy) and a modest number of specimens leaves open the question of whether the framework generalizes to other surgical domains. The reliance on imitation learning also means the system is bounded by the quality and diversity of the demonstration data.

VPPV takes a broader bet on generalization and infrastructure. The zero-shot sim-to-real transfer result is the headline contribution, addressing one of the most persistent barriers in the field. By open-sourcing the simulator and demonstrating across a range of tasks, instruments, and environments — including live-animal trials — the work provides a foundation that others can build on. The four-stage pipeline (visual parsing, perceptual regressor, policy learning, visual servoing) offers a clean separation of concerns. However, the tasks demonstrated are still relatively short-horizon compared to full surgical procedures, and the policy learning is task-specific rather than exhibiting cross-task generalization within a single model.

Looking at these papers together, the field appears to be at an inflection point. Both move beyond simple tabletop demonstrations into clinically relevant tissue manipulation, which has been a missing step for years. They also highlight a tension that the field will need to resolve: hierarchical, procedure-specific systems (SRT-H) versus general-purpose, platform-oriented approaches (VPPV). Clinical adoption will likely require elements of both — general foundations for perception and motor skills, layered with procedure-specific planning and safety constraints. For instance, language-based error correction (as in SRT-H) could naturally augment simulation-trained low-level policies (as in VPPV).

The gap to clinical deployment remains substantial. Neither paper addresses the regulatory pathway, and both operate in research settings even when using real tissue. Patient variability, rare failure modes, and guaranteed safety margins are challenges that current experimental scales cannot fully answer. Nonetheless, these works collectively shift the conversation from "can learning-based methods work in surgery?" to "how do we scale and validate them for clinical use?" — which is meaningful progress.

Ji Woong (Brian) Kim et al. ,SRT-H: A hierarchical framework for autonomous surgery via language-conditioned imitation learning.Sci. Robot.10,eadt5254(2025). DOI:10.1126/scirobotics.adt5254 ↩

Yonghao Long et al. ,Surgical embodied intelligence for generalized task autonomy in laparoscopic robot-assisted surgery. Sci. Robot.10,eadt3093(2025). DOI:10.1126/scirobotics.adt3093 ↩