ABG Procedure Vision Dataset

← All Projects Computer Vision AI / ML Clinical Skills Research Dataset

ABG Procedure Vision Dataset

A multi-angle annotated vision dataset for arterial blood gas procedure recognition — comprising over 100,000 frames across five camera angles plus a first-person component, built for real-time AI inference in clinical simulation and vision-guided robotics.

This project documents the creation of what is believed to be one of the first purpose-built computer vision datasets for the Arterial Blood Gas (ABG) sampling procedure, annotated at the object level and designed for real-time detection. The dataset was captured primarily across five fixed camera angles, producing over 80,000 annotated frames that give the model varied spatial perspectives on the ABG procedure. An additional first-person component of approximately 20,000 frames was captured using a Meta Quest headset, replicating the direct viewpoint of a clinician performing the procedure. This multi-angle approach maximises model robustness for real-world deployment, while the first-person component supports integration directly within XR simulation environments. The dataset is also intended for future use in vision-guided robot reaching applications, where procedural understanding from multiple viewpoints is a key requirement.

The work was carried out collaboratively with two teaching fellows at Imperial London over three months (October–December 2025), combining clinical procedure expertise with machine learning practice. A YOLOv12s model was trained on the resulting dataset using Roboflow for annotation management, with training runs executed locally on a laptop GPU. The project has since informed a pivot toward robotic procedure guidance, with the dataset now serving as a foundation for ongoing research.

Project at a Glance

Status	Dataset complete — model inference ongoing; robotic application in development
Period	October – December 2025
Annotation Classes	`alcohol_wipe` `needle_syringe` `procedural_hand` `puncture_site` `stabilizing_hand`
Capture Method	First-person video, Meta Quest headset · 1024 × 1024px resolution
Model Trained	YOLOv12s · exported to ONNX for deployment · trained on local GPU
Toolchain	Roboflow (annotation & dataset management) · YOLO · Unity Sentis · ONNX
Original Target	Real-time inference in Meta Quest XR simulation via Unity Sentis / Meta XR Building Blocks
Current Direction	Robotic ABG guidance — dataset repurposed as training foundation for robotic procedure recognition

Dataset Design & Annotation

ABG sampling is a precise, multi-step clinical procedure involving palpation of the radial artery, site preparation, needle insertion, sample aspiration, and safe sharps disposal. Each step involves distinct objects and hand configurations that must be reliably distinguished by a computer vision model. The five annotation classes were designed to reflect the procedure’s key safety-critical elements: the instruments in use (needle_syringe, alcohol_wipe), the anatomical target (puncture_site), and the two functional hand roles (procedural_hand, stabilizing_hand).

Video was recorded from first-person perspective wearing a Meta Quest headset in a clinical simulation environment, producing footage with the characteristic wide-angle FOV, natural head-movement blur, and visual clutter (blue tray, red sharps bin, sterile packaging, absorbent drape) typical of a real simulation station. This choice was intentional: a model trained on this perspective generalises directly to inference running on the headset itself, without the domain shift that would arise from a conventional camera rig.

Annotation was managed in Roboflow, with bounding box labels applied frame by frame. Roboflow’s dataset versioning, augmentation pipeline, and YOLO export format were used throughout. The team annotated procedural footage across the full sequence of ABG steps to ensure class balance and procedural coverage, with particular attention to challenging cases: heavy occlusion of the puncture site by hands, partial syringe visibility, and motion blur during needle handling.

Synthetic Data Augmentation Strategy

A key research contribution of this project is a structured synthetic image generation strategy developed to augment the real dataset. Using generative AI (DALL-E / OpenAI Images API), a prompt framework was designed to produce photorealistic first-person ABG images that visually match the Quest-captured footage — including lens distortion, motion blur, sensor noise, and the specific objects present on a simulation station.

Rather than generating generic clinical images, the prompt strategy was engineered label-first: each generation prompt is written to maximise the visibility and variety of a specific annotation class, while maintaining plausible co-occurrence of other objects. Separate prompt variants target hard cases such as heavy occlusion, partial hand crops, motion blur from head turns, and varying lighting (daylight, tungsten). This approach allows systematic construction of a training distribution that addresses known model weaknesses — such as confusion between procedural_hand and needle_syringe at close range.

The recommended synthetic distribution for a 500-image augmentation batch is approximately 35% needle_syringe-heavy, 20% puncture_site-clear, 15% each for alcohol_wipe, stabilizing_hand, and procedural_hand — with roughly 25–30% of the full set comprising “hard mode” variants (occlusion, crop, blur, off-centre). Synthetic images are reviewed before annotation and ingested into Roboflow alongside real frames using the same labelling schema.

Model Training & Deployment Pathway

YOLOv12s was selected for its balance of inference speed and detection accuracy at the scale of objects present in ABG footage — particularly the small and often occluded puncture_site marker and the visually similar hand classes. Training was run locally on a laptop GPU at 1024×1024 input resolution, with the trained model exported to ONNX for cross-platform deployment.

The initial deployment target was the Meta XR Building Blocks Object Detection system in Unity, using the Unity Sentis runtime to run the ONNX model on-device via the Quest’s neural processing capability. The ONNX export includes integrated NMS post-processing (outputs: boxes / class_ids / scores), reducing the custom pipeline required in Unity. This pathway was explored in detail and documented as part of the research — including ONNX import into Unity Sentis, Provider Asset configuration, and input resolution alignment between training (1024) and inference (640).

Following a strategic pivot, the model and dataset are now being adapted for a robotic guidance application, where real-time procedure recognition will inform robotic arm positioning and step-sequencing during simulated ABG. The dataset’s first-person visual characteristics remain highly relevant in this context, as robotic camera placement can be configured to approximate the same viewpoint.

“Creating a labelled vision dataset for a specific clinical procedure, from scratch, in first-person perspective — this kind of domain-specific annotation work is what bridges the gap between general-purpose AI models and tools that are actually deployable in clinical simulation.”

Team & Collaborators

Dr Risheka Walls

Project Lead

Payal Guha

Teaching Fellow

Oscar L. Oglina

Teaching Fellow

Adrian Cowell

Innovation Lead — Technology & Development

Upcoming Research Directions

The robotic guidance application represents the most immediate next phase. Beyond that, several research threads are open: expanding the dataset to cover additional clinical procedures (e.g. venepuncture, cannulation) using the same first-person capture methodology; exploring semi-automatic annotation using the trained model to pre-label new footage; and investigating the use of synthetic data pipelines to build datasets for procedures where real footage is difficult to obtain due to clinical access constraints.

There is also potential for this dataset and methodology to be shared more broadly — as a benchmark for medical procedure recognition research, or as a teaching resource for students learning computer vision annotation practices in a clinical context.