Data-driven spatio-temporal RGBD feature encoding for action recognition in operating rooms

Purpose Context-aware systems for the operating room (OR) provide the possibility to significantly improve surgical workflow through various applications such as efficient OR scheduling, context-sensitive user interfaces, and automatic transcription of medical procedures. Being an essential element...

Full description

Saved in:
Bibliographic Details
Published in:International journal for computer assisted radiology and surgery Vol. 10; no. 6; pp. 737 - 747
Main Authors: Twinanda, Andru P., Alkan, Emre O., Gangi, Afshin, de Mathelin, Michel, Padoy, Nicolas
Format: Journal Article
Language:English
Published: Berlin/Heidelberg Springer Berlin Heidelberg 01-06-2015
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Purpose Context-aware systems for the operating room (OR) provide the possibility to significantly improve surgical workflow through various applications such as efficient OR scheduling, context-sensitive user interfaces, and automatic transcription of medical procedures. Being an essential element of such a system, surgical action recognition is thus an important research area. In this paper, we tackle the problem of classifying surgical actions from video clips that capture the activities taking place in the OR. Methods We acquire recordings using a multi-view RGBD camera system mounted on the ceiling of a hybrid OR dedicated to X-ray-based procedures and annotate clips of the recordings with the corresponding actions. To recognize the surgical actions from the video clips, we use a classification pipeline based on the bag-of-words (BoW) approach. We propose a novel feature encoding method that extends the classical BoW approach. Instead of using the typical rigid grid layout to divide the space of the feature locations, we propose to learn the layout from the actual 4D spatio-temporal locations of the visual features. This results in a data-driven and non-rigid layout which retains more spatio-temporal information compared to the rigid counterpart. Results We classify multi-view video clips from a new dataset generated from 11-day recordings of real operations. This dataset is composed of 1734 video clips of 15 actions. These include generic actions (e.g., moving patient to the OR bed) and actions specific to the vertebroplasty procedure (e.g., hammering). The experiments show that the proposed non-rigid feature encoding method performs better than the rigid encoding one. The classifier’s accuracy is increased by over 4 %, from 81.08 to 85.53 %. Conclusion The combination of both intensity and depth information from the RGBD data provides more discriminative power in carrying out the surgical action recognition task as compared to using either one of them alone. Furthermore, the proposed non-rigid spatio-temporal feature encoding scheme provides more discriminative histogram representations than the rigid counterpart. To the best of our knowledge, this is also the first work that presents action recognition results on multi-view RGBD data recorded in the OR.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1861-6410
1861-6429
DOI:10.1007/s11548-015-1186-1