Skip to main content
Log in

Segmenting and classifying activities in robot-assisted surgery with recurrent neural networks

  • Original Article
  • Published:
International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Abstract

Purpose

Automatically segmenting and classifying surgical activities is an important prerequisite to providing automated, targeted assessment and feedback during surgical training. Prior work has focused almost exclusively on recognizing gestures, or short, atomic units of activity such as pushing needle through tissue, whereas we also focus on recognizing higher-level maneuvers, such as suture throw. Maneuvers exhibit more complexity and variability than the gestures from which they are composed, however working at this granularity has the benefit of being consistent with existing training curricula.

Methods

Prior work has focused on hidden Markov model and conditional-random-field-based methods, which typically leverage unary terms that are local in time and linear in model parameters. Because maneuvers are governed by long-term, nonlinear dynamics, we argue that the more expressive unary terms offered by recurrent neural networks (RNNs) are better suited for this task. Four RNN architectures are compared for recognizing activities from kinematics: simple RNNs, long short-term memory, gated recurrent units, and mixed history RNNs. We report performance in terms of error rate and edit distance, and we use a functional analysis-of-variance framework to assess hyperparameter sensitivity for each architecture.

Results

We obtain state-of-the-art performance for both maneuver recognition from kinematics (4 maneuvers; error rate of \(8.6 \pm 3.4\%\); normalized edit distance of \(9.3 \pm 4.3\%\)) and gesture recognition from kinematics (10 gestures; error rate of \(15.2 \pm 6.0\%\); normalized edit distance of \(8.4 \pm 6.3\%\)).

Conclusions

Automated maneuver recognition is feasible with RNNs, an exciting result which offers the opportunity to provide targeted assessment and feedback at a higher level of granularity. In addition, we show that multiple hyperparameters are important for achieving good performance, and our hyperparameter analysis serves to aid future work in RNN-based activity recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The dataset-level normalization provides immediate connections to the original edit distances, which aids analysis in future work; in contrast, one cannot invert the sequence-level normalizations without access to the number of segments in each predicted sequence. Second, the dataset-level normalization continues to penalize predicted sequences with the same weight as more spurious segments are added; in contrast, the sequence-level normalization penalizes predicted sequences less and less as spurious segments are added.

References

  1. Ahmidi N, Tao L, Sefati S, Gao Y, Lea C, Haro BB, Zappella L, Khudanpur S, Vidal R, Hager GD (2017) A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Trans Biomed Eng 64:2025–2041

    Article  Google Scholar 

  2. Bell RH (2009) Why Johnny cannot operate. Surgery 146(4):533–542

    Article  Google Scholar 

  3. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

    Article  CAS  Google Scholar 

  4. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(Feb):281–305

    Google Scholar 

  5. Birkmeyer JD, Finks JF, O’reilly A, Oerline M, Carlin AM, Nunn AR, Dimick J, Banerjee M, Birkmeyer NJ (2013) Surgical skill and complication rates after bariatric surgery. N Engl J Med 369(15):1434–1442

    Article  CAS  Google Scholar 

  6. Cho K, van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: EMNLP

  7. DiPietro R, Hager GD (2018) Unsupervised learning for surgical motion by learning to predict the future. In: International conference on medical image computing and computer-assisted intervention

  8. DiPietro R, Lea C, Malpani A, Ahmidi N, Vedula SS, Lee GI, Lee MR, Hager GD (2016) Recognizing surgical activities with recurrent neural networks. In: International conference on medical image computing and computer-assisted intervention, pp 551–558

    Chapter  Google Scholar 

  9. DiPietro R, Rupprecht C, Navab N, Hager GD (2017) Analyzing and exploiting NARX recurrent neural networks for long-term dependencies. arXiv preprint arXiv:1702.07805

  10. Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211

    Article  Google Scholar 

  11. Ericsson KA (2004) Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Acad Med 79(10):S70–S81

    Article  Google Scholar 

  12. Gao Y, Vedula S, Lee GI, Lee MR, Khudanpur S, Hager GD (2016) Unsupervised surgical data alignment with application to automatic activity annotation. In: 2016 IEEE international conference on robotics and automation (ICRA)

  13. Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, Bejar B, Yuh DD, Chen CCG, Vidal R, Khudanpur S, Hager GD (2014) Language of surgery: a surgical gesture dataset for human motion modeling. In: Modeling and monitoring of computer assisted interventions (M2CAI) 2014. Springer, Boston

  14. Gearhart SL, Wang MH, Gilson MM, Chen B, Kern DE (2012) Teaching and assessing technical proficiency in surgical subspecialty fellowships. J Surg Educ 69(4):521–528

    Article  Google Scholar 

  15. Gers FA, Schmidhuber J (2000) Recurrent nets that time and count. In: Neural networks, IJCNN, vol 3

  16. Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471

    Article  CAS  Google Scholar 

  17. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069

  18. Hammer B (2000) On the approximation capability of recurrent neural networks. Neurocomputing 31(1):107–123

    Article  Google Scholar 

  19. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  CAS  Google Scholar 

  20. Hutter F, Hoos H, Leyton-Brown K (2014) An efficient approach for assessing hyperparameter importance. In: International conference on machine learning, pp 754–762

  21. Jacobs DM, Poenaru D (eds) (2001) Surgical educators’ handbook. Association for Surgical Education, Los Angeles

  22. Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. Technical report, UPenn

  23. Lea C, Hager GD, Vidal R (2015) An improved model for segmentation and recognition of fine-grained activities with application to surgical training tasks. In: 2015 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1123–1129

  24. Lea C, Vidal R, Hager GD (2016) Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE international conference on robotics and automation (ICRA)

  25. Lea C, Vidal R, Hager GD (2016) Learning convolutional action primitives from multimodal time series data. In: Proceedings of the IEEE international conference on robotics and automation—ICRA

  26. Lea C, Vidal R, Reiter A, Hager GD (2016) Temporal convolutional networks: a unified approach to action segmentation. In: European conference on computer vision. Springer, pp 47–54

  27. Lin T, Horne BG, Tino P, Giles CL (1996) Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans Neural Netw 7(6):1329–1338

    Article  CAS  Google Scholar 

  28. Liu D, Jiang T (2018) Deep reinforcement learning for surgical gesture segmentation and classification. In: International conference on medical image computing and computer-assisted intervention

  29. Mavroudi E, Bhaskara D, Sefati S, Ali H, Vidal R (2018) End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1558–1567

  30. Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286

    Article  Google Scholar 

  31. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  32. Scott DJ, Cendan JC, Pugh CM, Minter RM, Dunnington GL, Kozar RA (2008) The changing face of surgical education: simulation as the new paradigm. J Surg Res 147(2):189–193

    Article  Google Scholar 

  33. Sefati S, Cowan NJ, Vidal R (2015) Learning shared, discriminative dictionaries for surgical gesture segmentation and classification. In: Modeling and monitoring of computer assisted interventions (M2CAI) 2015. Springer, Berlin

  34. Sutton C, McCallum A (2006) An introduction to conditional random fields for relational learning, vol 2. MIT Press, Cambridge

    Google Scholar 

  35. Tao L, Elhamifar E, Khudanpur S, Hager GD, Vidal R (2012) Sparse hidden Markov models for surgical gesture classification and skill evaluation. In: International conference on information processing in computer-assisted interventions. Springer, pp 167–177

  36. Tao L, Zappella L, Hager GD, Vidal R (2013) Surgical gesture segmentation and recognition. In: Mori K, Sakuma I, Sato Y, Barillot C, Navab N (eds) Medical image computing and computer-assisted intervention (MICCAI) 2013, Part III. LNCS, vol 8151. Springer, Berlin, pp 339–346

    Chapter  Google Scholar 

  37. Vedula SS, Ishii M, Hager GD (2017) Objective assessment of surgical technical skill and competency in the operating room. Annu Rev Biomed Eng 19:301–325

    Article  CAS  Google Scholar 

  38. Wenghofer E, Klass D, Abrahamowicz M, Dauphinee D, Jacques A, Smee S, Blackmore D, Winslade N, Reidel K, Bartman I, Tamblyn R (2009) Doctor scores on national qualifying examinations predict quality of care in future practice. Med Educ 43(12):1166–1173

    Article  Google Scholar 

  39. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329

Download references

Acknowledgements

This research was supported by NSF Grant OISE-1065092, “A US-Germany Research Collaboration on Systems for Computer-Integrated Healthcare,” and by a fellowship for modeling, simulation, and training from the Link Foundation (Grant No. 90078471).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert DiPietro.

Ethics declarations

Conflicts of interest

They authors declare that they have no conflicts of interest.

Ethical standard

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Figures 8, 9, 10, 11 and Tables 4, 5, 6, 7.

Fig. 8
figure 8

Qualitative results: MISTIC-SL trials with median error rate (for each architecture). Ground-truth labels are shown above predicted labels. From top to bottom, the error rates are 11.3%, 7.7%, 7.0%, and 8.6% (with normalized edit distances of 30.5%, 20.3%, 6.8%, and 20.3%)

Fig. 9
figure 9

Qualitative results: MISTIC-SL trials with median edit distance (for each architecture). Ground-truth labels are shown above predicted labels. From top to bottom, the normalized edit distances are 30.5%, 10.2%, 6.8%, and 28.8% (with error rates of 15.8%, 16.3%, 6.4%, and 21.1%)

Fig. 10
figure 10

Qualitative results: JIGSAWS trials with median error rate (for each architecture). Ground-truth labels are shown above predicted labels. From top to bottom, the error rates are 16.1%, 12.9%, 12.3%, and 12.9% (with normalized edit distances of 16.2%, 5.4%, 2.7%, and 8.1%)

Fig. 11
figure 11

Qualitative results: JIGSAWS trials with median edit distance (for each architecture). Ground-truth labels are shown above predicted labels. From top to bottom, the normalized edit distances are 16.2%, 8.1%, 5.4%, and 10.8% (with error rates of 16.1%, 15.0%, 9.0%, and 9.4%)

Table 4 MISTIC-SL test-set error rates (%)
Table 5 MISTIC-SL test-set edit distances (%)
Table 6 JIGSAWS test-set error rates (%)
Table 7 JIGSAWS test-set edit distances (%)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

DiPietro, R., Ahmidi, N., Malpani, A. et al. Segmenting and classifying activities in robot-assisted surgery with recurrent neural networks. Int J CARS 14, 2005–2020 (2019). https://doi.org/10.1007/s11548-019-01953-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11548-019-01953-x

Keywords

Navigation