Skip to main content

BeamAtt: Generating Medical Diagnosis from Chest X-Rays Using Sampling-Based Intelligence

  • Chapter
  • First Online:
Concepts and Real-Time Applications of Deep Learning

Abstract

Medical imaging coupled with image captioning is fuelling the possibilities of generating accurate medical reports with minimal human intervention. In economically downtrodden nations, this produces opportunity for the poor to acquire world-class treatment from around the globe with an efficient time to market. Chest X-ray images are integral to the task of diagnosis and treatment of respiratory problems. In this paper, we propose BeamAtt: an end-to-end deep CNN-RNN-based encoder-decoder framework that incorporates spatial visual attention to generate a terse diagnosis from chest X-ray films. We choose to use a GRU RNN decoder as compared to LSTMs or hierarchical LSTMs from previous literature and reason via extensive evaluation for the same. To boost performance over state-of-the-art methods with complex architectures, we employ sampling-based techniques along with beam search optimisation while generating inferences and argue that a simpler framework with intelligent optimisation is able to successfully achieve higher performance metrics. We show how vivid attention plots can provide deep insight into the region of the image on which the network concentrates to generate a word token. We compare our model with recent prior art using standard evaluation metrics BLEU-1/2/3/4 and ROUGE-L and demonstrate superiority of the proposed method. BeamAtt achieves a BLEU-1 score of 0.56 and CIDEr score of 2.077 which is a significant boost in performance over contemporary solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jimmy Ba, Volodymyr Mnih, and KorayKavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.

    Google Scholar 

  2. DzmitryBahdanau, Kyunghyun Cho, and YoshuaBengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

    Google Scholar 

  3. Oliver Bintcliffe and Nick Maskell. Spontaneous pneumothorax. Bmj, 348, 2014.

    Google Scholar 

  4. WilliamBoag,Tzu-MingHarryHsu, MatthewMcDermott, GabrielaBerner, Emily Alesentzer, and Peter Szolovits. Baselines for chest x-ray report generation. In Machine Learning for Health Workshop, pages 126–140, 2020.

    Google Scholar 

  5. Kyunghyun Cho, Bart Van Merri ̈enboer, DzmitryBahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

    Google Scholar 

  6. DinaDemner-Fushman,SonyaEShooshan,LaritzaRodriguez,SameerAntani,and George R Thoma. Annotation of chest radiology reports for indexing and retrieval. In International Workshop on Multimodal Retrieval in the Medical Domain, pages 99–111. Springer, 2015.

    Google Scholar 

  7. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.

    Google Scholar 

  8. Sheffi eld. Respiratory societies. the global impact of respiratory disease – second edition. Forum of International Respiratory Societies, 2017.

    Google Scholar 

  9. Desmond Elliott and Frank Keller. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1292–1302, 2013.

    Google Scholar 

  10. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In European conference on computer vision, pages 15–29. Springer, 2010.

    Google Scholar 

  11. Yansong Feng and Mirella Lapata. How many words is a picture worth? automatic caption generation for news images. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pages 1239–1249, 2010.

    Google Scholar 

  12. Ralf Gerber and N-H Nagel. Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In Proceedings of 3rd IEEE International Conference on Image Processing, volume 2, pages 805–808. IEEE, 1996.

    Google Scholar 

  13. Xin Huang, Fengqi Yan, Wei Xu, and Maozhen Li. Multi-attention and incorporating background information model for chest x-ray image report generation. IEEE Access, 7:154808–154817, 2019.

    Google Scholar 

  14. Global Innovation Index. Global innovation index, 2019.

    Google Scholar 

  15. Baoyu Jing, PengtaoXie, and Eric Xing. On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195, 2017.

    Google Scholar 

  16. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.

    Google Scholar 

  17. Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. Multimodal neural language models. In International conference on machine learning, pages 595–603, 2014.

    Google Scholar 

  18. Girish Kulkarni, VisruthPremraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.

    Google Scholar 

  19. Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6666–6673, 2019.

    Google Scholar 

  20. Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hybrid retrieval-generation reinforced agent for medical image report generation. In Advances in neural information processing systems, pages 1530–1540, 2018.

    Google Scholar 

  21. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.

    Google Scholar 

  22. Carolyn E Lipscomb. Medical subject headings (mesh). Bulletin of the Medical Library Association, 88(3):265, 2000.

    Google Scholar 

  23. Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew McDermott, Willie Boag, Wei- Hung Weng, Peter Szolovits, and MarzyehGhassemi. Clinically accurate chest x-ray report generation. arXiv preprint arXiv:1904.02633, 2019.

    Google Scholar 

  24. Jiasen Lu, CaimingXiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 375–383, 2017.

    Google Scholar 

  25. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.

    Google Scholar 

  26. Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alexander Berg, Tamara Berg, and Hal Daum ́e III. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 747–756, 2012.

    Google Scholar 

  27. Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.

    Google Scholar 

  28. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.

    Google Scholar 

  29. Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008–7024, 2017.

    Google Scholar 

  30. Hoo-Chang Shin, Kirk Roberts, Le Lu, Dina Demner-Fushman, Jianhua Yao, and Ronald M Summers. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2497–2506, 2016.

    Google Scholar 

  31. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus- based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.

    Google Scholar 

  32. Oriol Vinyals, Alexander Toshev, SamyBengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.

    Google Scholar 

  33. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M Summers. Tienet: Text-image embedding network for common thorax disease classification and re- porting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9049–9058, 2018.

    Google Scholar 

  34. Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam- search optimization. arXiv preprint arXiv:1606.02960, 2016.

    Google Scholar 

  35. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and YoshuaBengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.

    Google Scholar 

  36. Benjamin Z Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. I2t: Image parsing to text description. Proceedings of the IEEE, 98(8):1485–1508, 2010.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shefali Srivastava .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sawarn, A., Srivastava, S., Gupta, M., Srivastava, S. (2021). BeamAtt: Generating Medical Diagnosis from Chest X-Rays Using Sampling-Based Intelligence. In: Srivastava, S., Khari, M., Gonzalez Crespo, R., Chaudhary, G., Arora, P. (eds) Concepts and Real-Time Applications of Deep Learning. EAI/Springer Innovations in Communication and Computing. Springer, Cham. https://doi.org/10.1007/978-3-030-76167-7_9

Download citation

Publish with us

Policies and ethics