Abstract
Medical imaging coupled with image captioning is fuelling the possibilities of generating accurate medical reports with minimal human intervention. In economically downtrodden nations, this produces opportunity for the poor to acquire world-class treatment from around the globe with an efficient time to market. Chest X-ray images are integral to the task of diagnosis and treatment of respiratory problems. In this paper, we propose BeamAtt: an end-to-end deep CNN-RNN-based encoder-decoder framework that incorporates spatial visual attention to generate a terse diagnosis from chest X-ray films. We choose to use a GRU RNN decoder as compared to LSTMs or hierarchical LSTMs from previous literature and reason via extensive evaluation for the same. To boost performance over state-of-the-art methods with complex architectures, we employ sampling-based techniques along with beam search optimisation while generating inferences and argue that a simpler framework with intelligent optimisation is able to successfully achieve higher performance metrics. We show how vivid attention plots can provide deep insight into the region of the image on which the network concentrates to generate a word token. We compare our model with recent prior art using standard evaluation metrics BLEU-1/2/3/4 and ROUGE-L and demonstrate superiority of the proposed method. BeamAtt achieves a BLEU-1 score of 0.56 and CIDEr score of 2.077 which is a significant boost in performance over contemporary solutions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jimmy Ba, Volodymyr Mnih, and KorayKavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
DzmitryBahdanau, Kyunghyun Cho, and YoshuaBengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Oliver Bintcliffe and Nick Maskell. Spontaneous pneumothorax. Bmj, 348, 2014.
WilliamBoag,Tzu-MingHarryHsu, MatthewMcDermott, GabrielaBerner, Emily Alesentzer, and Peter Szolovits. Baselines for chest x-ray report generation. In Machine Learning for Health Workshop, pages 126–140, 2020.
Kyunghyun Cho, Bart Van Merri ̈enboer, DzmitryBahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
DinaDemner-Fushman,SonyaEShooshan,LaritzaRodriguez,SameerAntani,and George R Thoma. Annotation of chest radiology reports for indexing and retrieval. In International Workshop on Multimodal Retrieval in the Medical Domain, pages 99–111. Springer, 2015.
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
Sheffi eld. Respiratory societies. the global impact of respiratory disease – second edition. Forum of International Respiratory Societies, 2017.
Desmond Elliott and Frank Keller. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1292–1302, 2013.
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In European conference on computer vision, pages 15–29. Springer, 2010.
Yansong Feng and Mirella Lapata. How many words is a picture worth? automatic caption generation for news images. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pages 1239–1249, 2010.
Ralf Gerber and N-H Nagel. Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In Proceedings of 3rd IEEE International Conference on Image Processing, volume 2, pages 805–808. IEEE, 1996.
Xin Huang, Fengqi Yan, Wei Xu, and Maozhen Li. Multi-attention and incorporating background information model for chest x-ray image report generation. IEEE Access, 7:154808–154817, 2019.
Global Innovation Index. Global innovation index, 2019.
Baoyu Jing, PengtaoXie, and Eric Xing. On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195, 2017.
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. Multimodal neural language models. In International conference on machine learning, pages 595–603, 2014.
Girish Kulkarni, VisruthPremraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.
Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6666–6673, 2019.
Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hybrid retrieval-generation reinforced agent for medical image report generation. In Advances in neural information processing systems, pages 1530–1540, 2018.
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
Carolyn E Lipscomb. Medical subject headings (mesh). Bulletin of the Medical Library Association, 88(3):265, 2000.
Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew McDermott, Willie Boag, Wei- Hung Weng, Peter Szolovits, and MarzyehGhassemi. Clinically accurate chest x-ray report generation. arXiv preprint arXiv:1904.02633, 2019.
Jiasen Lu, CaimingXiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 375–383, 2017.
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alexander Berg, Tamara Berg, and Hal Daum ́e III. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 747–756, 2012.
Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008–7024, 2017.
Hoo-Chang Shin, Kirk Roberts, Le Lu, Dina Demner-Fushman, Jianhua Yao, and Ronald M Summers. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2497–2506, 2016.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus- based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
Oriol Vinyals, Alexander Toshev, SamyBengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M Summers. Tienet: Text-image embedding network for common thorax disease classification and re- porting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9049–9058, 2018.
Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam- search optimization. arXiv preprint arXiv:1606.02960, 2016.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and YoshuaBengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
Benjamin Z Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. I2t: Image parsing to text description. Proceedings of the IEEE, 98(8):1485–1508, 2010.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sawarn, A., Srivastava, S., Gupta, M., Srivastava, S. (2021). BeamAtt: Generating Medical Diagnosis from Chest X-Rays Using Sampling-Based Intelligence. In: Srivastava, S., Khari, M., Gonzalez Crespo, R., Chaudhary, G., Arora, P. (eds) Concepts and Real-Time Applications of Deep Learning. EAI/Springer Innovations in Communication and Computing. Springer, Cham. https://doi.org/10.1007/978-3-030-76167-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-76167-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76166-0
Online ISBN: 978-3-030-76167-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)