MISS-D: A fast and scalable framework of medical image storage service based on distributed file system

https://doi.org/10.1016/j.cmpb.2019.105189Get rights and content

Highlights

  • A fast and scalable framework based on distributed file system is proposed which allows rapid data accessing for massive medical images.

  • An integrated medical imaging content indexing file model is designed to adapt to the high-performance storage efficiency on HDFS.

  • A virtual file pooling technology is proposed to realize the efficient reading data process and provides the data swapping strategy.

  • Experimental results demonstrate advantages of the proposed framework.

Abstract

Background and Objective Processing of medical imaging big data is deeply challenging due to the size of data, computational complexity, security storage and inherent privacy issues. Traditional picture archiving and communication system, which is an imaging technology used in the healthcare industry, generally uses centralized high performance disk storage arrays in the practical solutions. The existing storage solutions are not suitable for the diverse range of medical imaging big data that needs to be stored reliably and accessed in a timely manner. The economical solution is emerging as the cloud computing which provides scalability, elasticity, performance and better managing cost. Cloud based storage architecture for medical imaging big data has attracted more and more attention in industry and academia.

Methods This study presents a novel, fast and scalable framework of medical image storage service based on distributed file system. Two innovations of the framework are introduced in this paper. An integrated medical imaging content indexing file model for large-scale image sequence is designed to adapt to the high performance storage efficiency on distributed file system. A virtual file pooling technology is proposed, which uses the memory-mapped file method to achieve an efficient data reading process and provides the data swapping strategy in the pool.

Result The experiments show that the framework not only has comparable performance of reading and writing files which meets requirements in real-time application domain, but also bings greater convenience for clinical system developers by multiple client accessing types. The framework supports different user client types through the unified micro-service interfaces which basically meet the needs of clinical system development especially for online applications. The experimental results demonstrate the framework can meet the needs of real-time data access as well as traditional picture archiving and communication system.

Conclusions This framework aims to allow rapid data accessing for massive medical images, which can be demonstrated by the online web client for MISS-D framework implemented in this paper for real-time data interaction. The framework also provides a substantial subset of features to existing open-source and commercial alternatives, which has a wide range of potential applications.

Introduction

Over the past two decades, different medical imaging technical breakthroughs have made image data explosive growth. The new generation imaging equipment makes this growth somewhat exponential as radiologists are starting to observing datasets in volume of petabytes [1]. This is highly challenging due to the size of the data, computational complexity, security storage and inherent privacy issues [2].

Picture archiving and communication system (PACS), which is an imaging technology used in the healthcare industry, plays a central supporting role in the management and storage of these image big data. This helps healthcare professionals to easily collect data, efficiently manage it and quickly access it whenever necessary. At present, PACS has gradually developed from a single machine, a department to the whole hospital and the region, replacing the conventional model of manually collecting, storing and retaining image-based medical information (such as X-rays) with digital image creation and storage [3].

In the hospitals or healthcare institutes, its important that patient image data is collected correctly and then stored in a reliable storage media where it can persist for an indefinite period of time (e.g. 7 years in USA). This requires that PACS ensures a large,safe and durable data storage capacity firstly. Moreover, the data should be quickly accessible at any given time. There is a diverse range of medical imaging data that needs to be stored reliably and accessed in a timely manner.

In the current application market, most PACS providers generally use centralized high performance disk storage array in their archive solutions, which includes direct attached storage (DAS), storage area network (SAN) or network attached storage (NAS), tapes or hybrid storage, etc. Data in the PACS is divided into three levels according to data accessing frequency, namely online data, near-line data and off-line data. With the fast growing of storage technology, it may be said that cost of storage unit has been reduced. However, as mentioned earlier, the immense size of image data comes at an increased total because of the relative higher consumption of storage space. Data is growing much faster than the growth of storage [4]. Medical imaging storage along with the proper amount of space, sufficient equipment and energy usage, is expensive and will only increase as data increases.

Therefore, another PACS archive solution is emerging as the cloud computing provides scalability, elasticity, performance and then better managing cost. The economical solution will be leveraging the cost-efficient cloud computing services to host data and conduct required analysis on demand [5], [6]. The Cloud-based data archive solutions leverage archive storage tiers offered by public and private cloud service providers (i.e, Amazon EC2, Microsoft Azure, Google Cloud Platform, AliCloud, etc.). There are multiple ways to integrate cloud data archive solutions with existing storage solutions. Users who have already storage media can integrate cloud archive tiers using storage gateway technology. Another way to leverage cloud archive tiers for PACS archiving or medical record storage is to setup hybrid solutions [7], [8].

With the rapid development of cloud computing technology, it provides an effective way to build a low cost, high availability and high performance shared medical image collaboration platform, in which medical image cloud storage is an important part. However, PACS is relatively weak in supporting capacity of the sharing of collaboration and regional coordination presently. The services as image consultation, image referral, online remote education, Web DICOM access basically adopt the point-to-point model which are lacking of integration, cross-platform, highly available regional medical image collaborative software. At the same time, the use of mobile devices such as smart-phones, tablets and other devices makes access to information more convenient. Doctors are no longer limited to reading in the darkroom. Medical image reading and interpretation at any time, any place, any device in real-time has become an urgent need for radiologists [9].

As we all know, the amount of medical image data reaches TB or PB level. The use of traditional storage architecture usually has high cost and poor flexibility of heterogeneous integration and storage scaling. Therefore, large hospital PACS is commonly used in ”online-near line-offline” storage mode that offline data is mostly stored in the tape library with poor usability and data cannot be obtained in real time. In fact, there is no difference for data usage. All data is hot data. For example, patients can access their historical data in online application for many years. At the same time, parallel performance for online application can easily be degraded by the centralized storage bottlenecks, and no more cluster can be deployed for the application server side. Even high-performance FC SAN network with large bandwidth and processing power is difficult to meet the rapid processing and transmission requirements of TB level data.

Cloud based storage architectures for PACS are getting more and more attentions in industry and academia. In terms of cloud based storage architecture, Hadoop distributed file system (HDFS) is highly reliable and scalable, which stores large files as streaming data access mode and runs on inexpensive hardware clusters. It is a viable solution for building secure, reliable, stable, cost-effective cloud storage. Obviously there are several choices for distributed file systems such as GPFS or Luster. The reason that why we choose HDFS as the storage infrastructure is that the Hadoop also allows for the distributed processing of large data sets across clusters of computers such as MapReduce or Spark. The distributed computing on these medical image data will be our future works and Hadoop will make the deep analysis more Convenience. There is no problem if the Hadoop distributed file system is replaced by other distributed file systems in this paper.

In this paper, we describe an alternative solution for storage and retrieval of medical image data on a distributed computing cluster to provide a high-performance and cost-effective distributed processing. The core of this solution relies on a fast and scalable framework of Medical Image Storage Service based on Distributed File System (MISS-D), which replaces the traditional PACS centralized storage center with Hadoop distributed file system directly. The advantages of HDFS are listed as follows.

  • (1)

    High performance in parallel and all data are hot data. HDFS adopts master-slave architecture that consists of a name node and multiple data nodes. Sometime, there is also a backup name node. The name node is a central server that manages metadata for the whole file system. The data nodes are responsible for the storage of blocks of data and use redundant backup mechanisms. As a result, HDFS is ideal for storing and processing massive amounts of data in parallel. Once the client gets the data information from the name node, the client can read data from multiple HDFS data nodes in parallel.

  • (2)

    Scalability and reliability. HDFS is highly scalable with linear growth in storage capacity and computing power as simply increasing the number of servers. Each data block is retained back-up on three server by default for high data redundancy in order to maintain reliability of stored data. This mechanism is suitable for streaming access scenery that is writing once and reading multiple times. The medical image data is rarely modified after writing into storage, which is also very suitable for the characteristics of HDFS;

  • (3)

    Low cost and computing environment. Hadoop framework can take full advantage of the computing resources of various servers to facilitate the algorithms of image fusion, image content retrieval, three-dimensional reconstruction and other computational intensive applications on mass medical image data.

The reference architecture of MISS-D based PACS principle is shown as in Fig. 1. It shows two different modules compared to traditional PACS. The original disk storage array is replaced by HDFS. The traditional PACS clients connect to storage system using localized API or networked disk mount approach, while the new reference HDFS-based PACS makes the clients connect to the HDFS system by HDFS APIs. Service cluster for online or mobile applications is easily deployed in this environment. The advantages of such a solution are shown as follows.

  • (1)

    Allow organizations to bring their medical imaging data closer to other sources already in use like genomics, electronic medical records, pharmaceuticals, wearable data, and so on, reducing cost of data storage and movement.

  • (2)

    Enable solutions with lower cost using open source software and commodity servers.

  • (3)

    Enable medical image analytics and correlation of many forms of data within the same cluster.

Theoretically, the HDFS based architecture has advantage in inter-operation and image sharing across different healthcare systems. However, the architecture of HDFS is not suitable for real-time applications because each data block needs to replicate at least 3 copies in data writing process. The efficiency is significantly reduced for a sequence I/O of hundreds of images frequently compared with the local file system. Therefore, HDFS is suitable for high throughput and is not suitable for low-time latency access. In general, the time of waiting data loading will not more than 2 minute for most PACS products. If you deposit 1 million files at the same time, HDFS will take more hours compared to local disk storage. More importantly, HDFS uses streaming to I/O operations, which is not suitable for multi-user reading and writing. If you access a small file, you have to jump from one data node to another data node, which greatly reduces read performance. Therefore, HDFS is generally used as a second level storage, rather than as a real-time access facility.

However, as we all know medical image data is generally large in scale, such as the common medical image size listed in Table 1. The data rarely uses lossy compression to reduce size for storage, because compressed images have an impact on image reading tasks in the clinical application.

In addition to the large size of medical images described above, the characteristics of data access in medical image applications are shown in Fig. 2.Problem 1: Storing performance decreasing of massive Small files in name node of HDFS

Distributed file system is designed and optimized for large file storage with a default file size of 64 MB, such as Lustre, GFS, HDFS, GlusterFS etc. However, there are about 100–200 images of a scan series and the size of an image is usually about 512 KB. Too many small files will cause HDFS name node consuming too much memory of the entire cluster. The metadata of the files are saved in name node, so the memory capacity of the name node limits the number of files. Each block and index directory is stored as an object in memory which is about 150 bytes. For example, when users store 100 million or more files, name node requires 20 G or more of memory capacity. Hadoop currently offers corresponding solutions for this kind of small file problem such as Hadoop Archive, SequenceFile and so on. However, these methods do not fully meet the requirements of medical DICOM sequence image applications, such as lack of content indexing and single image random access.

Problem 2: Low efficiency of random frequency reading in real time application

The architecture of HDFS is not suitable for real-time applications because each data block needs to replicate at least 3 copies in data writing process. Therefore, the writing performance is significantly lower than reading performance. It is too slow for PACS to obtain images quickly. At the same time, the client needs to establish a link to HDFS for file I/O operations each time. HDFS is suitable for high throughput and is not suitable for low-time latency access. Therefore, HDFS is generally used as a second level storage, rather than as a real-time access facility.

The two problems are solved by MISS-D framework proposed in this paper by two methods of data content packing and virtual file pool respectively which will be presented in detail in the next Section. The rest of the paper are organized as follows. The proposed framework is introduced in Section 2. The experiment results are shown in Section 3. We discuss the proposed framework in Section 4 and we make conclusion in Section 5.

Section snippets

Methods

From the above introduction, we know that medical image data loading operations in the application basically can be summed up into two categories: the entire series image browsing one by one and volume data visualization as a whole. HDFS lacks the rapid response capability for the former approach and does not have high access performance to those massive small files on decentralized storage for the later way. Therefore, the MISS-D framework in this paper proposes two new approaches in response

Test data and environment

The implementation environment of MISS-D framework for testing is shown in Table 2. The test data is organized by three serial image data with different sizes that are 100 MB (A), 250 MB (B), 500 MB (C), 1 GB (D) in total for each data-set, which are list in Table 3.

Performance testing of MISS-D framework

Firstly, the reading and writing performances on different file systems are tested on local file system and HDFS. Secondly, tests are conducted on each data-set whether the image files are compressed or not. Table 4 lists the

Discussion

This section provides the comparison between the proposed framework and literature specifically. Many research works have been done about medical data processing and analytics on Hadoop platform. MIAPS (medical image access and presentation system) is a web-based system designed for remotely accessing and presenting DICOM image. DICOM image indexing engine of MIAPS is deployed on a computing cluster, powered by Hadoop with 10 PCs used as slavers to store DICOM image. MIAPS provides a web-based

Conclusion

This paper presents a fast and scalable framework of medical image storage service based on distributed file system (Hadoop). This framework aims to allow rapid data accessing for massive medical images storage. MISS-D also provides a substantial subset of features to existing open-source and commercial alternatives. It supports different types of client (e.g. web browser, mobile application) through the unified micro-service interfaces that traditional PACS client is compatible too. Two

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Declaration of Competing Interest

There are no conflicts of interest declared.

Acknowledgments

This work was supported by the National Key Research and Development Program (2018YFC0830701), the National Natural Science Foundation of China (U1708261) and the Innovation Talent Program of Shenyang (RC170521 ).

References (25)

  • V. Joshi et al.

    PACS administrators’ and radiologists’ perspective on the importance of features for PACS selection

    J. Digit. Imaging

    (2014)
  • J. Philbin et al.

    Will the next generation of PACS be sitting on a cloud

    J. Digit. Imaging

    (2011)
  • Cited by (12)

    • Blockchain based Securing Medical Records in Big Data Analytics

      2023, Data and Knowledge Engineering
      Citation Excerpt :

      Therefore, certain solutions are essential towards to solve this issues, which is motivated to do this research area. Furthermore, it depicts the privacy leaks can occur in the block chain equal with their authorized user [17]. The patient privacy is in danger with medical records and data are transmitted beyond the secure big data.

    • A medical image cryptosystem using bit-level diffusion with DNA coding

      2023, Journal of Ambient Intelligence and Humanized Computing
    View all citing articles on Scopus
    View full text