Text segmentation of health examination item based on character statistics and information measurement

: This study explores the segmentation algorithm of item text data, especially of single long length data in health examination. In the specific implementation, a large amount of historical health examination data is analysed. Using the method of character statistics, the connection tightness values T AB s between two adjacent characters are calculated. Three parameters, the candidate number N , the best position BP, and balance weight BW are set. The total segmentation indexes SIs are calculated, thus determined the segmentation position Pos. The optimal parameter values are determined by the method of information measurement. Experimental results show that the accuracy rate is 78.6% and reaches 82.9% in the most frequently appeared text item. The complexity of the algorithm is O ( n ). Using no existing domain knowledge, it is very simple and fast. By executed repeatedly, it is convenient to obtain the characteristics of each single item of text data, furthermore, to distinguish respective express preference of different physicians to the same item. The assumption is verified that without professional domain knowledge, a large amount of historical data can provide valuable clues for the text understanding. The results of this research are being applied and verified in the following research works in the field of health examination.


Introduction
Health information collection is the first step in the trilogy of health management and disease preventive treatment in traditional Chinese medicine (TCM), which is the basis of subsequent health risk assessment and health intervention [1]. Health examination data is the most important source of health information, which plays a pivotal role in the health management industry chain in China. At present, a large amount of health examination data has been obtained [2], among which, precious data of unstructured text type is difficult to be used for automatic health assessment. Up to now, text data analysis and evaluation are mainly performed by artificially string matching; while, they are lack of automation and intelligence due to the difficulty in comprehension and meticulousness, and also necessary to be checked manually leading to the low efficiency.
In China, Health examination becomes popular after SARS in 2003. With the development of social economy, the improvement of people's living standards, and people's increasing attention to their own health, the health examination industry has developed rapidly. This job is not only medical work, but also closely related to the commercial operation. A large number of records have been accumulated over the past 10 years. These records are not as strict or formal as the clinical medical files, especially the text type data. Mixed using and abusing of traditional Chinese medicine and Western Medicine terminology, colloquial expression, vague concept, and so on, the insufficiency of these misleads to a poor quality health examination records. It is difficult to analyse and utilise these text data, and there are few research specially carried out on them. However, these physical examination data records the changes of the health for people, especially those who have regular annual physical examination and have an important potential value.
Our team is carrying out several researches related to health examination: construction of knowledge graph in the field of health examination, development of special input method for health examination results, design of intelligent and automated method for health examination results evaluation, visualisation of health examination results, and so on. All these researches need to deal with the analysis of health examination result of text type. In the previous attempts, we found that the tools and methods of clinical text analysis are not so applicable. Although the health examination data is not standardised, and there are a large number of individual categories and items, each specific item has its unique characteristics, in which the expression of information is limited to a very limited range. What we need is the characteristics of each single item of text data in health examination, furthermore, respective express preference of different physicians to the same item.
No relevant research results of the characteristics analysis of the text item data in health examination have been found. As a result, an algorithm of the starting point of the above study is needed and it should be as simple as possible. Any existing domain knowledge is not used for the time being to avoid too much restriction on the algorithm and results due to the purpose of the algorithm is text features and knowledge discovery. Similarities and differences from the large sample of item data have been selected for clues. The algorithm must be simple enough and can be executed repeatedly. It will run repeatedly for a large number of existing and continuously emerging item data. Perhaps, personal data by different physicians need to be analysed in real time. The algorithm does not pursue the perfect result at one time, which will be continuously verified and improved in the use and interaction with the doctor. The algorithm will be upgraded to make use of the verified knowledge to improve the ability of text analysis.
This study is such a simple starting algorithm. It is conducted by analysing a large amount of historical health examination data with character statistics and information measurement used. The goal is to search for the inherent law of the specific field jargon, and to explore appropriate algorithm and tool for encoding and analysis of text data in health examination. It will provide a basis for follow-up researches.

Related work
There is a large amount of health information in form of natural language, which is difficult to be analysed and utilised. The analysis of medical texts for the purpose of information extraction and knowledge discovery has been the focus of the research. Spasić reported KneeTex (a system for information extraction of knee pathology from MRI reports) which is modelled by a set of sophisticated lexico-semantic rules with minimal syntactic analysis in combination with the ontology [3]. Nguyen assessed the utility of Medtex on automating cancer registry notifications from pathology HL7 messages [4]. Koopman automatically extracted ICD-10 classification information of cancers from free-text death certificates [5]. Yepes used the technology of machine learning to improve the performance of Mesh keyword indexing program such as MTI [6]. Chard leveraged cloud-based approaches to solve the problem of poor accessibility, scalability, and flexibility of natural language processing (NLP) systems on processing medical text [7]. Botsis demonstrated a multilevel text mining approach for automatic rule-based text classification of adverse event reports that could potentially reduce human workload [8]. Li reported the research on information extraction based on domain ontology, which can improve the computer's ability of information extracting and knowledge discovering from electronic medical records in Chinese [9]. Nishmoto constructed a medical dictionary for ChaSen from unified medical language system (UMLS) believing that retrieval of transitional probability would improve the accuracy of parsing compound medical terms [10]. Zhou proposed a method and a prototype system for discovering implicit temporal assertions in medical text by applying discourse analysis as well as semantic and syntactic analysis, and by generating heuristic rules that encode the discovered domain and linguistic knowledge [11]. Yetisgenyildiz improved the efficiency of MEDLINE document classification by medical phrases extracting based on the medical knowledge base and NLP [12]. Niu treated analysis of the polarity information of clinical outcomes as a classification problem, which could be solved by NLP and supervised machine learning [13]. Travers evaluated an emergency medical text processor, a system for cleaning chief complaint text data [14]. There are many similar researches in China, in which Chinese word segmentation methods are used [15][16][17], and the research field is extended to traditional Chinese medicine [18][19][20][21][22].
As mentioned above, the current researches and applications on medical text processing are based on NLP, like lexical, syntactic, and semantic analysis. Ontology, knowledge base, and other medical expertise in specific areas are often used. The goals are to extract a small amount of specific information. It is difficult to use medical NLP and it is difficult to obtain and maintain comprehensive domain knowledge; furthermore, it is difficult for the specific researches to be extended to related fields. Reports on analysis of text data in health examination are rare.
These related works utilise specific domain knowledge to extract a small number of information of specific purpose from a large number of raw data. The obtained information has limited amount, and may cause important information omissions, which is not suitable for the analysis of health examination data and the discovery of unknown knowledge and rules.

Data source
The data used in this paper came from a health examination department in a top-level first grade hospital in Wenzhou, Zhejiang, China. The work of health examination has been carried out for 20 years there. Health examination software was introduced at the end of 2009 and electronic data has been saved for more than 7 years from then on with about 20,000 people per year. The software is developed by a Hangzhou medical software company, who has a relatively high market share. The data shows the common data condition in Chinese health examination.

Data status
Health examination results of 130,028 people have been stored in the database. There are 11,380,790 rows in the detailed data table, and 599 items are involved. The items can be divided into three types according to the health examination methodslaboratory test type, physical examination type, and instrument check type. The results are saved as numeric or textual data, as shown in Table 1.
Laboratory results are mainly of numerical type, and data of text type is very short, with an average of 2.3 and all in 5 characters. Also, they have strictly limited range for input choices, with only an average of six kinds. Two-thirds of the physical examination results are text type. They are also mainly short, while the numbers of input freedom vary greatly with no more than ten kinds and sometimes are very high. Instrument check results are mainly text type, their length and input freedom increase significantly, as shown in Table 1 and Figs. 1 and 2.

Problems
The difficulty degrees of health item data to be analysed and utilised vary greatly according to the data types. Numerical results can be used most easily, because they always have reference ranges, according to which a given result is confirmed as normal or not, even to get its degree abnormality. Most laboratory test results and some physical examination indicators are in this category. Text results of shorter length and limited degrees of input freedom are not so difficult because the possible results can be listed easily and assessed separately. All the laboratory test results and lots of physical and instrument ones are this type. It turns to be the most difficult one for the text data of long length and high input freedom degree since there are no strict format specifications and can be input arbitrarily.
Current measures include the following series for the analysis and utilisation of long text data: all the data are ignored directly, just not used; in addition to these original data, the physicians are also required to input a thumbnail copy which can be assessed relatively easily, leading to duplication of work and increase of medical staff burden; manual reading and analysis; natural languages are too flexible and complex by keywords matching and it is difficult to list all the keywords comprehensively without strictly input constraints, which causes the necessity of manual review. The problem of regular expressions is the same as keywords. These methods are lack of automation and intelligence resulting in low efficiency.
In order to make better utilisation of these texts, it is necessary to analyse the structures and rules of the data. The large amount of historical data accumulated in the physical examination system can play an important role. In this study, we explore the methods of long text data analysing and provide methods and tools for encoding based on the historical health examination data, compression, structuring, analysis, and assessment, thus achieving more automatic and intelligent health assessment.

Data processing algorithm
Natural languages have very high freedom degrees of expression, especially in Chinese. However, when applied to a specific context, the degree of freedom is limited. A health examination item describes a single physiological or test outcome, its degree of freedom was obviously stricter. In the 347 types of health examination items with input freedoms of 4, between 5 and 64, and more than 256, account for 32.3, 79.3, and 9.5%, respectively. Higher degree of freedom resulted in longer text length; while there must exist context domain constraints and unique language fingerprints like character frequencies, word frequencies, and their connection rules.
To use analysis and evaluation in a better way, the long unstructured texts should first be segmented, encoded, and structured. The information in long unstructured text includes each short sentence and their permutation sequence. First of all, the short sentences need to be analysed and segmented and each sentence can be regarded as a piece of basic information, including the item name and the corresponding value. Take the sentence 'Intrahepatic light spots are thickening and disorder' as an example, 'Intrahepatic light spots' should be regarded as its item name and '(are thickening and disorder)' as its value; and after the segmentation, the sentence can be easier to be encoded and classified, being ready for analysis and evaluation.
Based on the assumptions above, this study employed the large amount of historical health examination data and constructs a text analysis algorithm with the character statistics and information measurement used. The algorithm is developed by C# language, and exemplified by the B ultrasound results of liver as an example, as described below.

Data preparation
To avoid the impact on medical online services, the 11,380,790 rows of data are exported into a Microsoft LocalDB database with the table name 'ExaminItemResults'. The main column information is shown in Table 2. Liver B ultrasound data is one of the most common type of long text, with the examination item number '050001', and a total of 82,772 rows saved.

Data loading and numerical substitution
In order to merge the same results, the structured query language (SQL) aggregation statement is used as code 1. About 12,941 results are returned from the database, in which the default normal results occur most frequently, and the count is 41,383. There are many measured value in the texts, such as the size of liver or liver cyst, and the figures will affect the classification. So, a regular expression is used to identify and replace all the figures with a placeholder ' ┻ ', then the number of result kinds reduces to 7438. As shown the regular expression below:

Character frequency counting and segmentation
The connection tightness values T AB s between two adjacent characters A and B are calculated as follows: first of all, three frequencies are counted, F A* represents the frequency of arbitrary two adjacent characters that start with A, F *B end with B, F AB start with A and end with B. Three candidate formulas (1a)-(1c) are shown below. By comparison, (i)-(iii) shows the best performance By adding an end tag to each sentence, the same number as the containing characters of T AB s can be counted. Then the T AB s are sorted in ascending order, and the first NT AB s are chosen and used to segment the sentence. All the front parts are counted and sorted in descending order. Then each T AB gets its own front part order FO. Setting a new parameter BP, which means best position, the balance indexes BIs can also be calculated for each position of    (2) where Pos represents the split position of the sentence and Len is the length of the sentence. Setting another parameter BW, balance weight, the sum split indexes SIs of each candidate position can be calculated In each sentence, the segment position with the largest SI is chosen finally.

Determination of the optimal value of parameters N, BP, and BW
Each sentence can be classified according to its front segmentation after segmented, which represents the problem KEY the sentence describes and the latter part represents the CHOICE the sentence makes about the KEY. A dictionary is then built, where stores all the N KEYs and all the M CHOICEs for each KEY, and the storage space of the dictionary SD can be calculated as follows: Two parts are required for encoding and storing each sentence, the first is for the KEY code, and the second for CHOICE code. The storage space for all detailed sentences SS and the total storage space ST are calculated as follows: Different SD, SS, and ST can be calculated according to different parameter values of N, BP, and BW. By sorting SD in ascending order, the order rank of SD, SS, OSD, and OST can be obtained. OAVG is the average of OSD and OST. The optimal parameter values, 2, 9, 0.8 are determined by the minimum OAVG, as shown in Table 3 3 Experimental results and analysis

Segmentation results
Experimental results above show that the segmentation results of sentences have been used for 10 or more times and the accuracy rate is 78.6%. As shown in Table 4, the weighted accuracy rate is 80.3%, which reaches 82.9% in the results of the most frequently appeared (more than 100 times) long text.

Algorithm efficiency
The algorithm is of high execution efficiency; the complexity is O(n) according to the data row count n. In the VS.Net 2015 development environment, a demo has been developed with the usage of C# language and WPF interface. Regardless of the time of loading data from the database, it takes 170 ms for the first time to run in x64 Win10, i5-4590 CPU, 4G memory debugging environment, and only 90 ms for later time. The determination of the optimal values for N, BP, and BW requires up to 810 times execution cycles of the algorithm, consuming about 46,912 ms.

Limitations and further improvement
This algorithm accomplishes the segmentation of historical text data in health examination, and is only based on character statistics and information measurement without manual intervention. It runs fast and efficiently, and achieves the expected ideal results. However, there are limitations of the algorithm because the accuracy still needs further improvement. The possible reasons include: (i) the results are input arbitrarily causing irregularities and errors; (ii) some sentences of results have inadequate frequencies to display the language clue needed by the algorithm; (iii) some sentences do not match the assumed KEY-CHOICE pattern; (iv) the syntax and semantics are too complicated in Chinese; (v) the algorithm only measured and compared the connection tightness is two characters.  In order to improve the segmentation accuracy, future work may be performed as follows: (i) introducing professional Chinese word segmentation and other NLP tools; (ii) maintaining custom dictionary to justify abnormal T AB s; (iii) standardising physician input operation, and screening data of high quality; (iv) considering connection tightness within more than two characters.

Practical application
This algorithm has achieved the expected goals, laying a good foundation for the follow-up period of work and research. Based on this algorithm, several research in our team is progressing smoothly.
By this algorithm, we obtain the structure characteristic of all the individual text item data, and construct mini knowledge graphs for each item. Physicians can use these mini graphs for the input of text item data. The application of text segmentation results greatly reduces the degree of input freedom, so the input method is to slide the finger on the touch screen. As the algorithm can analyse and treat physician's personal preferences, it can greatly improve the convenience and speed of Chinese character input. In the process of using this input method, the accurately segmented results are often touched; while the poor ones are seldom or never touched. The accuracy of segmentation can be judged through the use and interaction with physicians. In the later period, we will develop a new algorithm to judge the segmentation results using the physician's interaction information, and help this algorithm to improve the ability of text segmentation.
Also by this algorithm, the original unstructured textual data can be structured, and greatly reduces the difficulty of analysis of health examination text data. This algorithm reduces the freedom degree of text data in health examination, thus reduces the difficulty of analysis. Therefore, this algorithm also contributes a lot to the work of setting up an intelligent and automated method for health examination results evaluation.
The above studies will be reported later.

Conclusion
This study employs historical health examination data, and makes the long text segmentation in health examination based on character statistics and information measurement. The assumption is verified that without professional domain knowledge, a large amount of historical data can provide valuable clues for the text understanding. The toolkit can be used in automatic data analysis, encoding, lossless compression, encryption, structured storage, and information classification, which thus can make health assessment more automatic and intelligent. The results of this research are being applied and verified in the works of our team, such as the construction of knowledge graph in the field of health examination, development of special input method for health examination results, design of intelligent and automated method for health examination results evaluation, visualisation of health examination results, and so on. Possible applications of the algorithm include: (i) implement the automatic encoding and compression for text data. In the experiment above, each liver B ultrasound result only needs to be stored in an average of 5.89 bytes, which is significantly reduced compared to original 56 bytes. The compression is lossless and loyal to the physician's input; thus the data can be completely recovered. Compression can greatly reduce the pressure load of the network and database system; (ii) with this encoding used, a certain degree of encryption can be achieved to improve the safety of medical information; (iii) with this encoding used, the texts are better structured and greatly reduced in freedom, thus can lead to better information classification, evaluation, and analysis.

Acknowledgments
This project is supported by the Health and Family Planning Commission of Zhejiang Province, Wenzhou Science & Technology Bureau, and Wenzhou People's Hospital.