Fine-Tuning for Accuracy: Evaluation of GPT for Automatic Assignment of ICD Codes to Clinical Documentation

doi:10.21203/rs.3.rs-4392229/v1

Download PDF

Research Article

Fine-Tuning for Accuracy: Evaluation of GPT for Automatic Assignment of ICD Codes to Clinical Documentation

https://doi.org/10.21203/rs.3.rs-4392229/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background:

Assignment of International Classification of Disease (ICD) codes to clinical documentation is a tedious but important task that is mostly done manually. This study evaluated the widely popular OpenAI’s Generative Pretrained Model (GPT) 3.5 Turbo in facilitating the automation of assigning ICD codes to clinical notes.

Methods:

We identified the 10 most prevalent ICD-10 codes in the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. We selected 200 notes for each code, and then split them equally into two groups of 100 each (randomly selected) for training and testing. We then passed each note to GPT 3.5 Turbo via OpenAI’s API, prompting the model to assign ICD-10 codes to each note. We evaluated the model’s response for the presence of the target ICD-10 code. After fine-tuning the GPT model on the training data, we repeated the process with the test data, comparing the fine-tuned model’s performance against the default model.

Results:

Initially the target ICD-10 code was present in the assigned codes by the default GPT 3.5 Turbo model in 29.7% of the cases. After fine-tuning with 100 notes for each top code, the accuracy improved to 62.6%.

Conclusions:

Historically, GPT’s performance for healthcare related tasks is sub-optimal. Fine-tuning as in this study provides great potential for improved performance, highlighting a path forward for integration of Artificial Intelligence (AI) in healthcare for improved efficiency and accuracy of this administrative task. Future research should focus on expanding the training datasets with specialized data and exploring the potential integration of these models into existing healthcare systems to maximize their utility and reliability.

Artificial Intelligence and Machine Learning

ICD-10 codes

Artificial Intelligence

GPT 3.5 Turbo

Clinical Documentation

Automation

The coding of diseases in the United States using the International Classification of Diseases (ICD) is required for billing and is the underlying technical structure for problem lists. Although it is an international common language, it is a largely manual process in electronic health records (EHR), which can be tedious, produces a backlog of work, and delays clinical processes.

The World Health Organization (WHO) regulates the ICD, utilizing a coding system that designates short codes for specific diseases.(1) Each code comprises alpha-numeric characters corresponding to various health categories, etiologies, manifestations, and severity.(2) The ICD is currently available in 43 languages and all WHO member states use it to share death and disease statistics. The system's global use also makes it a common language for healthcare providers to record, report, and communicate disease information efficiently between hospitals in various regions and countries.(1) Additionally, designating ICD codes for diseases based on health information listed in clinical notes includes a subjective component that may decrease the accuracy of code assignment. Artificial intelligence-based software that automates the ICD code assignment process can increase clinical efficiency by reducing the need for manual classification by healthcare providers.(3)

OpenAI is a non-profit artificial intelligence research company founded in 2015.(4) One of OpenAI’s most notable projects, ChatGPT, is a web application that is powered by a state-of-the-art LLM called Generative Pretrained Transformer (GPT).(5) ChatGPT functions as an intelligent chatting bot built upon its various language understanding mechanisms, such as multilingual machine translation, code debugging, story writing, mistake correction, and identification and rejection of inappropriate requests. These mechanisms allow users to input specific prompts and receive detailed responses.(6) GPT’s impressive performance in various tasks is possible due to its pre-training process, where the model is trained with large amounts of structured and unstructured data from books, articles, reviews, and online conversations.

This extensive pre-training process ultimately separates GPT from previous LLMs that failed to interpret the context of a given input and produce relevant output.(5) GPT’s ability to derive context from data without needing domain knowledge from medical experts can be utilized to extract relevant information from clinical notes. Electronic health records (EHR) are clinical patient records that contain medical information such as vitals, lab results, medical history, and clinical notes from providers. The efficient transmission and analysis of EHR data between various providers improve clinical care quality considerably.(7) The two primary methods to automate ICD code assignment to text-free clinical notes are rule-based systems and learning-based systems. The former depends on the manual intervention of medical professionals, thus limiting the scale by which the process can be optimized. The latter does not require manual manipulation and relies on learning algorithms to extract meaningful underlying distributions in datasets.(8) OpenAI’s GPT model is an example of a learning-based system that can be trained to assign ICD codes to clinical notes using datasets prelabelled with ICD codes. Fine-tuning is the process of providing the pre-trained model with a smaller, specialized dataset for further training. Considering the model possess contextual knowledge from the larger corpus it was originally trained on, it is able to derive important insights from the smaller but more specific dataset and potentially, significantly improve the performance of the model on a specialized task.(9) By using clinical training data and fine-tuning the GPT model using OpenAI’s platform, one can assess the effectiveness of using GPT for ICD code assignment based on clinical notes.

Electronic health records (EHR) are clinical patient records that contain medical information such as vitals, lab results, medical history, and clinical notes from providers. The efficient transmission and analysis of EHR data between various providers improve clinical care quality considerably.(7) GPT’s ability to derive context from data without needing domain knowledge from medical experts can be utilized to extract relevant information from clinical notes. The two primary methods to automate ICD code assignment to text-free clinical notes are rule-based systems and learning-based systems. The former depends on the manual intervention of medical professionals, thus limiting the scale by which the process can be optimized. The latter does not require manual manipulation and relies on learning algorithms to extract meaningful underlying distributions in datasets.(8)

This extensive pre-training process ultimately separates GPT from previous LLMs that failed to interpret the context of a given input and produce relevant output.(5) OpenAI’s GPT model is an example of a learning-based system that can be trained to assign ICD codes to clinical notes using datasets prelabelled with ICD codes. Fine-tuning is the process of providing the pre-trained model with a smaller, specialized dataset for further training. Considering that the model possesses contextual knowledge from the larger corpus on which it was originally trained, it is able to derive important insights from the smaller but more specific dataset and potentially to improve the performance of the model on a specialized task (9) By using clinical training data and fine-tuning the GPT model using OpenAI’s platform, one can assess the effectiveness of using GPT for ICD code assignment based on clinical notes.

Literature Review

AI in Healthcare

Artificial intelligence in healthcare can potentially lower healthcare costs and improve outcomes. An estimate is savings of $150 billion in the United States healthcare industry by 2026.(10) AI has found its place in healthcare as robotic-assisted surgical systems, virtual nurse assistants, medications management, medical diagnostics, and so on.(10) This paper will discuss the potential of a specific AI model, GPT-3.5 Turbo for ICD code assignment.

Challenges of ICD Codes Implementation

There are several costs associated with the use of ICD codes, especially during a time of transition (from ICD-9 to ICD-10 in 2015 or currently from ICD-10 to ICD-11). One survey of 6000 medical centers found the average time spent on staff education was 61.2 hours for small, and 139 hours for medium-sized practices, and for physician education was 35.6 hours and 75.1 hours, respectively.(11) The average cost of the ICD-10-CM implementation in the United States was between $6,748 to $9,564 for a small medical practice and between $14,577 to $23,062 for a medium-sized medical practice. These costs included software updates, staff education, and EHR quality assurance projects.(11) Furthermore, the transition from ICD-10-CM code to the new ICD-11 system will take time. One 2021 study found that approximately 23.5% of ICD-10-CM codes only could be fully represented by a single ICD-11 stem code without the need for combining multiple codes and, if necessary, introducing new stem codes.(12) Most studies focus on the financial and time burden, but less research is available on the emotional stress within the healthcare system surrounding ICD coding. In clinical practice, many clinicians are frustrated by the emphasis on medical coding. One 2015 survey found that over 85% of surveyed clinicians said ICD-10 diverts focus from patient-centered care and more towards insurance and billing.(13)

Coding Errors

Insurance companies and Medicare use these codes in the diagnosis-related group’s payment system (DRG) to determine payments to hospitals.(14) Correct coding of patient encounters is exceedingly important, and failure to correctly code can have several financial and even legal repercussions for a medical practice. Some of the most common errors include upcoding (reporting that a provider spent more time with a patient than in reality), selecting the wrong procedure code, and using dated coding term instead of updated ones.(15) In the U.S., the quality of the coding process has been questioned by many studies showing there is significant room for improvement. One study by the National Academy of Medicine on the reliability of hospital discharge coding showed that only 65% agreed with independent re-coding.(16) Hsia, et al revealed a coding error rate of 20%.(17) Other similar studies showed a typical error rate of 25–30%, with low agreement between coders.(18) A report analyzing the previous ICD-9-CM codes estimated that the cost of correcting wrong codes in the U.S. was upwards of $25 billion per year. Manual coding for diverse disease etiologies, pathologies, clinical manifestations, and treatment plans is not only prone to errors but is also time-consuming and inefficient.(19)

To overcome the challenges associated with using ICD code assignments and implementation, AI seems to offer a viable solution.

AI for Assigning ICD Codes

A review of 1611 publications with automated coding from 1974–2020 found a significant increase in AI-based coding publications after 2009, with Natural Language Processing (NLP) and Machine Learning (ML) as the most used methodologies for automated coding.(20) An example is the successful collaboration between a Clinical Documentation Integrity Specialist and an embedded Computer Assisted Coding (CAC) system.(21) The ICD provides a taxonomy of classes, representing various conditions addressed at an episode of care for a patient, as presented in clinical documentation. Considering clinical documentation consists of unstructured textual data, and a single note can have multiple ICD codes assigned to it, it can thus be treated as a multi-label classification problem.(22) Deep learning-based methods have outperformed other conventional models in ICD codes assignment.(23) A systemic review of studies from 2010 to 2021 provided an overview of automatic ICD coding assignment systems that utilized NLP, machine learning, and deep learning techniques, and concluded that deep learning models were found to be better than other traditional machine learning models when automating clinical coding systems.(24)

Utilizing NLP techniques such as Word Embedding (a representation of words and phrases by vectors in a low-dimensional space such that it retains semantic and syntactic information) and a Convolutional Neural Network model (a deep learning algorithm that captures hierarchical patterns in textual data utilizing convolutional layers), another study processed 21,953 clinical records from five departments, significantly enhancing the accuracy of automated ICD-10 code predictions and potentially easing the manual coding process for physicians.(25) A similar study analyzed the use of a natural language processing-bidirectional recurrent neural network (NLP-BIRNN) algorithm to optimize the medical records and identified areas of error by medical coders. NLP-BIRNN is a deep learning algorithm that processes sequences of text in both forward and reverse directions, thus retaining contextual information from both past and future states. NLP-BIRNN reduced errors in the assignment of principal diagnosis and ICD coding.(26) The introduction of transformers (deep learning models that rely on self-attention mechanisms, processing entire sentences simultaneously, rather than word-by-word, being a lot more efficient at retaining context and thus exceptional at linguistic tasks) and Large Language Models (AI systems based on transformer architecture, trained on diverse language dataset to understand, generate and interact with human language at large) opened new doors. Publicly available systems like ChatGPT make these models available to the general public and it was only a matter of time before professionals started experimenting with these systems for healthcare applications. One study found that ChatGPT was able to generate at least one correct ICD code for an encounter 70% of the time.(27) Another compared off-the-shelf LLM models GPT-4, Llama-2 and a model specifically trained for ICD code assignments known as PLM-ICD and showed that PLM-ICD had a consistent accuracy of 22%, while GPT-4 accuracy was 22.5% as represented by F1-score.(28) The objective of the current project is to evaluate the precision of GPT to assign ICD-10 codes, and whether fine tuning can improve its performance.

The Dataset

We used free-text discharge summaries from the Medical Information Mart for Intensive Care IV Note (MIMIC-IV-Note) dataset which contains de-identified clinical data of over 40,000 patients admitted to the Beth Israel Deaconess Medical Center.(29, 30) Each discharge summary contains free text data describing the initial presentation, course of treatment during the specific hospital encounter, and also includes diagnostic data. Access to this dataset is restricted and requires the user to sign a Data Use Agreement with PhysioNet.(31) Each admission encounter includes a unique Hospital Admission ID denoted as ‘hadm_id’ in the dataset. We downloaded the free text de-identified clinical notes from the dataset contained in the “discharges.csv” file(30) as well as the file containing the ICD diagnosis contained in the “diagnosis_icd.csv” file.(29) The file containing ICD codes has both ICD-10 as well as ICD-9 codes assigned for each admission encounter as represented by “hadm_id”. By default, the dataset is arranged such that the diagnosis_icd.csv file will have multiple entries for the same note, depending on how many ICD-9 and ICD-10 codes are assigned to that specific note. In order make a single aggregated table, we then joined the two tables at the data field “hadm_id”, such that each “hadm_id” was a unique entry, with a single unique discharge summary note entry, and a list of ICD-10 codes assigned to that note as a single entry in the ICD-codes column. Some basic statistics about the dataset are shown in Table 1.

Table 1

Summary statistics of the final dataset.
Total number of discharge summaries	122300
Average number of ICD-10 codes per discharge summary	14.4
Maximum number of ICD-10 codes assigned to a note	39
Minimum number of ICD-10 codes assigned to a note	1

To evaluate the performance of the model, we adopted an approach similar to Huang, et. al.(3) In this approach, each ICD code is treated independently of the others and discharge summaries related to only the top 10 most prevalent ICD codes are considered. For each specific ICD code under review, we verify whether it appears in the model’s list for each corresponding note. Accuracy is calculated based on this criterion. For instance, if the code is correctly predicted to be applicable in 80 out of the 100 notes, the model achieves an accuracy rate of 80% for that code. The reason this approach seems logical as the assignment of ICD codes to clinical documentation is subjective to some extent. It is possible that in the MIMIC-IV dataset, since codes were assigned manually, not all possible codes were covered for each note. Thus, comparing the entire list of ICD-10 codes assigned by the model to a specific note to the manual list might be challenging since exact matches will be rare. Thus, we calculated the top 10 most prevalent ICD-10 codes in the dataset as shown in Table 2.

Table 2

Top 10 most prevalent ICD-10 codes in the dataset.
ICD-10 Code	Diagnosis	Count
E785	Hyperlipidemia, unspecified	44044
I10	Essential (primary) hypertension	43574
Z87891	Personal history of nicotine use	36299
K219	Gastroesophageal reflux disease without esophagitis	30803
F329	Major depressive disorder, single episode, unspecified	23231
I2510	Atherosclerotic heart disease of native coronary artery without angina pectoris	22609
N179	Acute kidney failure, unspecified	19706
F419	Anxiety disorder, unspecified	19155
Z7901	Long term (current) use of anticoagulants	15323
Z794	Long term (current) use of insulin	15277

With Python code, for each of the top-10 most prevalent code, we selected 200 notes randomly and divided them into 2 groups of 100 notes each: a training group, and a testing group. Thus, total of 2000 notes were selected, with 1000 for fine-tuning, and 1000 for testing. Figure 1 displays a summary of the dataset preparation.

The Model

OpenAI has several GPT models available to the public via their Application Programming Interface (API). The latest one is GPT-4, however, as of writing this manuscript, fine tuning GPT-4 is available only on an experimental basis to a limited number of users. Therefore, we decided to use the GPT-3.5 Turbo model, which would be the latest model offered by OpenAI that can be fine-tuned.

Evaluating the Base Model

For each ICD-10 code in the top 10 codes, we passed the notes from the testing dataset one-by-one to the GPT-3.5 model via an OpenAI API and prompted it to assign a list of ICD-10 codes based on the information in each note. We then checked if the target ICD-10 code was present in the returned response or not.

Model Fine-Tuning and Evaluation

Once the base model evaluation was completed, we used the remaining 1000 notes in the training dataset (100 notes for each of the top-10 most common ICD-10 codes) to fine-tune the model. We used the web-based methodology for fine-tuning the model as offered by OpenAI. The training data was prepared in the format required and specified by OpenAI and uploaded to their server. The model expects each data point in the fine-tuning dataset to be labelled as “prompt”, which represents the input data, which in this case is a discharge summary note, along with the instructions for the model, and “output”, representing the expected output, which in this case is a list of ICD-10 codes assigned to the provided note. Table 3 shows an example of a data point in the dataset used for fine-tuning.

Table 3

A sample data point in the fine-tuning dataset.
Prompt	Response
“Assign ICD-10 codes to the following discharge summary: {Discharge Summary}*”	"['M1612', 'E871', 'I10', 'Z96641', 'I482', 'Z7901', 'E785', 'K219', 'F17290', 'R339', 'R42']"
*A place holder representing a discharge summary note. Full note not displayed to save space.

Once fine-tuning was completed, the custom fine-tuned model was then evaluated with the testing dataset and data recorded, similar to the methodology used for the base model as described above. Figure 2 shows a summary of the model evaluation flow.

This study did not enroll individual participants nor use identifiable private information. The Penn State Health Institutional Review Board granted an exemption for informed consent.

The target ICD-10 was present in the assigned list of ICD-10 codes by the model 29.7% of the time without fine tuning. The fine-tuned model, however, performed almost twice as better with the target ICD-10 code present in the assigned list of codes by the fine-tuned model 62.6% of the time. Figure 3 summarizes the results.

Performance improved across the board for all codes 5 to 52% but varied from 47 to 82% for each code individually. Table 4 shows the base-model as well as the fine-tuned model’s accuracy for each code.

Table 4

A comparison of the models’ accuracy for each code as well as summary statistics.
Index	Code	Base Model Accuracy (%)	Fine-tuned Model Accuracy (%)	Absolute Improvement
0	E785	39	69	30
1	I10	59	82	23
2	Z87891	10	47	37
3	K219	33	74	41
4	F329	35	75	40
5	I2510	56	61	5
6	N179	12	58	46
7	F419	15	67	52
8	Z7901	23	33	10
9	Z794	15	60	45
Max		59	82
Min		10	33
Mean		29.7	62.6

Training a specialized model for such tasks is resource-intensive involving the collection and preparation of extensive datasets. Fine-tuning, however, presents a viable alternative. This approach involves taking a pre-trained model, which has already learned features from a large, diverse dataset, and adjusting it to perform a specific task. This provides the model with a task specific dataset, including the expected outputs. Fine-tuning can thus significantly improve a task specific performance, as demonstrated by our work which doubled the accuracy of the model.

Fine-tuning offers several other benefits:

It requires a reduced amount of data compared to training a model from scratch. This is particularly advantageous in healthcare where acquiring large amounts of data can be challenging due to privacy concerns as well as due to the rarity of certain medical conditions.

Since the model has already learned the basic patterns and features, fine-tuning for a specific task requires much less computational time and resources.

Fine-tuning allows the model to maintain general capabilities while gaining proficiency in a specific task.

Cloud-based, publicly available AI models offer an option that is accessible and cost-effective as compared to developing, training, and deploying a custom model from scratch. This has led to a wide adoption of these models in various sectors, including healthcare. We are seeing a wave of AI based applications in healthcare, mostly powered by third party cloud-based models offered via APIs. These models offer significant advantages like scalability, reduced infrastructure cost, and ease of integration.

Performance of these models for specialized healthcare-related tasks like assigning ICD codes to clinical notes remains subpar. Saroush, et al(32) demonstrated that prompting GPT-3.5 and − 4 via the ChatGPT interface by providing descriptions of the ICD-10 code predicted the correct ICD-10 codes only 10% (GPT-3.5) and 13% (GPT-4) of the time. Boyle, et al(28) observed similar results. Healthcare tasks require an understanding of medical terminology and context, which the generic AI models might not possess. The GPT models have been trained on a large dataset obtained from the internet. One would assume that these data will contain medical data as well, curated from openly accessible journals, user posts on open forums, and websites like Wikipedia and Medscape. However, there is no room for error when one is working with real patient data. The data used for training these models have not been vetted for medical accuracy and the model output may not always be accurate. Our study, using a combination of specific EHR data, AI, and fine-tuning, suggests that we can improve the accuracy of ICD-10 coding. This is true even when using the less mature GPT-3.5.

Furthermore, healthcare data is inherently private and data security is a major concern. Using a model hosted by a third-party risks exposure of Protected Health Information (PHI). Therefore, before implementing such technologies, one must ensure security of the PHI remains uncompromised. This requires deploying models locally within the healthcare institution to maintain control over data security, strict business associate and data usage agreements with any external parties, absolute restrictions on selling or otherwise sharing the data with any other entities and insisting that technology partners verify their compliance with data security standards.

We acknowledge the limitations of our study. First, we used a specific dataset, the MIMIC-IV. Our observations may not be generalized to other datasets.

Another limitation is the extent of the model’s fine-tuning, which was done with a predetermined number for notes for each of the 10 most prevalent codes. The number was chosen arbitrarily and thus a larger number of notes could enhance performance with a larger number of notes.

The task of ICD code assignment to clinical notes by itself comes with inherent challenges. The linguistic similarities between different codes can lead to complexities in accurate assignment. The subjective nature of code assignment by different experts can result in varying sets of codes assigned to the same clinical note. Furthermore, the consistency of the content of the model’s response for the exact same prompt and summary passed to it may vary. Thus, a direct comparison of the model assigned codes with the manually assigned codes can be very challenging. Keeping this in mind, adopting the methodology used by Huang, et al(3) for the model evaluation offers a structured and logical way to navigate this challenge. This also means that the model will still likely miss some codes as well as assign wrong codes. But with fine-tuning, the accuracy will improve significantly as shown by our study.

The AI landscape is rapidly evolving, and we are seeing more LLMs being released. Companies like Google have announced LLMs specifically trained on healthcare data but their availability at the moment is limited to a selected group of users. It would be interesting to evaluate the performance of such models for various healthcare related tasks including assignment of ICD codes.

This study presents an evaluation of the OpenAI GPT-3.5 Turbo model for the assignment of ICD-10 codes to a clinical note. We demonstrated that at baseline the model’s performance of a healthcare-related task is inadequate, however, there is potential for marked enhancement of performance with fine-tuning.

Our study further illuminates the broader implications for the adoption of publicly available and affordable AI models, emphasizing the importance of fine-tuning these pre-trained models to meet the unique demands of healthcare tasks like medical coding.

Future work to evaluate the amount of data required for fine-tuning such models for optimal performance would be revealing, as would evaluating models trained on healthcare data for healthcare associated tasks.

Author Contributions

(I) Conception and design: Khalid Nawab

(II) Administrative support: Shadi Hijjawi, Richard Schreiber

(III) Provision of study materials or patients: Khalid Nawab, Shadi Hijjawi, Richard Schreiber

(IV) Collection and assembly of data: Khalid Nawab, Gulalai Khan, Iqbal Hussain, Riya Arora

(V) Data analysis and interpretation: Khalid Nawab, Gulalai Khan, Riya Arora, Sayuj Atreya

(VI) Manuscript writing: All authors

(VII) Final approval of manuscript: All authors

Conflict of interest statement

The authors declare that there are no conflicts of interest regarding the publication of this manuscript. No financial or personal relationships have influenced the work reported in this paper.

Hirsch JA, Nicola G, McGinty G, et al. ICD-10: History and Context. AJNR Am J Neuroradiol [Internet]. 2016 Apr 1 [cited 2024 Jan 18];37(4):596. Available from: /pmc/articles/PMC7960170/
Cartwright DJ. ICD-9-CM to ICD-10-CM Codes: What? Why? How? Adv Wound Care (New Rochelle) [Internet]. 2013 Dec [cited 2024 Jan 18];2(10):588. Available from: /pmc/articles/PMC3865615/
Huang J, Osorio C, Sy LW. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Comput Methods Programs Biomed. 2019 Aug 1;177:141–53.
Introducing OpenAI [Internet]. [cited 2024 Jan 18]. Available from: https://openai.com/blog/introducing-openai
Roumeliotis KI, Tselikas ND. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet 2023, Vol 15, Page 192 [Internet]. 2023 May 26 [cited 2024 Jan 18];15(6):192. Available from: https://www.mdpi.com/1999-5903/15/6/192/htm
Wu T, He S, Liu J, et al. A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development. IEEE/CAA Journal of Automatica Sinica. 2023 May 1;10(5):1122–36.
Campanella P, Lovato E, Marone C, et al. The impact of electronic health records on healthcare quality: a systematic review and meta-analysis. Eur J Public Health [Internet]. 2016 Feb 1 [cited 2024 Jan 31];26(1):60–4. Available from: https://dx.doi.org/10.1093/eurpub/ckv122
Sy LW. An Empirical Evaluation of Deep Learning for ICD-9 Code Assignment using MIMIC-III Clinical Notes. Comput Methods Programs Biomed [Internet]. [cited 2024 Jan 18]; Available from: https://www.academia.edu/48411204/An_Empirical_Evaluation_of_Deep_Learning_for_ICD_9_Code_Assignment_using_MIMIC_III_Clinical_Notes
Fine-tuning - OpenAI API [Internet]. [cited 2024 Apr 11]. Available from: https://platform.openai.com/docs/guides/fine-tuning
Väänänen A, Haataja K, Vehviläinen-Julkunen K, et al. AI in healthcare: A narrative review. F1000Research 2021 10:6 [Internet]. 2021 Oct 8 [cited 2024 Jan 18];10:6. Available from: https://f1000research.com/articles/10-6
Desai P, Eljazzar R. Post-Implementation Cost-Analysis of the ICD-10-CM Transition on Small and Medium-Sized Medical Practices. J Health Med Econ [Internet]. 2018 May 25 [cited 2024 Jan 22];4(1):4. Available from: https://health-medical-economics.imedpub.com/postimplementation-costanalysis-of-the-icd10cm-transition-on-small-andmediumsized-medical-practices.php?aid=22696
Fung KW, Xu J, McConnell-Lamptey S, Pickett D, et al. Feasibility of replacing the ICD-10-CM with the ICD-11 for morbidity coding: A content analysis. J Am Med Inform Assoc [Internet]. 2021 Nov 1 [cited 2024 Jan 22];28(11):2404. Available from: /pmc/articles/PMC8510319/
% of Physicians Say ICD-10 Diverts Focus from Patient Care [Internet]. [cited 2024 Jan 22]. Available from: https://revcycleintelligence.com/news/86-of-physicians-say-icd-10-diverts-focus-from-patient-care
Mihailovic N, Kocic S, Jakovljevic M. Review of Diagnosis-Related Group-Based Financing of Hospital Care. Health Serv Res Manag Epidemiol [Internet]. 2016 May 10 [cited 2024 Jan 22];3. Available from: https://pubmed.ncbi.nlm.nih.gov/28462278/
ICD-10-CM and CPT® Coding Mistakes Can Cost You – And not Just Financially – MedLearn Publishing [Internet]. [cited 2024 Jan 22]. Available from: https://medlearn.com/icd-10-cm-and-cpt-coding-mistakes-can-cost-you-and-not-just-financially/
Medicine I of. Reliability of Medicare Hospital Discharge Records: Report of a Study. Reliability of Medicare Hospital Discharge Records. 1977 Jan 1;
Hsia DC, Krushat WM, Fagan AB, et al. Accuracy of diagnostic coding for Medicare patients under the prospective-payment system. N Engl J Med [Internet]. 1988 Feb 11 [cited 2024 Jan 31];318(6):352–5. Available from: https://pubmed.ncbi.nlm.nih.gov/3123929/
Fung KW, Xu J, Rosenbloom ST, Campbell JR. Using SNOMED CT-encoded problems to improve ICD-10-CM coding—A randomized controlled experiment. Int J Med Inform. 2019 Jun 1;126:19–25.
Lang: Consultant report-natural language processing... - Google Scholar [Internet]. [cited 2024 Jan 22]. Available from: https://scholar.google.com/scholar_lookup?journal=Cincinnati+Children%E2%80%99s+Hospital+Medical+Center&title=Consultant+report-natural+language+processing+in+the+health+care+industry&author=D+Lang&volume=Winter&issue=6&publication_year=2007&
Ramalho A, Souza J, Freitas A. The use of artificial intelligence for clinical coding automation: A bibliometric analysis. Advances in Intelligent Systems and Computing [Internet]. 2021 [cited 2024 Jan 18];1237 AISC:274–83. Available from: https://link.springer.com/chapter/10.1007/978-3-030-53036-5_30
Bossen C, Pine KH. Batman and Robin in Healthcare Knowledge Work: Human-AI Collaboration by Clinical Documentation Integrity Specialists. ACM Transactions on Computer-Human Interaction [Internet]. 2023 Mar 17 [cited 2024 Jan 31];30(2). Available from: https://dl.acm.org/doi/10.1145/3569892
Kaur R, Ginige JA. Analysing Effectiveness of Multi-Label Classification in Clinical Coding. ACM International Conference Proceeding Series [Internet]. 2019 Jan 29 [cited 2024 Jan 18]; Available from: https://dl.acm.org/doi/10.1145/3290688.3290728
Huang J, Osorio C, Sy LW. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Comput Methods Programs Biomed. 2019 Aug 1;177:141–53.
Kaur R, Ginige JA, Obst O. AI-based ICD coding and classification approaches using discharge summaries: A systematic literature review. Expert Syst Appl. 2023 Mar 1;213:118997.
Masud JHB, Kuo CC, Yeh CY, et al. Applying Deep Learning Model to Predict Diagnosis Code of Medical Records. Diagnostics [Internet]. 2023 Jul 1 [cited 2024 Jan 18];13(13). Available from: /pmc/articles/PMC10340491/
Wang C, Yao C, Chen P, et al. Artificial Intelligence Algorithm with ICD Coding Technology Guided by Embedded Electronic Medical Record System in Medical Record Information Management. Microprocess Microsyst [Internet]. 2023 Oct 13 [cited 2024 Jan 18];104962. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0141933123002065
Ong J, Kedia N, Harihar S, Vupparaboina SC, et al. Applying large language model artificial intelligence for retina International Classification of Diseases (ICD) coding. J Med Artif Intell [Internet]. 2023 Oct 30 [cited 2024 Jan 19];6(0). Available from: https://jmai.amegroups.org/article/view/8198/html
Boyle JS, Kascenas A, Lok P, et al. Automated clinical coding using off-the-shelf large language models. 2023 Oct 10 [cited 2024 Jan 19]; Available from: https://arxiv.org/abs/2310.06552v3
MIMIC-IV v2.2 [Internet]. [cited 2024 Jan 30]. Available from: https://physionet.org/content/mimiciv/2.2/#files-panel
MIMIC-IV-Note: Deidentified free-text clinical notes v2.2 [Internet]. [cited 2024 Jan 30]. Available from: https://physionet.org/content/mimic-iv-note/2.2/note/#files-panel
License Content [Internet]. [cited 2024 Apr 12]. Available from: https://physionet.org/about/licenses/physionet-credentialed-health-data-license-150/
A S, BS G, E Z, et al. Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes. 2023 Jul 9 [cited 2024 Jan 30]; Available from: https://europepmc.org/article/ppr/ppr688592

The authors declare no competing interests.

Download PDF

Version 1

posted

You are reading this latest preprint version

Fine-Tuning for Accuracy: Evaluation of GPT for Automatic Assignment of ICD Codes to Clinical Documentation

Status:

Version 1

Abstract

Background:

Methods:

Results:

Conclusions:

Figures

Introduction

Background

Methods

Results

Discussion

Conclusions

Declarations

Author Contributions

Conflict of interest statement

References

Additional Declarations

Status:

Version 1