Curation of an international drug proprietary names dataset

A drug dataset containing international proprietary names is essential for researchers investigating different drugs from different countries worldwide. However, many websites on the internet offer free access for a single drug searching service to identify international drug trade names, but not for a list of drugs to be searched and identified. Therefore, it will be problematic if the researcher has a list of hundreds or thousands of drug trade names to be identified. In this project, we have created an International Drug Dictionary (IDD) by curating collected drug lists from open access websites belonging to official drug regulatory agencies, official healthcare systems, or recognized scientific bodies from 44 countries around the world in addition to the European public assessment reports (EPAR) and the DRUGBANK vocabulary published in the public domain. Researchers interested in pharmacovigilance, pharmacoepidemiology, or pharmacoeconomics can benefit from this dataset, especially when identifying lists of proprietary drug names, particularly of multi-national origin. To enhance its adaptability, we also mapped the IDD to the standardized drug vocabulary RxNorm. The IDD can also be used as a tool for mapping international drug trade names to RxNorm. Each drug entity in the IDD mapped to a unique identification number for each entity called Atom Unique Identifier (RXAUI) from RxNorm.

A drug dataset containing international proprietary names is essential for researchers investigating different drugs from different countries worldwide.However, many websites on the internet offer free access for a single drug searching service to identify international drug trade names, but not for a list of drugs to be searched and identified.Therefore, it will be problematic if the researcher has a list of hundreds or thousands of drug trade names to be identified.In this project, we have created an International Drug Dictionary (IDD) by curating collected drug lists from open access websites belonging to official drug regulatory agencies, official healthcare systems, or recognized scientific bodies from 44 countries around the world in addition to the European public assessment reports (EPAR) and the DRUGBANK vocabulary published in the public domain.Researchers interested in pharmacovigilance, pharmacoepidemiology, or pharmacoeconomics can benefit from this dataset, especially when identifying lists of proprietary drug names, particularly of multinational origin.To enhance its adaptability, we also mapped the IDD to the standardized drug vocabulary RxNorm.The IDD can also be used as a tool for mapping international drug trade names to RxNorm.Each drug entity in the IDD mapped to a unique • The added English translation for active ingredients to some sources would also enhance matching.

Data Description
The data provided is composed of one Excel sheet file and one Microsoft Access database file.The excel sheet contains a single table of the whole dataset of drug names in the IDD, mapped to the ingredient level in RxNorm, as shown in Table 1 .
The Access database file helps users with limited technical background retrieve active ingredient names from lists containing brand names.The user of the access database can search for a single drug or do a batch matching by importing an Excel datasheet of the drugs of interest into the access database through a simplified user-friendly interface.Results of the list matching can be viewed inside the Access database instantly, or it can be exported into Excel or PDF file format.However, since this dataset was not manually reviewed in full, we do not recommend using it to make any clinical decision.

Experimental Design, Materials and Methods
Drug lists downloaded from the official websites and drug names were extracted from its original lists (i.e., Excel sheets, Access database, CSV, XML, text and pdf files), then merged into one larger table or database of proprietary drug names with their relevant alternative names or active ingredient names.
All individual source tables (i.e., text, CSV, and Excel files) were imported directly into a SQL server database.Those datasets composed of more than one table were first imported into the Microsoft Access database to make the needed relationships between tables and export one unified table into the SQL server database.If the source vocabulary was presented as a pdf file, then first the pdf file was converted into an excel file, the excel file then checked for integrity, then imported into the SQL server database.Only one dataset wherein XML format (NHSBSA md + d, UK) and for that specific dataset, the relevant excel files were extracted through the official dm + d XML transformation tool [4] .Tables were imported into the Microsoft Access database to establish a proper relationship between the dataset files, and one table was extracted then exported to the SQL Server.Then all tables from all sources were merged using SQL server queries into one larger table by selecting distinct rows from each source.Then, multistep SQL server queries were made for each source dataset separately to create alternate drug names that are shorter and cleaner than the source names as needed (i.e., removing dosage form, strength, unite of strength, or manufacturer name).The original drug entries were kept without any processing, except for very limited removal of trailing spaces and some unexplained trailing symbols from the original fields.
If the source drug list was not in English and did not contain ATC codes for drug names, then relevant active ingredient names from the source list were translated by exporting the list to a Google sheet, then performing a translation to English through Google API translation on that sheet, and finally updating the source drug list with the English translation as an alternative name for active ingredients.The DRUGBANK vocabulary was added to the IDD as it contains synonym terms for active ingredients.That would enhance the matching and linking between drug names, especially when mapping active ingredients spelled differently by different languages.To enhance the adaptability of the IDD, we mapped drug names to the standardized drug vocabulary RxNorm.All the drug name fields in each source dataset (i.e., trade name, active ingredient, other alternative names if present, cleaned trade names, cleaned active ingredient names) were matched separately to STR column (drug string) in RXNCONSO table from RxNorm to map each drug entity in the IDD to a specific Atom Unique Identifier (RXAUI) from RxNorm.Then, we used the available ATC codes from the source drug lists to match with ATC codes from RxNorm to further identify drug names that could not be matched directly to STR field in RxNorm.Only successfully mapped drug names to RxNorm were included in the final set of the dictionary.Finally, each row in the dataset was divided into multiple rows so that each new row consists of only one drug field (i.e., trade name, cleaned trade name, active ingredient, cleaned active ingredient, or alternative name) in addition to the RxNorm attributes (i.e., RXAUI, RXCUI, STR, SAB, TTY, and CODE) as in the original row.
To verify the integrity of the IDD, we randomly selected 50 drug names and manually verified the trade names and the linked active ingredients from RxNorm with other attributes.
[3] data source for trade names should be an official website in each country and be freely available to the public.The trade name list must contain either the active ingredient name or the ATC code of drugs in that list.Drug names should be in a Latin-based alphabet.The DRUGBANK[1]vocabulary was used as a source of synonyms and alternative names rather than as a source of trade names.Description of data collectionWe downloaded the datasets from the official websites of drug authorities or agencies in 44 countries in addition to the European public assessment reports (EPAR) and the DRUGBANK vocabulary.Being a combined and cured dataset from 46 sources globally makes the IDD suitable for identifying most drug proprietary names globally.•Researchersinterested in pharmacovigilance, pharmacoepidemiology, or pharmacoeconomics can benefit from this dataset, especially when identifying lists of proprietary drug names.•TheIDDprovides a tool for mapping proprietary drug names to the active ingredient level in the standard vocabulary RxNorm[3].• The introduction of the cleaned trade names field would enhance matching with other drug lists.

Table 1
Preview of IDD in the Excel sheet format.
Abbreviations: RXAUI, the RxNorm atom identifier; RXCUI, the RxNorm concept identifier; STR, String as in RxNorm; SAB, Source abbreviation as in RxNorm; TTY, Term type in the source vocabulary as in RxNorm; CODE, the source asserted identifier as in RxNorm.