Mapping chemical structure-activity information of HAART-drug cocktails over complex networks of AIDS epidemiology and socioeconomic data of U.S. counties

doi:10.1016/j.biosystems.2015.04.007

Biosystems

Volumes 132–133, June 2015, Pages 20-34

https://doi.org/10.1016/j.biosystems.2015.04.007 Get rights and content

Abstract

Using computational algorithms to design tailored drug cocktails for highly active antiretroviral therapy (HAART) on specific populations is a goal of major importance for both pharmaceutical industry and public health policy institutions. New combinations of compounds need to be predicted in order to design HAART cocktails. On the one hand, there are the biomolecular factors related to the drugs in the cocktail (experimental measure, chemical structure, drug target, assay organisms, etc.); on the other hand, there are the socioeconomic factors of the specific population (income inequalities, employment levels, fiscal pressure, education, migration, population structure, etc.) to study the relationship between the socioeconomic status and the disease. In this context, machine learning algorithms, able to seek models for problems with multi-source data, have to be used. In this work, the first artificial neural network (ANN) model is proposed for the prediction of HAART cocktails, to halt AIDS on epidemic networks of U.S. counties using information indices that codify both biomolecular and several socioeconomic factors. The data was obtained from at least three major sources. The first dataset included assays of anti-HIV chemical compounds released to ChEMBL. The second dataset is the AIDSVu database of Emory University. AIDSVu compiled AIDS prevalence for >2300 U.S. counties. The third data set included socioeconomic data from the U.S. Census Bureau. Three scales or levels were employed to group the counties according to the location or population structure codes: state, rural urban continuum code (RUCC) and urban influence code (UIC). An analysis of >130,000 pairs (network links) was performed, corresponding to AIDS prevalence in 2310 counties in U.S. vs. drug cocktails made up of combinations of ChEMBL results for 21,582 unique drugs, 9 viral or human protein targets, 4856 protocols, and 10 possible experimental measures. The best model found with the original data was a linear neural network (LNN) with AUROC > 0.80 and accuracy, specificity, and sensitivity ≈ 77% in training and external validation series. The change of the spatial and population structure scale (State, UIC, or RUCC codes) does not affect the quality of the model. Unbalance was detected in all the models found comparing positive/negative cases and linear/non-linear model accuracy ratios. Using synthetic minority over-sampling technique (SMOTE), data pre-processing and machine-learning algorithms implemented into the WEKA software, more balanced models were found. In particular, a multilayer perceptron (MLP) with AUROC = 97.4% and precision, recall, and F-measure >90% was found.

Introduction

Computational algorithms may play an important role in the process of elucidation of structure–activity relationships for many molecular systems and biological problems (Aguilera and Rodriguez-Gonzalez, 2014, Barresi et al., 2013, Gonzalez-Diaz et al., 2011, Munteanu et al., 2009). In particular, the theoretical biology has been useful in the study of anti-HIV drugs and/or their molecular targets (Jain Pancholi et al., 2014, Ogul, 2009, Speck-Planche et al., 2012, Weekes and Fogel, 2003, Xu et al., 2013). However, classic algorithms useful to connect the structure of a single molecule with its biological properties are unable to study the effect of combinations (cocktails) of drugs over epidemiological outbreaks in large populations with different social and economic factors. For instance, infections with HIV are commonly treated with antiretroviral drug combinations. These treatments could diminish the risk of HIV transmission (Castilla et al., 2005, Ping et al., 2013). In addition, the rates of disease progression, opportunistic infections, and mortality decreased with the implementation of HAART, and the combination of anti-HIV drugs resulted in longer survival and a better quality of life for the people infected with the virus (Colombo et al., 2014). The most common drug treatment administered to patients consists of two nucleoside reverse transcriptase inhibitors combined with either a non-nucleoside reverse transcriptase inhibitor, a “boosted” protease inhibitor or integrase strand transfer inhibitors (INSTIs), which resulted in decreased HIV RNA levels (<50 copies/mL) at 48 weeks and CD4 cell increases in the majority of patients (Usach et al., 2013). Research indicates (McMahon et al., 2011) that despite HAART therapy, HIV infected individuals who are poor, homeless, hungry, or have less education, continue to have a higher risk of death. Additionally, researchers (McMahon et al., 2011) suggest that HIV-infected individuals of low socioeconomic status (SES) are more likely to have increased mortality rates than those who are not living under these adverse conditions. Therefore, resources for HIV testing care and proven economic interventions should be directed to areas of economically disadvantaged people (McDavid Harrison et al., 2008).

The case of the United States (U.S.) is interesting for theoretical studies due to the abundance of epidemiological information. Holtgrave and Crosby (2003) found an important correlation (r = 0.469, p < 0.01) between the income inequality and the AIDS case rates at state level in the U.S. In addition, in 2010, the U.S. National HIV Behavioural Surveillance System developed a study about HIV infection among heterosexuals at increased risk, involving a total of 12,478 persons. Out of 8473 participants, 197 (2.3%) participants were positive for HIV infection, and prevalence was similar for men (2.2%) and women (2.5%). The research study shows a higher prevalence in persons who reported less than a high school education (3.1%), compared with those with a high school education (1.8%). Income inequality, employment, and other social variables also seem to be relevant on AIDS epidemiology. Prevalence was also higher in those with an annual household income of less than $10,000 (2.8%), compared to those with an income of $20,000 or more (1.2%) (CDC, 2013). Moreover, the percentage of HIV-infected individuals was higher in participants who reported being unemployed (1.1%) or disabled (and unemployed) (2.7%), compared to employed (0.4%) ones. Some authors, such as Mondal and Shitan (2013), commented in their study connections among life expectancy, income, educational attainment, fertility, health facilities, and HIV prevalence.

Recently, large amounts of data have been accumulated in public databases about the scope of molecular biology. For instance, the ChEMBL database (https://www.ebi.ac.uk/chembl/) (Bento et al., 2013, Gaulton et al., 2012, Heikamp and Bajorath, 2011) provides data from life science experiments (Bento et al., 2013). In the same way, there are online resources containing epidemiological data of AIDS prevalence and data about socioeconomic factors at county level. These databases are AIDSVu (http://aidsvu.org), created by researchers at the Rollins School of Public Health at Emory University, and the U.S. Center for Disease Control and Prevention (CDC). In this context, the search of computational chemistry algorithms that may prove useful to carry out a mapping of structure–activity data of HAART-drug cocktails over AIDS epidemiology networks and socioeconomic data is of major importance. In a recent paper (González-Díaz et al., 2014), ANNs have been used to link data related to AIDS in the U.S. counties to ChEMBL data about the chemical structure and preclinical activity of anti-HIV compounds. ANNs are prediction models, widely used in many areas of science, such as medicine, chemistry, biochemistry, as well as in drug development. In the latter, they are very useful for the prediction of properties of potential drugs. ANNs approximate the operation of the human brain with the ability to get results from complicated or imprecise data, which are very difficult to appreciate by humans or other computer techniques (Burbidge et al., 2001, Guha, 2013, Patel, 2013, Speck-Planche et al., 2012). Indices of social networks and molecular graphs were used as input information. A Shannon information index based on the Gini coefficient was employed to quantify the effect of income inequality in the social network. In addition, Balaban’s information indices were used to quantify changes in the chemical structure of single anti-HIV drugs. Last, Box–Jenkins moving average operators (MA) were also employed to quantify information about the deviations of drugs with respect to data subsets of reference (targets, organisms, experimental parameters, protocols). In our previous paper (González-Díaz et al., 2014), the model found was able to link the deviations in the AIDS prevalence rates in the ath county to the changes in the biological activity of the qth drug (d_q).

However, the previous computational chemistry algorithm fails in accounting for drug cocktails and many socioeconomic factors. This work is aimed at developing, for the first time, a computational algorithm for network epidemiology which is able to map structure–activity data of HAART-drugs cocktails over complex networks of AIDS epidemiology and socioeconomic factors for >2000 U.S. counties.

Section snippets

Socioeconomic variables and Shannon-entropy transformation into information indices

In total, 17 variables were withdrawn from AIDSVu, U.S. Census Bureau databases (http://www.census.gov/) and Internal Revenue Service (2014) (http://taxfoundation.org/). See the symbols and details of these variables in Table 1. All 17 socioeconomic variables (v_a) discussed previously come from very different original sources, describe different phenomena, and then use different scales.

In order to perform an uniform and scale unbiased representation of information, all these variables were

Two-way joining cluster analysis and principal components analysis

The two-way joining cluster analysis (TWJCA) and principal components analysis (PCA) are useful methods to reduce the magnitude of datasets with many input variables. Two-way joining is useful in circumstances in which it is expected that both cases and variables will simultaneously contribute to find meaningful patterns of clusters (Hill and Lewicki, 2006). A dichotomist approach for both TWJCA and PCA was used herein. It means that TWJCA and PCA of socioeconomic and biomolecular factors were

Conclusions

ALMA models were used to carry out a back-projection of the preclinical activity of drugs combined in a HAART cocktail over a complex network of AIDS in the U.S. counties. In this work, the UIC–LNN model was chosen, because it is a more specific classification scheme of the population structure than the other ones and LNN is the simplest type of classification model. However, an unbalance was noted regarding the classification of positive/negatives cases, as well as regarding the predictive

Acknowledgements

R.O.M acknowledges financial support of FPI fellowship funded by MECD (Ministry of Education, Culture and Sport, Spain).

References (44)

Internal Revenue Service (February, 2014). Tax Foundation,...
L.U. Aguilera et al.
Studying HIV latency by modeling the interaction between HIV proteins and the innate immune response
J. Theor. Biol.
(2014)
T. Barnett et al.
AIDS in the Twenty-first Century: Disease and Globalization
(2006)
V. Barresi et al.
Modeling, design and synthesis of new heteroaryl ethylenes active against the MCF-7 breast cancer cell-line
Mol. Biosyst.
(2013)
A.P. Bento et al.
The ChEMBL bioactivity database: an update
Nucleic Acids Res.
(2013)
S.H. Bertz
The first general index of molecular complexity
J. Am. Chem. Soc.
(1981)
D. Bonchev et al.
On topological characterization of molecular branching
Int. J. Quantum Chem. Quant. Chem. Symp.
(1978)
D.L. Brown et al.
Social and Economic Characteristics of the Population in Metro and Nonmetro Counties: 1970
(1976)
R. Burbidge et al.
Drug design by machine learning: support vector machines for pharmaceutical data analysis
Comput. Chem.
(2001)
J. Castilla et al.
Effectiveness of highly active antiretroviral therapy in reducing heterosexual transmission of HIV
J. Acquir. Immune Defic. Syndr.
(2005)

CDC

HIV infection among heterosexuals at increased risk – United States, 2010

MMWR Morb. Mortal Wkly. Rep.

(2013)

G.L. Colombo et al.

Cost analysis of initial highly active antiretroviral therapy regimens for managing human immunodeficiency virus-infected patients according to clinical practice in a hospital setting

Ther. Clin. Risk Manage.

(2014)

N.V. Chawla et al.

SMOTE: synthetic minority over-sampling technique

J. Artif. Int. Res.

(2002)

S.M. Dancoff et al.

Essays on the Use of Information Theory in Biology

(1953)

M.E. Falagas et al.

Socioeconomic status (SES) as a determinant of adherence to treatment in HIV infected patients: a systematic review of the literature

Retrovirology

(2008)

A. Gaulton et al.

ChEMBL: a large-scale bioactivity database for drug discovery

Nucleic Acids Res.

(2012)

L.M. Ghelfi et al.

A county-level measure of urban influence

Rural Dev. Perspect.

(1997)

H. Gonzalez-Diaz et al.

NL MIND-BEST: a web server for ligands and proteins discovery – theoretic-experimental study of proteins of Giardia lamblia and new compounds active against Plasmodium falciparum

J. Theor. Biol.

(2011)

H. González-Díaz et al.

Model of the multiscale complex network of AIDS prevalence in US at county level vs. preclinical activity of anti-HIV drugs based on information indices of molecular graphs and social networks

J. Chem. Inf. Model

(2014)

R. Guha

On exploring structure-activity relationships

Methods Mol. Biol.

(2013)

M. Hall et al.

The WEKA data mining software: an update

SIGKDD Explor. Newslett.

(2009)

K. Heikamp et al.

2011: Large-scale similarity search profiling of ChEMBL compound data sets

J. Chem. Inf. Model.

(2011)

Cited by (20)

A scoping review on the use of machine learning in research on social determinants of health: Trends and research prospects
2021, SSM - Population Health
Machine learning (ML) has spread rapidly from computer science to several disciplines. Given the predictive capacity of ML, it offers new opportunities for health, behavioral, and social scientists. However, it remains unclear how and to what extent ML is being used in studies of social determinants of health (SDH).
Using four search engines, we conducted a scoping review of studies that used ML to study SDH (published before May 1, 2020). Two independent reviewers analyzed the relevant studies. For each study, we identified the research questions, Results, data, and algorithms. We synthesized our findings in a narrative report.
Of the initial 8097 hits, we identified 82 relevant studies. The number of publications has risen during the past decade. More than half of the studies (n = 46) used US data. About 80% (n = 66) utilized surveys, and 70% (n = 57) employed ML for common prediction tasks. Although the number of studies in ML and SDH is growing rapidly, only a few studies used ML to improve causal inference, curate data, or identify social bias in predictions (i.e., algorithmic fairness).
While ML equips researchers with new ways to measure health outcomes and their determinants from non-conventional sources such as text, audio, and image data, most studies still rely on traditional surveys. Although there are no guarantees that ML will lead to better social epidemiological research, the potential for innovation in SDH research is evident as a result of harnessing the predictive power of ML for causality, data curation, or algorithmic fairness.
Multitasking Model for Computer-Aided Design and Virtual Screening of Compounds With High Anti-HIV Activity and Desirable ADMET Properties
2017, Multi-Scale Approaches in Drug Discovery: From Empirical Knowledge to In silico Experiments and Back
Human immunodeficiency virus (HIV) is responsible for causing the life-threatening condition known as acquired immune deficiency syndrome (AIDS). Current antiretroviral regimens are usually effective in halting the progression of HIV/AIDS, but serious concerns exist regarding the emergence of multidrug resistance and the prevalence of side effects. In the present chapter, we introduce the first multitasking model for quantitative structure–biological effect relationships (mtk-QSBER), which is focused on performing simultaneous predictions of anti-HIV activities and desirable safety profiles, and the fragment-based design of virtually efficacious anti-HIV compounds. The mtk-QSBER model was constructed from a data set formed by 29,682 cases, displaying accuracy greater than 96%. Several fragments were selected, and their contributions to multiple biological effects were calculated. The joint use of the fragment contributions and the physicochemical interpretations of the molecular descriptors in the mtk-QSBER model allowed the design of six new molecules, which were predicted as potent and safe anti-HIV agents.
Drug-symptom networking: Linking drug-likeness screening to drug discovery
2016, Pharmacological Research
Citation Excerpt :
In this paper, development of a novel drug-likeness screening strategy based on support vector domain description (SVDD) classification can explore potential therapeutic molecules (herbal molecules) by searching for a boundary in the form of a hypersphere containing all the target data while excluding all the outliers outside this sphere [24] (Fig. 1). Further investigation of the network of large-scale medical bibliographic records (herbal components) with the related Medical Subject Headings (MeSH) metadata from PubMed can systematically reveal the relationships between drugs and symptoms that can be exploited in clinical research and drug development [25–32]. Identification of the associations can also suggest for which classes of drugs are in common clinical use on the basis of the analysis of the roles of shared symptoms.
Understanding the relationships between drugs and symptoms has broad medical consequences, yet a comprehensive description of the drug-symptom associations is currently lacking. Here, 1441 FDA-approved drugs were collected, and PCA was used to extract 122 descriptors which explained 91% of the variance. Then, a k-means++ method was employed to partition the drug dataset into 3 clusters, and 3 corresponding SVDD models (drug-likeness screening models) were constructed with an overall accuracy of up to 95.6%. Furthermore, 6878 herbal molecules from the TcmSP™ database were screened by the above 3 SVDD model to obtain 5309 candidate drug molecules with highly accept classification of 77.19%. To assess the accuracy of the SVDD models, 8559 herbal molecule-symptom co-occurrences were mined from Pubmed abstracts, involving 697 herbal molecules and 314 symptoms. Most of the 697 herbal molecules could be found in the accepted SVDD data (5309 molecules), showing the potential of the SVDD for the screening of drug candidates. Moreover, a herbal molecule-herbal molecule network and a herbal molecule-symptom were constructed. Overall, the results provided a new drug-likeness screening approach independent to abnormal training data, and the comprehensive collection of herbal molecule-symptom associations formed a new data resource for systematic characterization of the symptom-oriented medicines.
Optimizing drug discovery using multitasking models for quantitative structure–biological effect relationships: an update of the literature
2023, Expert Opinion on Drug Discovery
PTML Modeling for Pancreatic Cancer Research: In Silico Design of Simultaneous Multi-Protein and Multi-Cell Inhibitors
2022, Biomedicines
In silico drug repurposing for anti-inflammatory therapy: Virtual search for dual inhibitors of caspase-1 and TNF-alpha
2021, Biomolecules

View all citing articles on Scopus

View full text

Mapping chemical structure-activity information of HAART-drug cocktails over complex networks of AIDS epidemiology and socioeconomic data of U.S. counties

Abstract

Introduction

Section snippets

Socioeconomic variables and Shannon-entropy transformation into information indices

Two-way joining cluster analysis and principal components analysis

Conclusions

Acknowledgements

Studying HIV latency by modeling the interaction between HIV proteins and the innate immune response

J. Theor. Biol.

AIDS in the Twenty-first Century: Disease and Globalization

Modeling, design and synthesis of new heteroaryl ethylenes active against the MCF-7 breast cancer cell-line

Mol. Biosyst.

The ChEMBL bioactivity database: an update

Nucleic Acids Res.

The first general index of molecular complexity

J. Am. Chem. Soc.

On topological characterization of molecular branching

Int. J. Quantum Chem. Quant. Chem. Symp.

Social and Economic Characteristics of the Population in Metro and Nonmetro Counties: 1970

Drug design by machine learning: support vector machines for pharmaceutical data analysis

Comput. Chem.

Effectiveness of highly active antiretroviral therapy in reducing heterosexual transmission of HIV

J. Acquir. Immune Defic. Syndr.

HIV infection among heterosexuals at increased risk – United States, 2010

MMWR Morb. Mortal Wkly. Rep.

Cost analysis of initial highly active antiretroviral therapy regimens for managing human immunodeficiency virus-infected patients according to clinical practice in a hospital setting

Ther. Clin. Risk Manage.

SMOTE: synthetic minority over-sampling technique

J. Artif. Int. Res.

Essays on the Use of Information Theory in Biology

Socioeconomic status (SES) as a determinant of adherence to treatment in HIV infected patients: a systematic review of the literature

Retrovirology

ChEMBL: a large-scale bioactivity database for drug discovery

Nucleic Acids Res.

A county-level measure of urban influence

Rural Dev. Perspect.

NL MIND-BEST: a web server for ligands and proteins discovery – theoretic-experimental study of proteins of Giardia lamblia and new compounds active against Plasmodium falciparum

J. Theor. Biol.

Model of the multiscale complex network of AIDS prevalence in US at county level vs. preclinical activity of anti-HIV drugs based on information indices of molecular graphs and social networks

J. Chem. Inf. Model

On exploring structure-activity relationships

Methods Mol. Biol.

The WEKA data mining software: an update

SIGKDD Explor. Newslett.

2011: Large-scale similarity search profiling of ChEMBL compound data sets

J. Chem. Inf. Model.