Elsevier

Biosystems

Volumes 132–133, June 2015, Pages 20-34
Biosystems

Mapping chemical structure-activity information of HAART-drug cocktails over complex networks of AIDS epidemiology and socioeconomic data of U.S. counties

https://doi.org/10.1016/j.biosystems.2015.04.007Get rights and content

Abstract

Using computational algorithms to design tailored drug cocktails for highly active antiretroviral therapy (HAART) on specific populations is a goal of major importance for both pharmaceutical industry and public health policy institutions. New combinations of compounds need to be predicted in order to design HAART cocktails. On the one hand, there are the biomolecular factors related to the drugs in the cocktail (experimental measure, chemical structure, drug target, assay organisms, etc.); on the other hand, there are the socioeconomic factors of the specific population (income inequalities, employment levels, fiscal pressure, education, migration, population structure, etc.) to study the relationship between the socioeconomic status and the disease. In this context, machine learning algorithms, able to seek models for problems with multi-source data, have to be used. In this work, the first artificial neural network (ANN) model is proposed for the prediction of HAART cocktails, to halt AIDS on epidemic networks of U.S. counties using information indices that codify both biomolecular and several socioeconomic factors. The data was obtained from at least three major sources. The first dataset included assays of anti-HIV chemical compounds released to ChEMBL. The second dataset is the AIDSVu database of Emory University. AIDSVu compiled AIDS prevalence for >2300 U.S. counties. The third data set included socioeconomic data from the U.S. Census Bureau. Three scales or levels were employed to group the counties according to the location or population structure codes: state, rural urban continuum code (RUCC) and urban influence code (UIC). An analysis of >130,000 pairs (network links) was performed, corresponding to AIDS prevalence in 2310 counties in U.S. vs. drug cocktails made up of combinations of ChEMBL results for 21,582 unique drugs, 9 viral or human protein targets, 4856 protocols, and 10 possible experimental measures. The best model found with the original data was a linear neural network (LNN) with AUROC > 0.80 and accuracy, specificity, and sensitivity  77% in training and external validation series. The change of the spatial and population structure scale (State, UIC, or RUCC codes) does not affect the quality of the model. Unbalance was detected in all the models found comparing positive/negative cases and linear/non-linear model accuracy ratios. Using synthetic minority over-sampling technique (SMOTE), data pre-processing and machine-learning algorithms implemented into the WEKA software, more balanced models were found. In particular, a multilayer perceptron (MLP) with AUROC = 97.4% and precision, recall, and F-measure >90% was found.

Introduction

Computational algorithms may play an important role in the process of elucidation of structure–activity relationships for many molecular systems and biological problems (Aguilera and Rodriguez-Gonzalez, 2014, Barresi et al., 2013, Gonzalez-Diaz et al., 2011, Munteanu et al., 2009). In particular, the theoretical biology has been useful in the study of anti-HIV drugs and/or their molecular targets (Jain Pancholi et al., 2014, Ogul, 2009, Speck-Planche et al., 2012, Weekes and Fogel, 2003, Xu et al., 2013). However, classic algorithms useful to connect the structure of a single molecule with its biological properties are unable to study the effect of combinations (cocktails) of drugs over epidemiological outbreaks in large populations with different social and economic factors. For instance, infections with HIV are commonly treated with antiretroviral drug combinations. These treatments could diminish the risk of HIV transmission (Castilla et al., 2005, Ping et al., 2013). In addition, the rates of disease progression, opportunistic infections, and mortality decreased with the implementation of HAART, and the combination of anti-HIV drugs resulted in longer survival and a better quality of life for the people infected with the virus (Colombo et al., 2014). The most common drug treatment administered to patients consists of two nucleoside reverse transcriptase inhibitors combined with either a non-nucleoside reverse transcriptase inhibitor, a “boosted” protease inhibitor or integrase strand transfer inhibitors (INSTIs), which resulted in decreased HIV RNA levels (<50 copies/mL) at 48 weeks and CD4 cell increases in the majority of patients (Usach et al., 2013). Research indicates (McMahon et al., 2011) that despite HAART therapy, HIV infected individuals who are poor, homeless, hungry, or have less education, continue to have a higher risk of death. Additionally, researchers (McMahon et al., 2011) suggest that HIV-infected individuals of low socioeconomic status (SES) are more likely to have increased mortality rates than those who are not living under these adverse conditions. Therefore, resources for HIV testing care and proven economic interventions should be directed to areas of economically disadvantaged people (McDavid Harrison et al., 2008).

The case of the United States (U.S.) is interesting for theoretical studies due to the abundance of epidemiological information. Holtgrave and Crosby (2003) found an important correlation (r = 0.469, p < 0.01) between the income inequality and the AIDS case rates at state level in the U.S. In addition, in 2010, the U.S. National HIV Behavioural Surveillance System developed a study about HIV infection among heterosexuals at increased risk, involving a total of 12,478 persons. Out of 8473 participants, 197 (2.3%) participants were positive for HIV infection, and prevalence was similar for men (2.2%) and women (2.5%). The research study shows a higher prevalence in persons who reported less than a high school education (3.1%), compared with those with a high school education (1.8%). Income inequality, employment, and other social variables also seem to be relevant on AIDS epidemiology. Prevalence was also higher in those with an annual household income of less than $10,000 (2.8%), compared to those with an income of $20,000 or more (1.2%) (CDC, 2013). Moreover, the percentage of HIV-infected individuals was higher in participants who reported being unemployed (1.1%) or disabled (and unemployed) (2.7%), compared to employed (0.4%) ones. Some authors, such as Mondal and Shitan (2013), commented in their study connections among life expectancy, income, educational attainment, fertility, health facilities, and HIV prevalence.

Recently, large amounts of data have been accumulated in public databases about the scope of molecular biology. For instance, the ChEMBL database (https://www.ebi.ac.uk/chembl/) (Bento et al., 2013, Gaulton et al., 2012, Heikamp and Bajorath, 2011) provides data from life science experiments (Bento et al., 2013). In the same way, there are online resources containing epidemiological data of AIDS prevalence and data about socioeconomic factors at county level. These databases are AIDSVu (http://aidsvu.org), created by researchers at the Rollins School of Public Health at Emory University, and the U.S. Center for Disease Control and Prevention (CDC). In this context, the search of computational chemistry algorithms that may prove useful to carry out a mapping of structure–activity data of HAART-drug cocktails over AIDS epidemiology networks and socioeconomic data is of major importance. In a recent paper (González-Díaz et al., 2014), ANNs have been used to link data related to AIDS in the U.S. counties to ChEMBL data about the chemical structure and preclinical activity of anti-HIV compounds. ANNs are prediction models, widely used in many areas of science, such as medicine, chemistry, biochemistry, as well as in drug development. In the latter, they are very useful for the prediction of properties of potential drugs. ANNs approximate the operation of the human brain with the ability to get results from complicated or imprecise data, which are very difficult to appreciate by humans or other computer techniques (Burbidge et al., 2001, Guha, 2013, Patel, 2013, Speck-Planche et al., 2012). Indices of social networks and molecular graphs were used as input information. A Shannon information index based on the Gini coefficient was employed to quantify the effect of income inequality in the social network. In addition, Balaban’s information indices were used to quantify changes in the chemical structure of single anti-HIV drugs. Last, Box–Jenkins moving average operators (MA) were also employed to quantify information about the deviations of drugs with respect to data subsets of reference (targets, organisms, experimental parameters, protocols). In our previous paper (González-Díaz et al., 2014), the model found was able to link the deviations in the AIDS prevalence rates in the ath county to the changes in the biological activity of the qth drug (dq).

However, the previous computational chemistry algorithm fails in accounting for drug cocktails and many socioeconomic factors. This work is aimed at developing, for the first time, a computational algorithm for network epidemiology which is able to map structure–activity data of HAART-drugs cocktails over complex networks of AIDS epidemiology and socioeconomic factors for >2000 U.S. counties.

Section snippets

Socioeconomic variables and Shannon-entropy transformation into information indices

In total, 17 variables were withdrawn from AIDSVu, U.S. Census Bureau databases (http://www.census.gov/) and Internal Revenue Service (2014) (http://taxfoundation.org/). See the symbols and details of these variables in Table 1. All 17 socioeconomic variables (va) discussed previously come from very different original sources, describe different phenomena, and then use different scales.

In order to perform an uniform and scale unbiased representation of information, all these variables were

Two-way joining cluster analysis and principal components analysis

The two-way joining cluster analysis (TWJCA) and principal components analysis (PCA) are useful methods to reduce the magnitude of datasets with many input variables. Two-way joining is useful in circumstances in which it is expected that both cases and variables will simultaneously contribute to find meaningful patterns of clusters (Hill and Lewicki, 2006). A dichotomist approach for both TWJCA and PCA was used herein. It means that TWJCA and PCA of socioeconomic and biomolecular factors were

Conclusions

ALMA models were used to carry out a back-projection of the preclinical activity of drugs combined in a HAART cocktail over a complex network of AIDS in the U.S. counties. In this work, the UIC–LNN model was chosen, because it is a more specific classification scheme of the population structure than the other ones and LNN is the simplest type of classification model. However, an unbalance was noted regarding the classification of positive/negatives cases, as well as regarding the predictive

Acknowledgements

R.O.M acknowledges financial support of FPI fellowship funded by MECD (Ministry of Education, Culture and Sport, Spain).

References (44)

  • Internal Revenue Service (February, 2014). Tax Foundation,...
  • L.U. Aguilera et al.

    Studying HIV latency by modeling the interaction between HIV proteins and the innate immune response

    J. Theor. Biol.

    (2014)
  • T. Barnett et al.

    AIDS in the Twenty-first Century: Disease and Globalization

    (2006)
  • V. Barresi et al.

    Modeling, design and synthesis of new heteroaryl ethylenes active against the MCF-7 breast cancer cell-line

    Mol. Biosyst.

    (2013)
  • A.P. Bento et al.

    The ChEMBL bioactivity database: an update

    Nucleic Acids Res.

    (2013)
  • S.H. Bertz

    The first general index of molecular complexity

    J. Am. Chem. Soc.

    (1981)
  • D. Bonchev et al.

    On topological characterization of molecular branching

    Int. J. Quantum Chem. Quant. Chem. Symp.

    (1978)
  • D.L. Brown et al.

    Social and Economic Characteristics of the Population in Metro and Nonmetro Counties: 1970

    (1976)
  • R. Burbidge et al.

    Drug design by machine learning: support vector machines for pharmaceutical data analysis

    Comput. Chem.

    (2001)
  • J. Castilla et al.

    Effectiveness of highly active antiretroviral therapy in reducing heterosexual transmission of HIV

    J. Acquir. Immune Defic. Syndr.

    (2005)
  • CDC

    HIV infection among heterosexuals at increased risk – United States, 2010

    MMWR Morb. Mortal Wkly. Rep.

    (2013)
  • G.L. Colombo et al.

    Cost analysis of initial highly active antiretroviral therapy regimens for managing human immunodeficiency virus-infected patients according to clinical practice in a hospital setting

    Ther. Clin. Risk Manage.

    (2014)
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artif. Int. Res.

    (2002)
  • S.M. Dancoff et al.

    Essays on the Use of Information Theory in Biology

    (1953)
  • M.E. Falagas et al.

    Socioeconomic status (SES) as a determinant of adherence to treatment in HIV infected patients: a systematic review of the literature

    Retrovirology

    (2008)
  • A. Gaulton et al.

    ChEMBL: a large-scale bioactivity database for drug discovery

    Nucleic Acids Res.

    (2012)
  • L.M. Ghelfi et al.

    A county-level measure of urban influence

    Rural Dev. Perspect.

    (1997)
  • H. Gonzalez-Diaz et al.

    NL MIND-BEST: a web server for ligands and proteins discovery – theoretic-experimental study of proteins of Giardia lamblia and new compounds active against Plasmodium falciparum

    J. Theor. Biol.

    (2011)
  • H. González-Díaz et al.

    Model of the multiscale complex network of AIDS prevalence in US at county level vs. preclinical activity of anti-HIV drugs based on information indices of molecular graphs and social networks

    J. Chem. Inf. Model

    (2014)
  • R. Guha

    On exploring structure-activity relationships

    Methods Mol. Biol.

    (2013)
  • M. Hall et al.

    The WEKA data mining software: an update

    SIGKDD Explor. Newslett.

    (2009)
  • K. Heikamp et al.

    2011: Large-scale similarity search profiling of ChEMBL compound data sets

    J. Chem. Inf. Model.

    (2011)
  • Cited by (20)

    • Multitasking Model for Computer-Aided Design and Virtual Screening of Compounds With High Anti-HIV Activity and Desirable ADMET Properties

      2017, Multi-Scale Approaches in Drug Discovery: From Empirical Knowledge to In silico Experiments and Back
    • Drug-symptom networking: Linking drug-likeness screening to drug discovery

      2016, Pharmacological Research
      Citation Excerpt :

      In this paper, development of a novel drug-likeness screening strategy based on support vector domain description (SVDD) classification can explore potential therapeutic molecules (herbal molecules) by searching for a boundary in the form of a hypersphere containing all the target data while excluding all the outliers outside this sphere [24] (Fig. 1). Further investigation of the network of large-scale medical bibliographic records (herbal components) with the related Medical Subject Headings (MeSH) metadata from PubMed can systematically reveal the relationships between drugs and symptoms that can be exploited in clinical research and drug development [25–32]. Identification of the associations can also suggest for which classes of drugs are in common clinical use on the basis of the analysis of the roles of shared symptoms.

    View all citing articles on Scopus
    View full text