Published on in Vol 8, No 6 (2020): June

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/16886, first published .
Identification of High-Order Single-Nucleotide Polymorphism Barcodes in Breast Cancer Using a Hybrid Taguchi-Genetic Algorithm: Case-Control Study

Identification of High-Order Single-Nucleotide Polymorphism Barcodes in Breast Cancer Using a Hybrid Taguchi-Genetic Algorithm: Case-Control Study

Identification of High-Order Single-Nucleotide Polymorphism Barcodes in Breast Cancer Using a Hybrid Taguchi-Genetic Algorithm: Case-Control Study

Original Paper

1I-Shou Uneiversity, Kaohsiung City, Taiwan

2Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi City, Taiwan

3Department of Electronic Engineering, National Kaohsiung University of Science and Technology, Kaohsiung City, Taiwan

4Drug Development and Value Creation Research Center, Kaohsiung Medical University, Kaohsiung, Taiwan

5College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan

Corresponding Author:

Cheng-Hong Yang, PhD

Department of Electronic Engineering

National Kaohsiung University of Science and Technology

No. 415 Jiangong Road, San-Min District

Kaohsiung City, 82778

Taiwan

Phone: 886 7 381 4526

Email: chyang@nkust.edu.tw


Background: Breast cancer has a major disease burden in the female population, and it is a highly genome-associated human disease. However, in genetic studies of complex diseases, modern geneticists face challenges in detecting interactions among loci.

Objective: This study aimed to investigate whether variations of single-nucleotide polymorphisms (SNPs) are associated with histopathological tumor characteristics in breast cancer patients.

Methods: A hybrid Taguchi-genetic algorithm (HTGA) was proposed to identify the high-order SNP barcodes in a breast cancer case-control study. A Taguchi method was used to enhance a genetic algorithm (GA) for identifying high-order SNP barcodes. The Taguchi method was integrated into the GA after the crossover operations in order to optimize the generated offspring systematically for enhancing the GA search ability.

Results: The proposed HTGA effectively converged to a promising region within the problem space and provided excellent SNP barcode identification. Regression analysis was used to validate the association between breast cancer and the identified high-order SNP barcodes. The maximum OR was less than 1 (range 0.870-0.755) for two- to seven-order SNP barcodes.

Conclusions: We systematically evaluated the interaction effects of 26 SNPs within growth factor–related genes for breast carcinogenesis pathways. The HTGA could successfully identify relevant high-order SNP barcodes by evaluating the differences between cases and controls. The validation results showed that the HTGA can provide better fitness values as compared with other methods for the identification of high-order SNP barcodes using breast cancer case-control data sets.

JMIR Med Inform 2020;8(6):e16886

doi:10.2196/16886

Keywords



Breast cancer has a major disease burden in the female population, with a growing incidence recently [1,2]. Previously, several interpretations of associations between breast cancer and tumor characteristics [3-5], single-nucleotide polymorphisms (SNPs) [6-8], clinicopathological factors [9], and biomarkers [10] revealed relevant association effects between these factors and the risk of cancer. Previous studies also indicated that genomic variation could contribute to the tumorigenicity process in breast cancer [11-14]. Thus, effective approaches for breast cancer estimation are required.

SNPs are crucial genetic variants in genomic association analyses involving leukemia [15], cancers [16], and other diseases [17-19]. Numerous SNPs cannot be excluded from analyses as no relevant differences between cases and controls can be found through conventional methods. Some SNPs may have relevant associations with other SNPs, and these associations are referred to as SNP barcodes. Consequently, the detection of SNP barcodes is vital for association analyses of diseases and cancers [20-23].

An SNP barcode consists of SNPs, and each SNP includes three genotypes. The large space of suitable SNP barcode combinations complicates the statistical evaluation and identification of relevant SNP barcodes. Evolutionary algorithms have been proposed to facilitate statistical identification of SNP barcodes, and a genetic algorithm (GA) is one of the most frequently used algorithms in genomic studies [24,25]. A GA is an effective approach in the identification of relevant genetic associations for various diseases through the use of more efficient search abilities to enhance population diversity [26]. The crossover and local search operations in a GA can reduce the probability of the same vector being identified between two selected SNPs, and hence, they can improve the search ability of this algorithm.

Breast cancer is a major health issue, and machine learning algorithms are frequently employed to detect the complex genomic associations in breast cancer studies. Although previous machine learning approaches could effectively identify SNP associations in genomic studies, the detection rate of SNP barcodes remains challenging for high-order SNP barcodes. Thus, we proposed a hybrid Taguchi-genetic algorithm (HTGA) for high-order SNP barcode identification in a breast cancer case-control study.


Genetic Algorithm

A GA is a machine learning algorithm inspired by biological evolutionary processes [27]. The first GA operation is population initialization, in which solutions are produced over the solution space; these initial solutions are designated as parents. In the population, two parents are strategically selected according to some fitness values for crossover operators. Crossover operators generate offspring by combining the chromosomal matter from the two parents. Mutation operations can increase population diversity through localized change, eliminating inferior chromosomes from the population and retaining good offspring. Thus, the good factors within the population can be passed on to the next generation. The aforementioned operations and population replacement are repeated until the stopping criterion is satisfied.

Taguchi Method

The methods proposed by Taguchi et al [28] are based on a statistical experimental design to improve the evaluation and performance of products, process conditions, and parameter settings. Taguchi methods primarily rely on orthogonal arrays (OAs) and the signal-to-noise ratio (SNR). An OA is a fractional factorial matrix that provides a comprehensive analysis of interactions among all design factors. This matrix ensures a proportionate comparison of levels for all factors. A two-level OA can be defined as Ln (2n−1), where n=2k is the number of experimental runs, k (1) is a positive integer, base 2 represents two levels for each design parameter, and n−1 is the number of columns in the OA. “L” represents “Latin,” because the OA experimental design concept is associated with the Latin square. An example of an OA is shown in Table 1.

SNR (η) is used as the selection quality characteristic in the field of communications engineering; it can be used to optimize the parameters for a target. Taguchi methods can classify the parameter design problem into several categories according the problem. Both smaller-the-better and larger-the-better SNR types are used. Considering the set of characteristics y1, y2, …, yn, in the smaller-the-better case, the SNR can be determined using the following equation:

In the larger-the-better case, the SNR can be determined using the following equation:

The SNR evaluates the robustness of the levels of each design parameter. A high-quality result can be achieved for a particular target by controlling the parameters at a particular level with a high SNR value.

Table 1. A L8(27) orthogonal array.
Experiment numberFactors
ABCDEFG
11111111
21112222
31221122
41222211
52121212
62122121
72211221
82212112

Hybrid Taguchi-Genetic Algorithm

In the HTGA, a Taguchi method is added into GA crossover and mutation operations. Figure 1 depicts a flowchart of the HTGA approach, which includes the below-mentioned 17 steps. The pseudocode of the HTGA is shown in Textbox 1.

HTGA Procedure

The procedure involves the following 17 steps: (1) Population initialization, execute the algorithm and generate an initial population; (2) Fitness value evaluations, evaluate the population’s fitness values; (3) Selection operation, select candidates using the tournament approach; (4) Crossover operation, the probability of crossover is determined by the crossover rate pc; (5) Select a suitable two-level orthogonal array for the experiment; (6) Randomly choose two chromosomes at a time to execute a matrix experiment; (7) Calculate the function values and SNRs of n experiments in the orthogonal array Ln (2n−1); (8) Calculate the effects of different factors and in the experiment; (9) An optimal chromosome is generated; (10) Repeat steps 5 through 8 until the expected number (1/2) × M × pc is reached; (11) New chromosomes are generated through the Taguchi method; (12) Mutation operation, mutation probability is determined by the mutation rate pm; (13) Add chromosomes from a pool into the population; (14) Sort the population by fitness; (15) Select the fittest chromosomes as the new population for the next generation; (16) If the stopping criterion is met, execute step 17; if not, go back to step 2; (17) The chromosome with the highest fitness value is the HTGA solution.

Figure 1. Hybrid Taguchi-genetic algorithm flowchart. SNR: signal-to-noise ratio.
View this figure
Pseudocode of the hybrid Taguchi-genetic algorithm.

Input: maximum iteration as T (termination criterion)

     population size M

     crossover rate Pc

     mutation rate Pm

Output: optimal chromosome (the optimal solution)

begin

# Initialization

/* Initialize M chromosomes as population */

foriteration ← 1 to Tdo

     # Selection operation

     fori ← 1 to Mdo

         /* Randomly select two chromosomes from population as chromosome1 and

         chromosome2 */

         iffitness1fitness2do

             winnerchromosome1

         end if

         else do

             winnerchromosome2

         end if

             /* put winner into mating pool */

     end for

     # Crossover operation

     fori ← 1 to (M / 2) do

         /* Sequentially select two chromosomes from mating pool as chromosome1

         and chromosome2 */

         if random() < Pcdo

             crossover(chromosome1, chromosome2)

             /* generate two offspring */

         end if

         /* put two offspring into offspring pool */

     end for

     # Taguchi operation

     fori ← 1 to (0.5 × M × Pc) do

         /* Randomly select two chromosomes from offspring pool as chromosome1

         and chromosome2 */

         Taguchi(chromosome1, chromosome2)

         /* generate one offspring */

         /* put one offspring into offspring pool */

      end for

     # Mutation operation

     fori ← 1 to size of offspring pooldo

         forj ← 1 to dimensions of chromosome do

             if random() < Pmdo

                 mutation(chromosome[j])

             end if

         end for

         /* generate one offspring */

         /* put one offspring into offspring pool */

      end for

     # Replacement operation

     /* Reserve best M chromosomes as new population from population and offspring

     pool */

end for

/* Obtain optimal chromosome */

End

Textbox 1. Pseudocode of the hybrid Taguchi-genetic algorithm.
Encoding Schemes and Population Initialization

In the proposed GA, a suitable solution to a problem is denoted as chromosome C = {c1, c2, …, cn}, and the encoding scheme aims to design suitable elements in a chromosome. In the SNP barcode problem, the elements in a chromosome include (1) the indexes of the selected SNPs in the data set and (2) the genotypes of these selected SNPs. Thus, a chromosome Ci is expressed as shown in equation 3.

Ci = (SNPi,s, Genotypei,g) (3)

where i = 1, 2, …, m, and is the population size. SNPi,s, where s = 1, 2, …, n/2, is a selected SNP dimension in which all SNPs are unrepeatable, and n is the SNP barcode order. Genotypei,g represents the three possible genotypes of the selected SNPi,s, where g = n/2 + 1, n/2 + 2, …, n is the selected genotype dimension. In the population initialization, all chromosomes are stochastically generated according to the encoding schemes.

Fitness Function Evaluation

The aim of SNP barcode identification is to detect relevant differences between cases and controls. To optimize the protective effect of the SNP combination, a fitness function is required for comparing cases and controls. A high difference between cases and controls indicates a high probability of detecting relevant SNP barcodes. In the proposed GA, a chromosome is measured by the fitness function shown in equation 4.

F(Ci) = number (controlCi) − number (caseCi) (4)

where number is the total number of elements in a set, control denotes the controls, case denotes the cases, and Ci is the ith chromosome. Thus, the number of intersections between the ith chromosome and the controls is calculated by number (controlCi), and the number of intersections between the ith chromosome and the cases is calculated by number (caseCi). Thereafter, we calculate the difference between number (controlCi) and number (caseCi) as the fitness value at Ci.

Selection Operation

In the selection operation, a random tournament selection scheme is used to pick each pair of parents from the population [29]. In tournament selection, two chromosomes are randomly selected to compare their individual fitness values. The chromosomes with better fitness values are inserted into the mating pool. According to the mechanism of tournament selection, the probability that the average fitness value of solutions in the mating pool is better than the average fitness value of the parent population is high. Chromosomes in the mating pool are selected for the crossover operation and used to produce offspring. Textbox 2 provides the pseudocode of tournament selection. The selection operation is repeatedly executed until the maximum mating pool size is achieved.

Tournament selection procedure.

Input:population, the list of chromosomes to select from

Input:chromosome1, the first randomly selected chromosome from population

Input:chromosome2, the second randomly selected chromosome from population

Input:fitness1, the fitness value of the first chromosome

Input:fitness2, the fitness value of the second chromosome

Output:winner: the chromosome with better fitness value in tournament

Output:mating pool: reserve the list of chromosomes to execute crossover operation

# Tournament selection

begin

fori ← 1 to size of mating pool do

     Randomly select two chromosomes from population

     iffitness1fitness2do

         winnerchromosome1

     end if

     else do

         winnerchromosome2

      end else

     put winner into mating pool

end for

Textbox 2. Tournament selection procedure.
Crossover Operation

After the selection operation, the crossover operation is implemented to create high-performing individuals. Two chromosomes are sequentially selected from the mating pool as a pair of parents, and then, the crossover operation is executed on them. The crossover operation uses a uniform crossover. Each bit in a chromosome is randomly generated as 0 or 1, and for 1, points are swapped between parent organisms; otherwise, points are not swapped. The encoding schemes establish a single point as an SNP locus with a corresponding genotype locus at the j 2 + 1 position, where j = 1, 2, …, n/2 is the index in the chromosome and n is the SNP barcode order. Therefore, n/2 bits are randomly generated, and both the j 2 + 1 genotype locus and jth bit representing an SNP are swapped in the parent organisms.

Taguchi Operation

An orthogonal array exhibits Q design factors. Each factor has two levels. An orthogonal array Ln (2n−1) exhibits n−1 columns and n individual experiments corresponding to n rows, where n = 2k and Qn−1; k is a positive integer, defined as an integer >1, and it is used for adjusting the number of experimental runs.

The SNR (η) is the mean square deviation of the fitness function. Let two values of η be ηi = (yi)2 and ηi = −(yi)2 (where is negative) in the case of a fitness function that is maximized (larger-the-better). Let yi be the function evaluation value of experiment i = 1, 2, 3, …, n, where n is the number of experiments. The effect of factor f is defined as follows:

Efl = sum of ηi for factor f at level l (5)

where i is the experiment number, f is the factor name, and l is the level number.

Mutation Operation

The mutation operation aims to prevent the population from falling into local optima. In all suitable solutions, each offspring element has a chance to undergo a mutation operation. Each mutation position with a probability of mutation pm generates a random number in (0, 1). If the number is less than pm at the ith element in an offspring specimen, the ith element will be mutated by a randomly generated possible value.

Replacement Operation

The replacement operation uses an individual to replace the weakest individual in the population. After the completion of the aforementioned operations, the offspring are added to the population, and then, all the parents and offspring are ranked based on their fitness values. Subsequently, top p chromosomes in the population size are selected as the new population for the next generation, where p is the population size.

Termination Condition

The HTGA operation is repeated in successive iterations until the stopping criterion is met. In this study, a maximum number of iterations was used to terminate HTGA operations.

Parameter Setting

This study compared the search effectiveness of the HTGA with that of standard GA, particle swarm optimization (PSO) [30], and chaotic PSO (CPSO) [31] methods. PSO is a swarm intelligence algorithm that simulates the social behavior of organisms. In PSO, each individual represents a particle and considers a potential solution in the swarm population. In CPSO, chaotic theory is incorporated into PSO to increase the search space and enhance PSO performance. PSO and CPSO parameters include population size, iteration size, minimum and maximum inertial weights, and learning factors. In each method, the number of iterations was set to 1000, and the population size was 50 for the test data set. In PSO and CPSO, the minimum and maximum inertial weights were 0.4 and 0.9, respectively. Both weights of learning factors c1 and c2 were set to 2. In the tested GA and the proposed HTGA, the probability of crossover (pc) with an exchange probability was 0.3 and the probability of mutation (pm) with an exchange probability was 0.05.

Statistical Analysis

The OR was used to evaluate the risk of an SNP barcode [32], and it was defined as follows:

OR = (TP × TN) / (FP × FN) (6)

where TP represents the number of true positives, TN represents the number of true negatives, FN represents the number of false negatives, and FP represents the number of false positives.


Data Sets

A set of 26 SNPs related to growth factor genes was selected to simulate a data set. Several growth factor–related breast cancer genes (EGF, IGF1, IGF1R, IGF2, IGFBP3, IL10, TGFB1, and VEGF), including 26 SNPs, were used as simulation data to evaluate existing algorithms and the proposed HTGA. The data set only provided the genotype frequencies of each SNP without the original raw data of genotypes. Table 2 presents the SNPs and genotype distributions. The simulated frequencies of SNPs were acquired from the literature [33]. SNPs used in the original data comprised different numbers of individuals; therefore, the number of every SNP must be normalized to the same number. The new data were randomly generated according to the frequency of the original data. All SNP data from the data source were adjusted to 5000 samples for all genotype distributions. For example, for SNP1 (gene, EGF; dBSNP ref. rs2237054), the total number of three genotypes (ie, TT, TA, and AA) in the control was 2273 (2008 + 259 + 6). The percentage for each genotype in SNP1 was calculated as “original data*/sum (%)” (ie, 2008/2273, 88.3% for TT; 259/2273, 11.4% for TA; and 6/2273, 0.3% for AA), where the symbol “*” indicates that the original data were derived from the SNP data set before normalization. On the basis of this percentage, the modified data for SNP1 were calculated by multiplying the percentage with the sum of the complete data set (SNP number adjusted to 5000) (ie, 88.3% × 5000 [n=4418] for AA; 11.4% × 5000 [n=569] for Aa, and 0.3% × 5000 [n=13] for aa). Therefore, the modified data for SNP1 were adjusted to a total of 5000 (4418 + 569 + 13 = 5000). Thus, 5000 simulation samples of SNP genotypes were randomly generated by following fixed distribution.

Table 2. Estimated effect from individual single-nucleotide polymorphisms of 26 growth factor–related genes for the occurrence of breast cancer.
SNPa (gene)SNP typeCase number/normal numberOR95% CIP value
1. rs2237054 (EGF)1-TT4408/4418N/AbN/AN/A
1. rs2237054 (EGF)2-TA570/56910.89-1.14.97
1. rs2237054 (EGF)3-AA22/131.70.85-3.37.18
2. rs5742678 (IGF1)1-CC2797/2866N/AN/AN/A
2. rs5742678 (IGF1)2-CG1844/18371.030.95-1.12.52
2. rs5742678 (IGF1)3-GG359/2971.241.05-1.46.01
3. rs1549593 (IGF1)1-CC2924/2970N/AN/AN/A
3. rs1549593 (IGF1)2-CA1753/17711.010.93-1.09.92
3. rs1549593 (IGF1)3-AA323/2591.271.07-1.50.008
4. rs6220 (IGF1)1-AA2643/2698N/AN/AN/A
4. rs6220 (IGF1)2-AG1933/19511.010.93-1.10.80
4. rs6220 (IGF1)3-GG424/3511.231.06-1.44.007
5. rs2946834 (IGF1)1-CC2295/2336N/AN/AN/A
5. rs2946834 (IGF1)2-CT2171/21501.030.95-1.12.53
5. rs2946834 (IGF1)3-TT534/5141.060.93-1.21.43
6. rs1568502 (IGF1R)1-AA2914/2955N/AN/AN/A
6. rs1568502 (IGF1R)2-AG1840/18071.030.95-1.12.46
6. rs1568502 (IGF1R)3-GG246/2381.050.87-1.26.65
7. IGF1R-10 (IGF1R)1-AA3169/3201N/AN/AN/A
7. IGF1R-10 (IGF1R)2-Aa1545/15820.990.91-1.08.77
7. IGF1R-10 (IGF1R)3-aa286/2171.331.11-1.60.003
8. rs2229765 (IGF1R)1-GG1523/1429N/AN/AN/A
8. rs2229765 (IGF1R)2-GA2533/24890.960.87-1.05.33
8. rs2229765 (IGF1R)3-AA944/10820.820.73-0.92c.001
9. rs8030950 (IGF1R)1-CC2737/2745N/AN/AN/A
9. rs8030950 (IGF1R)2-CA1902/191710.92-1.08.92
9. rs8030950 (IGF1R)3-AA361/3381.070.92-1.25.41
10. rs680 (IGF2)1-GG2538/2451N/AN/AN/A
10. rs680 (IGF2)2-GA2074/21830.920.85-1.00.04
10. rs680 (IGF2)3-AA388/3661.020.88-1.19.79
11. rs3741211 (IGF2)1-TT1936/1971N/AN/AN/A
11. rs3741211 (IGF2)2-TC2367/22691.060.98-1.16.17
11. rs3741211 (IGF2)3-CC697/7600.930.83-1.05.28
12. IGF2-05 (IGF2)1-AA2651/2694N/AN/AN/A
12. IGF2-05 (IGF2)2-Aa1955/19521.020.94-1.11.69
12. IGF2-05 (IGF2)3-aa394/3541.130.97-1.32.12
13. IGF2-06 (IGF2)1-AA2160/2162N/AN/AN/A
13. IGF2-06 (IGF2)2-Aa2237/22840.980.90-1.07.66
13. IGF2-06 (IGF2)3-aa603/5541.090.96-1.24.21
14. rs2132571 (IGFBP3)1-GG2415/2407N/AN/AN/A
14. rs2132571 (IGFBP3)2-GA2163/215710.92-1.09.99
14. rs2132571 (IGFBP3)3-AA422/4360.970.83-1.12.65
15. rs2471551 (IGFBP3)1-GG3225/3284N/AN/AN/A
15. rs2471551 (IGFBP3)2-GC1591/15151.070.98-1.17.13
15. rs2471551 (IGFBP3)3-CC184/2010.930.76-1.15.54
16. rs2854744 (IGFBP3)1-AA1538/1469N/AN/AN/A
16. rs2854744 (IGFBP3)2-AC2487/24750.960.88-1.05.39
16. rs2854744 (IGFBP3)3-CC975/10560.880.79-0.99.03
17. rs2132572 (IGFBP3)1-GG2908/3027N/AN/AN/A
17. rs2132572 (IGFBP3)2-GA1805/17281.091.00-1.18.051
17. rs2132572 (IGFBP3)3-AA287/2451.221.02-1.46.03
18. rs3024496 (IL10)1-TT1218/1235N/AN/AN/A
18. rs3024496 (IL10)2-TC2533/25491.010.92-1.11.90
18. rs3024496 (IL10)3-CC1249/12161.040.93-1.17.49
19. rs1800872 (IL10)1-CC3059/3017N/AN/AN/A
19. rs1800872 (IL10)2-CA1660/17220.950.87-1.03.25
19. rs1800872 (IL10)3-AA281/2611.060.89-1.27.53
20. rs1800890 (IL10)1-TT1703/1701N/AN/AN/A
20. rs1800890 (IL10)2-TA2455/25080.980.90-1.07.63
20. rs1800890 (IL10)3-AA842/7911.060.95-1.20.32
21. rs1554286 (IL10)1-CC3400/3446N/AN/AN/A
21. rs1554286 (IL10)2-CT1431/14101.030.94-1.12.54
21. rs1554286 (IL10)3-TT169/1441.190.95-1.49.15
22. rs1800470 (TGFB1)1-TT1850/1914N/AN/AN/A
22. rs1800470 (TGFB1)2-TC2372/23991.020.94-1.11.62
22. rs1800470 (TGFB1)3-CC778/6871.171.04-1.32.01
23. rs699947 (VEGF)1-CC1236/1273N/AN/AN/A
23. rs699947 (VEGF)2-CA2511/24631.050.95-1.16.33
23. rs699947 (VEGF)3-AA1253/12641.020.91-1.14.73
24. rs1570360 (VEGF)1-GG2278/2341N/AN/AN/A
24. rs1570360 (VEGF)2-GA2214/21321.070.98-1.16.13
24. rs1570360 (VEGF)3-AA508/5270.990.87-1.13.92
25. rs2010963 (VEGF)1-GG2354/2279N/AN/AN/A
25. rs2010963 (VEGF)2-GC2133/21570.960.88-1.04.31
25. rs2010963 (VEGF)3-CC513/5640.880.77-1.01.07
26. rs3025039 (VEGF)1-CC3744/3741N/AN/AN/A
26. rs3025039 (VEGF)2-CT1160/11740.990.90-1.08.81
26. rs3025039 (VEGF)3-TT96/851.130.84-1.52.47

aSNP: single-nucleotide polymorphism.

bN/A: not applicable.

Comparison of Cases and Controls in Terms of the Effect of a Single SNP

Table 2 compares patients and normal subjects in terms of effect (OR and 95% CI) at a single SNP for growth factor–related genes. Two SNPs within two genes (rs2229765-AA [IGF1R] and rs2854744-CC [IGFBP3]) showed significant protection associations (rs2229765-AA: OR 0.82, P=.001; rs2854744-CC: OR 0.88, P=.03) for breast cancer. The minimum and maximum protection associations exhibited ORs of 0.82 and 0.88, respectively, and the other SNPs showed nonsignificant protection associations for breast cancer.

Comparison Between the Proposed HTGA and Existing Algorithms

We compared PSO [34], CPSO [35], and the GA [24] with the HTGA for 2-SNP to 7-SNP barcodes with protection associations (Table 3). ORs (<1) indicate the impact of the protection association of SNP barcodes for the occurrence of breast cancer. A high difference between cases and controls in the SNP barcodes represents informative protection associations, and P<.05 indicates a significant difference for the SNP barcode between cases and controls. The identified 3-SNP to 7-SNP barcodes showed that the HTGA provided values with a greater degree of difference as compared with PSO, CPSO, and the GA, indicating that the HTGA identified relevant SNP barcodes with protection associations more effectively (Table 3). HTGA-identified SNP barcodes showed ORs ranging from 0.755 to 0.870 (P=.003) for protection associations with breast cancer. The 2-SNP and 3-SNP barcodes in PSO, CPSO, and the GA showed significant differences between cases and controls (2-SNP: P=.003, P=.001, and P=.03, respectively; 3-SNP: P=.04, P=.04, and P=.002, respectively). The 4-SNP barcodes in CPSO and the GA showed significant differences (P=.04 and P=004, respectively), and the 5-SNP barcode in the GA also showed a significant difference (P=.03). Although CPSO and the GA provided better ORs as compared with the HTGA in all SNP barcodes, the degrees of difference indicated that the SNP barcodes identified by the HTGA were superior to those identified by other methods, and P values >.05 indicated that these differences revealed by the models were not significant.

Optimization algorithms have been widely applied to detect relevant high-order SNP barcodes in disease and cancer studies [24,25,34]. Differences between cases and controls are often applied to evaluate the values of SNP barcodes in terms of their fitness function design. As indicated in Table 3, the HTGA effectively identified the relevant protection associations of SNP barcodes for breast cancer. The logistic regression analysis results were strongly validated by the outstanding performance of the HTGA in breast cancer SNP barcode identification. The SNP barcodes detected by the proposed HTGA are simply associations between a barcode and disease, and this type of analysis does not support the inference of causality.

Table 3. Estimation of the best protection single-nucleotide polymorphism barcodes for the occurrence of breast cancer as determined by particle swarm optimization, chaotic particle swarm optimization, the genetic algorithm, and the hybrid Taguchi-genetic algorithm.
Order and methodCombined SNPaSNP genotypesControl
number
Case
number
DifferenceOR95% CIP value
2-SNP








PSOb1,81-39418161250.8410.76-0.93.001

PSON/AcOther40594184N/AN/AN/AN/A

CPSOd1,81-39418161250.8410.76-0.93.001

CPSON/AOther40594184N/AN/AN/AN/A

GAe1,101-2192618231030.9160.85-0.99.03

GAN/AOther30743177N/AN/AN/AN/A

HTGAf10,172-113091179130g0.8700.79-0.95.003

HTGAN/AOther36913821N/AN/AN/AN/A
3-SNP








PSO8,9,223-1-2269225440.8290.69-0.99.043

PSON/AOther47314775N/AN/AN/AN/A

CPSO3,8,91-3-1371319520.8500.73-0.99.04

CPSON/AOther46294681N/AN/AN/AN/A

GA1,8,151-3-1624527970.8260.73-0.94.002

GAN/AOther43764473N/AN/AN/AN/A

HTGA1,10,171-2-111581035123g0.8660.79-0.95.003

HTGAN/AOther38423965N/AN/AN/AN/A
4-SNP








PSO4,8,14,222-3-1-29976230.7640.57-1.03.08

PSON/AOther49014924N/AN/AN/AN/A

CPSO10,17,21,232-1-1-1268223450.8240.69-0.99.04

CPSON/AOther47324777N/AN/AN/AN/A

GA1,10,17,211-2-1-1795692103g0.8500.76-0.95.004

GAN/AOther42054308N/AN/AN/AN/A

HTGA1,10,17,211-2-1-1795692103g0.8500.76-0.95.004

HTGAN/AOther42054308N/AN/AN/AN/A
5-SNP








PSO5,6,8,9,261-1-3-2-19175160.8210.60-1.12.21

PSON/AOther49094925N/AN/AN/AN/A

CPSO2,4,8,11,181-2-3-1-24432120.7260.46-1.15.17

CPSON/AOther49564968N/AN/AN/AN/A

GA1,4,15,17,211-1-1-1-1657585720.8760.78-0.99.03

GAN/AOther43434415N/AN/AN/AN/A

HTGA1,10,17,21,261-2-1-1-160352083g0.8460.75-0.96.009

HTGAN/AOther43974480N/AN/AN/AN/A
6-SNP








PSO4,8,15,19,22,241-3-2-2-1-32020.3330.04-3.20.34

PSON/AOther49985000N/AN/AN/AN/A

CPSO3,4,12,16,20,241-1-1-2-2-3282170.7490.43-1.32.32

CPSON/AOther49724979N/AN/AN/AN/A

GA1,2,4,6,15,181-1-1-1-1-2276247290.9000.75-1.06.19

GAN/AOther47244753N/AN/AN/AN/A

HTGA1,10,15,17,21,261-2-1-1-1-139432767g0.8180.70-0.95.01

HTGAN/AOther46064673N/AN/AN/AN/A
7-SNP








PSO5,8,11,13,14,24,251-1-3-1-1-2-16330.5000.13-2.00.33

PSON/AOther49944997N/AN/AN/AN/A

CPSO10,12,16,17,19,22,262-2-2-1-2-2-1272070.7400.41-1.32.31

CPSON/AOther49734980N/AN/AN/AN/A

GA1,2,6,7,10,14,151-1-1-3-2-1-13825130.6560.40-1.09.10

GAN/AOther49624975N/AN/AN/AN/A

HTGA1,10,13,15,17,21,261-2-2-1-1-1-118514144g0.7550.60-0.94.01

HTGAN/AOther48154859N/AN/AN/AN/A

aSNP: single-nucleotide polymorphism.

bPSO: particle swarm optimization.

cN/A: not applicable.

dCPSO: chaotic particle swarm optimization.

eGA: genetic algorithm.

fHTGA: hybrid Taguchi-genetic algorithm.

gThe best results in the n-SNP barcodes.


Principal Findings

Many breast cancer studies have identified the associations among the effects of important related genes [36-42], including genes related to DNA repair [43,44] and estrogen-response genes [45]. In this study, we introduced a HTGA to identify the SNP barcodes among 26 breast cancer–related SNPs. The HTGA-generated SNP barcodes were examined to determine their protective effects against breast cancer. The results suggest that nonrelevant SNPs might cumulatively reduce the risk of breast cancer, as indicated by the HTGA-generated preventive SNP barcodes. A search space consisting of SNP barcode combinations can generate numerous local optima in multiple regions. These local optima raise challenges for optimization algorithm search operations, because the heuristic and stochastic properties of such optimization algorithms can easily cause searches to become trapped in local optima. A GA population can be updated by referring to other chromosomes to determine the next position in the search space. However, GA operations can result in stagnation if the chromosomes are similar; points of stagnation in a search space are referred to as local optima. The computational processes and comparisons are shown in Figure 2. A Taguchi system is a nonlinear system with deterministic dynamic behavior owing to its ergodic and stochastic properties. Taguchi methods are used to enhance GA crossover operations, and these methods can be remarkably helpful for avoiding population entrapment in local optima because improved solutions can be found through experimentation. Because the population learns from experience, it can be said to exhibit population intelligence. The HTGA can converge quickly to excellent fitness values for SNP barcodes, whereas the GA is very slow to converge and the results are worse than those of the HTGA (Figure 2), indicating that the GA can very easily result in stagnation in regions that may not include any global optima. However, the population is effectively improved in the HTGA, and Figure 2 shows that the fitness values of chromosomes clearly increase over time, proving that the proposed Taguchi method can be used to improve GA performance to identify SNP barcodes. Moreover, our results prove the ability of this Taguchi-based GA to solve SNP barcode identification problems. The optimal parameters of the HTGA could be further analyzed for enhancing the detection ability of SNP barcodes. Our HTGA includes the probability of crossover and mutation. A further investigation with more data sets is required to determine the optimal parameters. Moreover, selection, crossover, mutation, and replacement operations can be analyzed to determine the superior operation strategy for enhancing the ability of our HTGA to detect potential SNP barcodes. If the HTGA is applied for clinical data, we suggest considering permutation testing to examine the relevance of the results obtained. For each trial in permutation testing, the case/control labels would be scrambled, and the algorithm would then search for an optimal solution. After numerous trials, we would be able to determine the number of times a solution at least as good as the one from the original data is found and then determine if the algorithm is simply fitting the data or identifying underlying associations.

Figure 2. Comparison of improvements to fitness values between the genetic algorithm (GA) and hybrid Taguchi-genetic algorithm (HTGA).The images on the left compare GA and HTGA search results for 1000 iterations. The images on the right present the fitness values of an HTGA population at specific iterations. SNP: single-nucleotide polymorphism.
View this figure

Conclusions

An HTGA was proposed to effectively identify relevant SNP barcodes among genes related to breast cancer. The study results demonstrated that the HTGA could effectively detect SNP barcodes for problems with numerous high-order SNP barcode combinations. The proposed Taguchi method can improve the GA via the identification of high-dimensional SNP barcodes, and hence, it is integrated following GA crossover operations to systematically optimize chromosomes and thus enhance their values. Moreover, the HTGA can effectively converge to a promising region within the problem space and provide excellent SNP barcode identification. In this study, large data sets were used to evaluate and compare the performances of the GA, PSO, CPSO, and the HTGA, and the results indicated that the HTGA can effectively identify relevant high-order SNP barcodes in breast cancer.

Acknowledgments

This work was supported by the National Science Council, Taiwan (108-2221-E-214-019-MY3 and 108-2221-E-992-031-MY3).

Conflicts of Interest

None declared.

  1. Sharma R. Breast cancer incidence, mortality and mortality-to-incidence ratio (MIR) are associated with human development, 1990-2016: evidence from Global Burden of Disease Study 2016. Breast Cancer 2019 Jul;26(4):428-445. [CrossRef] [Medline]
  2. Liu F, Lin H, Kuo C, See L, Chiou M, Yu H. Epidemiology and survival outcome of breast cancer in a nationwide study. Oncotarget 2017 Mar 07;8(10):16939-16950 [FREE Full text] [CrossRef] [Medline]
  3. Abubakar M, Sung H, Bcr D, Guida J, Tang TS, Pfeiffer RM, et al. Breast cancer risk factors, survival and recurrence, and tumor molecular subtype: analysis of 3012 women from an indigenous Asian population. Breast Cancer Res 2018 Sep 18;20(1):114 [FREE Full text] [CrossRef] [Medline]
  4. Visser LL, Elshof LE, Schaapveld M, van de Vijver K, Groen EJ, Almekinders MM, et al. Clinicopathological Risk Factors for an Invasive Breast Cancer Recurrence after Ductal Carcinoma —A Nested Case–Control Study. Clin Cancer Res 2018 Apr 23;24(15):3593-3601. [CrossRef]
  5. Park S, Han W, Kim J, Kim MK, Lee E, Yoo T, et al. Risk Factors Associated with Distant Metastasis and Survival Outcomes in Breast Cancer Patients with Locoregional Recurrence. J Breast Cancer 2015 Jun;18(2):160-166 [FREE Full text] [CrossRef] [Medline]
  6. Marsaux CF, Celis-Morales C, Livingstone KM, Fallaize R, Kolossa S, Hallmann J, et al. Changes in Physical Activity Following a Genetic-Based Internet-Delivered Personalized Intervention: Randomized Controlled Trial (Food4Me). J Med Internet Res 2016 Feb 05;18(2):e30 [FREE Full text] [CrossRef] [Medline]
  7. Shin SJ, You SC, Park YR, Roh J, Kim J, Haam S, et al. Genomic Common Data Model for Seamless Interoperation of Biomedical Data in Clinical Practice: Retrospective Study. J Med Internet Res 2019 Mar 26;21(3):e13249 [FREE Full text] [CrossRef] [Medline]
  8. Yang C, Chuang L, Lin Y. Epistasis Analysis Using an Improved Fuzzy C-Means-Based Entropy Approach. IEEE Trans. Fuzzy Syst 2020 Apr;28(4):718-730. [CrossRef]
  9. Yang C, Moi S, Ou-Yang F, Chuang L, Hou M, Lin Y. Identifying Risk Stratification Associated With a Cancer for Overall Survival by Deep Learning-Based CoxPH. IEEE Access 2019;7:67708-67717. [CrossRef]
  10. Moi S, Lee Y, Chuang L, Yuan SF, Ou-Yang F, Hou M, et al. Cumulative receiver operating characteristics for analyzing interaction between tissue visfatin and clinicopathologic factors in breast cancer progression. Cancer Cell Int 2018;18:19 [FREE Full text] [CrossRef] [Medline]
  11. Kotredes KP, Razmpour R, Lutton E, Alfonso-Prieto M, Ramirez SH, Gamero AM. Characterization of cancer-associated IDH2 mutations that differ in tumorigenicity, chemosensitivity and 2-hydroxyglutarate production. Oncotarget 2019 Apr 12;10(28):2675-2692 [FREE Full text] [CrossRef] [Medline]
  12. Šolman M, Ligabue A, Blaževitš O, Jaiswal A, Zhou Y, Liang H, et al. Specific cancer-associated mutations in the switch III region of Ras increase tumorigenicity by nanocluster augmentation. Elife 2015 Aug 14;4:e08905 [FREE Full text] [CrossRef] [Medline]
  13. Derouet M, Wu X, May L, Hoon Yoo B, Sasazuki T, Shirasawa S, et al. Acquisition of anoikis resistance promotes the emergence of oncogenic K-ras mutations in colorectal cancer cells and stimulates their tumorigenicity in vivo. Neoplasia 2007 Jul;9(7):536-545 [FREE Full text] [CrossRef] [Medline]
  14. Petros JA, Baumann AK, Ruiz-Pesini E, Amin MB, Sun CQ, Hall J, et al. mtDNA mutations increase tumorigenicity in prostate cancer. Proc Natl Acad Sci U S A 2005 Jan 18;102(3):719-724 [FREE Full text] [CrossRef] [Medline]
  15. Weich N, Ferri C, Moiraghi B, Bengió R, Giere I, Pavlovsky C, et al. GSTM1 and GSTP1, but not GSTT1 genetic polymorphisms are associated with chronic myeloid leukemia risk and treatment response. Cancer Epidemiol 2016 Oct;44:16-21. [CrossRef] [Medline]
  16. Fu OY, Chang H, Lin Y, Chuang L, Hou M, Yang C. Breast cancer-associated high-order SNP-SNP interaction of CXCL12/CXCR4-related genes by an improved multifactor dimensionality reduction (MDR-ER). Oncol Rep 2016 Sep;36(3):1739-1747. [CrossRef] [Medline]
  17. Tang J, Chuang L, Hsi E, Lin Y, Yang C, Chang H. Identifying the association rules between clinicopathologic factors and higher survival performance in operation-centric oral cancer patients using the Apriori algorithm. Biomed Res Int 2013;2013:359634 [FREE Full text] [CrossRef] [Medline]
  18. Yang C, Wu K, Dahms H, Chuang L, Chang H. Single nucleotide polymorphism barcoding of cytochrome c oxidase I sequences for discriminating 17 species of Columbidae by decision tree algorithm. Ecol Evol 2017 Jul;7(13):4717-4725 [FREE Full text] [CrossRef] [Medline]
  19. Yang C, Lin Y, Chuang L. Class Balanced Multifactor Dimensionality Reduction to Detect Gene–Gene Interactions. IEEE/ACM Trans. Comput. Biol. and Bioinf 2020 Jan 1;17(1):71-81. [CrossRef]
  20. Ou-Yang F, Lin Y, Chuang L, Chang H, Yang C, Hou M. The Combinational Polymorphisms of ORAI1 Gene Are Associated with Preventive Models of Breast Cancer in the Taiwanese. Biomed Res Int 2015;2015:281263 [FREE Full text] [CrossRef] [Medline]
  21. Yang P, Ho JW, Yang YH, Zhou BB. Gene-gene interaction filtering with ensemble of filters. BMC Bioinformatics 2011 Feb 15;12(S1). [CrossRef]
  22. Chuang L, Lane H, Lin Y, Lin M, Yang C, Chang H. Identification of SNP barcode biomarkers for genes associated with facial emotion perception using particle swarm optimization algorithm. Ann Gen Psychiatry 2014;13(1):15. [CrossRef]
  23. Yan R, Cao J, Song C, Chen Y, Wu Z, Wang K, et al. Polymorphisms in lncRNA HOTAIR and susceptibility to breast cancer in a Chinese population. Cancer Epidemiol 2015 Dec;39(6):978-985. [CrossRef] [Medline]
  24. Chang W, Fang Y, Chang H, Chuang L, Lin Y, Hou M, et al. Identifying association model for single-nucleotide polymorphisms of ORAI1 gene for breast cancer. Cancer Cell Int 2014 Mar 31;14(1):29 [FREE Full text] [CrossRef] [Medline]
  25. Chen J, Chuang L, Lin Y, Liou C, Lin T, Lee W, et al. Genetic algorithm-generated SNP barcodes of the mitochondrial D-loop for chronic dialysis susceptibility. Mitochondrial DNA 2014 Jun;25(3):231-237. [CrossRef] [Medline]
  26. Yang C, Moi S, Lin Y, Chuang L. Genetic Algorithm Combined with a Local Search Method for Identifying Susceptibility Genes. Journal of Artificial Intelligence and Soft Computing Research 2016;6(3):203-212. [CrossRef]
  27. Holland J. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Cambridge, Massachusetts: MIT Press; 1992.
  28. Bendell A, Disney J, Pridmore W. Taguchi Methods: Applications in World Industry. Berlin, Germany: Springer Verlag; 1989.
  29. Miller B, Goldberg D. Genetic algorithms, tournament selection, and the effects of noise. Complex Systems 1995;9(3):193-212.
  30. Eberhart R, Kennedy J. A new optimizer using particle swarm theory. In: MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science. 1995 Presented at: Sixth International Symposium on Micro Machine and Human Science; October 4-6, 1995; Nagoya, Japan p. 39-43. [CrossRef]
  31. Chuang L, Chang H, Lin M, Yang C. Chaotic particle swarm optimization for detecting SNP–SNP interactions for CXCL12-related genes in breast cancer prevention. European Journal of Cancer Prevention 2012;21(4):336-342. [CrossRef]
  32. Mechanic LE, Luke BT, Goodman JE, Chanock SJ, Harris CC. Polymorphism Interaction Analysis (PIA): a method for investigating complex gene-gene interactions. BMC Bioinformatics 2008 Mar 06;9:146 [FREE Full text] [CrossRef] [Medline]
  33. Pharoah PD, Tyrer J, Dunning AM, Easton DF, Ponder BA, SEARCH Investigators. Association between common variation in 120 candidate genes and breast cancer risk. PLoS Genet 2007 Mar 16;3(3):e42 [FREE Full text] [CrossRef] [Medline]
  34. Wu S, Chuang L, Lin Y, Ho W, Chiang F, Yang C, et al. Particle swarm optimization algorithm for analyzing SNP-SNP interaction of renin-angiotensin system genes against hypertension. Mol Biol Rep 2013 Jul;40(7):4227-4233. [CrossRef] [Medline]
  35. Chuang L, Chang H, Lin M, Yang C. Chaotic particle swarm optimization for detecting SNP–SNP interactions for CXCL12-related genes in breast cancer prevention. European Journal of Cancer Prevention 2012;21(4):336-342. [CrossRef]
  36. Chen L, Li W, Zhang L, Wang H, He W, Tai J, et al. Disease gene interaction pathways: a potential framework for how disease genes associate by disease-risk modules. PLoS One 2011;6(9):e24495 [FREE Full text] [CrossRef] [Medline]
  37. Yin J, Lu K, Lin J, Wu L, Hildebrandt MA, Chang DW, et al. Genetic variants in TGF-β pathway are associated with ovarian cancer risk. PLoS One 2011;6(9):e25559 [FREE Full text] [CrossRef] [Medline]
  38. Ricceri F, Guarrera S, Sacerdote C, Polidoro S, Allione A, Fontana D, et al. ERCC1 haplotypes modify bladder cancer risk: a case-control study. DNA Repair (Amst) 2010 Feb 04;9(2):191-200. [CrossRef] [Medline]
  39. Cauchi S, Meyre D, Durand E, Proença C, Marre M, Hadjadj S, et al. Post genome-wide association studies of novel genes associated with type 2 diabetes show gene-gene interaction and high predictive value. PLoS One 2008 May 07;3(5):e2031 [FREE Full text] [CrossRef] [Medline]
  40. Lin G, Tseng H, Chang C, Chuang L, Liu C, Yang C, et al. SNP combinations in chromosome-wide genes are associated with bone mineral density in Taiwanese women. Chin J Physiol 2008 Feb 29;51(1):32-41. [Medline]
  41. Yang C, Lin Y, Yen C, Chuang L, Chang H. A systematic gene-gene and gene-environment interaction analysis of DNA repair genes XRCC1, XRCC2, XRCC3, XRCC4, and oral cancer risk. OMICS 2015 Apr;19(4):238-247. [CrossRef] [Medline]
  42. Zheng SL, Sun J, Wiklund F, Smith S, Stattin P, Li G, et al. Cumulative Association of Five Genetic Variants with Prostate Cancer. N Engl J Med 2008 Feb 28;358(9):910-919. [CrossRef]
  43. Conde J, Silva SN, Azevedo AP, Teixeira V, Pina JE, Rueff J, et al. Association of common variants in mismatch repair genes and breast cancer susceptibility: a multigene study. BMC Cancer 2009 Sep 25;9:344 [FREE Full text] [CrossRef] [Medline]
  44. Han W, Kim K, Yang S, Noh D, Kang D, Kwack K. SNP-SNP interactions between DNA repair genes were associated with breast cancer risk in a Korean population. Cancer 2012 Feb 01;118(3):594-602 [FREE Full text] [CrossRef] [Medline]
  45. Yu J, Hsiung C, Hsu H, Bao B, Chen S, Hsu G, et al. Genetic variation in the genome-wide predicted estrogen response element-related sequences is associated with breast cancer development. Breast Cancer Res 2011 Jan 31;13(1):R13 [FREE Full text] [CrossRef] [Medline]


CPSO: chaotic particle swarm optimization
GA: genetic algorithm
HTGA: hybrid Taguchi-genetic algorithm
OA: orthogonal array
PSO: particle swarm optimization
SNP: single-nucleotide polymorphism
SNR: signal-to-noise ratio


Edited by C Lovis; submitted 04.11.19; peer-reviewed by BTL Brian T Luke, HW Chang; comments to author 15.12.19; revised version received 09.02.20; accepted 08.04.20; published 17.06.20

Copyright

©Li-Yeh Chuang, Cheng-San Yang, Huai-Shuo Yang, Cheng-Hong Yang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 17.06.2020.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.