A Novel Approach in Determining Neural Networks Architecture to Classify Data With Large Number of Attributes

One of the challenges in the successful implementation of deep neural networks (DNN) lies on the determination of its architecture, in terms of the number of hidden layers and neurons for each hidden layer. In this research, a new approach is proposed to determine the neural networks architecture especially in the form of Multi-layer Perceptron (MLP) which will later be used as a machine learning method to classify data with large number of attribute. The new approach is proposed since the previous approaches are no longer applicable as general guidelines to determine the architecture of neural networks. Thus, the proposed approach aims to determine the number of hidden layers by using principal component analysis (PCA), while the number of neurons for each hidden layer is determined by using K-Means clustering. The determined neural network architecture is utilized to classify data with large number of attribute, such as the Gas Sensor Array Drift dataset which has 128 input attributes and six output classes and the Parkinson’s Disease Classification dataset which has 754 output attributes and two output classes. The results indicate that the best-performing architecture for the first dataset is the one that uses one hidden layer, with a PCA cumulative variance of 69.7%, while for the second dataset is the one that uses three hidden layers, with a PCA cumulative variance of 38.9%. Increasing the number of hidden layers does not always improve the performance of neural networks. Therefore, it is essential to determine the number of hidden layers and neurons that are appropriate to achieve good performance in neural networks. The use of PCA and K-Means clustering is expected to provide guidelines in determining neural networks architectures with good performance.


I. INTRODUCTION
As one of the techniques in artificial intelligence, artificial neural networks have been widely applied in various fields, such as in: medicine, engineering, and finance, due to its ability to approach arbitrary functions. Over the past 30 years, published results indicating that artificial neural networks (recognized as feedforward networks with one hidden layer) can approach all functions arbitrarily [1]- [7]. One important aspect for designing neural networks is due to its architecture having generalization capabilities [8], [9]. In some cases by using a fully connected network, and given a set of The associate editor coordinating the review of this manuscript and approving it for publication was Yongping Pan . predetermined inputs and outputs, the architecture is determined by the number of layers and neurons for each hidden layer. Former literature solely discusses on how to determine the number of hidden neurons (assuming by using one hidden layer), but does rarely discuss the way to determine the optimal number of hidden layers. This fact is due to the assumption that a network with only one hidden layer is sufficient to universally approve almost all functions [3]- [7]. However, several studies probed that the application of two hidden layers provides better performance compare to one hidden layer in some cases [3]- [7].
Since the increasing ability of computers, the application of neural networks with more than one hidden layer has become one of the attractions for researchers, especially since the use of deep neural networks (DNN) to solve problems in various fields. The utilization of DNN is defined as a technique applying neural networks by using numerous hidden layers between the input and output layers [3], [10]; or in other words it is considered as machine learning with DNN. One of the challenges in the successful implementation of deep neural networks lies on the determination of the network architecture, which is closely related to the number of hidden layers and hidden neurons. Determination of hidden layer and hidden neurons plays a significant role which affects the performance of deep neural networks [8], [9]. However, determination of these two amounts manually (usually through the 'trial and error' approach) to get a fairly optimal architecture, is a time consuming process.
Deep neural networks have achieved great success in practice with a large influence on the machine learning literature and artificial intelligence [11], [12]. Although it has achieved success on a practical level, the theoretical properties of architectural determination of neural networks are still investigated. Sample might include methods to find the number of hidden layers and hidden neurons in deep neural networks which are not based on the established theories [8], [12], [13]. Neural networks should be simply designed as they relate to computational time [14]. Therefore, it is necessary to find a practical method to determine the neural networks architecture.
The approach in this paper proposes the determination of neural networks architecture, which is to determine the number of hidden layers and neurons for each hidden layer. A predetermined architecture, in particular, is applied to classify data which have a large number of attributes. The increasing number of attributes in data set will lead to a more infrequent data space; a smaller distance between objects, and the reduced performance of distance-based machine learning [15]. Countless attribute data used in this research are those with more than one hundred attributes. Subsequently, this paper will only discuss the determination of neural networks architecture for classification purposes, especially for classification of data with large number of attribute. Data with ''Large Number of Attributes'' are data that have a number of input attributes between one hundred to one thousand. We used this term with the intention to differentiate it from ''high-dimensional data'' which are data whose the number of input attribute(p) outnumbers the amount of data (N), p N [16].
The architecture of the neural networks referred to this paper is the Multi-Layer Perceptron (MLP), which is a neural network where each neuron in the previous layer will be connected to all the neurons in the next layer. MLP architecture was chosen as the most fundamental architecture for deep learning [17]. Consequently, the approach proposed in determining MLP architecture is expected to be a guide in determining the number of hidden layers and neurons in other types of neural networks architecture. This paper is organized as follows. Section II explains the related works. Further, section III discusses the proposed new approaches to determine the architecture of neural networks. Section IV presents the stages of experimentation. Subsequently, section V explains the results of experiments and discussions. Finally, section VI presents conclusions and future works.

II. RELATED WORKS
Prior researchers have attempted and proposed various methodologies to count the number of hidden neurons, by comparing the use of one and two hidden layers to determine the number of hidden layers and neurons in each layer. At the same time, the number of hidden layers and neurons is investigated after the use of deep neural networks. In general, research related to determine the number of hidden layers and neurons can be divided into: research that aims to only determine the number of neurons, assuming that the network has only one hidden layer; research that compares the use of one with two hidden layers; and research that determine both the number of hidden layers and neurons for each hidden layer. In the early stages of using neural networks in machine learning, the majority of researchers only focused on determining the number of neurons by assuming that neural networks have only one or two hidden layers. Several studies to count the number of hidden neurons in a neural network will be described chronologically. Arai(1993) proposed a method to determine the number of hidden neurons, where I-1 to I / 3 hidden neurons were sufficient for the purpose of binary classification [18]. Master (1993) argued that there are no theoretical reasons in using more than two hidden layers in a neural network. Likewise, no practical reasons either for it. Whereas for determining the number of hidden neurons using the geometric pyramid rule, it is stated that the number of neurons for each hidden layer will form the shape of a pyramid, where the number of neurons keeps decreasing going from input to output [19]. Tamura and Tateishi (1997) developed a method based on Akaike Information Criteria. The number of neurons in a three-layer neural network is N-1 and a four-layer neural network is N / 2 + 3 where N is the number of inputs minus the number of outputs [20]. Further, Islam and Murase (2001) proposed a large numbers of hidden neurons in the network with one hidden layer to improve performance, but the ability to generalize networks can be degraded when the number of neurons is too large resulting in false connections [21]. Zhang et al. (2003) applied a set of covering algorithm (SCA) in a three-layer neural network based on the number of inputs. Theoretically, the number of hidden neurons is estimated by random search [22]. Doukim et al. (2010) proposed a technique to discover the number of neurons in MLP neural networks by using a search technique applied in skin detection, consisting of binary and sequential search. This implementation was trained by 30 networks to discover the lowest average square error. In addition, sequential search is performed to find out the best number of hidden neurons [23]. Similarly, prior study conducted by Sheela and Deepa (2013) proposed a method to calculate the optimal number of neurons by making a list of 101 convergent VOLUME 8, 2020 functions, and simulating them one by one to get the optimal number of neurons [24]. Whereas, prior study also calculated the optimal number of neurons by making a list of 151 convergent functions, and simulating them one by one [2].
Research comparing the use of one or two hidden layers was conducted on univariate or multivariate functions [4]- [7], [25]. Nakama(2011) compared the performance of using one and two hidden layers for linear and quadratic functions, presenting that the use of one hidden layer was faster to achieve convergence [25]. Thomas et al.(2016 and obtained different results, which were the use of two hidden layers with a better performance to predictive functions [4], [5].  concluded that the use of one hidden layer is less favoured to approach the multivariate function; thus, the use of two hidden layers provides better performance [7], [6]. To determine the number of hidden layers and neurons, previous study broadly divided the approach into automatic and manual determination [26], [27]. Automatic determination relates to the application of models and without models. Manual determination relates to certain formulas which are performed by trial and error. Automatic determination without a model consists of grid search (GS) and random search (RS). GS is generally used to optimize DNN parameters only if the numbers are too low by selecting a range of values to explore. Further, DNN is trained for each specification along with other hyper-parameters. Random search (RS) becomes an alternative to GS, which is faster at achieving convergence. RS is not adaptive, meaning that it is not dynamically updated during the experiment (solutions do not affect the search). There are also hybrid algorithms that combine RS with other techniques to improve performance, such as manual updates provided by experts [28]- [30]. One model-based algorithm applies particle swarm optimization (PSO), which is an optimization technique by continuously counting potential solutions with a quality reference [3], [29]. This algorithm optimizes the problem by moving particles / potential solutions in the problem space with certain functions for the position and speed of the particles. The movement of particles is influenced by the best solutions of these particles, generally obtained from other particles. This set of particles is called a swarm, and eventually this swarm will move towards the best solution. One method that can be practically applied is data mining techniques, including clustering and regression [31], [32]. However, in this study, it is less explained how the relationship between the number of clusters and feature extraction are utilized in the machine learning process. Thus, subsequent researchers experience difficulty to replicate this method. Although some researchers have proposed a method to count the number of hidden layers and neurons at the same time, the method has been infrequently applied by other researchers due to its difficult application which does not match the type of dataset. Thus, several researchers prefer to apply the trial and error method [9].

III. PROPOSED APPROACH FOR NEURAL NETWORKS ARCHITECTURE DETERMINATION
Previous research has proposed an approach to determine the neural networks architecture which includes approaches such as: trial and error, rule of thumb, or other certain techniques. However, there has not been a general approach referred as a guideline by other researchers to determine the neural networks architecture for other datasets or cases since the approaches proposed by previous researchers are not generally applicable. The following will explain the reason in applying principal component analysis and clustering to determine neural networks architecture.

A. RATIONALE
The next section will explain the steps that form the rationale to determine neural networks architecture as proposed by researchers. The proposed approach would utilize principal component analysis to determine the number of hidden layers in a neural network. Based on the cumulative variance of the components, the number of components is able to be discovered, serving as the basis to determine the number of hidden layers. From each component used, clustering will be carried out to obtain the optimal number of clusters as the basis to determine the number of neurons in each hidden layer.

1) PRINCIPAL COMPONENT ANALYSIS (PCA)
Principal Component Analysis conveys a multivariate analysis that transforms the original correlated variables into new incorrelated variables by reducing a number of these variables for smaller dimensions, but this analysis can account the diversity of the original variables. The number of principal components that are formed (PCi), where i is worth 1 to p (p = number of original variables) is equal to the number of original variables as in Figure 1 (number of original variables = 5). The principal component includes a linear combination of weighted origin variables and does not correlate with other principal components. Reduction (simplification) of dimensions is performed by the cumulative percentage criteria of variance (a %) of data explained by the first (q) principal components; or based on the eigenvalues of the principal components ( Figure 1 top section marked with dashed lines). Based on Figure 1, two representative components are taken, symbolized by a box notation.
Furthermore, each data on the original variable is projected into the selected principal components (PCi). The first principal component is the component that has the highest variance, the second principal component is the component that has the second highest variance, and so on until the last component that has the lowest variance. Then, PCi will be notated as Yi. Each of these components is a linear combination of the original variables and weights. where Y i is the i th principal component (the i value is from one to the number of input attributes = p), w ij is the weight for the j th attribute in the i th component, x j is the j th attribute. The number of principal components(Y i ) equals to the number of the original attributes. For example, if a dataset consisting of 5 input attributes is given, the PCA process will produce five principal components.
If y is the principal component, a is the weight matrix, and x is the original attribute, then the principal component expressed in Dirac notation is: After the transformation of the original data into the equation of each component, the data of the transformation will have values between a certain range (minimum and maximum). Data with a certain range is grouped into several clusters or classes with certain techniques, such as by K-Means method.
To obtain an optimal number of clusters, the Elbow criterion is then applied. Thus, clustering will be performed on data that have been transformed into a matching principal component equation so as each of these components is divided into a number of clusters, where each cluster will has their own range. Clustering results for each component are indicated by the small circle notation that in each component in Figure 1.
For example the clustering of the first component, the optimal number of clusters is k clusters, the results of the clustering can be seen in Table 1. For example, the first cluster has a value range between c 1 to c 2 , the second cluster has a range between c 3 to c 4 , and so on. Where c 1 < c 2 <c 3 . . .<c 2k−2 < c 2k−1 <c 2 k.

2) MULTILAYER NEURAL NETWORKS
The specific architecture of a multilayer neural network is referred as a feed-forward network, because layers consecutively feed each other in the forward direction from input to output. The default architecture of a network assumes that all nodes (neurons) in one layer are connected to each neuron of the next layer. If the neural network contains of p 1 . . . p n neurons in each layer k; then the output is represented as a vector (column), denoted by h 1 . . . h k with dimensions of p 1 . . . p n . Therefore, the number of neurons in each layer is called the dimension of the layer. Each neuron represents all values of the input with a certain complexity [33], [34]. As an example, on the bottom of Figure 1, it is a neural networks with 5 input attributes, 2 hidden layers, and one output layer. According to Aggarwal (2018), hidden layers in neural networks that are closer to output, calculate more complex patterns by using functions in the previous layers. Thus, the complexity of a layer is a concatenation or accumulation of complexity from the previous layers [35] as shown in Figure 2. Layer i is hidden layer i th on the neural network, where i between 1 to k, and X is the input. So, complexity(layer k) > complexity(layer k-1) > . . . > complexity (layer 1).
For example, if there is a neural network consisting of k hidden layers, then it has several k+1 computational layer (including the output layer). The value (k+1) of the corresponding matrix between successive layers is denoted by W 1 . . . W k+1 . For example,x is a column vector with dimension d that corresponds to the corresponding input attribute, h 1 . . .h k means it is a column vector related to the hidden layer, andō is a column vector with dimension m representing the output. The recurrence equation for multilayer neural network is as follows: Hidden layer and pattern complexity [35].
If all the activation functions above F are identity functions [35].

3) MAPPING FROM PRINCIPAL COMPONENT ANALYSIS TO NEURAL NETWORKS
The greater number of hidden layers in neural networks represents the increasingly complex features [33]. Deep architecture in neural networks can detect features arranged at each layer. Lower layers (closer to the input layer) detect simpler features, whereas layers closer to the output layer will detect more complex features [34]. Increasingly complex features represent higher information content in PCA, represented in the principal components with high variance. In other words, the increasingly complex features of the hidden layer neural networks are aligned with the principal components with higher variance. Based on this rationale, the number of hidden layers in neural networks is in accordance with the number of components in principal component analysis. Therefore in this research, the number of hidden layers in neural networks will be determined based on the number of principal components formed through PCA.
PC i is the principal component of PCA, h i is hidden layer of neural networks Each neuron at each layer on neural networks, despite producing different values, is assumed to have similar level of complexity; however, each neuron in similar layer will have different characteristic value. Each value has similar characteristics within similar group, while values that have different characteristics belong to different groups. This finding refers to the assumption that each principal component produced in PCA is aligned with the hidden layer on the neural network, where the grouping or clustering will be carried out on each principal component. The optimal number of clusters formed in each principal component is assumed to represent the number of different values in each neuron within similar hidden layer. Mapping from PCA and clustering into neural networks are depicted in Figure 1.
Based on equation 7 the cumulative variance in PCA is in line with the cumulative complexity in the neural network, and based on Figure 2, hidden layers that have higher complexity are those that are closer to the output, implying that principal components that have higher variance represent hidden layers that are closer to the output. Therefore, after determining the number of principal components to be used based on the cumulative variance, then sorting is performed as follows: • The first principal component (that has the highest variance) represents hidden layers the closest to the output (i.e., the k th hidden layer) • The second principal component (that has the second highest variance) represents the (k-1) th hidden layer • And so on successively, until the principal component that has the lowest variance represents the first hidden layer (closest to the input). Based on equations 1 and 5, it can be said that equations 1 and 5 are compatible equations, which are a linear combination of weights and input attributes. For instance, a neural network that only utilizes one hidden layer, then W 1 is the weight matrix that connects between input and neuron on the first hidden layer (in Figure 1, each weight is symbolized by a line connecting the input to the neuron). So, the linear combination results between input attributes and weight matrix lead to neuron h 11 , h 12 ,. . . h 1I , thus each of these neurons has a certain value. For instance, the first neuron on the first hidden layer: the second neuron for the first hidden layer, and so on until the last neuron. The results of clustering the first component in Table 1 when connected with each neuron in the first hidden layer can be seen in Table 2. For instance, the linear combination results between W 1 weight matrix and each input attribute that has a range between c 1 ..c 2 are compatible with the first hidden neuron on the first hidden layer (h 11 ), the one with a range between c 3 ..c 4 are compatible with the second hidden neuron on the first hidden layer (h 12 ), and so on.

B. STAGES OF DETERMINING THE PROPOSED NEURAL NETWORKS ARCHITECTURE
Processing of provided available datasets consists of 3 stages by: 1. Analyzing the dataset using PCA, to obtain significant principal components.
2. Conducting clustering with the K-Means method for each principal component by varying the number of clusters. 3. Determining the optimal number of clusters for each component through Elbow criteria to obtain the optimal number of clusters for each principal component. Algorithm of principal component analysis is performed through: 1. Calculating the average of all samples as follows: 2. Subtracting all data samples with the following averages: 3. Calculating the covariance matrix as follows: 4. Calculating eigenvectors V and eigenvalues λ from the covariance matrix ( ) 5. Sorting eigenvectors based on eigenvalues that correspond 6. Selecting eigenvectors with the largest eigenvalues W = {v 1 ,. . . , v k }. Eigenvectors (W) selected represent the PCA projection space. 7. Projecting all samples at a lower dimensional space than PCA (W) as follows: Clustering algorithm with K-Means method is conducted to: 1. Determine the number of clusters = k 2. Determine the center point randomly as many as k 3. Calculate the distance of each data to the center of the cluster with the formula: 4. Allocate data to the cluster based on the closest distance 5. Calculate the new cluster center which is calculated based on data in each cluster 6. Repeat steps 3 through 5, until there is no data to move the cluster {Calculates the total within-cluster variation with the formula where x i = data i th in a particular cluster, and µ k is the center of the cluster} Algorithm of Elbow Criteria Application is performed to: 1. Run a clustering algorithm (eg K-Means) for different number of cluster (k), for example k from 1 to n 2. Calculate the wss value for each k 3. Draw a graph of wss based on the number of clusters (k) 4. Discover the locations where curves are seen in plots which are generally considered as an indicator of the correct number of clusters. In this research, the criteria were modified, which was the optimal number of clusters when three successive values of wss were relatively unchanged.

IV. EXPERIMENTATION STAGES
The stages of the experiment consist of: preparing a dataset, normalizing, setting parameters, determining the architecture, training and testing, and calculating performance.

A. DATASET PREPARATION
Actually, the proposed determination of neural network architecture can be applied to data in various real-life fields such as: finance, medicine, marketing, chemical industry, etc. This determination of architecture can also be applied to many forms of datasets such as: image, text, audio, tabular etc. or various types of datasets: numeric or non-numeric. In this research, only two datasets were exemplified, namely the dataset in the chemical industry, namely the Gas Sensor Array Drift and in the medical field, namely the Parkinson's Disease Classification dataset which was downloaded from the Machine Learning Repository website [36], [37]. Both datasets are tabular in form, adapted to the architecture to be determined which was MLP. In fact, any dataset form or type can be converted to either tabular form or numeric type. The Gas Sensor Array Drift dataset consists of 13910 data. The input attributes of this dataset were data from 16 chemical sensors, each of which had 8 attributes, while the output attribute was a concentration level divided into 6 classes. The dataset actually consisted of 10 batches; however in this study, the ten batches were combined into one. The Parkinson's Disease Classification dataset consists of 754 data, with 754 input attributes, and the output attribute consists of two classes.

B. NORMALIZATION
Normalization is performed to obtain data with the appropriate scale attributes. Normalization involves Min-Max with a value between 0 and 1. Min-Max normalization formula is as follows: v = new data, v = old data, min A = the smallest value of an attribute, max A = the greatest value of an attribute, new_min A = the smallest value of a new attribute (= 0), new_max A = the largest value of a new attribute (= 1). After normalization, all input and output attributes of the dataset depicted a range of values between 0 and 1.

C. PARAMETER SETTING
Besides determining the number of hidden layers and hidden neurons, there are several parameters that should be considered in determining the structure of a neural network, namely: activation function, learning rate, momentum, etc. However, determining the number of hidden layers and hidden neurons has a significant role which affects the performance of VOLUME 8, 2020 deep neural networks [8], [9]. So, this research only aims to see changes in accuracy by changing the architecture of neural networks, namely the number of hidden layers and the number of neurons for each hidden layer, while the other parameters are fixed. Some parameters with a fixed value are learning rate, momentum and activation function, where the value uses the default ones from the application used. In this research, the machine learning process was carried out by using WEKA 3.7 software. The parameters include: learning rate, momentum, and activation functions by using the default value of WEKA 3.7.

D. DETERMINATION OF NEURAL NETWORKS ARCHITECTURE
As seen in section I, the neural network architecture to be determined in this research is MLP. It is the quintessential architecture in neural networks or deep learning [17]. To date, there is not one guidance to which the determination of neural network architecture can refer, even for the most basic architecture like MLP. Some of the methods proposed by previous researchers cannot be used as guidance by subsequent researchers because they were not generally applicable due to their funneled focus only on the number of input attributes or the number of their outputs without paying attention to variations and relationships between attributes. So, currently determining the MLP architecture is usually still performed on a trial and error basis or using the rule of thumbs, even then it only applies to certain datasets. The proposed determination of neural network architecture is expected to be the new universal guidance in determining neural network architecture, especially for MLP. In general, MLP can indeed process many forms of data such as images, tabular, texts etc. However, in certain cases, for instance, in complex image data or text sequence data, the MLP does not perform really well. Therefore, a derivative architecture of MLP was developed such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). To process complex image data, CNN is usually employed. CNN is a derivative of MLP that can capture spatial features of images. So, if MLP treats image input in one-dimensional format, CNN treats image input in two-dimensional format. To process data that have sequential order, i.e. text processing, RNN is typically used. RNN is a derivative of MLP in which the hidden layer has a loop in each neuron that functions as feedback. Since MLP is the most basic architecture, and other architectures are a derivative of MLP, the guidelines for determining the MLP architecture performed in this research can be the basis for determining the CNN, RNN, and several other architectures by making some adjustments and modifications. So, this research will only discuss the determination of MLP neural network architecture. Figure 3 is an illustration of MLP. An MLP has four input attributes and two hidden layers. The first hidden layer has four neurons and the second hidden layer has three neurons. From the input attribute to the neuron, from the neuron to the neuron and from the neuron to the output attribute is connected by a line that represents a certain weight. The value δ represents the bias value for each neuron.
Neural networks architecture is determined by using principal component analysis and K-Means clustering. The use of principal component analysis is to determine the number of hidden layers. The results of PCA obtained by principal component, each of which has certain variance. Determination in the number of components is based on the number of cumulative variants of these components. For example, for the multi classification objective function, a cumulative variance of around 60% or 70 % which will be used [38], thus in order to achieve the cumulative variance, three components are required. It means that three component are used as a benchmark to determine the number of hidden layers in neural networks. Thus, the number of hidden layers is determined based on the number of components to reach the specified number of cumulative variance. For each component, clustering is conducted by K-Means through changing the number of clusters between 2 to 50; thus, the optimal number of clusters can be selected.

E. TRAINING AND TESTING
Training is conducted with 70% of data from the dataset, while testing is performed with 30% of the data from the dataset. The training and testing process for each architecture is repeated ten times by changing the initial weight value.

F. PERFORMANCE CALCULATION
Neural networks architecture performance is calculated based on the level of accuracy. Accuracy is calculated by using a formula as follows:

Accuracy = Number of correct prediction Number of total prediction
The performance of the architecture determined using the proposed approach will be compared to one of the methods used by Master that used geometric pyramid rule [19], stating that the number of neuron for each hidden layer will form a pyramid, where the number of neurons keep decreasing going from input to output. According to Master, there are no theoretical or practical reasons to use more than two hidden layers. The number of neurons in hidden layers follow the geometric pyramid rule, where the number of hidden neurons close to the input that is close to output will be fewer. For a neural network with a single hidden layer, the number of neurons is where n is the number of input features, and m is the number of output. While a neural network with two hidden layers the number of neurons can be calculated as follows: Nh 1 = mr 2 (20) Nh 1 : the number of neurons in the first hidden layer, Nh 2 : the number of neuros in the second hidden layer.

V. EXPERIMENT RESULT AND DISCUSSION
In this section, the results of processing both datasets are presented including the results of PCA, K-Means clustering and Elbow criteria application to the dataset used, as well as performance comparisons.

A. GAS SENSOR ARRAY DRIFT DATASET
One of the conditions that a dataset can be analyzed by PCA is that all attributes must be numeric. Since all input attributes are already numeric, no conversion process is needed. Before carrying out PCA analysis, clustering and implementation of Elbow criterion all input attributes were normalized using equation 16 so that all attributes had a value between 0 and 1.  Table 4. In fact, the number of components is as many as input attributes, which are 128; however in Table 4, there are only first six principal components which are depicted due to classification purposes. It means that the required cumulative variants are around 60% or 70%, fulfilled by only the first principal component. The second to the 6th component is also displayed for comparison and analysis purposes. From each component that has been formed from PCA, clustering is performed until the optimal number of clusters is found. Clustering is conducted by the K-Means method, while the modified Elbow criteria are utilized to determine  the optimal number of clusters. Modification to determine the optimal number of clusters is obtained when the value of wss in a row (minimum of three in a row) is relatively unchanged. Samples of the optimal number determination of clusters for the first component is presented in Table 3, starting from the number of clusters = 19 with unchanged wss values. The results of clustering with the Elbow criteria for each component are demonstrated in column 4 of Table 4, indicating the first component the optimal number of clusters which is 19; while the second component the optimal number of clusters is 15, and so on. Table 4 demonstrates the number of hidden layers along with their respective number of neurons. Previously, the numbers were decided based on the cumulative variance of PCA as the basis to determine the number of component to be used. For the purpose of classification functions, the cumulative variance as initialization is 60% or 70%.
The results obtained from PCA and clustering are utilized to determine the architecture of neural networks. This architecture determination consists of determination from the number of hidden layers and neurons in each hidden layer. For example: if one component with the optimal number of clusters is 19, then the neural network architecture in a hidden layer is 19 neurons. Thus, if using two components, then the number of hidden layers is two where the first hidden layer has 15 neurons and the second hidden layer has 19 neurons; and so on. For the record, the principal component with the largest variance corresponds to the hidden layer which is close to the output layer. VOLUME 8, 2020 The results of PCA analysis from the Gas Sensor Array Drift dataset shows that achieving a cumulative variance of around 60% or 70% only requires one principal component. Therefore, the clustering process would be shown in detail only for the first component. For instance, the results of the clustering of the first principal components which were then mapped into the first hidden layer of the neural networks were shown in Table 5. The mapping results in Table 5 are used as a guideline for determining the architecture of neural networks, so that the architecture is obtained as shown in Figure 4. Figure 4 shows a Gas Sensor Array Drift dataset that has 128 input attributes that will be processed using machine learning using neural networks. The neural networks used have one hidden layer consisting of 19 neurons. The purpose of this machine learning is to classify the gas into six output classes.
For instance, the linear combination results between W 1 weight matrix and each input attribute that has a range between c 1 ..c 2 are compatible with the first hidden neuron on the first hidden layer (h 11 ), the one with a range between c 3 ..c 4 are compatible with the second hidden neuron on the first hidden layer (h 12 ), and so on.

2) PERFORMANCE COMPARISON
To assess the performance of neural networks by using PCA and clustering, analysis of architecture is performed based on the number of hidden layers and neurons in each layer. There are four architectures to be compared, including: architecture I: 19  architecture is conducted by a learning process using neural networks with 100 cycles. Each experiment is conducted ten times by changing the seeds. The column entitled Architecture indicates the number of hidden layer and the number of neuron for each architecture. For example, the column I heading depicts an architecture that consists of a hidden layer that has 19 neurons. The values in the table include the accuracy(in %) values, while the bottom row presents the average accuracy(in %) of 10 repetitions. Performance comparisons were made of architecture I, II, III, and IV.
Similarly, the experiments in Table 6 are also conducted for the number of cycles: 200, 500, and 1000. A summary of the mean values in each cycle value is presented in Table 7, while the graph is illustrated in Figure 5.    Figure 5 indicates architecture I which uses one hidden layer with 19 neurons with a higher accuracy value than other architectures, followed by architecture II, architecture III and IV. The graph also indicates that the increasing number of hidden layers does not always increase the accuracy value; even in this case it decreases accuracy. This is consistent with the initial assumption for classification purposes, in which the cumulative variance of PCA is 60% with one principal component of PCA. Since the required number of PCA is only one, the corresponding number of hidden layers is one (in line with the explanation in the section V.A). Increasing the number of cycles to a certain extent also tends to increase the accuracy value for architecture I and II, with only a few exceptions to architecture I, when the number of cycles = 1000, it slightly decreases accuracy. In architecture IV, more number of cycles has the tendency to decrease accuracy.
One approach to determine the architecture of neural networks to be used as the comparison against the proposed approach is the approach proposed by Master, which uses the geometric pyramid rule. This approach used one or two hidden layers. For one hidden layer, the number of hidden neurons is based on equation 18, which is: While for an architecture with two hidden layers, the formula for calculating the number of hidden neurons for each hidden layer can be seen in equation 19,20 and 21. Using these two formulas, then: Thus, for Master's approach, there are two architectures, which are architecture I: 28 and architecture II: 54,18. With the same method, the accuracy will be calculated in the same way as performed using the proposed method. Table 8 presents the average of each architecture determined by Master's method, where each architecture is conducted by a learning process using neural networks with 100 cycles. Each experiment is conducted ten times by changing the seeds. Similarly, the experiments in Table 8 are also conducted for the number of cycles: 200, 500, and 1000. A summary of the mean values in each cycle value is presented in Table 9, while the graph is illustrated in Figure 6. Figure 6 shows that architecture I which uses one hidden layer with 28 neurons tends to have a higher accuracy value than architecture II which uses two hidden layers with the number of neurons respectively 54 and 18. The graph also shows that the increase in the number of hidden layers is not always increase the accuracy value. Increasing the number of cycles up to a certain limit also tends to increase the accuracy value for architecture I and decrease when the number of  To compare the performance of Master's architecture and the proposed architecture, the architecture with the best performance was selected from each architecture. For both Master's architecture or the proposed architecture, the architecture that used one hidden layer shows the best performance. The comparison of each architecture with the best performance can be seen in Figure 7. Figure 7 indicates that the performance of the proposed method or approach is generally better or has better accuracy than that of Master's method. Only at the number of cycles=1000 the accuracy of the proposed method was slightly lower. The highest accuracy is achieved by the proposed approach which is 98.11% which is achieved at cycles = 1500.

B. PARKINSON'S DISEASE CLASSIFICATION
One condition that a dataset can be analyzed by PCA is that all attributes must be numeric. In the Parkinson's Disease Classification dataset, there is one input attribute that is not numeric type, namely the gender attribute which is a categorical type. For this reason, the conversion is done first with the One Hot Encoding (OHE) method [39], so that previously the categorical type is converted to numeric. Before performing PCA analysis, clustering and implementation of the Elbow criterion all attributes would be normalized using equation 16 so that all attributes have a value between 0 and 1. The objective function to be applied to this dataset is a binary classification which only consisted of two output classes.  Table 11. In fact, the number of components is as many as input attributes, which are 128; however in Table 11, there are only first six principal components which are depicted. For binary classification purposes, fewer cumulative variance is required compared to multiclass classifications. Binary classification requires a cumulative variance of about 40%. The second to the 6 th component is also displayed for comparison and analysis purposes.
From each component that has been formed from PCA, clustering is performed until the optimal number of clusters is found. Clustering is conducted by the K-Means method, while the modified Elbow criteria are utilized to determine the optimal number of clusters. Modification to determine the optimal number of clusters is obtained when the value of wss in a row (minimum of three in a row) is relatively unchanged. Samples of the optimal number determination of clusters for the first component is presented in Table 10,    Table 11, indicating the first component the optimal number of clusters which is 28; while the second component the optimal number of clusters is 23, and so on.
From Table 11 above, the number of hidden layers and each number of neurons can be determined. Previously, the numbers were decided based on the cumulative variance of PCA as the basis to determine the number of component to be used. For the purpose of binary classification objective function, the cumulative variance as initialization is about 40%.
The results obtained from PCA and clustering are utilized to determine the architecture of neural networks. This architecture determination consists of determination from the number of hidden layers and neurons in each hidden layer. For example: if one component with the optimal number of clusters is 28, then the neural network architecture in a hidden layer is 28 neurons. Thus, if using two components, then the number of hidden layers is two where the first hidden layer has 23 neurons and the second hidden layer has 28 neurons; and so on. For the record, the principal component with the largest variance corresponds to the hidden layer which is close to the output layer.
The results of PCA analysis from the Parkinson's Disease Classification dataset shows that achieving a cumulative variance of around 40 % requires three principal component. Mapping from the results of the clustering of PCA principal components to neurons in hidden neurons for the Parkinson's Disease Classification dataset is done in the same way as was done for the Gas Sensor Drift Array dataset but is not shown in this paper.

2) PERFORMANCE COMPARISON
To assess the performance of neural networks by using PCA and clustering, analysis of architecture is performed based on the number of hidden layers and neurons in each layer. There are four architectures to be compared, including: architecture I: 28, architecture II: 23, 28, architecture III: 22, 23, 28, and architecture IV: 24, 22, 23, 28. Examples of explanations are: architecture I uses one hidden layer with 28 neurons; architecture II uses two hidden layers, where the first hidden layer has 23 neurons and the second hidden layer has 28 neurons, and so on. Table 12 presents the average of each architecture determined by PCA and clustering, where each architecture is conducted by a learning process using neural networks with 100 cycles. Each experiment is conducted ten times by changing the seeds. The column entitled Architecture indicates the number of hidden layer and the number of neuron for each architecture. For example, the column I heading depicts an architecture that consists of a hidden layer that has 28 neurons. The values in the table include the accuracy(in %) values, while the bottom row presents the average accuracy(in %) of 10 repetitions. Performance comparisons were made of architecture I, II, III, and IV. Similarly, the experiments in Table 12 are also conducted for the number of cycles: 200, 500, and 1000. A summary of the mean values in each cycle value is presented in Table 13, while the graph is illustrated in Figure 8. Figure 8 indicates architecture III had higher accuracy and tended to be more stable than other architectures. Architecture III also achieves the highest accuracy compared to other architectures, namely 85.77%, achieved at cycles=1500. The graph also indicates that the increasing number of hidden  layers does not always increase the accuracy value. This is in line with the initial hypothesis that for binary classification purposes, the cumulative variance of PCA principal components needed is around 40%. The architecture with three hidden layers corresponds to three principal components that has a cumulative variant of 38.9%.
One approach to determine the architecture of neural networks to be used as the comparison against the proposed approach is the approach proposed by Master, which uses the geometric pyramid rule. This approach used one or two hidden layers. For one hidden layer, the number of hidden neurons is based on equation 18, which is: While for an architecture with two hidden layers, the formula for calculating the number of hidden neurons for each hidden layer can be seen in equation 19,20 and 21. Using these two formulas, then:  Thus, for Master's approach, there are two architectures, which are architecture I: 39 and architecture II: 128,16. In the same way, the accuracy will be calculated in the same way as performed using the proposed method. Table 14 presents the average of each architecture determined by Master's method, where each architecture is conducted by a learning process using neural networks with 100 cycles. Each experiment is conducted ten times by changing the seeds.
Similarly, the experiments in Table 14 are also conducted for the number of cycles: 200, 500, and 1000. A summary of the mean values in each cycle value is presented in Table 15, while the graph is illustrated in Figure 9. Figure 9 indicates that both architecture I and architecture II has a tendency towards instability, however, architecture II achieved the highest accuracy at cycles=1500 with an accuracy value of 83.52 %. Although less stable, in general the architecture using two hidden  layers with 128 and 16 neurons each has a higher level of accuracy than the architecture using one hidden layer consisting of 39 neurons. Increasing the number of cycles in either architecture I or architecture II does not always increase the accuracy value.
To compare the performance of Master's architecture and the proposed architecture, the architecture with the best performance is selected from each architecture. For Master's architecture, the highest performing architecture is the one that uses two hidden layers, while for the proposed architecture, it turned out that the architecture with three hidden layers performs the best performance. The comparison of each architecture with the best performance is presented in Figure 10. Figure 10 shows that the performance of the proposed approach is better in general, in terms of accuracy, compared to Master's approach. Only at the number of cycles=100 the accuracy of the proposed method is slightly lower. The highest accuracy is achieved by the proposed approach which is 85.77% which is achieved at cycles = 1500.

VI. CONCLUSION AND FUTURE WORK
This paper has proposed the determination of neural networks architecture by using principal component analysis (PCA) and K-Means clustering. PCA is utilized to determine the number of hidden layers in neural networks, while K-Means clustering is applied to determine the number of hidden neurons for each hidden layer. Determination for the number of hidden layers is based on the cumulative number of variants of PCA. For multi-class classification purposes, the cumulative variance required is around 60% to 70%, while for binary classification purposes the cumulative variance required is around 40%. The architecture that has been obtained from the researcher's proposed approach aims to classify data with large number of attributes, by using Gas Sensor Array Drift and Parkinson's Disease Classification datasets.
A performance comparison is conducted between the proposed architecture and Master's architecture. In general, the proposed architecture has better performance than Master's architecture, both for the Sensor Gas Array Drift dataset and the Parkinson's Disease Classification dataset. For the Sensor Gas Array Drift dataset, the best performance of the proposed architecture is achieved when using one hidden layer that is in line with the cumulative variant of PCA of 69.7%, while for the Parkinson's Disease Classification dataset it is obtained when using three hidden layers that is in line with the cumulative variant of 38.9%. From the results of the performance comparison, it is evident that the addition of the number of hidden layers in the neural network does not guarantee the increasing accuracy for classify data with large number of attribute. Thus, the appropriateness of using the number of hidden layers is needed to obtain a high level of accuracy.
In future research, before conducting PCA analysis and clustering, it is encouraged to analyze the variation of input and output attributes, and the relationship or correlation between each input attribute and output attribute. By considering the variation and correlation of attributes, it is expected that the architectural design of neural networks that has better performance can be obtained. It is also necessary to make a comparison between the computation times of each architecture for future research. The choice of dataset is also expected to be more varied in terms of form and type, for example datasets in the form of images, text, etc. or datasets that are non-numeric. Future research can also utilize datasets from various fields such as medicine, economy, finance etc. The determination of neural network architectures using the proposed approach is expected to be the preliminary guide on how to determine other deep neural network or other deep learning architecture that are the derivative of the MLP architecture, such as CNN, RNN etc. If the determination of these derivative architecture can be achieved, someday it is hoped that there will also be an architecture that is a special form of that architecture, for example U-Net (a special form of CNN) or Attention Net (a special form of RNN). The determination of the number of hidden layers and neurons as well as altering values of other parameters in structural design of neural networks, such as: learning rate, momentum etc. would be interesting to conduct in future research. VOLUME 8, 2020 JUDHI SANTOSO (Member, IEEE) received the B.Sc. degree in mathematics from Institut Teknologi Bandung (ITB), Indonesia, in 1987, the master's degree in computer science from Universitas Indonesia (UI), in 1991, and the Ph.D. degree in informatics from the School of Electrical Engineering and Informatics, ITB, in 2006. He is currently an Associate Professor of Computer Science with Institut Teknologi Bandung (ITB), Indonesia. His primary research interests include modeling and simulation, soft computing, and numerical analysis.
KRIDANTO SURENDRO received the B.Eng. and master's degrees in industrial engineering from Institut Teknologi Bandung (ITB), in 1987 and 1991, respectively, and the Ph.D. degree in computer science from Keio University, Japan, in 1999. He is currently an Associate Professor of Computer Science with Institut Teknologi Bandung (ITB), Indonesia. His research interests include the field of data science, soft computing, and IT governance. VOLUME 8, 2020