Self-Organizing Maps Neural Networks Applied to the Classification of Ethanol Samples According to the Region of Commercialization

Physical-chemical analysis data were collected, from 998 ethanol samples of automotive ethanol commercialized in the northern, midwestern and eastern regions of the state of Paraná. The data presented self-organizing maps (SOM) neural networks, which classified them according to those regions. The selforganizing maps best configuration had a 45 x 45 topology and 5000 training epochs, with a final learning rate of 6.7x10-4, a final neighborhood relationship of 3x10-2 and a mean quantization error of 2x10-2. This neural network provided a topological map depicting three separated groups, each one corresponding to samples of a same region of commercialization. Four maps of weights, one for each parameter, were presented. The network established the pH was the most important variable for classification and electrical conductivity the least one. The self-organizing maps application allowed the segmentation of alcohol samples, therefore identifying them according to the region of commercialization.


INTRODUCTION
Fuels obtained from crude oil hydrocarbons, whose utilization is still prevalent, do not possess the advantage of ethanol which can be obtained from more evenly distributed resources worldwide [1].Ethanol has increasingly attracted the attention of researchers, companies and governments, due to prospects of depletion of non-renewable fossil fuel sources, competition on fuel prices, as well as environmental concerns related to the emission of substances that harm the environment [2].This biofuel is, to date, the only one able to meet the growing global demand for renewable energy of low cost and low polluting potential.Emissions from burning ethanol are smaller compared to emissions from burning gasoline and part of released CO 2 is reabsorbed by the sugarcane itself [3,4].
Ethanol remains the most widely used biofuel in Brazil [5].The country has large territory, climatic diversity and is currently positioned on the international scene as the largest producer and exporter of sugar cane and the largest producer and consumer of ethanol.It is also the only country to use ethanol on a large scale as an alternative fuel to petroleum derivatives.In the 70s, Brazil launched the National Alcohol Program -PROALCOOL and developed technology to use ethanol, extensively, on vehicles previously powered by gasoline.It is currently one of the most advanced countries, on the technological point of view, on the production and use of ethanol as fuel.The production process used in Brazil is almost exclusively the fermentation of the must, consisting of sugarcane juice and molasses [4,6].
There are other raw materials used in the world for obtaining ethanol.An alternative source of glucose used is sugar beet [7].It is also possible to extract starch from vegetables, such from corn [8] and transform it into glucose for subsequent fermentation.In this case, enzymes make the starch liquefaction and saccharification to produce glucose, which is fermented by Saccharomyces cerevisiae to produce ethanol [9].The second-generation ethanol is emerging from technology that converts cellulose into glucose, to later perform its fermentation to ethanol [10].
Chemical characterization and quality control are also issues related to the use of ethanol.Brazil, as the pioneer in its operation, has relevant standards for the regulation of marketing and quality control.In the country, the National Agency of Petroleum, Natural Gas and Fuel (ANP) exercises this supervision, establishing quality control criteria for ethanol and other fuels in Brazil, through specific resolutions.The Resolution No. 19/2015 defines the fuel ethanol as a biofuel derived from the fermentation of renewable biomass, intended for use in internal combustion engines, which has as its main component ethanol, specified in the form of anhydrous ethanol fuel and hydrous ethanol fuel.The hydrous ethanol fuel is defined as ethanol fuel that is intended for direct use in internal combustion engines, unlike anhydrous ethanol which is used in blends with gasoline.The Resolution no.19/2015 also specifies which laboratory test methods and chemical parameters will be considered for the quality control of ethanol.The standard methods indicated may come from the Brazilian Technical Standards Association (ABNT), identified with the NBR prefix as well as the American Society for Testing and Materials (ASTM), the European Committee for Standardization (ECS) and the International Organization for Standardization (ISO).The methods are generally the same, they are only translated and formatted differently according to the regulatory authority [11].However, several of these analytical methods are developed based on other matrices, such as oil and water, and not specifically developed from tests with ethanol.
Other methods of physical and chemical analysis for ethanol, although not covered by national and international bodies and legislation, have been presented and discussed in the literature, because knowledge of the different chemical species that may be present is relevant.Substances that produce toxic effects to the environment and life, such as heavy metals could also be measured and controlled.There have been studies with spectrometric techniques, chromatography and electrochemistry to detect various organic and inorganic compounds, among which stand out: lead, aluminum, cadmium, esters and aromatic aldehydes.The high reliability and detection of substances at low limits for these instrumental methods have enabled changes and additions of new official standards regarding the quality control of fuel ethanol [12].
When evaluating the results of chemical parameters of quality control, it is possible to infer about their previous storage conditions, contamination by water and acids and the presence of metal ions, particularly iron and copper [13,14].Brazil has more than 400 producers of sugarcane ethanol [15], and the ethanol features may differ according to its producers or distributors.The chemometric classification of samples is quite complex in most cases, especially if the patterns are described by a large number of independent variables; when this occurs, it is necessary to use automated systems.One of the systems are artificial neural networks, which try to model, even though primitively, the logical operations by which the brain performs various tasks [16,17].
Over thousands of years, due to the evolution and adaptation of the human central nervous system, the pattern recognition and classification are naturally performed tasks [18,19].Artificial neural networks have emerged as an alternative to this task, operating by computational methods.In silicon circuits the electrical impulses occur in the order of nanoseconds (10 -9 s), while in the human brain they happen in the order of milliseconds (10 -3 s).Artificial systems are typically configured to use fewer neurons, because the human brain compensates the lower rate of operation using a larger number of neurons linked with solid connections [19].
Artificial neural networks and brain neurons are similar due to acquiring knowledge through a process of learning and storing knowledge as synaptic weights, namely, the connection strength between neurons [19].Each artificial neuron can receive multiple input stimuli and generates multiple output stimuli, and such flow is spread through network connections as in the biological model.The intensity of the propagation from a particular connection is made by bias weights.The received signals are summed and have their weight modified by a transfer function [20].
There are several types of neural networks such as multilayer perceptron, radial basis networks, self-organizing maps (SOM), among others [19,21].
For chemistry and agronomy, artificial neural networks have been used on cheminformatics research [22], on discrimination of coffee samples according to the region of growth [23,24], on the study of the profile of adulterated gasoline and gasoline marketed in different regions [25,26], on the segmentation of soybean samples by determining the content of inorganic compounds [27] and in the classification of ethanol samples by distillery of origin [28].
The self-organizing maps (SOM), specifically, can act in the field of humanities and applied social sciences to measure the welfare in a society [29].Another study established similarity among countries grouping them according to social indicators [30].In economics, the SOM can relate finances and society [31].In behavioral studies, neural networks can learn and predict human actions in situations such as driving a car [32].In medicine, they were used in the study of intestinal absorption of insulin [33] and in the representation of brain activity patterns [35].In oceanographic they can check the phytoplankton biomass growth according to the sea characteristics [36].In geology, they verified clay structures in rocks [37].In astronomy, neural networks allow large data mining [38].
The self-organizing maps (MAO) or Kohonen neural networks use only the input variables to find similarities among samples.There is not backpropagation of output signals to detect errors and adjust weights, that means they are non-supervised networks, unlike the perceptron networks.Their learning is competitive since there is a competition among neurons for the output variables.They are able to build a map from a set of data from an input space, contained in a finite set of neurons arranged in onedimensional or two-dimensional arrangement, being suitable for the task of characteristics selection [20].They work compressing variables in a twodimensional plane, resulting in a topological map where can be observed the presence of clusters of neurons that refer to the samples [29].Clusters are identified visually wherein samples close to each other indicate a neighborhood relationship and their similarity of characteristics [20].The self-organizing maps also allow the generation of maps of weights for each input variable, what allows a detailed understanding of how each factor studied interferes with the observed segmentation [39].
The prediction of commercialization regions of ethanol samples can be an object of experimentation of chemometric techniques [28,40].Therefore, artificial neural networks can examine ethanol database and indicate whether there is a pattern inherent to those samples.

Samples
The database was composed by 998 samples of hydrated ethanol fuel: 326 samples commercialized in the northern of Paraná State, analyzed at the Laboratory of Fuel Research and Analysis of the Universidade Estadual de Londrina; 420 from the eastern region, analyzed at the laboratory Chronion Análises Químicas Ltda.; and 252 samples marketed in the midwestern region of the state of Paraná, analyzed at the laboratory of fuels of the Universidade Estadual do Centro Oeste.The samples had been subjected to pH testing, alcohol content, electrical conductivity and density at 20 °C.The analysis results were presented to the self organizing maps neural networks.

Alcohol content and specific mass
To determine the alcohol content and the specific mass, the standard followed was the NBR 5992 [41], equivalent to the ASTM D4052-11 [42].

pH
The pH was determined according to the method NBR 10891, by measuring the difference of potential between electrodes [43].

Electrical conductivity
It was determined according to the NBR 10547 standard, equivalent to the D1125-11 [44].

Artificial neural networks
The neural network module of MATLAB R2007 software was used and the entry of parameters was performed in the following order: specific mass at 20 °C, alcohol content, pH and electrical conductivity.

Processing
All results of the experiments were processed on an Intel Core i7-4790 computer, with 3.60 GHz and 32 GB of RAM.

RESULTS AND DISCUSSION
In order to evaluate the profile of fuel ethanol commercialized in the northern, eastern and midwestern regions of Paraná state, data of physicochemical analysis of ethanol samples were collected.Figure 1 shows the data of 998 samples in chronological order by region, for the specific mass, pH, alcohol content and electrical conductivity.For pH, results for the midwestern region present more dispersed values, comparing with the values of the eastern and northern regions.The horizontal lines set the upper and lower compliance limits for the parameters utilized [45].The compliance values for the specific mass are between 807.6 and 811.0 kg m -3 , for the pH they are between 6.0 and 8.0, for the alcohol content between 92.5 and 93.8 % v/v and for conductivity at a maximum of 300 μS m -1 .In the northern region, one sample is above the upper limit for the specific mass, two above the upper limit for the conductivity and two under the lower limit for the pH.For the alcohol content, one sample of the midwest showed results above the upper limit.For conductivity, two samples of each region exceeded the upper limit.Table 1 specifies the minimum and maximum values for each parameter for the 998 samples, as well as the average and standard deviations, according to their region of commercialization.Standard deviations present higher values for conductivity, because their results tend to be more dispersed from the average value.
The data of all samples were presented to the self-organizing maps (MAO) module available in the neural networks toolbox of the MATLAB R2007 for segmenting the ethanol samples according to their region.The MAO network transformed a standard incident signal of arbitrary dimension in a discrete two-dimensional map, showing this transformation in a topologically ordered way [20].
A network was trained with hexagonal topology 35x35, with 7000 training epochs, initial learning rate of 0.1 and initial neighborhood relation of 17.The network has stabilized the average quantization error in 0.05 with 5000 training epochs.The learning rate decreased to 9,1x10 -5 and the final value of the neighborhood relation was 3.10 -3 .In this topology occurred 50 cases where more than one sample from different regions occupied the same neuron, by superposition, that means the mesh of neurons was not sufficient to enable the separation of groups.
According to Boishebert et al. [46], the topology should be chosen to not contain few neurons in relation to the number of samples, so they do not overlap.However, it can not have a very large number of neurons that cause the samples to overly disperse forming more groups than desired.Moreover, a larger topology demand a higher computational processing.
Therefore, a larger neural network with hexagonal topology 45x45 was trained with 5000 training epochs with an initial learning rate of 0.1 and initial neighborhood relation of 22.The network stabilized the error with 4000 training epochs as shown in the chart on the Figure 2, the final quantization error was of 0.02, the learning rate decreased to 6.7x10 -4 and the final value of the neighborhood relation was 0.03.  Figure 3 shows the distribution topology 45x45, where it can be observed that the samples of the eastern region (L) are located at the top, the northern region (N) samples in the middle part and the midwestern region samples (C) at the bottom.Therefore, three groups can be visually identified with only 26 cases of samples from different regions overlapping on the same neuron indicating that the network showed a lower classification error in relation to the network with topology of 35x35.
The topological map obtained showed the clustering of neurons referring to samples of the same regions.Its configuration originated from the interaction between four chemical parameters analyzed in the ethanol samples.Over the topological map there can be superimposed weight ranges for each variable.Thus, for each parameter a map of weights can be generated, which adopts the sample distribution map to overlap value ranges to the neurons.The samples become part of a range of weights and samples belonging to the same group have similar weights, this indicates similarities among the samples for that variable.The weight ranges are designated by different shades.If ethanol samples from the same region are present at or near same weight ranges, this means that parameter was the most significant to set apart those samples from the others.The map of weights for the specific mass (Figure 4) indicates that the samples of the midwest (C) predominate in the red and orange area, with higher weights, distinguishing it from the other two regions.However, it was not possible to distinguish the eastern (L) and northern (N) regions, since they show results in similar weight ranges, which are the blue and green areas of the map.Therefore, the specific mass was an important parameter only to differ samples of the midwest from the others.Figure 5 shows the map of weights for the alcohol content.This variable was also important to separate the samples from the midwest (C), because although there are samples from the eastern (L) group placed at the lower weights, most of the group C is located in these regions, that is, in the blue areas.However, the samples of the northern (N) and eastern (L) regions are distributed heterogenically on the map, in all weight ranges.The electrical conductivity is represented at the weight map shown in the Figure 6, which grouped samples from the midwest (C) and northern (N) in an area with higher weights indicating similarity between these regions.In the blue areas of the map, corresponding to lower weights, there are concentrations of samples from the eastern (L).The Figure 7 presents the map of weights for the pH parameter.The group L is situated in the red area of the map, corresponding to higher weights.The group C is situated in the darker blue of the map, with lower weights.The N group is located in a region of intermediate weights.In the areas around the neutral pH 7.0, samples from northern (N) region predominate.Besides indicating the formation of three groups, this parameter was also important to justify the distancing of L and C groups, as they are found in very different weights ranges.So, pH parameter proved to be important for the classification of samples according to their region.

CONCLUSION
It was possible to separate samples of hydrated ethanol marketed in the northern, midwestern and eastern regions of Paraná, making use of neural networks.The most relevant factor for separation was the pH followed by the specific mass, then the alcohol content.The choice of the topology 45x45 enabled the best separation of the samples and 4000 epochs of training demonstrated to be sufficient.
The self-organizing maps emerge as a tool able to do the segmentation of ethanol samples and verify which chemical parameters are the most relevant for this comparison and which are less important.The success in the segmentation of samples through this technique brings the possibility of its use to separate ethanol samples according to other factors, aside their region of commercialization, such as those relating to their production process or their feedstock.

Figure 1 .
Figure 1.Data for (a) specific mass, (b) pH, (c) alcohol content e (d) conductivity, for the ethanol samples.

Figure 2 .
Figure 2. Quantization error in relation to the number of epochs.

Figure 3 .
Figure 3. Training graph for the 45 x 45 topology showing the distributions of samples according to the winner neuron.

Figure 4 .
Figure 4. Map of weights for the specific mass.

Figure 5 .
Figure 5. Map of weights for the alcohol content.

Figure 6 .
Figure 6.Map of weights for the electrical conductivity.

Figure 7 .
Figure 7. Map of weights for pH.

Table 1 .
Minimum values, maximum values, average and standard deviation (SD) of the samples by region of commercialization.