Efficient CNN Architecture on FPGA Using High Level Module for Healthcare Devices

Modern wearable healthcare devices require new technologies with resource efficiency in terms of high performance, low energy consumption and diagnostic accuracy. In the field of artificial intelligence, the convolutional neural network (CNN) has performed as an effective algorithm. Field-programmable gate arrays (FPGAs) have been extensively utilised to construct hardware accelerators for CNNs. This paper suggests using an accelerator to create a specific 1-D CNN to classify the electrogram (ExG). ExGs used here include electrocardiogram, electroencephalogram and electromyography. The pipelined structure is designed with a register in the middle to facilitate easy data transfer. A 1-D CNN using an accelerator to categorise ExG signals implemented on Xilinx Zynq xc7z045 platform outperforms FPGA peer applications on the same platform by 1.14× in terms of speed. In addition, the 1-D CNN proposed accelerator operates very efficiently due to the use of a tristate buffer in the multiplexer and the substitution of the shift for the multiplier, resulting in a resource-efficient accelerator with 161 GOP/s/W energy efficiency and 28 GOP/s/KLUT, an improvement of 1.67 over the previous model. Finally, the performance of the accelerator applied to a Xilinx Zynq xc7z045 FPGA operating at 442.948MHz was calculated, achieving 1.145 TFLOP/s.


I. INTRODUCTION
N OWADAYS, thinking about how to use wearable Internet of Things (IoT) sensors to look at people's behaviours and determine how healthy they are is important. Wearable sensors for tracking are used in the medical field, and IoT assists in data collection via decision-making tools [1], [2]. Illnesses are often diagnosed using a cloud computing platform. Also, the huge amount of data kept and shared by many health research institutions around the world makes it hard for humans to find important information in medical data. As a result, the existing medical system continues to need a remarkable amount of time and effort from the overwhelming majority of people to provide a good medical diagnosis. In addition, the development of a wearable healthcare device with the potential of highprecision medical diagnostics is urgently required to address this issue [3]- [6].
Artificial intelligence (AI) models, surgical gadgets and mixed-reality applications may be used to diagnose and treat illnesses more effectively than ever before. Consequently, the clinical decision support system achieves particular goals such as the detection of electroencephalograms (EEG) and electromyography (EMG). AI diagnosis is also more accurate than manual diagnosis. Moreover, machine learning (ML)-based models outperform human pathologists and imaging specialists in terms of precision. With smart diagnosis, a patient's current health status and ailment severity may be precisely identified, so that a tailored treatment plan can be established [7], [8].
Electrocardiogram (ECG) is used to visualise the heart's electrical activity. ECG contains a wealth of information due to the simplicity with which it may be monitored. It may offer vital information about the progression of a myocardial infarction, the presence of various cardiac arrhythmias and the effect of hypertension [9]. Moreover, in standard EEG, electrical activity in the cortex is captured through scalp electrodes and shown as a waveform. Implanting electrodes directly on the exposed surface of the brain is necessary to capture electrical activity from the cerebral cortex. This process includes expressions of subcortical areas in cortical regions. In addition, EEG may be used to monitor the brain's functional integrity because it reflects the functional condition of the brain and as a result, aids doctors in identifying a wide range of neurological diseases. It is also critical to establish a link between scalp potentials and the underlying neurophysiology. It may detect pathological conditions such as ordinary headaches and dizziness, epilepsy, brain tumours and multiple sclerosis as well as sleep problems and movement irregularities [10]- [12].
EMG is a method for recording and analysing signals produced by the electrical activity of skeletal muscles that has been around for some time. These signals are also called myoelectric signals. EMG is utilised as an assessment tool in a variety of domains such as applied research, physiotherapy, sports training and other similar fields of study. Surface EMG provides information on the general function and conduction of muscles. To obtain myoelectric signals, electrodes are placed on the surface of the skin. When EMG is conducted on particular muscles, such as shoulder and upper trunk, or when it is recorded in monopolar mode, it is often polluted by ECG. This cardiac artefact in imaging cannot be prevented. Consequently, to extract suitable, qualitative information from an EMG signal, removing cardiac artefacts from the signal and deriving a pure EMG signal from it are necessary [11], [13], [14].
Many algorithms were previously built on morphological features and traditional signal processing techniques [15]- [24]. Fixed features used in such algorithms cannot properly discriminate between various forms of ExG because the ExG waveform and its morphological qualities fluctuate greatly depending on the situation and the patient. In addition, DL methods were recently developed to extract features automatically and improve ExG classification accuracy. DL approaches have been shown to be very adaptable and precise in the classification of ExG amongst other applications [25]. For instance, Cimtay et al. [26] demonstrated emotion recognition using facial expressions and EEG, and achieved 91.5% maximum accuracy and 53.8% mean. Oh et al. [27] proposed a hybrid model that consists of convolutional neural networks (CNN) and long-term memory (LSTM) to increase the accuracy of detection of arrhythmias. This model may detect a common arrhythmia. With variable length data, it achieved good classification accuracy (98.10%), sensitivity (97.5%) and specificity (97.5%). Alam et al. [28] proposed a new concept design with regard to the detection model portion. Human psychophysiological data were acquired using EMG, electro-dermal activity and ECG sensors, and processed using a CNN to detect the hidden emotional state.
Preprocessing, feature extraction and classification methods are often used in conjunction to achieve ExG classi-fication. Although LSTM is excellent at processing time domain data such as EMG, it is only capable of basic feature extraction in the context of EMG classification. When using LSTM, sophisticated preprocessing algorithms and classification are required to achieve high classification accuracy, making decreasing resource usage throughout the implementation on the hardware side challenging [29]- [31]. In contrast to LSTM, a sophisticated preprocessing technique for the network to use it need not be developed because of the remarkable self-learning capabilities and flexibility of CNN. Thus, building a CNN-based classification model is more efficient in terms of hardware use. In this paper, a 1-D CNN structure for the detection and classification of ExG signals (ECG beats, EMG and EEG signals) is presented. The 1-D CNN is implemented using an efficient hardware design. The developed 1-D CNN model and hardware architecture may also be used for other time series applications, such as blood pressure and diabetes monitoring. The following are the most substantial contributions of this work: 1) This paper is the first to design a hardware architecture using three biomedical signals from ExG in fieldprogrammable gate array (FPGA) platform for facilitating CNN acceleration. Additionally, the proposed architecture can compute convolution for any size of input and modify the stride value. 2) This work discusses a 1-D CNN structure tailored to an embedded application. By using Python, it produces an ensemble of output from 1-D CNN layers for each ExG with 99% accuracy, and the 1-D CNN recognises the ExG signal. 3) This design maximises the use of hardware resources whilst minimising the accuracy loss of ExG categorisation. 4) This work designs a pipelined processing unit array to achieve great performance and efficiency. It also includes a sign bit in each processor unit, which not only minimises power consumption but also lowers the cost of hardware resources. As a result, the proposed design achieved a high performance of 1.145 TFLOP /s at 442.948 MHz and 1.068 KLUT resource utilisation. 5) The design accurately identifies the ExG signal using FPGA Xilinx and attains a higher speed than classification of just one type.
The remainder of this paper is arranged as follows: Section II discusses the features of ECG, EEG and EMG signals as well as the history of CNN algorithm and its applications. Section III presents the proposed accelerator of 1-D CNN and explains in detail how to design the new architecture. Section IV focuses on the proposed structure of the 1-D CNN and other algorithms. Section V discusses the results and compares them with those of other models. Section VI summarises the conclusion.

II. RELATED WORKS
CNNs have seen much success in deep learning in recent years [7]. Using CNNs as the basis for the vector convolution calculation approach, new concepts for learning with high classification accuracy have been discovered. This section includes several relevant research works, such as EEG, EMG and ECG signals as well as classification learning research methods based on CNN and other algorithms, which can differentiate between CNN and other algorithms in terms of architecture, accuracy, speed of diagnosis and classification in healthcare monitoring.

A. EEG SIGNAL
Social integration can be substantially improved by using EEG-based emotion ratings for patients with early-stage Alzheimer's disease and neurological disorders. Moreover, emotions have traditionally been classified via the use of software running on computers and working without an internet connection. However, these classifiers can be worn, which is important to enhance the social life of patients. This level of wearability must be achieved by the deployment of low-power hardware resources that enable near real-time classification and long durations of operation. Gonzalez et al. [31] proposed a hardware CNN called BioCNN that is employed to optimise EEG-based emotion detection and other biomedical applications. They used Digilent Atlys Board in conjunction with a low-cost Spartan-6 FPGA to accomplish the training technique. Their results [31] revealed 100 MHz speed, 26.229 KLUT resource consumption and 0.0629% resource consumption efficiency.
Two key study issues in the field of EEG categorisation are the detection of epileptic seizures and the identification of emotional states. Through the development of a realtime, energy-efficient processor to conduct on-device seizure detection, informing others in the immediate vicinity or immediately preventing the seizure by providing instant stimulation will be feasible. Accurate seizure detection requires precision to be safe [25]. Sahani et al. [32] illustrated that scalp multichannel signalling and electroencephalography are both effective methods to detect epileptic seizures in real time. Additionally, they developed a novel architecture to extract additional features with great accuracy and speed, and implemented it using FPGA platform Virtex-5. The architecture [32] showed that it can detect and identify epileptic episodes in a steady, reliable manner. Speed was 86.73 MHz, and resource usage was 11.963 KLUT.

B. EMG SIGNAL
In recent years, EMG processors have received a great deal of interest because they are often employed in gesture recognition applications. To maximise classification accuracy whilst minimising power consumption, wearable devices are generally used for gesture recognition. In [33], a low-power embedded EMG acquisition and gesture recognition system was proposed. Software and hardware multilevel design optimisation was emphasised. In addition to EMG sensors, inertial and pressure sensors have been utilised to increase gesture identification and motion tracking accuracy.
The development of specialised CNN accelerators has opened up new possibilities for edge healthcare and biomedical applications [34]. Franco et al. [35] proposed a set of readings to analyse surface-Electromyography (sEMG) to study how the nervous system modulates muscle activity.The researchers built an FPGA-based real-time Non-Negative Matrix Factorization (NMF) processor that extracts muscle synergies from 8-electrode EMG recordings and feeds them to an SVM classifier. It was implemented on FPGA platform. In addition, their results [35] revealed 87.74 MHz speed, 38.836 KLUT resource consumption and 7.2 W power consumption. Mostafa et al. [36] proposed an architecture design and execution for estimating the desired clench strength of the hand using EMG signal and implemented this using Xilinx's XC7Z020 platform. They also showed that the proposed architecture can be used for any application related to prosthetics. In accordance with the findings of [36],speed was 388.20 MHz, resource usage was 4.379 KLUT and power consumption was 0.344 W.

C. ECG SIGNAL
In medicine,an ECG device is used to record the electrical activity of the heart's pulse to diagnose various forms of cardiac disease. Electrodes are placed on the patient's body to attach the ECG [37], [38]. The development of wearable sensors for healthcare monitoring has the ability to read and analyse different parts of the ECG [39]. Also, realtime monitoring systems use a lot of different methods to identify ECGs [40]. In recent years, IoT has been used to monitor patients and their health remotely. By training on a data set such as cardiology, AI algorithms are also used to classify and identify diseases with high speed and accuracy. Moreover, appropriate processors are designed for these data to expedite disease diagnosis [41]- [43]. Jiahao et al. [44], suggested a hardware design for an integrated ECG classification using a 1-D CNN with global average pooling. They built the efficient hardware design on FPGA Xilinx Zynq and delivered an average rate of 25.7 GOP/s at 200 MHz with 1538 LUT resource usage and optimised resource efficiency by more than thrice that of the non optimised scenario. The completely pipelined processing unit array meant to boost calculation speed. The accuracy of ECG beat classification was 99.10%.
The design in a study by Guo et al. [45] used the specifications of Angel-Eye, a programmable, adaptable CNN accelerator architecture that includes a data quantisation technique and a compilation tool. The use of a data quantisation approach may help lower bit width from 16 bits to 8 bits whilst maintaining minimal accuracy loss. The compilation tool efficiently adapts a certain CNN model to the available hardware. According to testing on Zynq XC7Z045 platform, compared with its equivalent FPGA running on the same platform, Angel-Eye operated at a faster hardware speed of 150 MHz with resource utilisation VOLUME 4, 2016 of 183 KLUT and power usage of 9.63 W. In [45], the model provided comparable performance with up to 16% more energy economy. Gong et al. [46] developed a novel FPGAbased accelerator architecture that operated synchronously as a pipeline of instructions. The model also generated focused optimisation, which may be used to reach the highest possible level of computing efficiency. The CNN model was used to test the performance of this accelerator on a variety of platforms, including Xilinx Zynq-7020 and Virtex FPGA. The model obtained 200 MHz speed with 2.15 W power consumption and 38.136 KLUT.

D. CNN MODEL AND FPGA
Pattern classification and data mining studies have been enhanced by neural network (NN) success. Many ML tasks that previously depended largely on handmade feature engineering have lately been transformed by end-to-end deep learning models such as CNN [47]. A 1-D CONV layer is generated from numerous computational layers organised as directed acyclic graphs. Each layer extracts a feature map, which is an abstraction of the data supplied by the preceding layer. The output of result y n is shown as where w nk and x k are the kernel weights and the feature map data, respectively, b n is the bias and k is the number of feature maps provided. Using the output of result y n as a starting point, output Y k can be represented as Additionally, f represents the activation function, which is commonly rectified linear unit (ReLU) in CNN.
Convolution, pooling and fully connected layers are the most popular layers. 1-D CNNs have been frequently utilised in medical and healthcare applications. A 1-D CNN employs convolutional layers, which use spatial filters to promote spatial invariance. When convolutional layers are included in 1-D CNNs, spatial filters are used to enhance spatial invariance, which is beneficial. Pool layers downsample input feature maps spatially by dividing them into subregions and merging their values into a single value. The pool operators' max-pooling and average-pooling employ maximum and average values for each subregion.The fully connected (FC) layers link all the neurons in the previous layer to every single neuron in the following layer, forming a network of connections. To build a relational representation of these characteristics for each class in the classifier detection set, FC layers join the features retrieved by the CONV layers and combine them into a single entity. Finally, a SoftMax classifier is applied to the outputs of the previous FC layer to obtain normalised class probabilities for different classes in the final layer.
In addition, FPGA acceleration for CNNs has received much interest. Using an appropriately designed FPGA accelerator for CNN, the full capability of the parallelism of low latency and fast speed can be achieved because of the highperformance, high-speed and low-power consumption needs of various applications. FPGAs are widely used as costeffective options in many industries. Furthermore, the recon durability of FPGAs enables them to adapt quickly to new CNN designs [48] . Compared with CPU or GPU, FPGA has better energy efficiency. Making a high-performance FPGA accelerator is a lengthy process that typically entails many steps, including parallel architectural discovery, memory bandwidth optimisation, area and timing tuning, and software-hardware interface creation. Consequently, automatic compilers were developed for FPGA CNN accelerators, in which the hardware description of target accelerators can be generated automatically based on parametric templates, and design space exploration can be simplified into parameter optimisation with respect to network structure and hardware resource constraints [49]. In this work, an FPGA accelerator is designed for CNN using three ExGs. The FPGA accelerator is designed with fewer hardware and lesser complexity, which increases classification speed and accuracy. Consequently, power consumption decreases.

III. THE PROPOSED ACCELERATOR OF 1-D CNN
Signal flow graph is a technique for discrete time modelling in systems that illustrates the iterative process of data processing or classification.This approach is used to explain the 1-D CNN CONV layer and how the processing element (PE) is designed in relation to the iterations of data processing and classification as depicted in Figure 1. The solution achieved by the multiplication and addition operations of the 1-D CNN CONV layer is shown in Figure 1. This approach is represented by variable w nk and h ki , which are the concatenation of the kernel weights and the feature map of the input data, respectively, with a bias b n . To reduce the amount of hardware required for 1-D CNN, the multiplication is shifted to the left. Continuous collection operations follow, which necessitates the use of a register (R). After completing the iterative data processing cycle in R, it is then collected with b n . Finally, the output of y k is the final result.
In addition, the developed architecture of 1-D CNN reduces the complexity of multiplication by using shift operations. The shift in Figure 2 is represented by the symbol im, which can be stated as shown in the equations below based on Eq. (1).
where i th is the partition of kernel weights of the 1-D CNN with the length of shift m=k/i. In the next step, Eq. (5) is substituted into Eq. (1) and yields.
The ultimate shape of the proposed architecture is determined by Eq. (6). The PE in Figure 2 consists of an XOR gate that serves as a selector to examine the sign bit. It employs a tristate buffer instead of other gates in the multiplexer to decrease the number of devices needed. Figure 3 depicts the architecture of the proposed CNN accelerator, in which data are stored in off-chip memory prior to the start of classification. Kernels and weights are extraordinarily huge in terms of data size and are consequently stored in our implementation's off-chip memory. However, on-chip buffer is often inadequate for caching all of the parameters and data for state-of-the-art 1-D CNNs. As a consequence, off-chip memory is utilised to store all of the network settings as well as the results of each layer. The on-chip buffer approach was chosen to effectively feed data to the PE arrays based on two factors. First, by preloading cores and weights from off-chip memory to on-chip buffer using the data bus, PE arrays can access the required data at high speeds. Second, this approach loads a collection of data from off-chip memory rather than a single datum at a time, which maximises memory bandwidth utilisation. The PE array is connected to the off-chip memory via the onchip buffer using a data bus. This enables parallelization of data input/output and computation. Additionally, the output buffer provides interim results to the PE array and max pooling if an output channel requires more than one cycle of computation. This work introduces a 1-D CNN description interface for data management. In addition, this allows the user to make full use of the on-chip buffer.
In this study, the 1-D CNN is initially trained using 32bit floating-point data to determine the data and parameter ranges of each layer. The bit width used in the proposed architecture is 16 bits, and the stride is 1. The data are divided into three categories in the buffer prior to inserting the data into the PE array. The PE array uses a pipeline structure by placing a shift register between the layers of the PE array. Moreover, the shift registers function as a form of local memory for previously obtained values. The pipeline structure is used in the proposed architecture to achieve high classification efficiency and increase speed. After storage in the PE array, the data are transmitted to the pipelined adder tree, where they are utilised to complete the calculation. The adder tree is selected because it produces a high-quality output with a low critical path latency.
The design of max pooling and its connection to the output buffer is shown in Figure 3. A MUX is placed in front of the max pooling to select the verified input data to the associated max pool to support varying convolution strides and pool size scenarios. In the suggested design of the pooling, the tristate buffer mux structure is utilised instead of other gates to decrease the number of hardware used, and a selector (enable) is employed to control the output data centrally. A comparator and a useful feedback register are used in the max pool to save the final comparator output. According to the suggested architecture of 1-D CNN, the results for all layers are selected by max pooling and then sorted. Figure 4 depicts the workflow followed throughout the system building phase. The collected data set is first saved in a database for simple retrieval and analysis. Then, preprocessing procedures such as padding, reshaping and resampling are performed on the stored data. Next, the data are divided into two sections, 1) testing data and 2) training data, which are employed in the model-building step. The model creation phase has two parts, 1) model evaluation and 2) model training. The training data are used to train the model, and the testing data are used to evaluate the model's performance during the evaluation phase. Then, the ensemble is applied to the three models to unify the findings VOLUME 4, 2016  and output, and identify which mode is the best. To select the best model, this step is performed 10 times using various algorithms of ML. The system is now ready to accept new data samples for classification after storing the optimum model. Finally, the algorithm is implemented on FPGAs and used to create an accelerator for classification.

B. UTILISED DATA
In this paper, the ExG signal, which contains ECG, EEG and EMG signals, was utilised because the signal characteristics of these three types are very similar as shown in Figure 5. The data set from UCI Machine Learning Repository was utilised. This data set contains several variables as well as a goal condition of heart disease or not having heart disease. It has 76 qualities, but all published studies only use 14 of them. Examples are age in years; sex (1 = male, 0 = female); type of chest pain; resting blood pressure; serum cholesterol; fasting blood sugar (1 = true, 0 = false); resting electrocardiographic results; maximum heart rate achieved; exercise-induced angina (1 = yes, 0 = no); ST depression induced by exercise relative to rest; slope of thal, where 3, 6 and 7 denote normal, a fixed defect and a reversible defect, respectively; target, whether you have a sickness (1 = yes, 0 = no).
The data set for EEG in [50], [51] used the brainwave data set for processing. It also utilised dry electrodes to represent each positive, negative or neutral state that participants encountered. In this paper, sentiments were classified as melancholy (negative), joy (positive) and neutral (regular) by the symbols 1, 2, and 3, respectively. Finally, movie excerpts from various films were used to illustrate each of the three instances of emotion. The EMG data set utilised in this paper contained a data set used to monitor human activity on volunteers in states of normal and aggressive actions on the body as well as the state of effect on them, utilising eight channels attached to their bodies. For simplicity of classification, it was encoded in this sheet from 0 to 6, with channel number 7 for the data carrier sensor.

V. RESULTS AND DISCUSSION
The proposed model was tested with three signal data sets (ECG, EEG and EMG). As mentioned above, the proposed model was tested with four different algorithms to validate it and select the best algorithm performance through the highest possible classification accuracy of the three signal data sets of ExG. The four techniques utilised were stochastic gradient descent (SGD), naïve Bayes (NB), support vector machine (SVM) and CNN. In this section, the performance of the proposed model for each algorithm is also evaluated, and the best algorithm for implementation on hardware FPGA is identified. The FPGA implementation model including a 1-D CNN algorithm that is the most effective in this model for ExG signal processing is discussed in the second part.

A. ANALYSIS OF TRAINING AND EVALUATION
In this section, the main parameters of each algorithm used, the structure of the model and the metrics used to evaluate the performance of the proposed model are discussed. Accuracy measures how often a model successfully classifies data samples, as shown in the equation below for the evaluation of the module of 1-D CNN and other algorithms based on true negative (TN), true positive (TP), false positive (FP) and false negative (FN).

Accuracy =
T P + T N T P + F P + T N + F N The accuracy attained by each algorithm based on the NN of the proposed model is shown in Figure 6, and the 1-D CNN acquired 99% maximum level of accuracy. The accuracies of the four models were also compared, as shown in Figure 7. The graph clearly shows that 1-D CNN is more accurate than the other models. This method was used with 100 iterations with 20 trains.
The main classification metrics were calculated after collecting the confusion matrix values using the classification report consisting of recall, F1-score and precision.
Recall is the proportion of positive samples accurately recognised from the real positives, as shown in Eq. (9), is referred to as recall.The models used have lower scores than Re call = T P T P + F N F1-score is the harmonic mean of precision and recall, as stated in Eq. (10). Each model was assessed with regard to F1, as demonstrated in Figure 8 (b), indicating that the proposed models are excellent, but 1-D CNN is the best.
The receiver operating characteristic curve and the area under the curve (AUC) are two probability models used to compare the TP rate to the FP rate at various thresholds through a graphical representation sensitivity against specificity, as shown in Figure 8(a). AUC indicates the classifier's ability to discriminate between distinct classes. The receiver operating characteristic curve contains three main evaluation values: A value close to 1 indicates that the classifier is performing well, a value close to zero indicates that the model is 100% incorrect and classifying in reverse, and a value close to 0.5 indicates that the classifier is only guessing. The relationship between FP rate (specificity) and TP rate (sensitivity) is depicted in Figure 9. All curves close to 1 indicate that the proposed model classifier performs VOLUME 4, 2016 FIGURE 9. The receiver operating characteristic curve of four models well, except for the SGD model, which is only guessing because its curve is close to 0.5.

B. ANALYSIS OF IMPLEMENTATION ON HARDWARE
The proposed architecture of 1-D CNN was designed on Xilinx Zynq Xc7z045 platform. The same assessment carried out on the testing set on PC platform was similarly performed on hardware. In terms of performance and characteristics, Table I highlights the comparison between the proposed 1-D CNN accelerator based on FPGA platform and other accelerators based on the kind of data processed on the accelerator to meet the standards and the specifications. A 2-D CNN and other models were used instead because no 1-D CNN accelerators were used during the comparison. Studies in the area of EMG and EEG in 1-D CNN accelerator are limited, and the studies were selected, as stated in Table I. The proposed accelerator achieved a high speed of 442.948 MHz compared with the highest speed of 388.29 [36] prior to this work, as shown in Figure  10(a).
Jiahao et al. [44] proposed a layout of the accelerator based on CNN that achieves the highest efficiency within the use of hardware resources of 16.71 GOP/s/kLUT. However, the proposed architecture of this work improved resource efficiency to 28 GOP/s/kLUT and is now the highest one, surpassing the previous 16.71 GOP/s/kLUT. The results demonstrate that our solution can reach a peak performance of 161 GOP/s/W energy efficiency on Zynq xc7z045 processor, which is much better than the previously reported results for other techniques. Several ExG biosignals, such as EMG, EEG and ECG, were described in this paper. The approach was implemented on FPGA platform and enabled multiclass learning and classification of the signals. Several methods were also employed to decrease the number of devices used. For example, in the multiplexer, a tristate buffer was used to reduce the number of resource utilisations that contain sign-enabled utilisation for central control. Moreover, in this work, the lowest value of resource utilisations was 1.067 KLUT, as shown in Figure 10 (b). In addition, the sign bit was considered because the XOR gate was configured as a selector to control the sign bit.
The bandwidth requirement of off-chip memory is determined directly by the access data width of the memory. So, the bandwidth requirement of the memory in off-chip memory is directly proportional to the access data width of the memory. When accessing 32-bit data, the bandwidth requirement of the memory is 1.8 GB/s, but when accessing 512-bit data, the bandwidth requirement of the memory rises to 28.4 GB/s. In this work, the access data width is set to 512-bit in order to optimise memory bandwidth. Kernels and weights have been meticulously structured to accommodate the 512-bit access mode that is operating at 442.948 MHz.
At the end of this study, the peak single-precision throughput (TFLOP/s) of mainstream FPGA devices is measured by the method proposed in [52]. The proposed design has a peak single-precision throughput of 1.145 TFLOP/s, which is higher than the earlier designs [44], [45] , which had 358.907 GFLOP/s and 2.262 GFLOP/s, respectively.

VI. CONCLUSION
This paper proposes a new model for classifying aggregated ExG signals consisting of ECG, EEG and EMG. When the proposed model is applied to four algorithms to ensure that it works properly, the average accuracy of 1-D CNN algorithm can reach 99%, which is higher than that of the rest of the models. In addition, the proposed model with the highest average accuracy of a 1-D CNN accelerator is implemented to classify ExG signals in wearable healthcare devices. The pipelined structure is designed with a register in the middle to facilitate easy data transfer. The proposed 1-D CNN accelerator works very efficiently due to using a tristate buffer in the multiplexer and replacing the multiplier by shift, which results in a resource-efficient accelerator with 161 GOP/s/W energy efficiency and 28 GOP/s/KLUT that is improved by 1.67× compared with the previous model. Also, the floating-point processing parts have a peak rate of 1.145 TFLOP/s, while the off-chip memory has a capacity of 28.4 GB/s. A 1-D CNN with an accelerator for classifying ExG signals implemented on Xilinx Zynq xc7z045 platform outperforms FPGA peer applications on the same platform by 1.14× in terms of speed.