CVs Classification Using Neural Network Approaches Combined with BERT and Gensim: CVs of Moroccan Engineering Students

: Deep learning (DL)-oriented document processing is widely used in different fields for extraction, recognition, and classification processes from raw corpus of data. The article examines the application of deep learning approaches, based on different neural network methods, including Gated Recurrent Unit (GRU), long short-term memory (LSTM), and convolutional neural networks (CNNs). The compared models were combined with two different word embedding techniques, namely: Bidirectional Encoder Representations from Transformers (BERT) and Gensim Word2Vec. The models are designed to evaluate the performance of architectures based on neural network techniques for the classification of CVs of Moroccan engineering students at ENSAK (National School of Applied Sciences of Kenitra, Ibn Tofail University). The used dataset included CVs collected from engineering students at ENSAK in 2023 for a project on the employability of Moroccan engineers in which new approaches were applied, especially machine learning, deep learning, and big data. Accordingly, 867 resumes were collected from five specialties of study (Electrical Engineering (ELE), Networks and Systems Telecommunications (NST), Computer Engineering (CE), Automotive Mechatronics Engineering (AutoMec), Industrial Engineering (Indus)). The results showed that the proposed models based on the BERT embedding approach had more accuracy compared to models based on the Gensim Word2Vec embedding approach. Accordingly, the CNN-GRU/BERT model achieved slightly better accuracy with 0.9351 compared to other hybrid models. On the other hand, single learning models also have good metrics, especially based on BERT embedding architectures, where CNN has the best accuracy with 0.9188.


Introduction
The use of Artificial Intelligence (AI) including deep learning (DL) and machine learning (ML) approaches to process large corpus, with the main objectives of information extraction, discovery of hidden patterns, as well as classification and categorization, has gained popularity in certain fields [1].Therefore, in many areas of the healthcare sector, clinical decision-making based on scanned images and laboratory samples has produced good results in classification, interpretation, and even prediction for patients [2,3].Also, several approaches based on machine learning have been proposed for evaluating the performance in different fields such as the educational systems and assessing student performance [4,5], the sentimental analysis based on machine learning, which can be applied in a variety of fields, such as politics [6], and tourism [7].
Therefore, the use of machine learning can be an innovative approach to improve the employability of students, support their career prospects, and identify their skills, and weaknesses based on labor market requirements [8].Accordingly, different classification approaches can be applied to capture relevant information and discover hidden patterns by analyzing data related to students' activities [9].Furthermore, a variety of data sources were used, including professional social networks (e.g., LinkedIn platform), CV and resume platforms, university student profiles, and even raw digital documents [10,11].Thus, even though CVs are considered unstructured, and noisy documents, student CVs remain important documents that contain relevant professional and personal information and are directly oriented towards employability [12].Accordingly, we propose our research on creating an employability model for Moroccan engineering students using combined machine learning to extract relevant information from their CVs.
Thus, when processing data from previous sources to obtain targeted results, a supervised or unsupervised machine learning approach can be used, including classification and clustering algorithms and even neural network approaches where data can be with or without labels [13,14].Recently, the use of models based on deep learning techniques namely the neural networks approaches such as gated recurrent unit (GRU), long shortterm memory (LSTM), and convolutional nNeuralnetwork (CNN) on text classification are generally chosen for their good performance on sequential tasks such as NLP and text classification [15,16].
In our case, the data under analysis consists of CVs of students containing a variety of information including personal and professional information, hard skills, soft skills, and academic background.Therefore, the use of deep learning techniques in text classification showed good performance and accuracy with new techniques based on neural network approaches [15].
Accordingly, for the classification of ENSAK's engineering students' CVs, we propose two different models combining two embedding techniques namely: BERT and Gensim with three approaches of neural network architectures namely: GRU, LSTM, and CNN.The results obtained from this study can serve as a basis for an in-depth analysis of the labor market and skills of Moroccan engineers.Additionally, the application of deep learning could contribute to the development of recommendation systems based on unstructured data associated with candidates and job offers intending to reduce unemployment.
Previously, different approaches have been proposed to use data such as job postings as a basis for building a recommendation system based on web-mined job postings and job seekers' skills, including classification and recommendation for job seekers and job providers [17].Otherwise, the use student documents, notably CVs. and classification resumes are still little explored, particularly for the Moroccan context, and using new approaches such as BERT and Neural Network architectures.
The paper is organized as follows; the second section presents the related works that have been applied to identify relevant information based on intelligent approaches and machine learning models especially for the employability on the Moroccan context.The third section presented our methodology adopted in this paper, the data collection method and machine learning model used for both architectures (BERT and Gensim).Finally, the fourth section details the findings and the interpretation of the results obtained based on the different approaches adopted.

Related Works
The application of machine learning and deep learning in subjects related to students and graduates is no exception; since the application of those new approaches has shown good results in helping and promoting decision-making in the education system [4].Therefore, these models are based on data from various sources and systems related to the academic process, including Learning Management Systems (LMS), MOOCs, Student Information Systems (SIS), and Intelligent Teaching Systems (ITS) [4].Accordingly, DL and NLP have been introduced into the education system in various areas, such as interpreting student behavior, detecting a lack of student motivation, analyzing the level of interest in lessons, predicting academic results, and even preventing school dropout problems [5,18].
Several studies have been carried out in the Moroccan educational context, where [19] proposed a model with an accuracy of 71% based on data from the Scholar Management System MASSAR and targeted baccalaureate students.Accordingly, based on the same experience, the prediction of accomplishment is very important for this group of students, where the decision to support or reinforce courses is suitable for both students and the educational system [19].Furthermore, the author [20] has focused his research on using massive open online course (MOOCS) data to classify students and predict their dropout problems, therefore, the objective is a predictive model where the issue of huge dropout rate reaches 90%.Based on machine learning, the accuracy of the compared models was classified as fellow Support Vector Machines (85.2%),K Nearest Neighbors (83.9%),Decision Trees (77%), Naive Bayes (85.5%), and Logistic Regressions (86.8%) with a combinatorial approach based on voting (92%) [20].
In addition, the author [21] conducted research into the prediction of on-time graduation rates among Moroccan students using models utilizing Support Vector Machines, Decision Trees, Naive Bayes, Logistic Regression, and Random forest.As a result, Random Forest's accuracy of 77% was found to be the highest among the models when analyzing academic success factors in Moroccan universities.Moreover, Ouatik [22] proposed a model based on a Big Data architecture (Hadoop and MapReduce) using neural networks, naive Bayes, and K-nearest neighbors to classify student orientation.Based on the results, Naive Bayes is more accurate and efficient when processing on-time data processing with an accuracy reaching 96%.Further, based on a systematic review of the models proposed for the use of Big Data and machine learning for employability, the author [23] proposed an intelligent system for employability based on Big Data and machine learning in Moroccan contexts using different data sources.
Meanwhile, machine learning and deep learning have also gained a great interest after the graduation of students, especially when it comes to predicting their employability for different stakeholders including the education system, employers, and graduates [24,25].To build their models, different factors are taken into account, including technical skills, soft skills, personality traits, demographics, extracurricular activities, and internships [26].Hence, different sources are considered, including recruitment platforms, professional social media platforms, CV platforms, and unstructured CV files [26].Therefore, the targeted results can be binary results such as employability: [Employable or unemployable], or even multi-classification models that correspond to the labor market situation according to different approaches (SVM, ANN, LR, AdaBoost, etc.) [26].
The first approach involves extracting relevant information from different corpus and detecting features by using Named Entity Recognition (NER) and Named Entity Normalization (NEN) [27].Accordingly, resumes and CVs classification are widely used based on DL, where the hiring process involves the analysis of each candidate's documents which are usually unstructured documents with very noisy data.The process of extracting entities like different attributes and skills is considered to be a challenging task for recruiters, hiring managers, and intelligent models [28].Therefore, those approaches aim to develop an automated model that can extract relevant information such as skills and personal characteristics for job matching and recommendation (Table 1).For the NER approach, many researchers have used a controlled dictionary called Folksonomy and Taxonomy, such as ESCO (European) and O*NET(American), where the European and American skills, competencies, qualifications, and professions classification were applied [29,30].However, in the Moroccan context, the application of machine learning to employability prediction is still in its infancy.Nevertheless, some research has given a roadmap in this direction, by analyzing the skills needed within the local professional market, extracting different features and classifications.where the experience led by [34] through neural network approach LSTM of both demands and offers and using word embedding vector supplemented respectively to the taxonomy ESCO database has shown relevant classification for different decision makers based on the priority of different features (explicit skills, soft skills, demographic and geographic information, experiences, etc.).
On the other hand, ref. [35] used combined Word embedding techniques word2Vec and neural networkDNN approach for the classification of IT resumes (Web/software development, Network engineering, Embedded software engineering, Testing engineering, Business intelligence, Big data development, Data science, Information systems management, Database administration) and labor market in the Moroccan context.In addition, those approaches can be applied toa recommendation system, where [36] proposed a model as a Recommendation system using data from Moroccan E-recruitment platforms (Rekrute, Emploi.ma,Linkedin, etc.) and based on a Classification approach including Weighted Semantic Network and Vector Space Model (VSM).

Sampling and Data Collection Method
The dataset of CVs was collected as part and continuation of a previous study carried out on the students' skills and their perceptions of labor market demands [37], which aims to uncover students' perceptions of their market-ready skills, including their assessment of the importance of each skill, such as technical and soft skills.On the other hand, the present study aims to use CVs collected from the same sample of ENSAK students (Table 2) using new approaches presented on machine learning to examine the performance of each model for automated and intelligent for classification of their CVs.Therefore, the ENSA network (Ecole Nationales des Sciences Appliquées) is the largest engineering school in Morocco, with eleven establishments in the main regions.The network has thousands of students in various modern fields of study relevant to the professional workplace.The surveywas conducted over a six-month period, between January 2023 and June 2023; the students' responses were collected using the Google Forms application with consideration of the confidentiality of personal information in the data collected.The dataset contained 867 CVs from five departments (Electrical Engineering (ELE), Networks and Systems Telecommunications (NST), Computer Engineering (CE), Automotive Mechatronics Engineering (AutoMec), and Industrial Engineering (Indus).Accordingly, the main objective of the current research is the study the relevance of models built using deep learning, namely, neural network approaches and word embedding techniques to the topic related the student employability using information extracted from their CVs.

Experiment and Problem Definition
In the first step, the experiment started by loading the collected dataset, which was composed of unstructured files of different formats (Table 3).Next, we generated a global CSV file by extracting text from each document where the data were labeled according to the specialty of the study.Therefore, after loading the final CSV file and removing the noisy data (symbols, punctuation, numbers, etc.) as outlined in the preprocessing and cleaning sections, the generated dataframe can be used for the experiment.As a result, the data wereloaded into different models for the generation of the vector representation of the text based on the BERT and Gensim embedding techniques.Finally, the output of each embedding technique was loaded as input for the different neural networks adopted in the research including GRU, LSTM, and CNN.

Architecture of the Proposed Solution
The proposed models considered the set CVs as the input C = {c1, ..., cn} and the student's specialty S:{s1, ..., sk} as the targets.The models classified and measured the probabilities {(c1, s1), ..., (cn, sk)}.The inputs to the classifierarethe training data, which area finite sequence of S × C pairs.The output of the classifiers is the function f: C → S that predicts s ∈ S for new samples in C (Figure 1).The basic basis of the models is to construct neural network architectures based on interconnected neurons in different layers, simulating the human neural brain metaphor for information processing [38].Therefore, after the common step of preprocessing the text, the first method based on the embedding layer is generated based on Gensim tokenization, and the target output (Specialty of the student) is presented on one hot encoding.For the second models that were based on the BERT architecture, the input comes from the embedding layer where the text has gone through the steps of applying certain features and tokenization respecting pretrained models and generating adequate tokens such as

Architecture of the Proposed Solution
The proposed models considered the set CVs as the input C = {c1, ..., cn} and the student's specialty S:{s1, ..., sk} as the targets.The models classified and measured the probabilities {(c1, s1), ..., (cn, sk)}.The inputs to the classifierarethe training data, which area finite sequence of S×C pairs.The output of the classifiers is the function f: C → S that predicts s ∈ S for new samples in C (Figure 1).The basic basis of the models is to construct neural network architectures based on interconnected neurons in different layers, simulating the human neural brain metaphor for information processing [38].Therefore, after the common step of preprocessing the text, the first method based on the embedding layer is generated based on Gensim tokenization, and the target output (Specialty of the student) is presented on one hot encoding.For the second models that were based on the BERT architecture, the input comes from the embedding layer where the text has gone through the steps of applying certain features and tokenization respecting pre-trained models and generating adequate tokens such as [ids_token], [mask_token], and [padd_token] while adding special tokens including [CLS] and [PAD].The text classification system can be decomposed into the following four stages: feature extraction, dimension reductions, classifier selection, and evaluation.

Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is considered a standard RNN and was introduced in 1997 [39].This model has proven successful in solving the vanishing error problem that arises in standard RNN models, as it offers the possibility to use a Constant Error Carousel (CEC), where memory cells are used instead of non-linear activation functions such as Tanh or Sigmoid.This solution enables LSTM models to store and transmit information over the long term (Figure 2).Long short-term memory (LSTM) is considered a standard RNN and was introduced in 1997 [39].This model has proven successful in solving the vanishing error problem that arises in standard RNN models, as it offers the possibility to use a Constant Error Carousel (CEC), where memory cells are used instead of non-linear activation functions such as Tanh or Sigmoid.This solution enables LSTM models to store and transmit information over the long term (Figure 2).Here, σ is the sigmoid function, tanh is the hyperbolic tangent function, and i, f, o, C, and Ĉ are the input gate, forget gate, output gate, the content of the memory unit, and the content of the new memory unit, respectively.The sigmoid function is used to form three gates in a memory cell, while the tanh function is used to improve the performance of a specific memory cell.Here, σ is the sigmoid function, tanh is the hyperbolic tangent function, and i, f, o, C, and Ĉ are the input gate, forget gate, output gate, the content of the memory unit, and the content of the new memory unit, respectively.The sigmoid function is used to form three gates in a memory cell, while the tanh function is used to improve the performance of a specific memory cell.

Gated Recurrent Unit (GRU)
Gated Recurrent Unit (GRU) was introduced in 2014 as a solution to LSTM's complexity and as a solution to the vanishing gradient problem [41,42].Moreover, by implementing gating mechanisms within their networks, GRU and LSTM can capture and propagate information over long sequences.Thus, GRU consists of the following components (Figure 3): gates in a memory cell, while the tanh function is used to improve the performance of a specific memory cell.

Gated Recurrent Unit (GRU)
Gated Recurrent Unit (GRU) was introduced in 2014 as a solution to LSTM's complexity and as a solution to the vanishing gradient problem [41,42].Moreover, by implementing gating mechanisms within their networks, GRU and LSTM can capture and propagate information over long sequences.Thus, GRU consists of the following components (Figure 3): Update Gate (zt): This calculates how much information from the past should be carried forward.
Reset Gate (rt): This determines how much information should be forgotten about the past.
Current Memory Content (ht): This value represents the network's current state.
The update gate and reset gate are responsible for controlling the information flow, and their values are learned adaptively during the training process.The GRU architecture is more computationally efficient than LSTM because it has fewer parameters.However, both GRU and LSTM are widely used in natural language processing, speech recognition, and various sequence modeling tasks.Update Gate (zt): This calculates how much information from the past should be carried forward.
Reset Gate (rt): This determines how much information should be forgotten about the past.
Current Memory Content (ht): This value represents the network's current state.
The update gate and reset gate are responsible for controlling the information flow, and their values are learned adaptively during the training process.The GRU architecture is more computationally efficient than LSTM because it has fewer parameters.However, both GRU and LSTM are widely used in natural language processing, speech recognition, and various sequence modeling tasks.

The Convolutional Neural Network (CNN)
The convolutional neural network (CNN) architecture is a model of deep neural networks used primarily toanalyze visual data [43].CNNs are effectively used in various deeplearning methods, including image classification, object detection, text classification, etc.Based on input data, they can learn spatial hierarchies of features automatically and adaptively [43].Important elements of a CNN include: Convolutional Layers: These layers operate on the input data using convolutional operations.To detect patterns or features in the input data (image, Text, etc.), convolution applies a filter also called a kernel (Conv1d, Conv2d, and Conv3d).A convolutional layer assists the network in learning how to represent the features hierarchically.
Pooling layer: Convolutional layers are typically followed by a pooling layer.In pooling, the width and height of feature maps are reduced, but their depth is maintained (the number of channels).Pooling operations can be performed in several different ways, including max pooling, which retains the maximum value in a region, and average pooling, which retains an average value Activation Functions: To introduce non-linearity into the model, non-linear activation functions, such as ReLU (Rectified Linear Unit), Softmax, and Sigmoid, are applied.In this way, the neural network will be able to learn more complex patterns and relationships between the data.
Fully Connected Layers: A convolutional network typically contains several layers of pooling and convolution, followed by a layer of fully connected connections.As a result, the network can make final predictions where each neuron in these layers is connected to the neuron in the previous layer and subsequent layers.
Flattening: Convolutional and poolingoutputs are flattened into one-dimensional vectors before the fully connected layers.

The Bidirectional Encoder Representations from Transformers (BERT)
The Bidirectional Encoder Representations from Transformers (BERT) model has been under development since A. Vaswani et al., published "All Attention You Need" in 2017 at Google Labs [44].The model is an embedding layer of pre-trained bidirectional representations from a large collection of unsupervised text corpus, including Wikipedia and BookCorpus [45].The related BERT models are BERTBASE and BERTLARGE.Transformer encoding and decoding layers are based on multiple "heads".The main features and mechanisms of the BERT model are as follows: Contextual Word Embeddings Unlike traditional embedding techniques such as Word2Vec and GloVe, BERT produces contextual word embeddings; the weights of words are recorded in the context of a sequence or sentence.The model is based on a bidirectional encoder function, where the input text is entered sequentially (left to right or right to left).
Transformer Architecture: To analyze a sequential data stream, BERTuses neural network architecture.Two main components are represented in the model, the encoder portion is responsible for the reading of input and the decoder portion responsible for the prediction task; the model is capable of working the two parts simultaneously.BERT, which stands for Bidirectional Encoder Representations from Transformers, is a DL model that can be trained using two methods: Next Sentence Prediction (NSP) and Masked Language Model (MLM).The mechanism used by the BERT model is based on an innovative approach to tokenization methods.BERT uses different and specialized markers during the learning process, such as [CLS] to indicate the beginning of the sequence; [SEP] to indicate the end of the sequence; [PAD] to pad if the sequence lengths are different; and [UNK] to indicate an unknown word in the sequence.
As shown in Figure 4, the first step for building the BERT-based embedding method is to import the required library HuggingFACE and define the pre-trained BERT model to be used.In our case we took different considerations, firstly, the majority of resumes are written in French, and secondly, the average length of the sequence.
The overall steps used in the majority of models of BERT are typically composed of the following steps: tokenization, padding, numericalization, and embedding [45].For the tokenization, the [ids_token], mask_token], and other specialized tokens are generated, such as [CLS] mentioning the start of the sequence; [SEP] for the end of the sequence; [PAD] for padding when the sequences do not have the same length; and [UNK] for unknown words in sequences.Since most resumes are in French, we used the un-cased BERT model architecture with 12 layers, 768 hidden nodes, and 12 attention heads with 110M parameters.Token representations are computed by the first-level encoder and then used by the second-level encoder.The whole process is repeated until the 12th encoder is reached, which is the final encoder.Based on the output, the obtained matrix had a size of 256 × 768, where 256 represents the number of tokens in the sequence and 768 represents the hidden size.
BERT model architecture with 12 layers, 768 hidden nodes, and 12 attention heads with 110M parameters.Token representations are computed by the first-level encoder and then used by the second-level encoder.The whole process is repeated until the 12th encoder is reached, which is the final encoder.Based on the output, the obtained matrix had a size of 256 × 768, where 256 represents the number of tokens in the sequence and 768 represents the hidden size.

Data Loading
In the experiment, the dataset was divided into two parts: the training data, which comprises 80% of the dataset, and 20% of the testing data.The training and testing dataset are loaded into the models using the data loader function where the batch size and epoch are initialized.In our case, multi-process mode is used to iterate the dataset where the epoch is 15 and the batch size is 32 (Figure 5).

Experimental Settings
The experiments were conducted in the Google Colab Pro environment, which has many computing units (100) and powerful GPUs (TPU, GPU T4, GPU A100, and GPU V100).Particularly, the BERT architecture requires high memory and GPU resources.The dataset used is the result of Extraction, Loading, and Transformation (ETL) techniques applied to raw documents, as shown in Table 3.The BERT tokenizer and associated template used are "dbmdz/bert-base-french-europeana-cased".In particular, the majority of CVs were written in French.As a final step, we adopted the following parameters and architectures:

•
Tensorflow and Keras libraries were used;  Pre-process data (cleaning and deleting noisy data).

3.
Generate one hot encoding for each class representing the field of study.

4.
Split dataset into two parts, training and testing dataset, with ratio the 80:20, respectively.
a. Tokenization step based on either BERT-obtained model or the Gensim embedding approach where the tokenization was based on the unigram mode.

5.
Add new token-related competencies and unknown vocabulary into the vocab.txt of the BERT models.

6.
Create an embedding matrix for every word in the vocabulary 7.
Builda simple model or hybrid model based on a combination of CNN, LSTM, and GRU. 8.
Dense (5 classes) layer with Softmax activation function.10.Train the model on the training set.11.Evaluate the model on the test set.
Note: The fifth step of the Gensim algorithm has been omitted.

Experimental Settings
The experiments were conducted in the Google Colab Pro environment, which has many computing units (100) and powerful GPUs (TPU, GPU T4, GPU A100, and GPU V100).Particularly, the BERT architecture requires high memory and GPU resources.The dataset used is the result of Extraction, Loading, and Transformation (ETL) techniques applied to raw documents, as shown in Table 3.The BERT tokenizer and associated template used are "dbmdz/bert-base-french-europeana-cased".In particular, the majority of CVs were written in French.As a final step, we adopted the following parameters and architectures:

Results
In the first step of the experiment, we aim to verify whether the two classification methods are effective in the context of CVs, particularly for Moroccan students where such experiments are still little explored.The second step consists of comparing the two approaches, BERT and Gensim, combined with neural network approaches (GRU, LSTM, CNN) and which models are the most accurate and precise for classification.
Accordingly, based on the determined architecture adopted for this study (Figures 1 and 5), it will be possible to test and discuss the results of simple and hybrid deep learning models, which are combinations of different neural network models with two different word embedding approaches: BERT and Gensim.Accordingly, the classifications of the students' resumes were evaluated based on three parameters: accuracy, precision, and recall.However, the first observation related to model execution time; BERT models require a little more time to train and predict due to their complexity compared to Gensim models.On the other hand, according to the three evaluation measures (accuracy, precision, recall), BERT-based models are more accurate, precise, and recallable (Table 4).
As shown in Table 4, this study's results were based on three different metrics of evaluation (accuracy, precision, and recall).The first observation indicates the performance of the models based on BERT improved after including a vocabulary dictionary containing new words related to competencies and skills that were unknown in the original versions.Furthermore, compared to the Gensim-based approaches, models based on BERT text representation achieved the best accuracy and performance for the majority of approaches, especially the models based on CNN as one layer or combined with the other neural networks LSTM and GRU.Accordingly, the hybrid models based on BERT achieved the best accuracy compared to the single models from the two approaches BERT and Gensim, where CNN-GRU/BERT has the best accuracy with 0.9351, while the GRU-LSTM/BERT has the highest precision with 0.9329.Finally, the best recall score was achieved with the CNN-GRU model based on the BERT at 0.9411.
Furthermore, the single learning models have also good metrics, especially based on BERT embedding architectures, where CNN has the best accuracy with 0.9188.On the other hand, the highest precision is achieved by the GRU with 0.9181, while LSTM/BERT achieved the best recall with 0.8963.For the models based on Gensim, the three metrics also showed good performances, where the hybrid models were shown as the best models where LSTM-CNN/Gensim achieved the best accuracy with 0.9025.Meanwhile, the model CNN-LSTM/Gensim has good precision with 0.8821, whereas the model LSTM-GRU achieved good recall with 0.7995.
Finally, for the single models based on Gensim, the CNN model reached a good accuracy and precision with 0.9021 and 0.8961, respectively, while GRU had a recall value of 0.7951.
From the results obtained from the study, the hybrid models are presented based on BERT embeddings utilizing the capabilities of the GRU, LSTM, and CNN approaches.The training of the models used many layers, which is a concept of deep learning models and enhanced the models' accuracy.Accordingly, the BERT hybrid model has a better performance compared to the Gensim hybrid model as shown in the comparison in Figure 6 where the training and prediction history of models are shown.On the other hand, the model presents a good correlation between training and validation steps, while for the Gensim models, it is possible that there is more difference that can lead to the under-fitting and over-fitting issues of the models.
Using the confusion matrix as a visual tool is an essential step to have a good interpretation of the model's performance (Figure 7).A confusion matrix is a matrix that displays the number of accurate and inaccurate instances resulting from a machine learning model's predictions.This allows for a better understanding of the model's recall, accuracy, precision, and overall effectiveness in distinguishing between classes.and enhanced the models' accuracy.Accordingly, the BERT hybrid model has a better performance compared to the Gensim hybrid model as shown in the comparison in Figure 6 where the training and prediction history of models are shown.On the other hand, the model presents a good correlation between training and validation steps, while for the Gensim models, it is possible that there is more difference that can lead to the under-fitting and over-fitting issues of the models.Using the confusion matrix as a visual tool is an essential step to have a good interpretation of the model's performance (Figure 7).A confusion matrix is a matrix that displays the number of accurate and inaccurate instances resulting from a machine learning model's predictions.This allows for a better understanding of the model's recall, accuracy, precision, and overall effectiveness in distinguishing between classes.Using the confusion matrix as a visual tool is an essential step to have a good interpretation of the model's performance (Figure 7).A confusion matrix is a matrix that displays the number of accurate and inaccurate instances resulting from a machine learning model's predictions.This allows for a better understanding of the model's recall, accuracy, precision, and overall effectiveness in distinguishing between classes.According to the confusion matrix, CNN-GRU/BERT predicts certain classes with greater precision and recall than other models, as shown in Table 2.In accordance with the obtained results, the model predicts Electrical Engineering (ELE) with the highest score compared to other specialties with a precision of 97.5% and a recall of 95.1%; in contrast, with a precision of 90.3% and recall of 88.6%, Computer Engineering (CE) received the lowest scores among all specialties (Table 5).The misclassifications of the Computer Engineering (CE) with other classes can be explained because the specialty of Computer Engineering has more common competencies with the other specialties, especially regarding the use of common technologies as hard skills related to IT such as "python", "java", and "C++".
The actual study also showed good results compared to previous works dealing with the same subject of CV classification.Ref. [47] obtained a 78.53% accuracy of classification of resumes based on Linear SVM compared to models of comparison including Logistic Regression (62.40%),Multinomial Naïve Bayes (44.39%), and Random Forest (38.99%).In contrat, [10] based on machine learning algorithms for classification of resumes such as Naïve Bayes, Random Forest, and SVM achieved worse results presented, respectively, by Naïve Bayes (45%), SVM (60%), and Random Forest (70%).
Thus, in the Moroccan context, machine learning-based classification research is still very limited but is progressing to find its potential.Ref. [48] proposed job offer classification models for the Moroccan job market based on SVM, Naïve Bayes, Logistic Regression, and BERT.Accordingly, the BERT model has the best accuracy of 94%;however the sector classification shows a considerable difference, especially in IT-related offers, which have the lowest accuracy rate of 85% compared with different specialties' offers.
In Figure 8, we visualize the t-SNE based on the embedded method based on the text data representations from BERT.Accordingly, based on the BERT integration model, four different clusters are observed and grouped properly, unlike the Gensim model where the distinction of each group is not well observed.

Conclusions
The study in this paper presented the application of simple and hybrid deep learning models for the classification of CVs.Accordingly, the classifications were based on a dataset that contained 867 CVs of ENSAK students.The approach adopted for this study was based on three neural network techniques, GRU, LSTM, and CNN, insingle mode or combined mode.The neural network approaches adopted were combined with two different text representation techniques generated based on BERT and Gensim text embedding techniques.We compared the performance of the hybrid models built with neural networks GRU, LSTM, and CNN using text representation from BERT with that of the embedding method Gensim respecting three metrics (accuracy, precision, and recall).We fed the embedding outputs of both algorithms (BERT and Gensim) as inputs into the neural networks in the comparative study.Therefore, CNN-GRU/BERT has the best accuracy with 0.9351 while the GRU-LSTM/BERT has the highest precision with 0.9329.Finally, the best recall score was achieved with the CNN-GRU model based on the BERT method at 0.9411.Additionally, the confusion matrix and t-SNE presentation showed good learning process of the BERT models, with an especially good interpretation for the classes of Automotive Electrical Engineering (ELE) and Industrial Engineering (Indus),respectively, where precision reached 0.975 and 0.937.In contrast, the class Computer Engineering has misclassification with classes such as Networks and Systems Telecommunications and Electrical Engineering due to the common skills found on the related CVs, which were generally presented as IT competencies.

Conclusions
The study in this paper presented the application of simple and hybrid deep learning models for the classification of CVs.Accordingly, the classifications were based on a dataset that contained 867 CVs of ENSAK students.The approach adopted for this study was based on three neural network techniques, GRU, LSTM, and CNN, insingle mode or combined mode.The neural network approaches adopted were combined with two different text representation techniques generated based on BERT and Gensim text embedding techniques.We compared the performance of the hybrid models built with neural networks GRU, LSTM, and CNN using text representation from BERT with that of the embedding method Gensim respecting three metrics (accuracy, precision, and recall).We fed the embedding outputs of both algorithms (BERT and Gensim) as inputs into the neural networks in the comparative study.Therefore, CNN-GRU/BERT has the best accuracy with 0.9351 while the GRU-LSTM/BERT has the highest precision with 0.9329.Finally, the best recall score was achieved with the CNN-GRU model based on the BERT method at 0.9411.Additionally, the confusion matrix and t-SNE presentation showed good learning process of the BERT models, with an especially good interpretation for the classes Data 2024, 9, 74 14 of 16 of Automotive Electrical Engineering (ELE) and Industrial Engineering (Indus),respectively, where precision reached 0.975 and 0.937.In contrast, the class Computer Engineering has misclassification with classes such as Networks and Systems Telecommunications and Electrical Engineering due to the common skills found on the related CVs, which were generally presented as IT competencies.
In conclusion, one of the reasons why BERT is effective in representing text data is that it can group related texts more closely.In general, combining deep learning methods with text representation embedding methods, such as BERT, will produce better results than single deep learning models.Finally, this study shows promising results regarding automated classification and recommendation of job applications based on CV content.With the development of real-time automated classification systems, different entities can make student recruitment processes more efficient by using relevant information such as CVs.

3. 4
.5.Data Loading In the experiment, the dataset was divided into two parts: the training data, which comprises 80% of the dataset, and 20% of the testing data.The training and testing dataset are loaded into the models using the data loader function where the batch size and epoch are initialized.In our case, multi-process mode is used to iterate the dataset where the epoch is 15 and the batch size is 32 (Figure 5).Process : classification based on different models  Input: the resumes of students  Output: A model trained on the CVs of students and one of the five pre-defined classes for each resume in the test dataset 1. Import dataset file (CVs.csv)into pandas data frame.2. Pre-process data (cleaning and deleting noisy data).3. Generate one hot encoding for each class representing the field of study.4. Split dataset into two parts, training and testing dataset, with ratio the 80:20, respectively.

Data 2024, 9 ,
x FOR PEER REVIEW 10 of 16 a.Tokenization step based on either BERT-obtained model or the Gensim embedding approach where the tokenization was based on the unigram mode. 5. Add new token-related competencies and unknown vocabulary into the vocab.txt of the BERT models.6. Create an embedding matrix for every word in the vocabulary 7. Builda simple model or hybrid model based on a combination of CNN, LSTM, and GRU. 8. Dropout layer(0.2). 9. Dense (5 classes) layer with Softmax activation function.10.Train the model on the training set.11.Evaluate the model on the test set.Note: The fifth step of the Gensim algorithm has been omitted.

Figure 5 .
Figure 5.The overall process of the experiment.

Figure 5 .
Figure 5.The overall process of the experiment.Process: classification based on different models • Input: the resumes of students • Output: A model trained on the CVs of students and one of the five pre-defined classes for each resume in the test dataset 1. Import dataset file (CVs.csv)into pandas data frame.2.Pre-process data (cleaning and deleting noisy data).3.Generate one hot encoding for each class representing the field of study.4.Split dataset into two parts, training and testing dataset, with ratio the 80:20, respectively.

Figure 6 .
Figure 6.(a) The loss function of training and validation of resumes dataset; (b) the accuracy of training and validation of resumes dataset for the model CNN-GRU/BERT with high accuracy.

Figure 6 .
Figure 6.(a) The loss function of training and validation of resumes dataset; (b) the accuracy of training and validation of resumes dataset for the model CNN-GRU/BERT with high accuracy.

Figure 6 .
Figure 6.(a) The loss function of training and validation of resumes dataset; (b) the accuracy of training and validation of resumes dataset for the model CNN-GRU/BERT with high accuracy.

Figure 7 .
Figure 7. Confusion matrix for CV classification using the CNN-GRU/BERT model.We can explain the misclassification of the Computer Engineering with other classes due to the proximity of these categories in the uses of certain skills, in particular hard skills presented with the computer technologies.

16 Figure 8 .
Figure 8.The t-SNE visualization of text representation of resumes using BERT and Gensim embedding methods.

Figure 8 .
Figure 8.The t-SNE visualization of text representation of resumes using BERT and Gensim embedding methods.

Table 1 .
Predicting employability models using different approaches of machine learning.

Table 2 .
Information and characteristics of respondents (sample = 867 students).

Table 3 .
File types that make up the dataset of student resumes.

Table 4 .
Evaluation neural network approaches based on BERT/Gensim.