PAKES: A Reinforcement Learning-Based Personalized Adaptability Knowledge Extraction Strategy for Adaptive Learning Systems

Advancements in adaptive educational technologies, specifically the adaptive learning system, have made it possible to automatically optimize the sequencing of the pedagogical instructions according to the needs of individual learners. The crux of such systems lies in the instructional sequencing policy, which recommends personalized learning material based on the learning experiences of the learner to maximize their learning outcomes. However, limited available information such as cognitive, affective states, and competence levels of the learners ongoing knowledge points servers critical challenges to optimizing individual-specific pedagogical instructions in real-time. Moreover, making such decisions policy for every learner with a unique knowledge profile demands a trade-off between learner current knowledge and curiosity to learn next knowledge point. To address these challenges, this paper proposes a personalized adaptability knowledge extraction strategy (PAKES) using cognitive diagnosis and reinforcement learning (RL). We apply the general diagnostic model to track the current knowledge state of the learners. Subsequently, an RL-based Q-learning algorithm is employed to recommend optimal pedagogical instructions for individuals to meet their learning objectives while maintaining equilibrium among the learner-control and teaching trajectories. The results indicate that the learning analytics of the proposed framework can fairly deliver the optimal pedagogical paths for the learners based upon their learning profiles. A 62% learning progress score was achieved with the pedagogical paths recommended by the PAKES, showing a 20% improvement compared to the baseline model.


I. INTRODUCTION
In recent years, educational technologies have gained immediate attention by transforming traditional classroom settings to an online learning environment such as Khan Academy, massive online open courses, and knetwon. The main advantage of these learning systems over classroom learning is the adaptive guideline during the learning trajectories of the learners. Adaptive learning indicates to assist learners with personalized instructional sequencing to maximize learning The associate editor coordinating the review of this manuscript and approving it for publication was James Harland.
progress. The learning system that provides adaptiveness to the learners to learn as per their learning needs is called the adaptive learning system (ALS). Implemented through adaptive algorithms, ALSs lead to personalized instructional sequencing for individual learning attributes [1]. Contrary to the traditional ''one-size-fits-all'' teaching methodology where a single tutor teaches hundreds of students in a class with the same pedagogical action [2], [3], the ALS focuses on providing a personalized learning experience at an individual level. Currently, ALSs are employed in multiple educational institutes worldwide to train students. For instance, Cognitive Tutor mathematics courses [2] have been used to VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ teach around 60,000 students annually, and a similar system called ASSISTment [3] has been employed to train 6000 students annually at various schools. The ALSs use academic information metrics, such as proficiency level, learning time, resulting grades, and response time to uncover hidden patterns about the learner and recommend productive learning activities to optimize the learning outcomes [4]. The core component of ALSs is a pedagogical strategy with various kinds of tracking models to represent the hidden knowledge state of the learner and to deliver new learning actions effectively. Pedagogical strategies are described as instructional sequencing policies to decide what to recommend in the next step as per ongoing available information of both the learners and the learning material. In recent literature, a variety of approaches have been introduced to design these pedagogical strategies for ALSs by employing rule-based [5], machine learning (ML)-based [6], and reinforcement learning (RL)-based method [7], [8]. In a rulebased approach, a domain expert defines a teaching sequence and corresponding learning actions, and the learner acquires knowledge by following them. The drawbacks associated with such an approach for designing strategies are its high cost and domain dependence. The ML models for ALSs have proven to be a significant development for handling many students with diversity in a dynamic environment [9], [10]. The ML models are capable of delivering personalized learning actions to the learners by analyzing their learning experiences. However, ML algorithms, such as the recurrent neural network and long short-term memory, have some limitations during designing strategies and implementation for the ALSs, as they require an enormous amount of knowledge records and high computational overhead. The significant advances in RL models play a vital role in the development of intelligent pedagogical strategies because they have the capability of optimizing the learning gains of the students within an uncertain environment [11]- [13]. However, it still remains an open issue to empower ALSs to make recommendations by giving the right to the learners to explore and review their knowledge structure during the learning session. Additionally, how to optimize the learning sequence with limited knowledge about leaner's cognitive and affective states while estimating the learner's onginging competence level in a specific knowledge point.
One of the emerging research interests in RL-aided ALSs is to incorporate psychological theories along with RL models to tackle the existing challenges [14], [15]. Moreover, ALSs with such hybrid approaches comprise various modules for adaptive teachings, such as the assessment model, learning model, and student model. The task of the assessment model is to question the learners and evaluate their performance to predict their knowledge levels. The learning model receives predictions from the assessment model and develops a relationship between the learning material and assessment results. Thereafter, the learning material is delivered to the learners in accordance with their learning abilities to optimize learning engagement. Finally, the student model is updated to reflect the current knowledge levels of the learners. A learning model that makes possible task recommendations for the best learning path was proposed in [16].
To enhance the learning outcomes by recommending the next activity, the learning model integrates the learning material with the learners' proficiency levels. Similarly, the studies in [17] and [18] on competency-based recommendation systems cover hybrid approaches to combine data-driven RL models with psychometric assessment models. Such hybrid approaches can efficiently track the learner's knowledge (performance, experience, and degree of achievement) and deliver the best learning items. In these recommendation systems, the assessment model deduces the latent knowledge states of the learners, and RL models play an essential role in making the best decision for personalized task selection. However, the problems of designing an adaptive strategy for ALSs that involves the dynamic nature of the environment and customizing the learning path for the learner's performance remain open. Numerous studies on the potential of ALSs highlight that a robust relationship between student learning and the learning material is required to develop an optimal pedagogical strategy [19]. There is a need to deal with the critical challenge for ALSs to design an optimal strategy that tracks the learners' ongoing knowledge proficiency in an online fashion and recommends the learning material as per the learner's psychological needs while considering limited cognitive and affective state information.
In this paper, a personalized adaptability knowledge extraction strategy (PAKES) is proposed for adaptive assessment and personalization in ALSs to address the aforementioned open research issues. The term adaptability in learning analytics and educational systems refers to providing learners with a choice to personalize their learning materials according to their learning styles, leading toward more learner control [20]. We summarize the contributions of this paper as follows: • To detect the learners' effective knowledge states and recommend suitable learning materials to achieve intrinsic cognitive learning improvements. The proposed framework incorporates cognitive psychology and RL-induced policies for optimal learning path recommendations using the Markov decision process (MDP).
• For the proposed system, a general cognitive model is used to predict the hidden knowledge states of the learners to provide ongoing personalized expectations during learning sessions to enhance the learning progress even when restricted information is provided.
• To deliver the best instructional sequencing corresponding to the learners' exclusive knowledge attributes estimation, an RL-based Q-learning algorithm is employed.
• To improve the learners' engagement and effective education, the adaptability concept has been utilized to achieve a balance between learner control and system teaching control.
• Experiments reveal that the proposed system PAKES outperforms the traditional approaches by making it possible for each learner to learn at their own pace, choice, and intrinsic desire for learning even there is a new student with limited information. Hence, it indicates that the above-average learners can learn as per their desire without wasting time, while below-average learners can continuously work to reach a higher level with motivation.
The rest of the paper is organized as follows: Section II presents an overview of the empirical study of the related work. The proposed PAKES algorithm, along with the systematic mechanism, is described in Section III. Section IV demonstrates the experiment analysis to investigate the performance of the proposed framework. Finally, Section V concludes this article and discusses possible future avenues.

II. RELATED RESEARCH WORK
This section presents an overview of the relevant empirical studies on ALSs in artificial intelligence in education. Recent work addressing ML techniques, including RL approaches and content related to cognitive assessment models, are briefly discussed.
In the literature, several approaches to design personalized ALSs have been presented that aim to recommend learning paths to improve the overall learning outcomes of the learners. A comprehensive study on ALSs was presented in [21], where the authors classified the ALSs into five clusters based on RL approaches. The authors designate that RL-induced instructional sequencing proves successful when a combination of learning science and cognitive psychology theories is employed. Such a classification suggests integrating data-driven and theory-driven methods in ALSs for optimal learning instructions.
One study [22] reported an adaptive teaching algorithm for ALSs called the plain vanilla strategy. The strategy infers the mastery and nonmastery knowledge level of the learners for suitable learning path recommendations. The plain vanilla system encodes the understanding of the students for an inquiry to the probability of success to predict whether they have mastered specific knowledge components. The goal of the plain vanilla strategy is to optimize the objective function that estimates the entire gain of the learning process using a multi-armed bandit (MAB) framework with the Gittins index approach. Extensive research has been conducted to employ MAB for ALSs, including the contextual bandit [23] and adversarial bandit [24]- [28]. The Gittins index is one of the most primitive methods to understand the MAB problem and only proves promising when no assessment error exists and the skills are independent of each other in the ALS [29]. An adaptive oracle strategy was proposed in [29], where the information of the transition model is considered to be unknown for the learner, similar to a real-world practice. In particular, an RL model is employed to design optimal pedagogical policies rather than the Gittins index. The oracle strategy uses a dynamic cognitive diagnostic model and a Q-learning algorithm to track the hidden knowledge states of the learners and estimate the effectiveness of the learning material on knowledge and skill, respectively. However, a critical limitation in the oracle strategy is not considering earlier stopping with the fixed time horizon, which does not make it feasible for fast learners [30]. Therefore, fast learners must wait for the entire trajectory, leading to a source of frustration for the learners. Moreover, the assessment design in the oracle strategy merely uses one point of information to estimate the knowledge profiles of the learners.
Several novel instructional strategies for the ALS that incorporate deep RL approaches with cognitive assessment models have been proposed [30], [31]. The deep Q-network (DQN) recommendation strategy uses a deep Q-learning algorithm to design the optimal learning paths. In particular, a feed-forward neural network was employed to generate an adaptive learning design by approximating the objective function values to optimize sequential policies [30]. The work in [31] further extended the DQN recommendation strategy presented in [30] with a policy network and introduces a new predictive model for inferring the learner's learning curiosity. The system stores student learning experiences and then employs a feed-forward neural network to predict a curiosity reward. The key issue with such a curiosity-driven recommendation strategy (CDRS) is that it requires a specific amount of the learning experience data of students to make an accurate recommendation that is only possible during a long session of interaction with the ALS. During initialization, a lack of a sufficient amount of data in the CDRS leads to low-performance decisions. This discourages the learners during their interaction with the system, thus leading to the abandonment of the system by the learners [32]. Various studies demonstrate that accuracy in the estimation of latent knowledge states affects the performance of the recommended strategies [33], [34].
In ALSs, designing adequate learning material is insufficient for the best learning path recommendations [35]. The best assessment model is required to evaluate individual-specific knowledge levels and needs [36], [37]. An excellent assessment model serves as an integral part of the ALS by inferring the hidden learning capabilities of the students within specific knowledge points. Therefore, it is vital to develop an active instructional policy for ALSs that can be applied to accommodate the learners' needs on knowledge points over time. The main challenge faced in tracking the knowledge competencies of the learners is that a predictive model requires the right amount of information to assess the potential knowledge states of the learners. Most of the currently available ALSs use large datasets containing thousands of learners, which makes it challenging to integrate into a realm where restricted information is obtainable.
To address this problem, we propose a pedagogical strategy called PAKES that employs a general diagnostic model online to make precise predictions within a finite time and with limited information. After providing an assessment, it captures the below-average knowledge competencies of the learners and recommends the best learning material to promote the overall progress during the learning process. The goal of the PAKES is to enable the ALSs to be robust in any event, even when less information is accessible.

III. PROPOSED FRAMEWORK A. SYSTEM MODEL
The PAKES aims to assess the multiple competence levels of the learners in specific knowledge components and provides personalized pedagogical instructions to individual learners. To achieve this goal, PAKES deals with three significant challenges: (1) the knowledge attributes of the learners change over time as they progress, (2) the same evaluation method is not applicable for each learner under study, and (3) capturing the essential information of the learner under study from the binary responses (yes or no). This section presents the proposed strategy to optimize the learning schedules by quantifying the latent attributes of the learners. The proposed framework integrates an online learning approach and a cognitive measurement model with adaptive psychometric tests to deal with the aforementioned challenges.
In an ALS, we assume there is a learning course with knowledge component K and latent concept knowledge states s(t) of the learner. The total states based on the K knowledge components can be given as S = The hidden knowledge states of the learner are unobservable directly, whereas the performance of the learner corresponding to the knowledge states is observable. Therefore, a knowledge state of the learner is represented as a sequence of competence levels given as C s(t) = (c 1 . . , c n s(t) ), where n indicates the total number of competencies involved in this state. These competencies represent the set of skills and proficiency levels of the learner. With this central fact, the first step is to estimate the knowledge state by modeling the degree of proficiency of the learners. The proposed system conducts adaptive testing to uncover the knowledge states and captures the necessary knowledge concepts. A general multidimensional adaptive (GenMA) model [38] is used as an assessment model to reveal the learning states of the learner by forecasting learner performance. This assessment model uses task difficulty with the required competencies in the corresponding knowledge components as an adaptive test. During adaptive testing, if the learner answers the question correctly, c n s(t) is equal to 1; otherwise, it is 0. The adaptive assessment module returns a list of passed and failed competenciesĈ s(t) = {C P s(t) , C F s(t) }, which represent the estimated ability levels of the learner concerning the corresponding knowledge state. Moreover, C P s (t) refers to the total passed competencies, and C F s(t) represents the failed competencies of the learner. Table 1 presents a summary of the notation used in this paper. Fig. 1 represents the overview of the learning path recommendation process of the proposed system for adaptive learning. The learner moves into the next succeeding state after learning the recommended learning material proposed through personalized learning actions (PLAs), which is depicted as follows: where s i (t) refers to the learner's i th current state at time t, and s j (t + 1) indicates the j th next latent knowledge state. Based on the assessment results of the GenMA model, the proposed system ranks the failed competencies by the learners starting with the most failed competence on top. The adaptability recommendation strategy takes suitable action from the action set to recommend the learning material corresponding to the current understanding of the individual learners.
In this paper, we formalize the adaptability recommendation learning process as an MDP [39], where the adaptability recommendation strategy works as an intelligent agent. The adaptability agent interacts with the learners as a sequential decision-making procedure to optimize the pedagogical strategy. According to the MDP, the successor state s j (t + 1) of the learner depends only on the current state s i (t) rather than the entire prior learning history {s 1 , s 2 , s 3 , . . . , s (i−1) }. In this stochastic process, when a learner interacts with the system during learning episodes, the learning records of the learner are generated as a sequence of estimated competencies, actions, and rewards {Ĉ 1 , a(1), r(1),Ĉ 2 , a(2), r(2), . . . ,Ĉ (T −1) , a(T −1), r(T − 1),Ĉ T , a(T ), r(T )}. In each episode, the proposed system takes action a(t) to recommend the learning materials corresponding to the instructional policy π(a t |s t ) = P[A = a t |S = s t ]. The learners select the material of their own choice to master a specific task, and the agent receives a FIGURE 1. Framework for the personalized adaptability-recommendation system: (1) adaptive assessment model GenMA estimates the knowledge state s i (t ) by modeling degree of competencies of the learner asĈ s(t ) ; (2) following the police π the system recommends learning material by taking action a(t ) from personalized learning action space; and (3) after receiving reward r (t ) and taking learning actions system transits the learner into next state s j (t + 1).
of the recommendation for the unknown state transition model for individual learners and then maximize it using the optimal value V π * (s) = max π V π [40], [41].To accomplish this, we integrated a GenMA model with the Q-learning algorithm to predict the success rate and maximize the learning gain over time in an uncertain environment. Moreover, the proposed strategy works in an online fashion and does not depend on students' previous learning experience data to represent knowledge proficiency. Overall, the goal of the proposed system is to recommend the optimal personalized action to the learners given the current estimated knowledge state to increase their learning outcomes on the following adaptive test. The adaptability recommendation strategy aims to empower the learners to improve knowledge competencies while using ALS.

B. Q-LEARNING TECHNIQUE
For a long time, the Q-learning algorithm and its variants have been proven to be robust RL approaches to solving the MDP problem in the ML community [42]- [46]. In addition, Q-learning is an off-policy technique and applies to running on any scheme in the MDP framework. In the Q-learning method, a parametric mechanism is used to approximate the Q-function of the current control strategy. Then, the equivalence principle is applied to improve the procedure, from which the optimal policy is selected for the action network. The Q-learning algorithm approximates the value function, and it is defined as follows: where Q o (s t , a t ) is the old value, and Q N (s t , a t ) is the newly calculated value in the Q-table. Moreover α ∈ (0 < α ≤ 1) is a nonnegative step-size coefficient called the learning rate, r t is the immediate reward after applying action a t in state s t , and γ is the discount factor used to measure the significance of the future reward. Finally, Q (s t+1 , a) is used to estimate the optimal value for the successor state. The Q-table is similar to the lookup table, where rows represent states, and columns indicate actions. Moreover, it is used to store state-action pair values called Q-values and updates the Q-values after training.

C. GENERAL MULTIDIMENSIONAL ADAPTIVE ASSESSMENT MODEL
The GenMA model is a hybrid adaptive assessment model that incorporates general diagnostics [47] and multidimensional item response theory model [48], [49]. In [38], the authors proposed the GenMA by using a general diagnostic model for partial credit data, which is defined as follow: where Pr(x ij ) is the probability of the proficiency of learner i, is the logistic function, β i represents the aptitude of learner i, and B is the total number of knowledge competencies included in the adaptive test. Moreover, θ ib is the ability of learner i for the required competence b, and q jb is the cell (j, b) entry in the Q-matrix. Table 2 represents the Q-matrix and provides the description of the knowledge competencies corresponding to Tatsuoka's fraction subtraction [50] dataset, and hierarchical knowledge map of the leading online learning platform, the Khan Academy. 1 The involved competencies in the pedagogical testing are represented by TRUE, whereas FALSE indicates their absence. In this paper, an expert-defined Q-matrix with some contribution to extending the number of skills (C = 16) was used, as depicted in Table 2. The parametric value d jb is the difficulty of the test question j for competence b, which is well-calibrated using historical data and the Metropolis-Hastings Robbins-Monro (MH-RM) algorithm [51]. The comprehensive overview of the GenMA model at the parametric level is depicted in Table 3. Fig. 2 presents an overview of the systematic implementation of the adaptive test generation and evaluation of the learner performance using the GenMA model. Furthermore, to cope with the failed competencies of the learner and to provide choice control for different learning activities, the adaptive test is updated based on the present performance. In the proposed mechanism, the learners go through the adaptive benchmark assessment method, where their responses to questions are encoded to the success rate. Each item in the adaptive test belongs to a specific knowledge component that predicts the learners' knowledge competencies during the learning sessions that they endeavor to master. The assessment module asks the next question using several types of input parametric information, such as the difficulty of the item, knowledge components, and learner's knowledge attributes. Afterward, based on the achievements, the system promotes the learner to a pedagogical decision-process module to recommend learning resources. Based on the ongoing evaluation, the system identifies the most unsatisfactory skill for each learner. To improve the overall learning experience, the system then suggests learning material that corresponds to the learners' understanding, needs, and style. The Q-table is updated corresponding to the knowledge state of the learner, which represents the individual learning characteristics. The core components of the proposed system are the GenMA model and adaptability recommendation modules. Contrary to the ordinary adaptive tests in the GenMA model, the maximum number of tasks asked during the test is defined in advance to avoid the frustration of the learners. In PAKES, a total of eight questions (j = 8) were chosen during the adaptive testing using the GenMA model to obtain interactive

Algorithm 1 Q-Learning Algorithm for PAKES Mechanism
Input: Initialization of knowledge profile state (S t ) action set A, learning rate α, discount factor γ , exploration probability ε Output: Optimal Q function value Initialization of weight: W a(t) Initialization of Q-table with zeros while learning do for t = 0, . . . , T − 1 do Estimate competenciesĈ s(t) = {C P s(t) , C F s(t) } using assessment model GenMA Select an action a t ← argmax q(s t , a t ) with probability 1 − ε and otherwise explore using system estimation Recommend learning material (a t ) as per estimated failed competencies C F

s(t)
Learners select material (a t ) as per their desire: Adaptability selection Transit into next state: The qualifying failed competencies becomes adaptability reward r(t) = L information for the prediction rate. The pseudocode of the PAKES framework for the selection and recommendation of the optimal material is depicted in Algorithm 1.

A. SIMULATION SETUP
We have developed our simulation environment in Python and R programming language on a computer equipped with Intel Core i7 processor, 16GB running memory, CPU frequency 3.6GHz 64bit Ubuntu 18.04.5 operating system. The GenMA model is developed using R based mirt package, the actor-critic algorithm is build using the PyTorch framework, and the Q-learning algorithm is built upon Python. As we mentioned above in III-C we utilized the Tatsuoka's fraction subtraction [50] dataset with some addition in skill set. The dataset includes dichotomous responses of 536 school students over 20 questions, and 5-fold cross-validation was performed as suggested by authors in [51]. Table 4 enlist the parameters and their corresponding values used in the simulation environment.

B. BASELINE MODEL
A robust adaptive learning model [31] was used as a baseline to compare the effectiveness of the learning path recommendations made by the PAKES. In [31], the authors proposed a CDRS using an RL technique with cognitive diagnostic models. The CDRS follows the three-parameter logistic (3PL) model from the multidimensional item response theory to measure the knowledge state of the learner at the initial stage. In the later stage, it applies a data-driven RL actor-critic model [52] to recommend personalized learning material. For the CDRS, a predictive model is developed using a feed-forward neural network to return a prediction error. This error serves as a curiosity reward of the learner to the corresponding knowledge state. Moreover, this curiosity reward is used by the RL model to approximate the objective function and design an optimal recommendation strategy. After specific iterations of the trajectories of the learner's interaction data, the CDRS follows the learner's knowledge attributes and recommends adaptive material to the individual.

C. EVALUATION METRICS
A simple final exam prototype has been employed to measure the efficacy of the proposed framework for the ALS. The evaluation setup adopted in this work proceeds identically to the one defined in [31]. A simple promotion of knowledge competencies is employed as an evaluation metric to compare the learning path recommendations. A similar evaluation strategy is employed by several educational institutes to evaluate the performance of the students [53], [54]. The performance metrics to evaluate the learning rate of the students is given as follows: where s(T ) refers to the final state of the learner after interaction with ALSs, and L ps indicates the learning progress score of the learner in this state. The L ps is employed to quantify the performance gains offered by the PAKES over the baseline model. The effectiveness of the proposed framework is elaborated using a combination of empirical graphs and parametric analysis.

1) PARAMETER EVALUATION
It is important to discuss the role of the transition model employed in the proposed framework, which specifies the learning progress of the learners through their learning experiences during each episode. Learning materials that correspond to the action space and their respective weights (W a(t) ) for all knowledge competencies are defined in Table 5.
The transition model employed in this work follows the guidelines described in [31] with a slight variation in the evaluation parameters. The employed transition model considers the current knowledge state s(t) and the probability of success to predict the next knowledge state s(t +1). The probability of success is defined as the learner's ability to acquire the next knowledge state. The proposed framework maximizes the probability of success for each learner by identifying which failed competencies the learner should address first by taking the corresponding adaptability action a(t). The output of the transition model is written as follows: where ξ denotes the chi-squared distribution of X 2 2 , a random variable with m degrees of freedom. The value of m specifies the type of learners with different learning profiles and is used in various transition models for individual learners [30], [31]. In this work, m = 2 was employed. In addition, L s(t) a(t) indicates the acquired failed competencies of the learner at state s(t) after the learning adaptability recommendation a(t). From (4), it can be inferred that, at the initial stage, the learners would feel comfortable and make quick progress, as their knowledge state is low and they are assigned with easy initial tasks. However, it becomes more challenging for the learners in the later stage to progress, as their learning proficiency is increases, and they struggle to show significant improvement in learning outcomes [55]. When the learner transits from one knowledge state to the next, the progress in the learning proficiency is indicated by ( LP).
To show the efficacy of the proposed framework, we consider a learner with an initial knowledge state s(1) = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) along with a sequence of competencies (Ĉ s(1) = 16), as estimated by the GenMA model. The learners gain knowledge of their failed and unsatisfactory competencies within finite time episodes (T = 40). The Q-matrix depicted in Table 2 is employed to instruct the learners based on their personalized educational proficiency. An action space A with a total of 20 actions, constituting 20 learning materials was created to train the learners. Among the 20 learning materials, four materials were linked with one correlated knowledge competence, six materials were linked with two competencies, six materials were linked with three competencies, and the remaining four materials relate to four competencies. Learning actions with knowledge competencies and corresponding k-dimensional training weights (W a(t) ) are depicted in Table 5. The episodic learning progress of the learners after interacting with the proposed framework is calculated using (4). The proposed PAKES aims to analyze the learners' educational experiences and predict the competencies stateĈ s(t) by employing the GenMA model with test questions (j = 8), and it delivers the adaptability actions.
We study the performance of the proposed framework under different parameter settings. Two weighted tuning parameters (α, γ ) were used to adjust the learning of the recommending agent in the Q-learning algorithm, as defined in (2). The values of α and γ , lying between a range of 0.0 to 1.0, were described in the proposed framework in terms of the learning progress. The parameter α represents the learning of policies recommended by the proposed framework. The value of learning rate α depends on the learner's progress in knowledge attributes from the learning activities. It was employed to determine how much change in the Q-value of the states (s t+1 − s t ) is useful to accept. A lower learning rate and more change in the Q-value indicates an easy task for a learner to progress. The learning rate value α = 0 indicates no learning for the recommending agent, and α > 0 implies improvement in recommendation policy. In practice, the learning rate values from 0.1 to 0.9 were adopted. The value α = 1 directs the recommending agent to ignore the prior knowledge and just consider the current information, which specifies that no exploration was done based on prior knowledge. The proposed framework was simulated by varying the learning rate with the conventionally used values of 0.7 to 0.9. To show the effect of a lower learning rate, α = 0.1 was also investigated. The parameter γ indicates the importance of the future reward for the learner. For the proposed framework, γ = 0 was employed, which designates that the reward is based only on the ongoing performance of the learners.
The performance of the proposed PAKES in terms of episodic learning is depicted in Fig. 3. The system performance is evaluated using the above-mentioned parameters. To analyze the effect of variation in α and the number of training epochs on the system performance, the proposed framework was simulated for α = 0.1, 0.7 to 0.9, and 1 to 1000 training epochs. The trend with an increase in α and the number of training epochs can be perceived comprehensibly for the episodic learning progress of the learner, where the learning progress of the learners increases from 0 to 0.1 as the number of episodes T increases from 1 to 40. The episodic learning of the proposed system for α = 0.1 to 0.9 is presented in Fig. 3(a)-(d), respectively. In Fig. 3(a) with α = 0.1, the value of the learning progress LP increases with an increase in the number of episodes. In addition, the increase in the number of training epochs results in an increased value of learning progress LP after completion of 40 episodes. However, by employing a lower α value, the system is prone to errors, as the proposed framework seems to make random recommendations for learning. In particular, at episode T = 3 to 40, the recommended learning path at 20 epochs exhibits a higher learning progress LP value than 100 epochs. This trend is perceived as an error, because more training epochs should result in a better learning rate for the learners. A similar trend can be observed for T = 3 to 15 and at T = 29, where ( LP) is higher at 20 epochs, when compared to 1000 epochs. For a higher value of α = 0.7 to 0.9, the results indicate that the proposed framework learning path recommendations suggest the best learning activities to the learners and tend to push them towards their learning targets. Fig. 3(b) and 3(c) with α = 0.7 and 0.8 illustrate the improvement in the recommendation policies at different epochs. In particular, the proposed framework after training at 20, 100, and 1000 epochs shows less randomness and fast convergence. However, a similar trend of errors is dominant, especially for the results for 20 and 100 epochs, where lower epoch numbers exhibit better learning progress.
To observe the effect of the increased learning rate on the error, Fig. 3(d) illustrates the episodic learning progress of the learners at α = 0.9. The analysis depicted in Fig. 3(d) indicates that, with a higher value of the learning rate α, the proposed framework fine-tunes its policies for the adaptability of the learning path recommendations. This analysis demonstrates that, at the very beginning, the proposed framework attempts to learn the hidden attributes of the learner's knowledge state to recommend an optimal learning path. This is the fundamental logic that, at one epoch, the proposed framework does not recommend the best learning path. The proposed framework performs better with LP = 0.04 for episode T = 30 at one epoch compared to other results with different learning rate α values, such as LP = 0.03 − 0.03, for episode T = 30. Fig. 3(d) shows that the proposed framework with α = 0.9 shows the minimum error compared to the results in Fig. 3(b) and Fig. 3(c) with α = 0.7 and 0.8. In summary, the results demonstrate that the proposed framework makes the best learning path recommendations with a higher learning rate α = 0.9 and number of training epoch = 1000. These results also illustrate that, during a long learning session, the proposed framework allows the learner to explore and review more learning concepts for overall learning progress.

2) LEARNING CONVERGENCES
In this section, we present the analysis of the temporal difference (TD) error made by the proposed framework during interaction with the learners over time to study the convergence and further confirm the value of the employed learning rate α. The target is to achieve good convergence with a lower number of training epochs so that the computational overhead during the online phase can be reduced. The TD error is defined in (2) and computed as follows: (5) where r t + γ × max a Q (s t+1 , a) indicates the TD target of the newly calculated Q-value, whereas Q o (s t , a t ) is the prior recorded Q-value for the state s t in Q-table. Moreover, Q is known as the change in the Q-value and is estimated by subtracting the previous value from the target value.
To observe the effect of a lower value α = 0.1 on the error rate, the TD errors values for the proposed framework are summarized in Fig. 4(a). This analysis identifies the reasons for the lower convergence rate for the proposed framework with a lower learning rate α = 0.1 value as the number of training epochs increases. Fig. 4(a) demonstrates that, with an increase in training epochs, the TD error tends to decrease over the episodic learning progress. The higher epoch values exhibit lower TD error values compared to the lower epoch values. To elaborate on this decrement in the TD error, Fig. 4(b)-(e) represents the TD error for the training epoch values of 1, 20, 100 and 1000, respectively. The TD error exhibits a decreasing trend from 9 × 10 −2 to 5 × 10 −3 , as the training epochs increase from 1 to 1000. However, this reduction in the TD error values is still not significant, resulting in a higher learning progress LP at a lower epoch value, particularly for epoch values 20 and 100, as shown in Fig. 3(a), where the TD error has similar values.  To elaborate on the better convergence of the proposed framework when a higher learning rate value of α = 0.9 is employed, the TD error is summarized in Fig. 5. Compared to Fig. 4, the results in Fig. 5 indicate a significant decrease in the TD error for higher training epoch values. Fig. 5(b)-(e) presents the TD error for the training epoch values of 1 to 1000, respectively. Contrary to the results presented in Fig. 4(b)-(e), the results in Fig. 5(b)-(e) indicate that the TD error reduces for every training epochs value, whereas the change is more dominant for a higher value of learning rate α in Fig. 5(b)-(e). The TD error reduces from 9 × 10 −2 to 7 × 10 −18 as the training epoch approach 1000. It is interesting to observe that, for every increment in training epochs, the TD error exhibits a significant decrease, which was not the case for the lower learning rate value in Fig. 4. This decrease in TD error over the episodic learning progress justifies the learning progress rate result in Fig. 3(d). The higher value of the training epoch demonstrates a better result than lower training epochs. The experimental observations indicate that, with a higher learning rate α = 0.9 and 1000 training epoch, the PAKES becomes less erroneous than other parametric scenarios and makes qualitatively better personalized learning path recommendations to improve the overall learning outcomes.
In summary, the convergence analysis presented in Figs. 4 and 5 justifies our intention of employing a higher learning rate of 0.9 and 1000 training epochs for the proposed framework. With the optimized parameter values, the proposed framework exhibits better convergence within a few episodes than the conventional approaches. In the later section, we present the performance comparison of the proposed framework equipped with the optimized parameters to the traditional CDRS framework.

3) ADAPTABILITY EFFECT OF LEARNING WITH COMPARISON
Finally, we studied the effect of PAKES on the learning improvement of learners and compared the results with the CDRS. After selecting the optimal parametric configuration (α = 0.9, epoch = 1000), the learning progress score of the learners is calculated using (3). The experimental results in Fig. 6 illustrate the relationship between the percentage of the learning progress score L ps with a varying number of training epochs. The learners interacting with the proposed framework achieve an L ps from 23.99 to 62.28 for 1000 training epochs, whereas the CDRS fails to achieve this learning score, which is from 18.89 to 41.39. Moreover, the proposed framework attains L ps = 53.23 within 200 training epochs, whereas the higher L ps of the learners interacting with the CDRS is 41.39% at 1000 epochs. The comparative numerical analysis of the proposed and baseline systems is presented in Table 6. This analysis illustrates that, compared to the CDRS, the proposed framework gains in L ps from 5.1% to 20.89% with 1 to 1000 epochs. The maximum gain in L ps is 22.89% for PAKES with L ps = 60.13% compared to CDRS at 600 epochs. These results demonstrate that the proposed framework offers advantages for optimal learning path recommendations to the learners as per their personalized educational experiences, learning desires, and understanding compared to the CDRS. It adapted the learners with their different learning requirements to enhance the overall learning gain throughout the learning process. The most exciting fact these graphs show that exploits the reasonable diagnostic model is that it is more productive than using the RL approach to optimize learning instructions. This is the principal reason that the proposed framework with the GenMA model and Q-learning algorithm shows the potential results as claimed by the CDRS with 3PL and actor-critic algorithms. Moreover, the proposed  framework also employs the concept of affective states, which is a growing research area in educational data mining [56], [57]. By detecting the affective knowledge states of the learners to employ the GenMA model, the proposed framework significantly provides personalized instructional sequencing and allows learners to learn at their own pace. The experiment illustrates that the proposed framework not merely outperforms the CDRS but also delivers personalized adaptive learning instructions.

V. CONCLUSION AND FUTURE WORK
In this work, we investigated the effect of RL-aided ALSs on the learning performance of learners by combining data-driven and theory-driven approaches. A knowledge extraction strategy called PAKES was proposed for the ALSs to recommend personalized pedagogical sequences. The proposed framework ensures the optimal balance between the learner-control and system-control policies. Equipped with a GenMA model, PAKES can predict ongoing learning performance to model the latent knowledge state of learners. The proposed framework employed a Q-learning algorithm to recommend the best learning paths for learners. It can efficiently handle individual learners based on their learning curves by exploiting the adaptability methodology from learning analytics. The experimental results demonstrate that PAKES outperform the baseline strategy (CDRS), by improving the learning progress score L ps from 18.89% to 23.99% (5.1% gain), and from 41.39% to 62.28% (20.89% gain) for 1 to 1000 epochs, respectively. Moreover, the performance gain achieved in current ALSs by employing the optimal learning paths recommended by PAKES implies the progress of the learners in the overall learning trajectory. Future extensions of this work can explore the advancements for capturing the hidden information of the learner and adapting the RL-based policies to optimize learning outcomes with a real-time implementation in the educational system.