Integrated High-Workload Services for E-Learning

E-learning platforms are valuable tools to support and enhance the learning and teaching process during online classes. Even though these platforms are already well-known and have been widely used for a long time, transitioning to fully online activities with a large number of users can lead to an exponential growth in resource usage. Moreover, the availability of these platforms becomes critical since the learning activities cannot be interrupted, and the vital information (assignments, grades, and shared resources) cannot be lost. Thus, the system administrators must implement a scalable solution to accommodate many users and activities. Determining the most suitable architecture is a non-trivial task that must consider the various tools, services, and interconnect options that are available. This paper presents our approach to setting up an integrated high-workload e-learning setup used for online classes in University Politehnica of Bucharest and can be used as a reference by other universities. The e-learning platform must accommodate around 35000 users and 11000 courses; thus, its management is not trivial. In this paper we present our approach to integrating all the services necessary for a responsive online platform and to automating e-learning management operations such as class creation and user course enrolments. Moreover, we present our testing methodology and the deployment of the proposed architecture, as well as the results we have obtained in production, especially during the time when all teaching activities were performed using online platforms.


I. INTRODUCTION
The e-Learning platforms have been used as auxiliary tools for teaching for a long time. However, they became vital in the last years. Various events may require using online teaching tools to ensure a high quality of the educational process (e.g. fully online classes, or a hybrid approach: both face-toface and online teaching activities).
Through the e-learning platforms, assignments, grades and teaching materials are shared between teachers and students, making the learning management systems (LMS) an important part of the educational infrastructure of an institute. Nevertheless, their configuration can be challenging since it must be correct, scalable and easily adaptable to changes. Entities must choose suitable infrastructure and services architecture (from hardware to software) to accommodate The associate editor coordinating the review of this manuscript and approving it for publication was Chin-Feng Lai . numerous users and activities (e.g., Universities can have thousands of users) at all times.
As many universities, University Politehnica of Bucharest (UPB) from Romania uses various services to accommodate the needs of around 30,000 students and 5,000 teachers, who are part of some 11,000 courses. This paper will focus on the following services: • an identity management system where each person (teacher or student) affiliated with the university has a unique account to access online services; • an e-Learning platform where users can be enrolled into courses, where class materials are stored, and various forms of evaluation could be performed; • a live communication platform (Microsoft Teams) that can be used to hold live courses and laboratories. Even though the technical resources were available and used for many years in the university as auxiliary tools to enhance the educational process, transitioning to VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ predominantly online activities causes some issues that must be addressed to surface: • data management scalability: the infrastructure should automatically create accounts for everyone associated with the university and the courses, and map teachers and students to specific classes based on their work or study contracts. Since many users (around 35,000 students and teachers) must be able to access the resources, manual user management is not an option. Furthermore, considering the number of courses (11,000) and students per course (from 30, and up to 200), course creation and enrolment must also be automated.
• performance scalability: the platforms must handle a large number of concurrent sessions. The classes are scheduled between 08:00 to 20:00, from Monday to Friday. Given the number of students divided unevenly across all the intervals, we must ensure that the platform can handle all requests, even when the load is high. Moreover, the platform must be available during weekends, breaks and outside office hours since students may submit assignments and teachers may update teaching materials.
• integration between the platforms: all platforms must have compatible representations of resources (e.g. a course that appears on the e-Learning platform should also have a correspondent on the live communication platform). Furthermore, it is essential to have a pipeline that integrates all the components. The pipeline collects information from identity management databases (students, teachers, study contracts, learning plans) and then generates mappings between user accounts and user roles based on the information it gathers. For the e-learning platform, this translates to enrolling students and teachers into courses. This paper presents the importance of an e-learning environment, by analysing the literature in this domain (Section II). It then proposes a model for an integrated and scalable e-learning environment that other schools or universities can replicate (Sections III and IV).
The proposed architecture was tested and adjusted for almost two years and is, at the time this paper was written, used in University Politehnica of Bucharest. Nevertheless, when dealing with many users and resources, there are some issues that school or university system administrators may encounter (e.g. slow queries, code issues that generate misbehaviour, a large amount of data that require backup, unresponsive services). This paper also presents the various issues that were encountered in this period and describe how to address them (Section V), giving the community several examples of how to deal with complex architectures.
The paper presents a methodology to test and validate the proposed architecture and its configuration and the results that were obtained in production (Section VI). The last section presents this paper's conclusions and plans to improve the current results.

II. LITERATURE REVIEW
This section presents a literature review of the learning management systems and an overview of the impact of sudden transitioning to online teaching in universities.

A. OVERVIEW ON LMS
A Learning Management System (LMS) [1], [2] is an e-learning tool that helps students and teaching staff to improve the educational process. It allows the teachers to publish teaching materials for better dissemination, to give and grade assignments or create quizzes for their classes.
The LMS can be self-hosted (i.e. on the institution's premises) or deployed in a public cloud infrastructure, delegating the infrastructure management to the vendor. Moreover, an LMS can be open-sourced (i.e. the source code is publicly available and can generally be used by anyone) or closed-sourced (i.e. source code access is limited and usually a subscription or licence must be acquired). Blackboard, Canvas, Chamilo, Google Classroom, Microsoft Teams, Moodle, Sakai are a few examples of LMSs.
The Learning Management Systems have been used for a long time, especially in universities, and are an important part of the educational process. They are wildly used for different use-cases; to name a few: • technology related topics such as learning a programming languages [3] or teaching a Mechanical Engineering Education class [4]; • non-technical related topics such as learning a foreign language [5]; • medical fields [6], [7]; In some extreme circumstances (e.g. the SARS-CoV-2 virus, earthquakes, floods etc.), universities and schools from many countries may need to shift their activities online [8], [9], [10], [11], [12], [13], [15], [17]. Thus, they must rely on different learning management systems and online tools to continue the educational process. Some of the e-learning platforms that are commonly used for online classes are: • Google Classroom, 1 part of G Suite, provides a platform for meetings, giving assignments and quizzes, grading and sharing resources. It is free for institutions associated with the Google Workspace for Education Fundamentals program. In Georgia [8], [9], universities use Google Hangouts Meet for meetings and Google Calendar to schedule their classes.
• Blackboard Learn 2 is a specialised e-learning platform that contains various tools for enhancing the learning process. For example, during the pandemic, the majority of Saudi universities choose Blackboard Learn as their LMS [11], [17].
• Moodle 3 is a wildly used open-source and free e-learning tool. It can be customised and has various tools and plugins that can be added. Many institutions [12], [14], [15], [16], [18] use Moodle for online classes.
• live communication tools (WhatsApp, Zoom) were used as a replacement for an e-learning platform.
A study [10] shows that the majority of schools from India prefer using WhatsApp and Zoom for communication for online classes. Table 1 shows an overview of the methods used by institutions for online classes. The table shows that the following components are necessary for e-learning: • synchronous communication for live classes and exams; • asynchronous communication for teaching materials, assignments, quizzes or other resources. Table 1 highlights that even though institutes used dedicated e-learning platforms, there was also a need for a reliable tool for live communication. We observe that the standalone e-learning platforms lack synchronous communication mechanisms.

B. OPTIMISED MOODLE SETUP AND INFRASTRUCTURE
The process of setting up an LMS instance may seem easy since each LMS tool's documentation provides installation and configuration instructions. Nevertheless, the system administrators may face different challenges in adapting the infrastructure to meet each university's needs [14], [15], [18], [19], [20].
This subsection presents related work in the field and how other system administrators approached optimising their Moodle setup.
The University of Mercu Buana's Moodle platform has around 30,000 students and 1,000 lecturers. In [18], the authors propose a load balancing and clustering deployment for Moodle to handle the load when the number of students who simultaneously use the platform would increase. Their solution that minimised the poor performance issues with their Moodle instance involved the usage of: • HAProxy and keepalived to ensure high-availability and a proxy for platform access; • multiple database nodes that are synchronised using Galera Cluster; • distributed Moodle instances (4 nodes for application servers); In [19], the authors present the actions they took to increase the Moodle platform's service reliability. Their proposed solution uses: • an HAProxy load balancer that distributes the work across several Moodle containers; • several Moodle web server containers for redundancy; • a GlusterFS distributed volume with multiple bricks to provide redundancy. To test their implementation, the authors used the Apache JMeter tool.
Switching from face-to-face to online classes, the e-learning platform from the School of Economics and Business at the University of Sarajevo suffered from performance degradation [14]. The number of students that accessed it drastically increased in a short time. To increase the amount of physical resources, they moved their e-learning solutions to a cloud provider network. As an e-learning solution, they use the Moodle LMS alongside the BigBlueButton virtual classes tool.
Through their performance tests, they observed that Big-BlueButton accepts fewer parallel connections than Moodle, so their approach was to distribute the BigBlueButton infrastructure. Moreover, they concluded that the database should be installed on a different server than other services and use an NFS server for external storage. They used ScaleLite to load balance the workload on the BigBlueButton nodes. For a better user experience, they integrated the Moodle platform, through an add-on, to the BigBlueButton platform.
In [16], the authors offer a methodology for online classes implemented at the Universitat Politècnica de Catalunya during the Covid-19 pandemic.
Even before the pandemic, the university [15], [16] used Moodle as an LMS to facilitate the learning process (they have around 5,000 courses and 31,000 registered students). Thus, when the pandemic forced the university to go fully online, the Moodle platform, alongside other online learning tools such as Google Classroom, replaced the face-to-face activities for all faculties within the university.
The studies [15], [16] show a significant increase in the usage of the Moodle platform in the full-online teaching scenario as opposed to the face-to-face teaching scenario, when the LMS platform was used as an auxiliary tool for education.
It was also technically difficult [20] to ensure access for a large number of students to a Moodle activity (quizzes) when the learning activities moved full-online. Moreover, they want to improve the scalability of the Moodle Quiz Plugin so that the students are not affected during the exam period. VOLUME 11, 2023 An important aspect is that the UPC's Moodle platform architecture was designed to ensure scalability: • move the database to a dedicated optimised server; • use a load-balancing implementation to increase the number of simultaneous PHP requests. They use six front-end servers connected to the same storage through NFS; • use a high-performance disk cabinet (cache, backuprestore functionality) exposed through NFS to the servers; • deploy the servers in a cluster environment (i.e. use a virtual machine to run the services). Their solution was to implement the Moodle Quiz Plugin as an external Moodle platform that follows the same architecture as the central one and dedicate it to quizzes. Their implementation does not duplicate the resources but acts as an external tool that is used exclusively for examinations.

C. DISCUSSION BASED ON THE LITERATURE
The purpose of using the e-learning platforms is to facilitate the educational process and to improve the learning and teaching experience for students and teachers respectively. Thus, these platforms must work seamlessly from an end user's point of view (without interruptions, slow response times, or errors). The system administrators aim to implement solutions that scale for many users, work without interruptions or data losses, and consequently must perform backups and satisfy various functional requirements.
Events that force a transition from face-to-face to online classes are usually unpredictable. As presented before, the transition to online classes has forced system administrators to find solutions that could scale in high workload scenarios. Thus, system administrators must optimise their e-learning platforms to accept high workloads by: • using load balancing [14], [18] to reduce the load when there is a sudden increase in the number of simultaneous users; • increasing the physical resources [14]; • distributing the services [19], [20] to reduce bottlenecks; • fine-tuning service configurations [20]; • implementing dedicated tools [20] to reduce the load on critical components. We conclude that each setup is different depending on preexisting infrastructure by analysing the architecture proposals. Consequently, managing services is challenging due to the numerous issues that may arise.

III. PROPOSED ARCHITECTURE
In this section, a proposed architecture for e-learning platform management is presented. This proposal is implemented as the Learning Management System of University Politehnica of Bucharest (UPB).
Having multiple e-learning tools available, selecting the ones that suit an educational institution requires extensive analysis. As presented in Section II, both asynchronous and synchronous communication platforms are required. For the asynchronous communication platform, we chose Moodle, considering that it has already been used for many years. Moreover, it is open-source (unlike Google Classroom and Blackboard) and free (unlike Blackboard).
Because it is open-source, we can add custom or in-house plugins like other entities did [7], [16], [20]. In the previous years, students from UPB have developed plugins for Moodle or improved existing ones and helped the community. Moreover, having access to the source code can make debugging easier.
UPB is a technical university and some additional functionality that is required can be inserted through plugins. For example, some teachers rely on two Moodle plugins, Virtual Programming Lab and Code Runner, to test students' coding abilities.
For synchronous communication, Microsoft Teams was chosen since everyone in the university, both students and staff, already have Active Directory accounts that can be used to obtain access to various Microsoft services. Additionally, a Moodle plugin can be used to assure synchronisation between team memberships and Moodle course enrolments.
UPB has multiple faculties. Thus, it was necessary to decide if each faculity would have a separate instance, or a single university-wide instance was preferable. Based on our previous experience managing LMSs, having multiple instances increases the maintenance complexity. Each security patch, update or configuration change must be replicated on each instance. Thus, a single e-learning platform for all faculties within UPB was preferred.
As presented in Section II, the e-learning setup must be optimised for high workloads. Consequently, the proposed architecture relies on distributed components and fine tuned configurations of the various parts are presented in this section and in Sections IV and V. Figure 1 presents the typical Moodle architecture that was implemented in UPB. An extended Moodle architecture's components are the following: • Moodle application server -runs the Moodle code.
It uses a reverse proxy and PHP-FPM to handle requests from users; • MariaDB database -contains the database used by the Moodle instance. For better performance, the database uses a dedicated server; • Redis cache -caches information used by Moodle. For better performance, a dedicated server is used for memory caching; • Prometheus and Grafana -monitoring tools to trace the Moodle activity. They mainly monitor the web server (NGINX), the database (MariaDB) and the PHP-FPM process; • distributed storage -dedicated external storage for moodledata, which contains all files that are not related to Moodle code or configurations; • LDAP server -contains data regarding students and teaching staff. It is used for login and security purposes (only specific users can access the LMS platform). The servers (Moodle application, MariaDB, Redis) run inside virtual machines managed by Hyper-V. A few of the advantages of using virtualization are the following: • better resource allocation -adding or removing resources such as virtual CPUs or memory is easier; • live guest migration -it is possible to migrate the virtual machine while it is still running on another cluster node when needed (physical system maintenance, system issues); • snapshots -checkpoints of the virtual machines can be created to have a backup of the e-learning platform. The components of the Moodle LMS (web application, database, cache server, LDAP server) are connected through an internal network that can only be accessed by administrators. The only server connected to the Internet (through a firewall) is the web application server so the users can access it.
For improved connectivity between all services in UPB, the services visible in Figure 1 need to be integrated with the identity management system and the Microsoft Teams platform. Thus, the Moodle configuration ( Figure 1) was extended, and other components that interact with the e-learning platform were added ( Figure 2). Figure 2 is divided depending on the place where the servers are running (in the university's data centres or the cloud) and by isolation level (what can be exclusively accessed by system administrators and what is publicly available).  • user synchronisation -when a user's role for a course is updated (the user is added/removed, the user's role changes), these changes are propagated into the corresponding Microsoft Teams group; • links for Moodle courses in Microsoft Teams / Microsoft Teams block in Moodle -a user can easily access the other e-learning platform. To be considered adequate for teaching and evaluating student performance, the proposed architecture must be: • able to accommodate a large number of users; • easy to use.

IV. E-LEARNING PLATFORM SETUP
This section presents the proposed architecture for UPB's e-learning platform, optimised for online classes. Moreover, some of the tasks (e.g. user Moodle creation, course creation) are automated.

A. MOODLE CONFIGURATIONS FOR LARGE NUMBER OF USERS
On top of the previously described Moodle installation, we configured the following: • installed and configured PHP-FPM to achieve a better performance of the PHP processes on the web server instance; • added security enhancements: enabled SELinux -for comprehensive security, we enabled and configured per- Each of the installed plugins used for programming activities (Virtual Programming Lab and Code Runner) needs extended setup and configuration since they require external servers where the code is evaluated.
After installing and configuring the services, each of the important settings must be fined tuned. We determined the fine-tuned options for MariaDB, Redis, NGNIX and PHP-FPM based on the results of various automatic and manual tests and prior experience with the platform. These options may differ depending on the e-Learning platform's use-cases. Some of the most notable updates are: • increase PHP-FPM maximum execution time to allow some tasks to finish when processing requests during high system load (e.g., JMeter tests); • increase PHP-FPM file upload limit and NGINX request body size to allow students to upload larger files for assignments; • increase PHP-FPM worker count and use static allocation to create a large number of workers ahead of time for better performance; • increase the maximum number of NGINX connections for an estimated maximum number of simultaneous clients (Equation 1 -each user may have 2 connected devices). The maximum number of connections (Equation 2) is set in NGINX using worker processes , the number of NGINX worker processes that run, and worker connections , the maximum number of connections that an NGINX process can accept. To balance the load across all the physical CPUs, we set the number of processes to the number of cores (Equation 3) and the maximum number of connections as the estimated maximum number of clients divided by the number of processes (Equation 4).
max con = worker processes * worker connections (2) worker processes = N cores (3) To simplify the Moodle configuration process, we created and scheduled some tasks (that run on-demand or are scheduled through RunDeck). They synchronise the student databases with the Moodle platform and ensure that the courses are created according to the study plans. Then, they enrol students on their classes according to their corresponding contracts. Figure 3 depicts the automation process described in the following paragraphs. Since there are more than 11,000 courses from all faculties within the university, it is not feasible to manually create them. Thus, we use some in-house MySQL scripts that extract all classes for the current year from the internal databases based on faculty, year of study, and semester. Using this information and additional Linux shell scripts, we create a preview of the courses; we then verify the study programs' structure and upload the courses in the e-Learning platform using Moodle's upload courses feature. This operation is critical, and since there are many courses, it cannot be easily reverted if any issues are discovered (the process of deleting courses is complex and requires a long time to complete; because of this, recreating courses is resource intensive). We run this task on-demand at the beginning of each university year.
Similarly, it is impossible to manually enrol a very large number of students (around 30,000) on their specific courses (each student has around 5-7 courses each semester). Considering this, the enrolment process must also be automated.
To enrol a student on courses, they must have a Moodle account first. However, since we use an external authentication system, the database does not contain any users immediately after the platform's deployment. Their profiles are normally created when they first log in. The first step in the automation process is to create a profile for each student. Furthermore, teaching staff members also need to have their profiles created. Using custom MySQL scripts, we extract information from the University's database with the profiles of the users that have an active study or teaching contract. Then, using Moodle's upload users functionality, we create or update each user's profile. This step also updates extra fields that are required for each user: faculty, department, role, group, study year.
Having the users added to the platform, we then extract the course list for each student based on their study contract using SQL queries. Afterwards, we enrol the students on their specific courses using the same upload users functionality.
A student's study contract does not change often. However, we run the students' user creation and enrolment tasks every night (using jobs in RunDeck) to ensure that the students can access the online e-Learning resources in the shortest time possible in case any changes to contracts are made.

V. MAINTENANCE OPERATIONS
Creating a stable and scalable Moodle instance required solving various issues generated by the increased platform usage when the number of concurrent users suddenly increased. The problems encountered were generally related to scalability: increasing hardware resources does not automatically increase the number of supported users.

A. NUMA AUTO LOAD-BALANCING EFFECT
A Moodle instance contains the Moodle application written in PHP, the database in SQL format and a storage partition where files are stored. We have split these services between two different nodes (high available virtual machines running on Hyper-V nodes): the application and the storage partition, and the SQL database. This way, we were able to adjust resources and configuration parameters based on the requirements of each service. In the next paragraph, we will focus on the application virtual machine.
As presented in Equations 3 and 4, to increase the performance and balance the load across multiple NGINX workerprocesses, we increased the number of maximum NGINX connections to max con (Equation 2). Each of these connections sends requests to PHP-FPM (PHP FastCGI Process Manager) to serve the PHP content. PHP-FPM maintains a pool of worker processes.
In practice, we observed that the load (i.e., the number of processes that actively use the CPU) of the application server was always high. Moreover, increasing the number of CPUs allocated to the server caused the load to increase even more. Varying the number of CPUs revealed that the load was always higher than the number of cores.
A high load indicates that the operating system has various CPU-intensive processes. Inspecting the processes (mostly PHP-FPM processes), we observed that: • the processes were idling (i.e. no CPU intensive tasks were executed); • no I/O intensive jobs were running; • no user-space library calls or system calls that may explain this behavior were executed.
Considering the observed user-space behavior, we run kernel profiling operations on the PHP-FPM processes. We created a flame graph presented in Figure 4. The graph indicates that almost all execution time is actually spent with page_fault calls that are calling do_swap_page and migra-tion_entry_wait. However, the system was reporting that the swap space was not being used.
By studying the code of migration_entry_wait function, we concluded that the kernel was moving the process' memory between NUMA nodes to optimize the memory access time. Further research on this topic showed that the behaviour was controlled by the setting called numa_balancing which is configurable through /proc/sys/kernel/numa_balancing. After turning it off, the system load significantly decreased (i.e., for 64 cores, the load decreased from above 64 to below 5).
The NUMA Auto Load-balancing process was trying to move the memory of the running PHP-FPM processes between NUMA nodes, while the processes outnumbered the cores by far. Based on our experience, the recommendation is that when a system runs a small number of processes, comparable to the number of cores, the NUMA balancing should be left on. However, if the number of processes is much higher, as it was in our case, NUMA balancing may cause issues and should be turned off.

B. MOODLE CHAT
The Moodle e-learning platforms' functionality is split between modules and plugins. One of the default plugins, called ''Chat'', offers live communication capabilities between course members.
For full-online courses, an asynchronous method of teaching and evaluating like Moodle is not enough. Moodle Chat provides an option for live chatting between students and teachers.
A problem with the Moodle Chat is that it has a very old code base without any significant updates in recent years. Its code is simple and not scalable. In the default mode, it is making database queries for each web page refresh, for each client (e.g., for 1,000 connected clients, there were tens of millions of queries to the database), which increased the load on the database and application servers.
The Moodle community recommended switching the processing mode to an option that created a thread for each client to reduce the number of queries to the database. This setting worked, and the database was within normal parameters again. However, the application server's load increased and this approach failed to scale when the number of clients increased to more than 5,000.
Another indirect issue that affected the platform was that opcache_reset was called whenever certain pages under https://Moodle_URL/admin were accessed. This meant that opcache_reset was being called even for unauthenticated users. The function was invalidating the PHP opcache (i.e. a cache containing a bytecode representation of PHP scripts) for all currently running PHP processes, which was triggering the process manager to restart worker processes. In our case, this meant that all running PHP-FPM processes were restarted, making the students and teachers lose connectivity to the chat application. To mitigate this issue, we created access rules to deny access to the /admin path if the requests are not coming from a trusted IP address.
Unfortunately, these changes only partially solved the problem. In the end, the solution to this issue was to disable the Chat plugin and opt for an integration with a dedicated live communication platform, Microsoft Teams.  integration between the two platforms facilitates access to resources stored in Moodle, while also providing real-time communication facilities such as chat groups and video meetings. These interactions enable easier transitions between the learning platform and Teams. Of particular interest in our case was translating Moodle courses and user roles to Teams groups and group memberships. The most important technical aspect was ensuring that every course has an associated team in Microsoft Teams. Moreover, all users enrolled in a course must also be able to access the associated team.

Microsoft offers various integration options between the
To perform the synchronisation operations, Moodle must be registered as an OpenID Connect client in the Microsoft Azure Active Directory portal. This allows Moodle's service account to send requests to the Microsoft APIs to create, update or delete groups and group memberships. It also allows users to authenticate and access Moodle resources directly from a tab in their Teams client. All group operations are performed through the Microsoft Graph API, and the requests are authorised through the client's service account. Consequently, any issues with the client, such as expired credentials, will cause the synchronisation requests to fail.
The Moodle plugin handles synchronisation by managing resources in the Microsoft registry through API requests. The managed resources refer to groups associated with each course and group memberships that reflect course enrollments. The resources are usually updated in one of two cases: • a course is added to the pool of synchronised courses; • an event with a handler implemented in one of the Office 365 plugins is triggered in Moodle. The first case will trigger updates when a new course has been created (when we set up all classes' synchronisation) or when the administrator has manually added the course to the list of syncrhonized courses (when we set up only a subset of all classes to be synchronised). For the associated groups' creation, the Moodle plugin will periodically interrogate the database for any course without an associated group and attempt to create the group and add all enrolled users to the group. A downside of updating courses using this approach is that each course is only checked until an associated group is created to avoid keeping track of every possible change, but this downside is mitigated through the use of event handlers.
We have encountered an issue that Microsoft has two types of groups with different meanings -regular groups and groups with an associated team. The regular group is seen as an email distribution list, where members can send emails to their group, but it does not show in the Microsoft Teams interface. On the other hand, the group with an associated team functions similarly but it is also displayed in the Microsoft Teams interface as an available team. From the plugin's perspective, the two differ in their creation requirements, as teams must have an owner, while regular groups do not. The owners are identified by having the appropriate capabilities in Moodle (e.g., a teacher). They must be trusted users since they can perform more advanced operations on the team, such as adding or removing members or deleting the team.
The default functionality provided by the plugin when we first deployed it was creating teams for the courses with user rights and regular groups for the remaining courses. This meant that any course that did not have a user capable of being an owner when the group creation scheduled task ran would be associated with a regular group, instead of a team. Moreover, the group was never upgraded to a team even if users with the ownership rights were added, increasing user confusion. We changed the plugin to skip all courses that did not have any owners and instead only create teams to avoid this issue. We also made a feature request for this functionality to the plugin maintainers.
An additional problem on the first deployment was caused by the extremely long runs of the process that was tasked with creating the teams. Despite the group creation and user enrolment processes being quite short (i.e. under 10 seconds for most courses), only a few teams (under 100) were created each hour. Even when we selected only a few thousand courses to be synchronised, a rough estimate of the entire process' duration was that it would require a few days to complete. While inspecting the synchronisation task's output, we observed that the job would stall for a few minutes before creating teams on each run. By default, the script was scheduled to handle five teams for each run, causing the initial stall to far outweigh the team creation process itself. A temporary solution was to forcibly increase the number of courses handled by each run to 50, thus reducing the impact of the initial stall. Through an issue opened on the project's GitHub page, we had confirmed with the project's maintainer that the code that delayed the process was part of a legacy implementation and performed checks for some actions that were no longer required. The code causing this issue has since been removed.
The second scenario is critical to maintaining the course synchronised with the Microsoft Teams platform. In this scenario, the plugin handles various events in Moodle, such as updating some course parameters or adding, removing and updating user enrolments. For example, whenever a user is added, a team enrolment request is automatically sent through the REST API to also add the user to the team; conversely, if a user is removed from the course, their team association is also automatically removed. Furthermore, when a user's enrolment is updated, such as assigning the teacher role (a team owner) to a user that was previously enrolled as a student, their team's status is upgraded or downgraded as needed. A side effect of our automatic enrolment scripts that periodically update user enrolments is that Moodle generates user enrolment updates which also add the users to the teams in case they were accidentally removed.
Some other minor issues were also discovered and have been fixed, such as improper checks on user updates that could be affected by race conditions and leading to user removals, not correctly removing teams when a course was deleted, and not updating the names of teams to reflect changes to course names.

D. SLOW SQL QUERIES
The Microsoft Office 365 integration also provides the ability to authenticate users using the Microsoft Single Sign-On service in Moodle. In our case, account creation using this authentication method was disabled; instead, it only allowed the authentication of existing users since users are automatically created and enrolled on courses using external scripts.
Since the Microsoft SSO authentication method was set as optional, and most users opted to authenticate using the traditional username and password form, the number of users using this type of authentication did not increase quickly. However, Moodle started displaying sudden performance drops after this number had reached a few hundred. By investigating the data processed by the Prometheus monitoring tool, we have observed that the database server began reporting a number of slow queries. An output from the Grafana dashboard with status information extracted from the database server can be seen in figure 5.
By examining the database logs, we have identified that the auth_oidc_token table that is used by the Office 365 plugin to track which users use the Microsoft Single Sign-On and their authentication tokens was not queried efficiently. When querying the database to extract information on user tokens, the userid and username fields are used to identify rows. Because of this, these fields are used frequently, and searches that use them should be optimised. Relational databases can optimise searches using table indexes; by default, a table's primary keys are automatically indexed, but administrators can also set other fields as indexes. Since a server's memory resources are limited, only critical fields should be declared as indexes; however, their impact can be drastic -in our case, the amount of time required for each query decreased from a few seconds to around 0.1 seconds.

E. MOODLE BACKUP
To mitigate some common issues that may arise from mistakenly altering courses (e.g., editing or even deleting quiz questions or activities), backing up the courses at regular intervals is recommended. Moodle offers this functionality by default in their administration dashboard, where the administrator can configure various backup parameters: • the day(s) of the week the backup should be performed (i.e., this is the date when the backup procedure should begin, and it is not guaranteed to finish in the course of the same day, as presented further in this section); • the location where backup files should be stored (i.e., in the Moodle data directory, or a separate directory that can be placed on an external drive or volume); • whether attachments, such as images, should be backed up; • the minimum and the maximum number of backups that should be kept for each course. To create a backup of any resource, Moodle will create a compressed archive containing all useful files that can be attributed to that resource and a database dump with the data associated with the resource. Because this process requires copying and compressing data, the amount of time the backup creation takes is proportional to the number and size of the files that must be included. The data in each course is isolated from the data in other courses. Thanks to this, multiple processes can be created to reduce the time required to create backups for all courses on the platform, where each process handles a different course. VOLUME 11, 2023 While the initial setup was easy to perform, we have encountered a few issues along the way that were partly caused by our custom setup for managing courses and user enrolments.
Our deployment of Moodle serves a wide range of users, and because of this, it must be able to store data, such as courses and student assignment submissions, in various formats. Not all formats are as compressible as the others. For example, a quiz containing multiple-choice questions or text answers submitted directly into a text box in the web interface is highly compressible because the amount of data is usually small (on the order of kilobytes) and the character sets are small. However, in some cases, the submission files include images or even short videos. Multimedia files are usually not as compressible because of their binary encoding; additionally, they are generally larger (on the order of megabytes) depending on various factors, such as format and resolution.
The backup process has highlighted discrepancies in how different users are using the platform. Some teachers prefer using other platforms to upload content or grade student assignments and use Moodle to add links to that content. Other teachers upload all content to Moodle and require students to upload assignments in various formats, including text assignments or scanned files and screenshots. Consequently, some course backups consisted of a few files and were very small (a few megabytes). In contrast, other backups were huge and contained many files (tens of gigabytes).
We chose to perform backups over a Network File System (NFS) mounted on the Moodle server, which we used as a storage volume. Over time this has proven to be a non-optimal decision because of limited throughput and a low number of concurrent operations supported by the disk sharing mechanisms employed. The backup process must perform a large number of operations to list the files on the filesystem, copy the course files and compress them to create the backup. We have settled for a maximum of 5 concurrent backup operations; above this number, even listing the contents of the filesystem slowed considerably, and the speed of the overall process did not improve.
Moodle is expected to be available and highly responsive at least in the 8:00-20:00 interval of every workday when classes were held; with around 5,000 actively used courses (half of the total courses, which correspond to the classes held each semester) that had to be backed up with each iteration, the backups were scheduled to begin at 23:00 on each Friday, and they would hopefully end before 08:00 on the following Monday. However, there were times when this did not happen, and the platform began lagging, prompting us to stop the backup process. There were also several failed backups reported for some courses.
To inspect the failed backups, the debugging process began by starting the backup for one of the courses that did not have a backup and observing the output. From the output, it appeared that Moodle had lost connection to the database while the backup ran, but no database service interruptions were logged. After reading some of the Moodle code that handles backups and researching possible causes, we found that the issue affected courses with substantial amounts of data, whose backup ended after a very long delay. The issue was caused by a parameter that manages the time a connection to the database should remain open before timing out. The backup process kept an open connection to the database, which was only closed when the backup ended. Since the connection was being kept alive for the duration of the backup, for long running jobs it would eventually time out and an error would be recorded at the end of the backup, despite the backup being properly created on disk. The solution was setting a timeout interval of 3600 seconds for connections to the database, which allowed longer running tasks to complete successfully.
Our custom deployment was partially responsible for the fact that the backups did not finish over the weekend. The settings offer the option to not create backups for courses unless modified in the past days. Despite using this option, some courses had backups created even when their content was not changed recently (i.e.., some courses from the first semester were backed up well after the second semester had started, when users were no longer expected to update them). Although the contents were not changed, the scripts that handle student enrollments kept running, causing updates to the enrollments, which are also considered changes by the code that decides whether backups must be performed. In this case, the solution was adding an extra check in the code to ignore enrolment updates and only consider new enrolments or removals as relevant for backups.
We could, however, not solve an issue that causes the number of simultaneous backup processes to increase beyond the set limit. Despite setting a limit of 5 maximum jobs that can handle backups using multiple options, it is sometimes exceeded. The cause may again be very long-running jobs, but since we did not manage to reproduce the error reliably, we could not confirm the cause until now.

VI. TESTING AND RESULTS
Even though the university has used Moodle for a long time, we must validate each change before deploying the platform in production. Any issue that the changes create means that the platform may be unavailable or vulnerable. However, the end goal is to have a fully operational platform for online teaching. Hence, the proposed architecture must be tested before being deployed for public use.
The platform setup does not receive any major changes over the university year unless there is a security or critical update. When taking this into consideration, architectural changes to the platform must be proposed and validated in advance. There is a short time frame (around one week) when we must migrate the previous e-learning platform instance to a server with fewer resources and prepare the platform for the next year on the production servers. Considering the small time frame, we must have everything ready for production before the actual deployment. Moreover, we usually cannot perform separate tests on the production systems since they may affect the end-users.
This section presents the methodology used for validating the proposed architecture. However, the results that matter the most are from a production environment. Thus, we present how the proposed architecture worked in the 2020-2021 university year at the University Politehnica of Bucharest.

A. PROPOSED CONFIGURATION VALIDATION
To validate the configuration before moving it to the production servers, we create a deployment of the e-learning platform, similar to the one proposed for production, in a test cloud environment based on OpenStack. A battery of custom tests is then run using Apache JMeter for Moodle. The results could then be scaled depending on the available resources in production. However, since the cloud environment is also used for other purposes, we cannot permanently reserve the same amount of resources we have in production (e.g., we cannot dedicate 256GB of RAM to the web server in the cloud instance).
OpenStack is an open-source cloud solution used to manage a cluster's resources. Users can create virtual machines based on various templates. Figure 6 presents the Moodle network topology in OpenStack. It resembles the simple architecture we presented in Figure 1, having the webserver, Redis and MariaDB servers connected to an internal network. The web server is also connected to an external network that makes it accessible from the internet, through the Net224 network. Apache JMeter runs in a different virtual machine in the same cloud.
Besides the primary platform functionality, we want to stress-test the platform to observe its behaviour. Since our platform should support around 30,000 users, we must use test plans that use a large number of users. Thus, the small test plans are not suitable for our test case.
While testing, we concluded that some of the test plans were not suitable for us because they did not match the expected behaviour of our university's students. For example, one of the tests creates 1,000 assignments for a single course. While this can provide valuable data about the load on the servers (it generates a significant load when accessing the course), this test fails to emulate reality properly (i.e., a teacher is unlikely to create more than a few tens of activities in a single course). Another test we have chosen to skip while testing is the one that stress-tests a forum activity because it does not match the profile of normal forum activities. The tests simulate many users who log in, access the same forum activity, post a message, and then log out, resulting in forum activities with thousands of replies.
For testing, we decided to simulate a scenario that resembles the normal activity in our university. We have many courses that are accessed by a relatively small number of students (100-200 students) and perform some regular activities: check the forum for updates, fetch teaching resources and submit assignments. Thus, we chose to generate 50 medium  Apache JMeter test plans for Moodle that were adjusted to each use a different course and different users. Then, we ran these tests in parallel multiple times to see if the deployment could handle the load.

B. RESULTS FROM PRODUCTION
With the proposed configuration validated, we proceeded to deploy it in production. Table 2 shows the resources that were allocated for each component in our production environment. The web server uses the external storage to store the moodledata directory, which contains the uploaded or newly created resources, students' assignments and other data. The proposed resources were selected to allow a large number of simultaneous user connections (based on Equations 1, 2, and 3; by increasing the N cores , we can increase the maximum number of connections). Moreover, the storage capacity was considered appropriately sized for at least two full system backups.
Using these resources (Table 2), we have configured the proposed architecture presented in section III and set up the platform according to the steps described in section IV.   our e-learning platform was based on the proposed architecture. The results depict the load on the Moodle platform in three scenarios: during regular classes (March-June), during exams sessions (June and September) and during summer break (July and August).
All three graphs extracted from Grafana (7, 8 and 9) can be analysed together since they reflect the load on various parts of the system. Each graph shows the peak number of simultaneous connections, requests and, respectively, MySQL connections. We can use this information to observe that: • the platform is used, mainly, during workdays; in University Politehnica of Bucharest, workdays are Monday-Friday, between 08:00-20:00 (or 22:00 depending on faculty and study program); • the platform is used during weekends and Easter holidays (around 1st of May, 2021); however, the load is lower than in a workday, as there are no classes. There is still some load on the system since the students can access the resources or submit assignments, and the teaching staff can create or grade assignments; • the platform is used more during the exams session (June 2021); the peaks differ according to the number of exams that took place simultaneously; • the platform is not used during the summer break (July and August); however, there are some registered connections since the information is kept online and students and teaching staff can access their classes anytime; • the load starts to increase during the fall exams session, but it is far lower than during the summer exams. We can corroborate the data from Figures 5 and 7 and observe that the issue described in section V-D appeared only when the number of connections reached an all-time high peak. Moreover, we can see that after the issue was mitigated, there were no more slow queries, even though the number of connections had reached similar peaks.

VII. CONCLUSION AND FURTHER WORK
E-learning platforms enhance the educational process as they offer both students and teaching staff options for online resource sharing, facilitate communication and provide reliable evaluation tools. Their usage has increased dramatically over the last years as the world had to rely on online tools to continue educational activities.
Managing many users and courses is not simple from a system administrator's point of view: the exponential growth of users and resource requirements affected the server functionality. Thus, the setup of LMSs must be able to accommodate all users at all times.
In this paper, we proposed an e-learning architecture that is used in University Politehnica of Bucharest that accommodates around 35,000 users and 11,000 courses. We presented issues that arose since March 2020 and how we have addressed them. We pointed at some fine-tuning options that we consider helpful based on our experience with Moodle and system configuration. This information could help other system administrators improve their e-learning architectures or set up one if it is not yet deployed. Moreover, as presented in section VI, the proposed solution was exclusively used in University Politehnica of Bucharest during the 2020-2021 university year, when all the learning activities were online. System administrators can use the validation methodologies presented in Section VI to test their services, regardless of their areas of expertise. We have also proposed a deployment methodology that relies on virtual machines in a cloud environment for processing, and on Grafana and Prometheus for system monitoring.
At the time this paper was written, the proposed solution works well and it is used and set to be used in the future at University Politehnica of Bucharest since it extends a bridge between teachers and students that improves the learning processes. Even though the presented results are reasonable, we plan to refine the implementation further to improve the platform's performance: • custom Apache JMeter tests to test the e-learning environment better. While the approach this paper presents to test the proposed architecture was effective, we want to have an Apache JMeter test suite that replicates the common use-cases in our University. We plan to profile the activities for each course and create custom test suites to emulate them. This approach will allow us to fully prepare the platform and stress-test its functionality to anticipate situations where the platform may become unavailable.
• a distributed, highly-available Moodle infrastructure to balance the load. As stated, the e-learning platform is a critical resource in the e-learning environment, and we must ensure its availability even in unwanted situations (e.g. data centre failure).
• dynamic scaling of resources. In a cluster and grid environment, resource management is essential. The hardware resources must not be wasted. The current architecture can be only scaled up or down by changing the configuration parameters of the various virtual machines. This operation is also normally limited to winter or summer breaks to avoid service disruptions during intensive workloads. Thus, we want to set up an architecture that dynamically scales the resources based on system load. He also coordinates the activity at the level of the data centers with the Politehnica University of Bucharest, ensuring the functioning of the eLearning and high performance computing (HPC) services necessary for the teaching and research components. His research interests include computer architecture, cluster and grid computing, parallel architectures, parallel computing, distributed systems, and networking.