1 Introduction

Research shows that learning modes leveraging remote settings impact teaching and learning positively (Gilbert and Flores-Zambada, 2011; Morris, 2014). For instance, blended learning (Garrison and Kanuka, 2004; Graham, 2013) enriches traditional class lectures with virtual learning environments (Dillenbourg et al., 2002) to provide increased convenience, flexibility, exchange of materials, individualized learning, and feedback (Chou and Liu, 2005). All these prospects led online education to be an established feature of higher education (Puška et al., 2021), mainly driven by the recent coronavirus pandemic (Dhawan, 2020). However, this practice usually leads to an untenable grading burden, as more students require extra effort to assess all their assignments (Piech et al., 2013; Gernsbacher, 2015).

Generally, the course faculty is responsible for evaluating student performance in their classes since individualized feedback is an integral part of education. Given the considerably large number of students that need to receive feedback in open and distance learning modes such as Massive Open Online 30 Courses (MOOCs), research has proposed methods to provide suitable feedback analogous to those of feedback activities in a traditional classroom (see Suen (2014) for further discussion). Among these methodologies, we highlight automated formative assessment (Gikandi et al., 2011) and peer feedback (Reinholz, 2016). Recent studies leverage automated assessment frameworks showing that they can foster student performance, perception, and engagement with valuable learning experiences  (Núñez-Peña et al., 2015; Sangwin and Köcher 2016; Moreno and Pineda, 2020). However, several disadvantages of online assessment, such as software costs to develop educational content and supporting infrastructure, may hinder its use (Tuah and Naing, 2021). Researchers widely use peer assessment techniques to save time and resources during the formative assessment by offloading part of the grading work to students (Mahanan et al., 2021).

Peer assessment consists of a set of activities through which individuals judge the work of others (Reinholz, 2016). By playing the roles of both assessor and assessed to grade and comment on each other’s work (Na and Liu, 2019), students can also reflect and discover new understandings by finding the difference between others and themselves (Chang et al., 2020). Recent meta-analyses confirm that peer assessment is essential in teaching and learning nowadays (Zheng et al., 2020; Li et al., 2020; Yan et al., 2022). In particular, formative assessment literature shows that peer assessment has great potential to improve students’ academic performance (Black and Wiliam, 2009; Yan et al., 2022) by applying pedagogical activities that facilitate learning (Adachi et al., 2018; Double et al., 2020) and promote social-affective development (Li et al., 2020). Researchers demonstrated that this methodology provides positive results at all educational levels (Li et al., 2020; Yan et al., 2022) and can be a useful pedagogic tool for blended learning and MOOCs across a broad range of disciplines (Liu and Sadler, 2003; Price et al., 2013). Peer assessment fosters student autonomy and self-regulation capabilities (Li et al., 2020; Reinholz, 2016) by allowing students to develop a habit of reflection and a constructive critical spirit as evaluators (Panadero et al., 2013). Furthermore, it encourages students to have a more robust engagement in the overall learning process since the awareness of an audience encourages more thoughtful authoring (Wheeler et al., 2008). Nevertheless, peer assessment is susceptible to several pitfalls and hindrances that may introduce noisy data in the evaluation results (Adachi et al., 2018).

State-of-the-art research tackled limitations, including reliability, perceived expertise, and power relations (Liu and Carless, 2006; Li et al., 2018; Panadero et al., 2019). However, recent studies highlight the need to understand how other idiosyncratic factors may explain the variance observed in peer assessment environments (Schmidt et al., 2021). For instance, individual differences may lead students to follow distinct scoring tactics (Chan and King, 2017) and perceive user-generated content differently (Shao, 2009), yielding biased and unreliable grades. Educational psychology recognizes that individual differences play a significant role in learning and achievement (An and Carr, 2017). However, there is an empirical gap regarding the impact of psychological constructs such as personality on peer assessments (Chang et al., 2021; Rivers, 2021). Recent studies focus on whether personality influences peer assessment evaluation perceptions (Cachero et al., 2022; Rod et al., 2020) or how a target’s personality traits predicted how peers rate them (Martin and Locke, 2022). Nevertheless, to the best of our knowledge, there needs to be more research studying how the assessor’s personality traits affect their peer assessment despite researchers discussing the effect of personality traits as a possible factor in the quality of peer assessment techniques (Murray et al., 2017; Na and Liu, 2019).

Our goal is to address this research gap by investigating the impact of personality in peer assessment in distance learning. If personality traits offer substantial evidence, that would suggest that peer assessments may hinge more on the assessor’s personality than on the work they grade, warranting further inquiry. Our work presents an analysis through the scope of the personality on the peer assessment in a real dataset collected from a university course. This research might be helpful for practitioners as the first steps to include personality factors in the development of peer assessment environments by (a) Helping them to predict how personality introduces variance in the grading process and (b) Helping them to choose the set of student evaluators most appropriate to minimize the noise introduced by personality.

The remainder of the paper is organized as follows: Sect. 2 explains the conceptual model of our study. Section 3 reviews the literature on peer assessment and how past research used personality to understand the impact of individual differences in the peer assessment process. Then, Sect. 4 elaborates on the experimental design and details its execution. Next, Sect. 5 cover the analysis of the results, and Sect. 6 draws the implications of our work to the education community. Finally, Sect. 7 outlines the main conclusions and proposes future research directions.

2 Conceptual Model

We draw on an existing theoretical model of personality to empirically assess how an individual’s personality traits affect their grading process in a peer assessment environment. We also discuss the metrics to evaluate peer assessment environments that interest our study.

2.1 Personality

The American Psychological Association defines personality as the enduring configuration of characteristics and behavior that comprises an individual’s unique adjustment to life, including major traits, interests, drives, values, self-concept, abilities, and emotional patterns. There is a broad range of theories and models, each with differing perspectives on particular topics when defining personality constructs (Corr & Matthews, 2009). Among them, several models have been developed based on different personality theories such as the Five-Factor Model (FFM) (McCrae and John, 1992; Costa and McCrae, 2008), the HEXACO (Lee and Ashton, 2004), and the Eysenck’s Model (Eysenck, 1963; Carducci, 2015).

Among the first trait-based personality models, Eysenck (1963) created the PEN model, where personality divides into three dimensions: Psychoticism, Extraversion, and Neuroticism. However, many researchers in the 1980 s began to agree that terms of five broad, roughly independent dimensions were better fit to summarize personality variation. These five dimensions led to the creation of the FFM (McCrae and John, 1992). This model is a hierarchical organization of personality traits in five fundamental dimensions: Neuroticism, Extraversion, Openness to Experience, Agreeableness, and Conscientiousness. Each of the dimensions is divided then into six subdimensions (facets). It is usually referred to as the OCEAN model – the acronym of the five presented dimensions –, or simply by Big Five. Regarding the HEXACO model (Lee and Ashton, 2004), it shares four traits with the FFM and added two new personality traits. It is composed of Honesty-Humility (H), Emotionality (E), eXtraversion (X), Agreeableness (A), Conscientiousness (C), and Openness to Experience (O).

Several personality researchers agree that the FFM personality traits represent cross-cultural differences in normal behavior, and studies have replicated this taxonomy in a diversity of samples (Digman, 2003; Chamorro-Premuzic and Furnham, 2014). These findings led the FFM to be considered one of the theories that best represents the personality construct (Avia et al., 1995; Feldt et al., 2010) and dominate the personality research landscape (Terzis et al., 2012; Cruz et al., 2015). More importantly, researchers applied the FFM in education settings (Bergold and Steinmayr, 2018; Rivers, 2021), and the community considers this helpful model to predict both achievement and behaviors. Costa and McCrae (2008) derived the FFM from the lexical hypothesis, which claims that the model can express important personality characteristics succinctly, and the more relevant a trait is, the pertinent words for that trait will appear in the dictionary (Kortum and Oswald, 2018). The FFM consists of five general dimensions to describe personality as follows:

  1. Neuroticism

    is often referred to as emotional instability, addressing the tendency to experience mood swings and negative emotions such as anxiety, worry, fear, anger, frustration, envy, jealousy, guilt, depressed mood, and loneliness (Thompson, 2008). Highly neurotic people are more likely to experience stress and nervousness and are at risk for the development and onset of common mental disorders (Jeronimus et al., 2016). In contrast, people with lower neuroticism scores tend to be calmer and more self-confident, but, at the extreme, they may be emotionally reserved.

  2. Extraversion

    measures a person’s imagination, curiosity, seeking of new experiences, and interest in culture, ideas, and aesthetics. People high on Openness to Experience tend to have a greater art appreciation, devise novel ideas, hold unconventional values, and willingly question authority (Costa and McCrae, 2008). On the contrary, those with low openness to experience tend to be more conventional, less creative, and more authoritarian.

  3. Openness to Experience

    measures a person’s tendency to seek stimulation in the external world, the company of others, and to express positive emotions. Extroverts tend to be more outgoing, friendly, and socially active, which may explain why they are more prone to boredom when alone. Introverts are more comfortable in their own company and appreciate solitary activities such as reading, writing, using computers, hiking, and fishing.

  4. Agreeableness

    addresses how much a person focuses on maintaining positive social relations. On the one hand, people with high agreeableness scores tend to be friendly and compassionate but may find it difficult to tell the hard truth. On the other hand, disagreeable people show signs of negative behavior, such as manipulation and competing with others rather than cooperating.

  5. Conscientiousness

    assesses the preference for an organized approach to life compared to a spontaneous one. People high on conscientiousness are more likely to be well-organized, reliable, and consistent. Individuals with low conscientiousness are generally more easy-going, spontaneous, and creative.

2.2 Peer Assessment Factors

There are several scopes through which we can observe the quality of the peer assessment and, thus, how personality may affect a selection of metrics that reflect such quality. For instance, in a peer assessment process, students are assigned an arbitrary amount of peer reviews to complete. We can measure two typical performance metrics in the peer assessment process. One approach to measuring how effective students are in their peer assessment is to calculate the percentage of the completed peer reviews per student from all assigned to them (peer assessment efficacy). Another promising performance variable to consider is how long students complete the peer assessment (peer assessment efficiency).

Previous research has extensively used the differences between grades provided by students and faculty members as metrics to assess the quality of the peer assessment environment (AlFallay, 2004; Kulkarni et al., 2013; Yan et al., 2022). We also consider it relevant to tackle how accurate and fair the grades given by students are. We decided to consider three variables for this assessment: (i) The grade that the student provides to a specific post, (ii) The difference between the grade that the student provided to a specific post and the grade resulting from weighing all the peer grades for that post (grade difference from final), and (iii) The difference between the grade that the student provided to a specific post and the grade the faculty provided for that post (grade difference from faculty). These three measures allow us to validate if there is a perceiver effect solely based on the grader or if it generalizes compared to the remaining grades for a post and the faculty assessment.

Finally, the additional constructive feedback students provide to their peers improves the quality of the peer assessment process (Dochy et al., 1999; Basheti et al., 2010). Not all students may provide feedback to their peers or lack the creativity to elaborate their feedback further than the primary, formal requirements of the submission. In order to analyze the feedback that students provide in their peer assessments, we decided to count the number of characters present in the feedback (feedback length). Although it does not allow us to assess the quality of the feedback itself, this approach provides a broad analysis of whether students were engaged in the process.

3 Related Work

We present a literature review related to our work.

3.1 Peer Assessment Validity and Reliability

Nowadays, practitioners use peer assessment as both an assessment and a learning tool. However, there are several drawbacks to consider that may affect the validity and accuracy of the assessment. For instance, experimental results depict that peer grades usually correlate with faculty-assigned grades (Liu et al., 2004), but the former may be slightly higher (Kulkarni et al., 2013). Furthermore, students may give lower grades than the faculty to the best-performing students (Sadler and Good, 2006), and peer assessment completed by undergraduate students may not be as reliable (Gielen et al., 2011). These effects usually stem from students having less grading experience than faculty. Educators have often relied on rubrics to counter this lack of experience, thus assisting assessors in judging the quality of student performance (Panadero and Jonsson, 2020). However, even when the faculty lists specific criteria to support the assessment, students’ different backgrounds and knowledge levels question the eligibility and grading accuracy (Kulkarni et al., 2013; Na and Liu, 2019), leading peer assessment itself, as a valid assessment form, to be sometimes challenged by course participants (Glance et al., 2013; Suen, 2014).

Student perceptions may also hinder the peer assessment process (Adachi et al., 2018; To and Panadero, 2019). In particular, student attitudes and perceptions directly affect the reliability of peer assessment since it is highly dependent on objective and honest evaluation. Papinczak et al. (2007) explored the social pressure that makes students hesitate to criticize their peers and score them honestly. Kaufman and Schunn (2011) reported that students often regard peer assessment as unfair and question their peers’ ability to evaluate students’ works. Hoang et al. (2016) also found that students may occasionally not fully invest in peer assessment activities by providing poor knowledge sharing in their feedback.

Other factors affecting the peer assessment validity and reliability are the assessment task complexity and the reviewer load. Tong et al. (2023) found that higher complexity assessment tasks had lower validity. The authors also showed that increasing reviewer load can decline or improve single-rater reliability depending on task complexity. Researchers can counteract this subjectiveness bias with several strategies such as anonymity, multiple reviews per peer (where the final peer grade is an average of each student grade), and training (Bostock, 2000; Glance et al., 2013; Hoang et al., 2022). Nevertheless, other individual factors, such as personality, are more pervasive and still under-researched (Chang et al., 2021; Rivers, 2021). We consider that personality may provide an accurate and deep understanding relevant to the design of peer assessment environments as it may affect the fairness, accuracy, and reliability of peer assessment (Vickerman, 2009).

3.2 Personality Factors in Peer Assessment Environments

Few studies focused on understanding the role of personality in the peer assessment environments (Chang et al., 2021; Rivers, 2021). Recently, Martin and Locke (2022) focused on the association between peer ratings, personality traits, and self-ratings. The authors leveraged the HEXACO personality model to test if those members’ personality traits could predict the peer ratings of team members. Results showed that conscientiousness predicted higher peer ratings, suggesting practitioners may want to assign one highly conscientious person to every team in this setting.

In another example, using machine learning techniques, Cachero et al. (2022) studied the influence of personality and modality on peer assessment evaluation perceptions. The authors measured personality through the FFM. They found that agreeableness was the best predictor of peer assessment usefulness and ease of use, extraversion of compatibility, and neuroticism of intention of use. Furthermore, individuals with low consciousness scores may be more resistant to introducing peer assessment processes in the classroom. However, the value of peer assessment improves the positive feelings of those scoring high on neuroticism. These studies demonstrate that personality may bias how the assessor perceives the work of their peers (Martin and Locke, 2022) but also how one perceives the whole peer assessment process.

Besides these two factors, we hypothesize that personality may also affect how one conducts the peer grading process. For instance, students more prone to emotional instability may frequently be more anxious about being peer assessed and how to assess other students. Despite their influence on how individuals perceive user-generated content differently (Shao, 2009), little to no research focused on the dynamics of personality factors in students’ behavior in a peer assessment environment. Only AlFallay (2004) conducted a study to evaluate the effect of self-esteem, classroom anxiety, and motivational intensity personality traits on the accuracy of self- and peer-assessment in oral presentation tasks. The author found some exciting results, with learners possessing the positive side of a trait being more accurate than those with its negative side, except for students with high classroom anxiety. Nevertheless, the traits that AlFallay (2004) focused on are more relevant for oral, physical presentation settings rather than remote learning environments. Considering our domain, we believe that leveraging personality traits from well-researched models such as the FFM may provide a more substantial and broader understanding to help us integrate personality-based design guidelines in developing peer assessment mechanisms for remote learning.

4 Research Method

As we mentioned in the past sections, we need to carefuly design a peer assessment environment to not only act as an effective formative assessment tool (Wanner and Palmer, 2018) but allow us to understand whether personality traits can bias the peer assessment process. This section presents the context of the remote learning course, our research question and hypotheses, research procedure, and data analysis.

4.1 Research Context

To accomplish the primary goal of this study, which is understanding whether personality factors affect the peer assessment environment in a remote learning course, we need to collect data from a course that supports both remote learning and peer assessment. In our study, we leverage a course named Multimedia Content Production (MCP) designed for MSc students in Information Systems and Computer Engineering. In particular, there is already an extensive body of research based on this course (e.g., Barata et al., 2016; Barata et al., 2017; Nabizadeh et al., 2021). As such, we believe that the MCP course provides a stable, well-designed course system where we can implement a peer assessment environment to study our research question.

In this edition of MCP, 69 students (\(86.25\%\)) out of the 80 students that enrolled in MCP completed the course. Although MCP is traditionally a blended learning course, the COVID-19 pandemic led the faculty to run the course in a remote setting. In this case, the course setting has students attending theoretical lectures and practical laboratories through a videoconference platform. While in the theoretical lectures, students learn about different media formats (e.g., audio, video, and image) from an engineering standpoint (e.g., compression algorithms and encoding formats), the laboratory classes focus on creating high-quality media using the Processing programming language. In addition to these classes, students also join the discussions and complete online assignments via Moodle.Footnote 1.

Among the various educational components in MCP, students can earn grade points through a Skill Tree. The Skill Tree is a selection of learning activities students can complete autonomously during the semester. It is a precedence tree where each node represents a skill (see Fig. 1). Students start the semester with a set of unlocked skills that they can complete right away. Subsequent nodes can be unlocked if students complete a subset of anterior skills, thus acting as requirements. In order to complete a skill, students submit their work on Moodle. Then, a member of the faculty grades their work and provides feedback. The student completes the skill and earns some grade points if the grade is above a fixed threshold. If students receive a poor grade, they may try up to three times to resubmit their work and complete the skill.

Fig. 1
figure 1

Visual depiction of the Skill Tree used in MCP

This Skill Tree submission system is public, and every student in the course can see it. Moreover, it has been a constant feature in MCP where students invest significant effort in completing the course (Barata et al., 2016). As such, this component contains the appropriate characteristics to deploy a peer assessment environment.

4.2 Research Question and Hypotheses

Our work focuses on whether personality affects the peer assessment dynamics of a remote learning course. Thus, we formulate our research question as follows:

RQ: How does personality affect students’ behavior in the peer assessment process of a semester-long remote learning course?

Since this metric is intrinsically related to each student’s level of personal organization, we believe that the conscientiousness personality trait may have a mediator role regarding peer assessment efficacy. As we mentioned, conscientiousness suggests the self-use of socially prescribed restraints that facilitate goal completion, following norms and rules, and prioritizing tasks. Personality Psychology research has highlighted how this trait is decisive in behavioral control. The general rule is that individuals with high levels of conscientiousness have a more robust task performance than those who are low on conscientiousness (Barrick and Mount, 1991; Witt et al., 2002), and these performance qualities transfer to other domains such as health (Booth-Kewley and Vickers, 1994; Courneya and Hellsten, 1998) and gaming (Liao et al., 2021). Therefore, we project that students with higher levels of conscientiousness will complete a more significant percentage of the peer assessments assigned to them than their counterparts. Thus, we formulate our first hypothesis as follows:

H1: Conscientiousness positively affects peer assessment efficacy.

Besides peer assessment efficacy, the conscientiousness trait may also influence students’ time to complete a peer assessment. In particular, there is a similar effect compared to efficacy, i.e., individuals with higher levels of conscientiousness will complete their peer-assigned reviews faster than those low on conscientiousness. In other words, people with higher scores in conscientiousness will be more efficient, as they take less time to complete the process. As such, we formulate our second hypothesis as follows:

H2: Conscientiousness positively affects peer assessment efficiency.

The agreeableness trait may provide some an accurate and deep understanding regarding the grades students provide their peers since it distinguishes cooperation from competition. Recent research (Yee et al., 2011) has shown how this trait modulates behavior in game theory, with agreeable people preferring non-combat gameplays such as exploration and crafting and disagreeable individuals focused more on the competitive and antagonistic aspects of gameplay. Therefore, we believe that the personality trait of agreeableness will create a bias in the peer assessment process, acting as a moderator of interpersonal conflict (Jensen-Campbell and Graziano, 2001). In particular, we expect that agreeable students will provide higher scores to their peers, while more competitive individuals will assess submissions with lower grades. We formulate our third hypothesis as follows:

H3: Agreeableness positively affects peer assessment grades.

Finally, it is essential to leverage personality traits that explain how students interact with the outer world, i.e., their peers, in the overall peer assessment environment. As such, we focus on the traits of extraversion and openness to experience, as these constructs positively affect creativity (Sung and Choi, 2009; Filippi et al., 2017). We formulate our fourth hypothesis as follows:

H4: Extraversion and openness to experience positively affect the amount of feedback in peer assessment.

4.3 Data Collection Tools

We defined the variables leveraged in this study in Sect. 2. The personality variables of the FFM were collected with the (IPIP-BFM-50) (Goldberg, 1999; Goldberg et al., 2006; Oliveira, 2019) The International Personality Item Pool (IPIP) is a large-scale collaborative repository of public domain personality itemsFootnote 2 to measure personality constructs. In particular, the IPIP-BFM-50 provides several measures of the FFM personality traits, including the ones we target in this study. It has 50 items, with ten items for each trait. For each personality trait, we calculate responses by the sum of the Likert scales and direction of scoring based on assertions semantically connected to behaviors and five possible alternatives of agreement between very inaccurate and very accurate.

Regarding the metrics to measure the peer assessment process, we created a Moodle plug-in to simulate the peer assessment environment. The Moodle plug-in is responsible for assigning the reviewers to a post and for storing the data. We can aggregate all peer reviews that students performed and compute for each student their (i) Peer assessment efficacy, (ii) Peer assessment efficiency, (iii) Average peer assessment grade, (iv) Average grade difference from final, (v) Average grade difference from faculty, and (vi) Average feedback length.

4.4 Research Procedure

The Skill Tree was open during the whole duration of the course. Whenever a student posts a submission for a skill in Moodle, the plug-in pseudo-randomly assigns five other students to peer review it. We follow a pseudo-random approach to even the number of assigned peer reviews per student. We removed students that gave up or showed no activity in the course at half the semester length from the pool of students that could peer grade.

Each reviewer has two days to grade an assigned submission. The reviewer must provide a grade from 0 to 5, where a higher grade represents better work and as much written feedback as they want. The peer assessment is single-anonymized, i.e., the reviewer knows which student they are grading, but the assessed student does not know who does it. We decided to follow this approach to allow reviewers to critique submissions without any influence exerted by the authors. Although the double-anonymized method offers more advantages (Tomkins et al., 2017), it was impossible to apply since students submit their work publicly, and it is available to the whole course before being graded. In that case, reviewers could check Moodle anytime and find out the author, which would render the double-anonymized method strictly as good as the single-anonymized approach. Furthermore, the faculty could grade the submission at any time. However, the grading and feedback would only be available either (i) after a minimum of three student reviewers completed the peer assessment or (ii) after the two days limit has passed since the original submission.

After either of these conditions is satisfied, the student who submitted some work could see the feedback and grades from both the faculty and the peers that reviewed their work (see Fig. 2). In particular, each submission presents the grade that the faculty attributed as well as a weighted peer assessment grade, which is the average grade from all the peer assessments of that post. In the example of Fig. 2, the weighted peer assessment grade is 3.6 based on the pool of peer assessments (3, 3, 4, 4, 4).

Fig. 2
figure 2

Example of a container with the peer assessments for a given post

In addition, MCP contains a training feature where students can check and grade test examples for each skill to gain an initial level of knowledge necessary to assess their peers’ submissions. This feature aims to minimize the subjectiveness bias related to knowledge levels (Kulkarni et al., 2013) and provide a scaffold for students to base their evaluations. A student must first complete the training of that skill to be eligible to grade a post. This enforcing strategy aims at minimizing rogue views.Footnote 3, which increase with the anonymous nature of the review as well as with the decreased feeling of community affiliation (Hamer et al., 2005; Lu and Bol, 2007). By the end of the semester, we asked all native Portuguese speaker students to complete the Portuguese version of the IPIP-BFM-50. Finally, students received extra credit for participating in this experiment.

4.5 Data Analysis

First, we removed from the dataset all peer assessments that contained no feedback related to the post. We continued by aggregating all peer reviews that students performed that filled in the personality questionnaire and computed the peer assessment environment metrics. Then, we merged these data with the personality data to produce the final dataset. Our final dataset includes information from 806 posts and 2688 peer grades, resulting in an average of 3.33 peer assessments per post. Regarding our participants, the dataset contains personality information from 45 students.

Table 1 presents the descriptive statistics of our study variables, including a \(95\%\) confidence interval and a Shapiro-Wilk normality test. In addition, Fig. 3 illustrates the distribution of personality data used in our study. Emotional stability and extraversion present a more extensive interquartile range than the other traits. Moreover, we can observe two outliers in the openness to experience distribution. Although outliers may distort the statistical analysis, such a short number in our sample (\(4\%\)) can be neglected.

Table 1 Descriptive statistics of the study variables

We ran correlation methods to study the relationships between our quantitative values. We decided which method to run based on the Shapiro-Wilk test and preliminary visual inspection. We ran a Pearson’s product-moment correlation if both quantitative values presented normal distributions and no significant outliers. In case data fails the assumption, we run Spearman’s rank-order correlation.

Fig. 3
figure 3

Distributions of personality traits of the FFM from the sample

5 Results

This section describes the results of the experiment. It starts by analyzing the effect of personality on peer assessment efficacy and efficiency and continues with the grades and the feedback that students provide their peers. We also answer the research question and state the limitations of our study.

5.1 Peer Assessment Efficacy

As we mentioned, peer assessment efficacy refers to the percentage of assigned peer reviews completed by each student. By checking Table 1, we can see that peer assessment efficacy shows a non-normal distribution. Additionally, a preliminary analysis showed that the relationship is not linear, as assessed by visual inspection of a scatterplot (Fig. 4). We can observe that the right-hand half of locally estimated scatterplot smoothing (LOESS) line (Jacoby, 2000) presents a distorted sine wave without a general trend in the data. As such, we ran a Spearman’s rank-order correlation that showed a statistically non-significant, very weak positive correlation between peer assessment efficacy and conscientiousness scores, \(r_s(45) =.196, p =.198\). In this light, we cannot accept H1, since conscientiousness does not affect peer assessment efficacy.

Fig. 4
figure 4

Scatterplot showing the relationship between peer assessment efficacy (x axis) and conscientiousness scores (y axis)

5.2 Peer Assessment Efficiency

Next, we want to verify whether the same personality trait influences the peer assessment efficiency of each student, i.e., the average time students take to complete a peer assessment. In particular, students are more efficient as their average minutes to complete a peer assessment decreases. Besides peer assessment efficiency not showing a normal distribution, a preliminary visual inspection of the scatterplot (Fig. 5) showed a non-linear relationship. In particular, the LOESS line is quite a symmetric instance of the peer assessment efficacy’s line, which may hint at a correlation between peer assessment efficiency and efficacy. Indeed, Spearman’s rank-order correlation showed a statistically non-significant, weak negative correlation between peer assessment efficiency and conscientiousness scores, \(r_s(45) = -.216, p =.154\). We cannot accept H2 as well since conscientiousness does not affect peer assessment efficiency.

Fig. 5
figure 5

Scatterplot showing the relationship between peer assessment efficiency (x axis) and conscientiousness scores (y axis)

5.3 Peer Assessment Grades

Regarding the grades that students provide to their peers, we started by exploring how the level of agreeableness models these grades without considering any other factor. All variables of interest show a normal distribution. Moreover, a preliminary analysis through visual inspection showed the relationship to be linear and without any outliers (Fig. 6). Therefore, we opted for a Pearson’s product-moment correlation, which showed a statistically significant, weak positive correlation between peer assessment grades and agreeableness scores, \(r_s(45) =.358, p =.016\). Therefore, we accept H3, since we found that agreeableness is positively associated with peer assessment grades.

Fig. 6
figure 6

Scatterplot showing the relationship between peer assessment grades (x axis) and agreeableness scores (y axis)

We decided to follow up our analysis by considering two different baselines regarding the grades that students provide to their peers. Given that the distributions of average differences from the final and the faculty’s grades also presented normal distributions (Table 1), we decided to continue applying Pearson’s product-moment correlation method. First, we checked for each student whether agreeableness also regulated the average difference between the grade that faculty attributes to a post and those that students give to that post (Fig. 7). Again, we found a statistically significant, moderate positive correlation between these variables, \(r_s(45) =.360, p =.015\). In addition, we found a statistically significant, weak positive correlation between the average difference for the final grade of the post and the student’s grades, and agreeableness scores, \(r_s(45) =.490, p =.001\). See Fig. 8 for a visual inspection of this relationship. Both these results support the acceptance of H3, since agreeableness consistently led students to rate their assigned peer reviews differently independently of the baseline.

Fig. 7
figure 7

Scatterplot showing the relationship between the average grade difference from the faculty grade (x axis) and agreeableness scores (y axis)

Fig. 8
figure 8

Scatterplot showing the relationship between the average grade difference from the final grade (x axis) and agreeableness scores (y axis)

5.4 Peer Assessment Feedback

Finally, we want to check whether personality affects the feedback students provide in their reviews. Since extraversion and openness to experience did not present a normal distribution (see Table 1), we opted for a Spearman correlation method to tackle these relationships. In addition, we found in a preliminary analysis through visual inspection that both relationships presented a sine-based correlation (Figs. 9 and 10). We started by finding a statistically non-significant, very weak negative correlation between the amount of feedback that students provide and extraversion scores, \(r_s(45) =.017, p =.913\). In contrast, there was a statistically significant, weak positive correlation between the amount of feedback that students provide and openness to experience scores, \(r_s(45) =.299, p =.046\). As such, we cannot accept H4, taking into account that, although openness to experience is positively associated with peer assessment feedback, extraversion is not.

Fig. 9
figure 9

Scatterplot showing the relationship between average characters in feedback (x axis) and extraversion scores (y axis)

Fig. 10
figure 10

Scatterplot showing the relationship between average characters in feedback (x axis) and openness to experience scores (y axis)

5.5 Additional Findings

As mentioned in Sects. 5.1 and 5.2, results hinted at a relationship between peer assessment efficacy and efficiency. As both variables have a non-normal distribution (see Table 1), we used a Spearman correlation to study their relationship. A statistically significant, strong negative correlation existed between peer assessment efficacy and efficiency, \(r_s(45) = -.717, p <.001\). This result shows that people that complete their assigned peer reviews do it faster than their counterparts (Fig. 11).

Fig. 11
figure 11

Scatterplot showing the relationship between peer assessment efficacy (x axis) and peer assessment efficiency (y axis)

6 Discussion

In this section, we answer our research question by summarizing the significant results, deriving design implications, and discussing the limitations of this work.

6.1 Answering the Research Question

Results have shown that personality traits can bias the peer assessment process. The most noteworthy effect was how the agreeableness trait modulated the grades of the peer assessments. As mentioned, this psychological construct measures the disposition to maintain positive social relations (Halko and Kientz, 2010). As expected, people leaning more towards helping others assessed their peers’ submissions, on average, with higher grades than people more prone to compete. This raises a severe drawback of the overall reliability of the peer assessment process. If we consider the faculty’s grade as the true score, students’ grades are less likely to depend on the value of the submission and more on the grader. Although most grade differences are not at a grade point value, the pool of reviewers containing students with similar agreeableness scores may exacerbate the grading disparity. For instance, the grading system becomes less fair, considering that if the majority of the reviewers have high scores on agreeableness, the peer assessment grade may be higher than it was supposed to be. We also conducted another correlation test to compare the peer assessment grades with a different baseline. The association of agreeableness was still present when we compared the average difference between a post’s weighted peer assessment grade and the one a student provides. As we mention, the accuracy of peer grades is often questioned (AlFallay, 2004; Gielen et al., 2011; Kulkarni et al., 2013; Yan et al., 2022) and some students believe it to be unfair because of these grade discrepancies (Kaufman and Schunn, 2011). Our results confirm previous work regarding the peer grades being slightly higher than those given by the professors (Kulkarni et al., 2013). Our findings show that agreeableness is associated and may be responsible for biasing the peer assessment.

Openness to experience also showed a significant association on the feedback length students provide to their peers. In particular, a sin-based shape effect reflects an increase of feedback characters in higher values of openness. This result is in line with our expectations since people high on openness to experience tend to have a greater appreciation for art (Dollinger, 1993; Rawlings and Ciancarelli, 1997) as well as for new or unusual ideas (Halko and Kientz, 2010). Moreover, it shows that our feature to allow the students to train their peer assessment and which type of feedback they should provide did not affect the effort that students applied in the overall process. These results may shed more light on the findings from previous work since several studies reported a lack of engagement from the students in the peer assessment. Our findings suggest that openness to experience may play a significant role in this relationship.

Nevertheless, some personality traits we investigated did not affect the peer assessment. Regarding extraversion, research has shown that extroverts desire social attention and a tendency to display positivity (Bowden-Green et al., 2020). In addition, extroverts are also more likely to use social media, spend more time using one or more social media platforms, and regularly create content (Bowden-Green et al., 2020). In our case, we found contrasting results, with extraversion having no significant effect on the feedback length students provided each other. Taking into account the remote setting of the course and the features that Moodle has in common with a social media platform, we expected that extroverts would be more forthcoming in their feedback. We hypothesize that the lack of a significant relationship may arise for the other pole of extraversion, based on the remote setting, and the ability to participate in the course without much exposure may have led introverts to be as motivated as extroverts to provide feedback to their peers.

Moreover, in contrast to other researchers (Huels and Parboteeah, 2019; Joyner et al., 2018), conscientiousness, the degree to which one prefers an organized life or a spontaneous one (Halko and Kientz, 2010; Morizot, 2014), did not have a significant effect on the student’s behaviors. In particular, we found that this trait had no significant effect on the efficacy or the efficiency of the peer assessment, although results hinted towards weak negative correlations. There are several explanations for these results. For instance, the faculty periodically reminded the students in the theoretical classes and Moodle to complete their peer assessments, which provided exterior stimuli to the students’ behavior patterns and led them to complete the peer assessments independently of the conscientiousness score. Another factor may be the reward that students received for completing peer assessments. In either case, these factors were a normalizing strategy that tuned down the conscientiousness effect. Nevertheless, our findings show that agreeableness and openness to experience play a role in peer assessment. As such, practitioners must devise strategies to reduce the polarizing effect of personality traits to promote a stable and reliable peer assessment environment.

6.2 Design Implications

Based on our results, we devised a set of implications that can be useful for the pipeline of peer assessment systems, particularly for learning and training. Reminders to complete peer assessment assignments are positive. Results suggest a non-significant relationship between conscientiousness scores and efficiency and efficacy in the peer assessment process. Although people with high scores in conscientiousness are more likely to complete their assignments on time, we cannot expect the same from people with lower scores. Tooltips in the form of reminders that appear after a certain period of the assignment may help the latter individuals, accompanied by a reminder in class and grade incentives to complete the peer assignments. These tooltips may contain a set of targeted reminder texts since state-of-the-art research has shown how different personality types have distinct preferences for persuasive messages (Halko and Kientz, 2010; Anagnostopoulou et al., 2018).

Personality may jeopardize the fairness and reliability of the peer assessment. Although the previous strategy of persuasive periodic reminders to prompt an impartial grading process may also work for the agreeableness effect, there are other strategies to keep in mind. For instance, the system can also provide examples next to the peer assessment input screen to prompt the students to look at them and refresh their reviewing skills with the baseline set by the faculty. Another example is that when the system picks the students that should grade a post, this selection can leverage agreeableness to balance the number of agreeable and disagreeable students in the subset. This balancing strategy may reduce the effect of agreeableness by weighting the same grades that tend to undervalue and overvalue a submission.

The peer assessment system should empower closed-to-experience individuals. Since these individuals tend to be more conventional and less creative, peer assessment systems should provide features that allow people with lower scores in openness to experience to provide feedback on their peers’ submissions. For instance, the system can contain predefined fields covering a range of criteria that the student can quickly fill in. The student can provide complete feedback by leveraging auto-generated sentences based on their input options.

6.3 Limitations

Although we found exciting trends in our results, some essential factors may explain the lack of significance observed in some of our results. For instance, MOOC students are strangers to their peers. In our case, the MCP course is part of a Master’s degree, which inadvertently increases the chance of the students being already familiar with each other. As such, they may have created social bonds, inevitably influencing peer grading with some bias. As previously discussed, a double-anonymized assessment could suppress this effect. However, the number of students in the course needs to be more significant to guarantee that the sample of reviewers and the assessed student does not talk with one another while the review process is active. This approach would produce a flawed double-anonymized process. Moreover, the number of participants in this experience could have been more significant, which would provide conclusions with a more substantial impact and allowed us to investigate whether the relationship between the participants or any combination of values from personality factors of both participants affected the interactions. Another bias to consider is that our sample mainly comprises Portuguese individuals, which may render a cultural bias in our results.

7 Conclusions and Future Work

Our work aimed to investigate whether personality traits from the FFM influence how students perform their peer assessment in a semester-long online course. The results show that the personality traits of agreeableness and openness to experience are associated with how students evaluate or provide feedback to their peers. Therefore, these results indicate that in the context of distance learning, peer assessment in terms of reliability and fairness may be compromised by student personality traits.

Future work includes investigating whether some personality constructs with finer granularity, such as the facets from the FFM, play a role in this context. Indeed, facets such as trust and cooperation from the agreeableness trait, or assertiveness and friendliness from the extraversion trait, could play a decisive role in peer-grading. Other psychological constructs may also be interesting to consider in this type of setting, such as creativity (Amabile, 2018) and the Locus of Control (LoC) (Rotter, 1954, 1966). In addition, we would like to conduct another study with a larger sample. Another factor to consider is the remote setting. Therefore, we would like to study whether our results can be applied to in-person courses or blended learning environments. Finally, our findings can be applied in the design pipeline of peer assessment systems as well as in systems that simulate training and education of people, such as GIMME (Gomes et al., 2019). Furthermore, we should investigate how to account for inexperienced graders in a fair manner. Indeed, experience suggests that grading patterns evolve with experience and knowledge of subject matter. Leveraging how personality constructs affect the dynamics of these environments can empower researchers to increase the expressiveness of their models while taking a more human approach to integrate people into learning environments.