A MOOC Video Viewing Behavior Analysis Algorithm

an


Introduction
MOOCs [1] are the massive open online courses, and these courses are usually provided by university and shared in the network [2,3]. Coursera, Udacity, and edX [4] are currently the three largest providers in the world. The Chinese University MOOC [5] is the largest operator in China, which is launched by the Higher Education Agency of China.
MOOCs [2][3][4] tend to have a large number of learners. These learners are brought a new experience and given a chance to get involved in high education. At the same time, the feedback from the learners can also help the university improve courses.
MOOCs are developing rapidly, but they still have to face many problems, such as high dropout rates, low resource utilization, and lack of effective profit model [6]. Sometimes, the number of candidates can reach tens of thousands, but only 1% of the students completed the course. The Advanced Mathematics in the Chinese University MOOC is taken as an example. Its number of students is 4317 in 2017, while the number of learners who had watched more than 50% of videos is only 18. The low utilization of course videos is almost beyond imagination. How to make MOOC video more attractive is a very important research topic.
Philip J. Guo [7] studied the influence of teachers' images, video length, and speech speed on learners. Their research shows the video within 6 minutes is the best, which comes from data analysis and questionnaires. Many researchers explore the rules of online learning by analyzing learning behavior data. Among them, viewing video behavior data is a research hotspot. And some scholars have done a lot of valuable work. Tanmay Sinha [8] operationalizes video lecture clickstreams of students into cognitively plausible higher-level behaviors. Their results illustrate how such a metric inspired by cognitive psychology can help answer critical questions regarding students engagement, their future click interactions, and participation trajectories that lead to in-video and course dropouts. Nan Li et al. [9] present a video interaction analysis to provide empirical evidence about this issue. They find out that speed decreases, frequent and long pauses, and infrequent search with high amount of skipping and rewatching indicate higher level of video difficulty. Geza Kovacs [10] analyzes how users interact with in-video quizzes and how in-video quizzes influence users lecture viewing 2 Mathematical Problems in Engineering behavior. Through data analysis, they found the peak period for students to think about issues. Juho Kim et al. [11] identify five student activity patterns that can explain peaks: starting from the beginning of a new material, returning to missed content, following a tutorial step, replaying a brief segment, and repeating a nonvisual explanation. Christopher G. Brinton [12] studied student video-watching behavior and quiz performance. It is found that some of these behaviors are significantly correlated with changes in the likelihood that a student will be Correct on First Attempt or not in answering quiz questions, and in ways that are not necessarily intuitive. Qing Chen et al. [13] introduce a comprehensive visualization system called Peak Vizor. This system enables course instructors and education experts to analyze the peaks or the video segments that generate numerous clickstreams.
These studies provided many analysis methods for video viewing behavior and proposed some options to optimize video design. However, they are preliminary statistics and processing of the data, and there is no in-depth analysis of the functional relationship. The purpose of our research is to build a mathematical model, which can describe the impact of video length and release time on attraction.
The parameter is defined in the article, which is defined as ratio of average viewing duration to video length. Its value reflects the popularity of the video. Obviously, the larger the value of , the higher the popularity of the video. Our research shows that there are significant negative linear correlations between and video length, and between and release time. Oversized videos tend to have small values, because learners cannot keep an interest in video for a long time. At the same time, the learner's interest later is lower than the initial stage. However, there is a threshold for this. When the number of videos is less than a certain threshold, the learner's interest in all course videos does not fluctuate significantly.
The multiple regression is used to analyze viewing video data in this article. By comparing the standardized regression coefficients, the release time has a much greater effect on than the video length. If the length of a single video is reduced, the total number of videos will be increased. So, it is unwise to emphasize video segmentation to shorten the length of video. We need to choose a balance between the number of videos and the length of the video. The algorithm proposed in this paper can predict the attractiveness of course video and help providers optimize video segmentation.

Data Description and Preprocessing
MOOC learning behavior data [14,15] mining has become an important research field. Our data are provided by the Chinese University MOOC. The total time each learner watches each video is recorded. Only the impact of video length and number on attraction is studied here, and course content and teacher performance are ignored. Advanced Mathematics course is chosen as a research object. The time interval for data collection is 2016.09.26 to 2016.11.26. Although the content of each video is also different, and the teacher's performance will fluctuate, these differences are not considered in this paper.  Table 1 lists the basic information of the course. The visiting site for the course is https://www.icourse163.org/course/ NUDT-9004. It contains 129 videos and the number of students enrolled in this class reached 27664. Many of these classmates did not watch videos, so the data contains 41033 viewing video records.
The sampling period for the data is 10 seconds. Thus, the viewing duration is an integer multiple of 10, and the unit is seconds. The data collection method is to record the total duration of watching a video. In this way, each learner has a record for a video. Then everyone's record is an dimensional vector, where = 129. Let the number of learners be , where = 4805. Therefore, the viewing duration data is a matrix = ( ) × . Since too short viewing time is not enough for learning, the data must be preprocessed to remove invalid data. A histogram is drawn to analyze those invalid behaviors, which contained 41033 records.
As shown in Figure 1, there are many learners watching video with a total time of less than 60 seconds (red bar). These people may have just browsed the video or accidentally opened it. For noneffective learning behaviors, these data should be filtered out.

Ratio of Viewing Duration to Video Length
To analyze the relationship between the viewing duration and the length of the video, we calculate the average viewing duration of each video. This average viewing duration is calculated after removing the invalid data. Let be the average viewing time of -th video. Its definition is where = 4805 and = 129.
The average viewing duration of 129 videos and the length of the video are plotted in Figure 2. The blue curve in Figure 2 represents the learner's average viewing duration, and the red curve represents the video length. Observing Figure 1  all the average viewing time is less than the video length. But the average viewing time is sometimes greater than the length of the video, which is caused by the learner repeatedly viewing the same video. Because the length of the video will affect the average viewing duration, it is unreasonable to directly compare the average viewing times to determine the popularity of the video. Obviously, the ratio of the average viewing duration to the length of the video can reflect the students' preference for this video. Let the length of theth video be V and the average viewing duration of the -th video be . The ratio of the -th average viewing duration to the video length is Two rules can be found based on Figure 2. One is that the longer the length of the video, the smaller the ratio . Another one is that the ratio of the later period of the course is smaller than that of the early stage. Because the number of the video represents the order of release, the stage of the lesson can be represented by it. So, we can conclude that is negatively related to the video length and video number. However, these judgments are very simple and require more accurate quantitative analysis.

Correlation Analysis
The Pearson correlation coefficient has been widely used in many research fields [16], which is a statistical indicator used to reflect the strength of linear relationship between variables. In our study, the correlation coefficient will be used to analyze the relationship between video length, video number, and . This work also verified the rationality of the regression analysis. Let When | | is large, the mean squared error is small. So and show a strong linear relationship.
In order to analyze the relationship between and the video length and the video number, the scatter plots between them are drawn. Figure 3 is a plot of and video length. The abscissa is . The ordinate is the video length. There is a clear negative linear correlation between them. The Pearson correlation coefficient is equal to −0.758. There was a significant negative correlation at the level of significance of 0.01 ( − V < 10 −6 ).
Because the video number is the corresponding release order, it can be used to reflect the progress of the course. The relationship between the video number and can be observed in Figure 4. The abscissa of Figure 4 is , and the ordinate is the video number. The values of at the end of the course are smaller than the beginning. The Pearson correlation coefficient is equal to −0.646. It shows that there is a significant negative correlation between and video number ( − V < 0.05). Because the absolute value of the correlation coefficient represents the strength of correlation, the video length has a stronger linear correlation with than the video number. This conclusion can be obtained by comparing the correlation coefficients (−0.758 and −0.646). Figure 5 is a scatter plot of video length and video number. Obviously, there is no linear relationship between them. In fact, the length of the video is content-related. So, the video length and video number should be independent. Figure 6 shows a three-dimensional scatter plot of , video length, and video number. The vertical coordinate is , the abscissa is the video length, and the ordinate is the video number. From the image it can be found that the distribution of points is nearly linear. Therefore, the multiple linear regression [17] can be used to establish the functional relationship between video length and number and .

Multiple Linear Regression
Algorithm. is affected by the length of the video and the time of the release. From Figures  4 and 5, it can be seen that they exhibit a significant negative linear correlation. Here, is used as a dependent variable . Independent variables are video length and video number.
The dependent variable and the independent variable can be expressed as a three-dimensional vector. There are sets of data, and the th set of data is recorded as ( , 1 , 2 ). When there is a linear relationship between and 1 , and 2 , respectively, a multiple linear regression model can be used to describe the relationship between them.
where 0 is a constant coefficient and ( = 1, 2) is a partial regression coefficient.
represents the residual. The residual is the difference between the actual measured data and the estimated value of the dependent variable, which is not determined by the independent variable. The following expression calculates the sum of squares between the predicted and measured values.
By solving 0 , 1 , and 2 , let takes the minimum value. Figures 3 and 4, it can be found that there are several abnormal data. These abnormal data may have an influence on the calculation of the regression coefficient. Therefore, we should design an algorithm to filter them out. According to the Durbin-Watson test theory [18], those points will be filtered out that fall outside the interval (̃− 3̃,̃+ 3̃).̃is the mean of the residuals and̃is the variance of the residuals. The process can be divided into the following 5 steps.

Regression Coefficient and Residual Test.
After the previous 5 steps, the abnormal data are filtered out. Multiple linear regression is performed using data sets { }, { }, and { }. Let the regression function be It is not enough to just calculate the regression equation. It is necessary to verify the correctness of the model. The verification method is to test the residual { }, = 1, 2, . . . , for independence. When the residuals obey the normal distribution (0, 2 ), it shows that the model has a higher degree of approximation. Kolmogorov-Smirnov [19] test can be used here.
Mathematical Problems in Engineering 5    Note that the coefficient 0 here is 0. The results are shown in Table 2. By comparing the two standardized regression parameters in Table 2, we can get a clear conclusion. Compared to the length of the video, the video number has a greater influence on . The coefficients in Table 2 cannot be used for data fitting. Its purpose is to calculate residuals and find abnormal points. Table 3 lists the statistic residuals. The mean of the residuals is less than 10 −7 and the variance is 0.1053. According to the previously proposed algorithm, abnormal data are removed. In this course, there is only one data set that needs to be removed. After filtering, there are 128 sets of data, which are marked { }, { }, and { }. They will be used to do multiple regression analysis to get new coefficients.

Calculate the Regression Coefficient.
The nonstandardized regression coefficients can be used to predict , while the standard regression coefficients can be used to compare the influence of video length and number on . Here, they are all calculated.
Observing the data in Table 4, it was found that the standardized coefficients of the video number and video length variables are -0.622 and -0.155, respectively. For this course, the video number had almost 4 times more influence    on than the video length. The regression equation can be used to calculatê. The blue line in Figure 7 is calculated from the data, and the red line is the curve predicted by the regression equation. As can be seen from Figure 8, the predicted value reflects the change trend of . The regression equation can be used for video design of similar courses. After providing the video length and number, the values of can be predicted. The predicted valueŝcan serve as important reference. The video's recorder can optimize the design by video segmentation or integration to increase the attractiveness of the video. Because the appeal of the video is influenced by the course content, it is important to emphasize that the parameters used for prediction should come from similar courses.

Residual Test.
The residual independence test is a very important step in the algorithm. When the residual satisfies the normal distribution [20] of (0, ), the accuracy of the mathematical model can be guaranteed.   Figure 9 is a histogram of the residuals, which is very close to the normal distribution graphically. The Kolmogorov-Smirnov test [19] is used to examine whether the residuals are normally distributed. The result shows that the residuals follow a normal distribution at a significant level of 0.05. The residuals' mean value is < 10 −7 , and the standard deviation is equal to 0.12374. This means that the mathematical model has a high degree of approximation to the data.

Conclusion
This article analyzes the relationship between video length and video number and . Our research shows that there is a significant negative linear correlation between video length and , and between video number and ratio . The video length and video number are independent. Based on these results, the multiple linear regression equation is used to establish the functional relationship between them. The Durbin-Watson [18] test theory and 3 principles helped us effectively remove abnormal data. Because the effect of abnormal data on small samples is significant, it is necessary to remove abnormal data. The standardized regression coefficients in Tables 3 and 4 all indicate that the video number has a greater influence on .
In order to reduce the length of the video, the content of the video needs to be decomposed. However, decomposition will bring about an increase in the number of videos. Therefore, reducing the video length and reducing the number of videos cannot be achieved at the same time. We need to find a balance between these two issues.
We also found an interesting rule. When the number of course videos is small, video appeal is hardly affected by the release time. For courses with less than 40 total videos, the value is hardly affected by the video number. This means that learners have a threshold for the "patience" of the video number.
Two courses of Chinese University MOOC were chosen to illustrate this phenomenon, which are the Game Theory and the First Aid Knowledge. There are 38 videos in the Game Theory. The First Aid Knowledge contains 19 videos. It can be seen from Figure 9 that is hardly affected by the release time. There is a threshold for learners' patience with video number. Therefore, the number of videos for a course should be controlled within . However, we do not think that the threshold is 40; it requires more in-depth study. When the content of the course is really large, you can consider splitting it into subcurriculums release. These subcurriculums independently assess and issue certificates, which can guarantee learners' interest.
So the following suggestions are given for video design.
(i) Do not publish videos that are too long. The teaching content must be streamlined to reduce the length of the video. Philip J. Guo [7] thinks the video within 6 minutes is the best.
(ii) Try to control the number of videos within . When a certain limit is exceeded, the release time of the video will seriously affect .
(iii) Do not publish long video later in the course. The value of is often very low.
(iv) After the video is recorded, it must be clipped to balance the number and length of the video. When there are too many videos, you can consider splitting them into several courses and publishing them separately.
(v) The method proposed in this article can predict . MOOC operators can use this method to establish multiple linear regression models for various courses. It will help video producers optimize the video design and make the videos more attractive. In order to ensure accuracy, please use the regression equation of similar courses to predict.
In future work, we should research the best number of videos for a course. The courses studied in this article are all taught in Chinese. Whether language has an influence on the regression coefficient is another issue that we are concerned about. We expect that researchers with such data sets will be able to give answers.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.