Experimental Measurement and Markov Chain Modelling of Stroke Order Intuition

As a feature of Chinese characters, stroke order plays an important role in Chinese character education, Chinese character recognition and handwriting identification. However, in recent years, the research on stroke order is limited to elaboration and supplementation of existing stroke order and stroke order rules. There is no special research that focuses on the principle behind stroke order. To quantify the internal regularity of stroke order, we start with the stroke order intuition in the actual writing. Because stroke order has individual differences, we select a group of students with small differences as testers. In this paper, we design an algorithm to generate random strokes and collect the timing data during testers copying strokes. After pre-processing the collected data, a Markov chain is introduced to model stroke order intuition, which can divide the stroke sequences into three states. And then we classify them into three situations according to the stroke distribution. Next, we compare them with some well-known stroke order rules. The results show that the probability distribution of strokes in different situations is not always consistent with the empirical rules, as well as the relations between them.


Introduction
Stroke order includes the writing direction of one single stroke and the writing order between strokes, as defined in the document [1] released by the Chinese Ministry of Education in 2021. Stroke order is the sequential expression of Chinese characters, which is hidden in the actual writing process, so it is often ignored by people. The research on stroke order is often carried out together with Chinese characters. In traditional Chinese character recognition, stroke order is only an auxiliary feature of Chinese characters to assist the implementation of algorithms. In the field of education, whether it is foreign education or pre-school education, stroke order is one of the foundations for getting started with Chinese characters. Although it is very helpful for learning Chinese characters, teachers just require students to memorize the standardized stroke order [2]. In legal circles, stroke order is an important feature used in handwriting identification, but more attention is paid to the structural features of characters [3]. The above research on stroke order is often based on the existing stroke order and its rules, without in-depth research on the internal regularity of stroke order. To conduct indepth research on Chinese characters, stroke order should be paid more attention to.
How to analyze stroke order? We know from the definitions in Huang's study [4] that there are four stroke orders. To be more in line with the actual situation, we can start with the stroke order intuition, one of which is formed naturally by people in daily writing. Also, we should not ignore the individual differences and group commonality of stroke order, according to the research on handwriting identification in legal circles. Therefore, we select groups with relatively small differences, that is, college students. To further reduce the differences, we not only need to consider the stroke order rules but also to simplify the complex strokes. Therefore, we don't use the existing Chinese characters in the experiment, but use a random algorithm to generate random lines with certain rules instead.
On the other hand, because the stroke order can be regarded as a continuous random stroke sequence, the Markov chain can be introduced for modelling [5]- [6]. Then, the complex stroke order process can be divided into the initial state, steady state and transition matrix. The initial state and the transition matrix can be obtained by statistical analysis of the experimental data, and the steady state needs to be calculated according to the matrix equation. Corresponding conclusions can be drawn by analyzing the differences between the three states obtained from the stroke order intuition and the well-known stroke order rules.
To simplify and summarize the different types of strokes, we omit the fold (乛) and replace the point(丶) with the right diagonal (㇏). Then, we can take the midpoint of each stroke as the origin and make four direction encoding. To define the direction vector, we need the start point and the end point of one single stroke. As shown in the figure 1, the four most commonly used strokes-horizontal (一) / vertical (丨) / left diagonal (丿) / right diagonal (㇏), can be coded as--0 (-) / 2 (|) / 1 (/) / 3 (\). Therefore, the progress of stroke order is simplified to write and connect the strokes on the twodimensional plane.

Experimental design
After simplifying the strokes, it is necessary to eliminate the interference of the existing Chinese character. So, we use a random algorithm to generate random strokes instead of generating existing Chinese characters. Due to individual differences and group commonality, we selected students from the same class at the same university as testers. Because universities have relatively strict requirements on writing and have similar educational levels. The experimental process is shown in figure 2.
The experiment is divided into two parts, the first part is the random algorithm to generate random strokes, the second part is the testers copying strokes according to the generated random strokes. We collect the timing data during the second part of the experiment. Also, we set several variables to control the two parts of the experiment separately.
The first part of the experiment is the random generation algorithm with the following variables: • 1. The slope between strokes should be greater than s degrees; • 2. The length of strokes range from 0 pixels to r pixels; • 3. The number of strokes, which is called variable n, should satisfy the distribution of Chinese characters. The second part of the experiment is the copying process with the following variables:  After we test about 100 random samples in different circumstances, we adjust various variables as follows.
We set variable s to 4 pixels to avoid strokes coinciding and adjust variable r to 600 pixels to keep strokes in a certain range. According to the study by Dao Fu [8], variables n should be taken from 2 to 12. To check if there are differences in mutual strokes and multiple strokes, we build two experiment cases, which set variables n to 2 in case 1 and set variables n from 3 to 12 in case 2.
We also set variable t to 5 as well as variable l to 2. The two values enable testers to complete the experiment cases during the appropriate time, and the copying length is neither too long nor too short. And we set variable m to 100 cases and 50 cases in two experiment cases respectively.
Done with the adjustment of variables, we distributed the Experimental program to 114 and 119 testers separately. They need to copy random strokes generated by the algorithm, which should be less than the remaining length within the remaining time several times.
After the experiment, we collect 8191 valid data. The Data can convert into a set of stroke sequences composed of a series of coordinate points and time series.
Moreover, the stroke order sequence is continuous, the next state ( +1) is only related to the current state ( ) . And the past state is just a reminder that these strokes have been written. So we can calculate the probability of the next state as follows: ( ( +1) = +1 | ( ) = , ( −1) = −1 , . . . , (1) Here, 0 ,…, −1 , , +1 ∈ S, and S is the state space in which strokes are encoded in four directions.
As a result, the regularity in the past state cannot affect the next state, consistent with the memorylessness of the Markov chain.
The transition diagram for this process can draw in figure 3 and has the following properties: • 1. Irreducibility: Any one of the four basic strokes can return to the stroke repeatedly in the process. The strokes are connected in pairs, satisfying the recurrent property and having no transient state in this diagram. • 2. Aperiodicity: For any one of the four basic strokes, the maximum common divisor of the steps to return to the stroke in the evolution is 1.
Taking state 0 as an example, it may be (0 -> 1 -> 3 -> 0) or (0 -> 2 -> 0). Therefore, state 0 is aperiodic, and the whole transition diagram also satisfies the definition of aperiodicity. Figure 3. The transition diagram of the example. From the above two properties, it can be inferred that the example of random stroke sequence has a steady state on a long-time scale. And the progress is a subset of the four basic strokes. Therefore, it can be inferred that any sequence consisting of four basic strokes is also a stationary Markov chain.

Experimental results
After proving the Markov properties of stroke sequence, for the collected stroke sequences in data, we classify them by the types of strokes contained in the sequence. There are four main strokes, so the number of each situation is as follows: • Situation 4: Sequence contains four basic strokes, and there are 4 4 = 1 case; • Situation 3: Sequence contains three basic strokes, and there are 4 3 = 4 cases; • Situation 2: Sequence contains two basic strokes, and there are 4 2 = 6 cases; Besides, the situation in which the sequence only contains one stroke is omitted because it shows no regularity.
Therefore, for the stroke sequence like = { (0) , (1) , … , ( ) }, we can obtain from its Markov properties: Here, ∈ (1,2, … , ), ∈ (1,2, … , ), and is the length of the stroke sequence. After classifying by the type of strokes included, we can know exactly the situation of the stroke sequence. That's to say, we need to label all stroke sequences and count the number of probabilities ( ( ) | ( ) ) in each sequence. For the probability summarized in different situations after statistic, the transition matrix of the corresponding situation is obtained, which can be drawn in table 2: Table 2. Transition matrix of three situations. Therefore, the steady state distribution can be calculated using the matrix properties by the following equation: Here, is the stationary matrix, and E is the unit matrix. After computation, we know that the initial state will reach a steady state on a long-time scale, which can be drawn in

Data Analysis and discussion
For all situations, they can convert from initial state to steady state through transition matrix several times. So, we can discuss them in three states. In addition, among the widely accepted stroke order rules [4], only the following rules are categorized and discussed: • Initial rules: From left to right, from top to bottom.
• Relation rules: First horizontal and then vertical, first left diagonal and then right diagonal.

Initial state
In table 3, all situations have an approximate distribution of the initial state. While we still can divide four main-strokes into two levels: • The first level: horizontal (一), vertical (丨) and right diagonal (㇏). The initial probability of them are relatively high, and the priority is slightly different under different circumstances • The second level: left diagonal (丿). Its initial state probability is lowest, which is 4% -14% lower than other strokes. It can be seen from the initial state distribution that there are priorities among the four basic strokes. The left diagonal (丿) has the lowest probability although it does conform to the empirical rules. In situation 4, horizontal (一) has the highest probability, the other two are close; In situation 3, the probability is relatively equal; In situation 2, vertical (丨) has the lowest probability, and the other two are close. Table 2 shows the relationship between strokes, we can see situation 4 is similar to situation 3, and the difference between their transition matrix ranges from 0% to 2%. However, both of them are different from situation 2 by the difference fluctuates from 0% to 13%.

Transition matrix
Despite the differences of the three situations, the strokes still show commonalities: one stroke tends to convert into another stroke than staying in the current state, especially in situation 2 with the differences ranging from 54% to 84%.
Both horizontal (一) and vertical (丨) have a similar probability of converting into other strokes; while left diagonal (丿) and right diagonal (㇏) tend to convert into each other rather than others.

Steady state
In table 4, all situations have an approximate distribution of steady state.
The four basic strokes probabilities are relatively equal with the differences ranging from 0% to 4%. When the number of strokes reaches twelve, all situations have reached a steady state and the probabilities of four basic strokes are equal. There is no obvious difference.

Conclusion
This paper starts from the stroke order intuition when writing strokes, and considers the writing commonalities of groups with small differences. By designing a program to generate random strokes to further reduce the difference of the existing Chinese characters, collecting the timing data in the experiment, introducing the Markov chain for modelling, and obtaining the quantitative regularity under different situations. The regularity drawn from the data is not the same as the empirical stroke order rules.
In the initial state, the writing order of a single stroke is fixed. In this experiment, people usually start from horizontal (一), vertical (丨) and right diagonal (㇏), consistent with the Initial rules, not the relation rules. Because if we establish the coordinate system with Initial rules, it can be found that these three strokes are in this system while leaving left diagonal (丿) with lower probability.
As for the relationship between strokes, the matrices show that there is a strong correlation between the left diagonal (丿) and right diagonal (㇏), and a weak correlation between the horizontal (一) and the vertical (丨). The results are partly in line with the related rules, and at the same time, an extra rule is revealed here: One stroke tends to convert into another stroke than staying in the current state. This means that strokes are not only related to just two sets of strokes but shows group transitions.
When the state stabilizes, that is, after about twelve strokes, each stroke has an approximate probability. This trend may appear in most Chinese characters because most Chinese characters are within twelve strokes [8].
The above conclusions may be due to the shortest path principle and conformity to right-handed physiology, which differs from the actual situation. We only consider four well-known stroke order rules, others need more people to conduct in-depth research on stroke order.