Non-Hierarchical Clustering as a method to analyse an open-ended questionnaire on algebraic thinking

ion 1ARc. This strategy is based on arithmetic, which is a geometrical approach. The student draws a rectangle with dimensions (x; (x + 26)) and, in order to find the requested result, calculates the rectangle’s area. He/she still proceeds by trial and error: (1 + 26) * 1; (2 + 26) *2; (3 + 26) * 3; ..... ; (11 + 26) * 11; (12 + 26) * 12 = 456 When he/she obtains the 456 value, he/she finds that the value of x, i.e. the son’s age, is 12. 1ALa. The student formalises the question in algebraic language and writes the formula representing it: (x + 26) * x = 456, where x represents the son’s age. He/she solves this equation by using one of the algebraic methods he/she knows. This strategy highlights some understanding of symbolism and abstraction, and the explicit use of the x variable could suggest the presence of some form of algebraic thought. 1ALb. The student formalises the question in algebraic language. He/she writes a system of equations representing the question and solves it by using one of the algebraic methods he/she knows: The x variable is the father’s age, the y variable is the son’s. This strategy highlights the presence of algebraic thought and good abstraction skills in the student. 2ARa. The student tries to answer the question with a series of approximations: 25 * 1; 25 * 2; 25 * 3; ... ; 25 * 8; 25 * 9; 25 * 10 = 250 Once he/she has arrived at this result, he/she reads the question again and performs the subtraction 250 10 = 240. This is the actual cost of the football balls in the exercise. The student therefore decides that the number of balls actually bought by the football club is 10. In fact, (€25 * 10) €10 = €240. 2ARb. The student tries to repeatedly add the cost of the soccer balls: 25 + 25 = 50; 25 + 25 + 25 = 75; ... ; ... ; 25 + 25 + 25 + 25 + 25 + 25 + 25 + 25 + 25 + 25 = 250 With a €10 discount on the total price, the football club was able to buy 10 soccer balls. 2ARc. The student tries to repeatedly subtract the cost of a ball from the total amount spent. 240 – 25 = 215; 240 – 25 – 25 = 190; ...-.... ; 240 – 25 – 25 – 25 – 25 – 25 – 25 – 25 – 25 – 25 = 15 Once he/she has arrived at this result, the student thinks about the discount: with a €10 discount on the total price, the football club is able to buy one more soccer ball. In fact, €15 + €10 discount = €25, i.e. the cost of a soccer ball. Therefore, the football club can buy 10 soccer balls. 2ARd. The student solves the problem by thinking about ‘unitary cost’, ‘total cost’ and ‘discount’. He/she takes into consideration the arithmetic expression (240 + 10), involving the total cost (€240), plus the discount (€10), and divides the result by the unitary cost of the balls (€25). Therefore, with the calculation (240 + 10) / 25 = 10 he/she finds the total number of soccer balls bought by the team. 2ARe. The student formalises the problem, and obtains the equation 25x – 10 = 240. However, he/she solves it by a trial and error procedure on the x value, following an arithmetic procedure. 2ALa. The student formalises the problem, and obtains the equation 25x – 10 = 240. He/she solves it by using one of the algebraic methods he/she knows and finds x, representing the number of soccer ball bought. 2ALb. The student formalises the problem and writes an algebraic proportion, highlighting the cost of a single soccer ball and the total cost. He/she, therefore, writes: 25 (cost of one soccer ball): 1 (one soccer ball) = T (total cost): x (number of bought ball). However, the student is not able to properly find the right value of T and so he/she cannot find x. 12 Di Paola, Battaglia, Fazio 2ALc. The student formalises the problem and writes an algebraic proportion, highlighting the cost of a single soccer ball and the total cost. He/she, therefore, writes: 25 (cost of one soccer ball): 1 (one soccer ball) = T (total cost): x (number of bought ball). In order to find T, the student adds €240.00 to the discount (€10.00). He/she, therefore, calculates x by using the proportion rules. 3ARa. The student does not follow a specific logical line in choosing the triad of numbers required by the question. In particular, he/she does not choose three consecutive numbers and goes on more or less by chance, eventually finding the right result. 3ARb. The student tries to answer the question by several attempts. He/she first tries the triad 1, 2, 3 and verifies if they fit with the requirements of the question. As this is not the case, he/she tries again with 2, 3, 4 and, then 3, 4, 5. In this last case, the sum of the squares is 50, so the student finds his/her answer. After this, he/she does not care to verify if other triads of numbers satisfy to the question requirements. 3ARc. The student follows the steps described in strategy 3ARb, but chooses negative, consecutive numbers. So, he/she first tries the triad (-3, -2, -1). Then (-4, -3, -2) and he/she finds a result (-5, -4, -3) that fits the question requirements. After this, he/she does not care to verify if other triads of numbers satisfy to the question requirements. 3ALa. The student formalises the problem and writes the formula: x2 + y2 + z2 = 50. However, no relationships between x, y and z are found, and so he/she is not able to solve the problem. 3ALb. The student formalises the problem and writes the formula: x2 + (x + 1) 2 + (x + 2) 2 = 50. In order to solve it he/she uses a trial and error, arithmetic procedure. 3ALc. The student formalises the problem and writes the formula: x2 + (x + 1) 2 + (x + 2) 2 = 50. He/she solves it by using one of the algebraic methods he/she knows. By following this procedure, the student finds all the possible triads of integer, consecutive numbers that solve the problem: (3, 4, 5) and (-5, -4, -3). 3ALd. The student formalises the problem and writes a system of 3 equations with 3 variables and solves it by using one of the algebraic methods he/she knows: 4ARa. The student answers “no”, without further explanation. 4ARb. The student tries to solve the problem by a trial and error, arithmetic procedure, randomly searching numbers. 4ARc. The student decides to proceed by successive approximations. He/she starts from 1 and performs the calculations described in the text: he/she adds 1 to 4 and then multiplies the result by 80, verifying that the obtained value is less than 2360. He/she continues with numbers greater than 1 until he/she finds that, by using 25, the result is 2320 (25 + 4) * 80 = 2320, but by using 26, the result is 2400, that is greater than the required value (2360). As there are no other integers between 25 and 26, the student concludes that the answer to the question is “no”. 4ARd. The student draws a rectangle ((x+4); 80) and bases his/her reasoning on the fact that the area of such rectangle, according to the question data, is to be 2360. He/she goes on by a trial and error, arithmetic procedure: (1 + 4) * 80; (2 + 4) * 80; (3 + 4) * 80; ..... ; (24 + 4) * 80; (25 + 4) * 80 = 2.320; (26 + 4) * 80 = 2.400 > 2.360. As there are no other integers between 25 and 26 the student concludes that the answer to the question is “no”. 4ALa. The student formalises the problem and writes the equation: x + 4 * 80 = 2.360. He solves it, but finds the wrong result. 4ALb. The student formalises the problem and writes the equation: (x + 4) * 80 = 2.360. He tries to solve it by a trial and error procedure but does not find a result. 4ALc. The student formalises the problem and writes the equation: (x + 4) * 80 = 2.360. He solves it, but does not properly use the distributive property of multiplication on the addiction. 4ALd. The student formalises the problem and writes the equation: (x + 4) * 80 = 2.360. He solves it by using one of the algebraic methods he/she knows. 5ARa. The student tries to solve the algebraic expression by successive approximations on the x variable. He/she proceeds by trial and error, randomly choosing values. In this way, after many calculations, the student finds the value 0, and considers it the only correct solution. South African Journal of Education, Volume 36, Number 1, February 2016 13 5ARb. The student tries to solve the algebraic expression by successive approximations on the x variable. He/she proceeds by trial and error, randomly choosing values. In this way, after many calculations, he/she finds both solutions. 5ARc. The student tries to solve the algebraic expression by successive approximations on the x variable. He/she proceeds by trial and error, choosing values in ascending order (0, 2, 7, ...). In this way the student finds the value 0 and considers it the only correct solution. 5ARd. The student tries to solve the algebraic expression by successive approximations on the x variable. He/she proceeds by choosing values in ascending order (0, 1, 2, 3, ...). In this way the student finds both the solutions. 5ALa. The student tries to simplify the algebraic expression, but fails to do so. He/she, then, uses an arithmetic approach and solves the problem. 5ALb. The student solves the algebraic expression, blindly performing all the calculations. 5ALc. The student sees that it is possible to rewrite the expression in a more synthetic way. He/she does so, and therefore easily solves the equation. 6ARa. The student tries to solve the algebraic expression by successive approximations on the x variable. He/she proceeds by trial and error, randomly choosing values. In this way, after many calculations the student finds the value 0, and considers it the only correct solution. 6ARb. The student tries to solve the algebraic expression by successive approximations on the x variable. He/she proceeds by trial and error, randomly choosing values. In this way, after many calculations, he/she finds both the solutions. 6ARc. The student tries to solve the algebraic expression by successive approximations on the x variable. He/she proceeds by trial and error, choosing values in ascending order (0, 2, 7, ...). In this way the student finds the v


Introduction
Extensive qualitative research involving open-answer questionnaires, as well as standardised multiple-choice tests, provide instructors with tools to probe students' conceptual knowledge of various fields of science and mathematics.In recent years, some papers have tried to develop detailed models of the reasoning competences of the student populations tested, or to subdivide a sample of students into intellectually similar subgroups, by using quantitative or qualitative analysis methods (Ayene, Kriek & Damtie, 2011;Bao & Redish, 2006;Cohen, Manion & Morrison, 2000;Fazio & Spagnolo, 2008;Prediger, Bikner-Ahsbahs & Arzarello, 2008;Springuel, Wittmann & Thompson, 2007;Walsh, Howard & Bowe, 2007).In this paper, we discuss the application of a quantitative non-hierarchical clustering analysis method known as k-means (Everitt, Landau, Leese & Stahl, 2011), to make sense of answers given by 118 Tenth Grade students (14-15 years old) from Palermo, Italy, to six open-ended questions on algebraic thinking.It is worth noting that research papers using quantitative analysis methods to study student responses to open-ended questionnaire can be found in physics education (Springuel et al., 2007;Wittmann & Scherr, 2002), but the same cannot be said for research in mathematics education, with some notable exceptions (Gras, Suzuki, Guillet & Spagnolo, 2008).
In this paper, we chose to discuss the use of quantitative analysis methods in the specific domain of Algebra, because, as it is well known, the problem of studying the reasoning of students tackling mathematics problems in algebraic contexts, is relevant in mathematics education.There are many results in the literature devoted to this subject that are obtained by means of qualitative analysis methods (e.g.Arzarello, Bazzini & Chiappini, 2002;Kieran, 2004;Sfard, 1995).They can be compared with our results, in order to verify the real efficacy of the quantitative, non-hierarchical clustering analysis methods we propose.
In particular, we discuss here the results of an empirical study aimed at quantitatively finding the typical behaviours of students in tackling the algebraic resolution of word problems (Bednarz & Janvier, 1996;Boero, 2001;Clement, 1982) and, at the same time, at understanding how the student semantically and syntactically control questions containing symbolic i algebraic expressions (Filloy & Rojano, 1989;Radford & Puig, 2007).
Our decision to refer to word problems, according to the Programme for International Student Assessment (PISA), can allow us to study student literacy (PISA) in using algebra (Bohlmann, Straehler-Pohl & Gellert, 2014) and in the transition from arithmetic to the modelling of problems expressed in a not-symbolic language, that, according to Arzarello et al. (2002), we call natural language.
In the next section, we discuss some of the research results obtained in this field.These results will be, then, useful to understanding the results of our quantitative analysis.The main hypothesis of our research is that an analysis of student answers based on the k-means method allows the researcher to safely partition students into groups that can be characterised by common traits in their answers, without any prior researcher knowledge of what form those groups would take.For this reason, we did not perform an a-priori analysis of the student behaviour as is done in other types of research (Brousseau, 1987).Rather, we conducted an a-posteriori analysis that was based on the answering strategies actually used by the students when tackling the problems proposed by the researchers.
The choice to specifically use the k-means method is also due to the fact that this method allows the researcher to visualise the student behaviour in a Cartesian graph that can be quickly and easily read and discussed.

Theoretical Framework on Algebraic Thinking
The complexity in defining the meaning of algebraic thinking is evident.Although many studies done by mathematics educators and historians (Bagni, 2000;Rogers, 2002) have made important contributions to this question (e.g.Arzarello, Robutti & Bazzini, 2005;Boero, 2001;Carraher, Schliemann, Brizuela & Earnest, 2006;Lins & Kaput, 2004;Ursini & Trigueros, 2001), we still don't have a sharp, concise and shared definition of the concept of algebraic thinking.For example, according to (Schoenfeld & Arcavi, 1988) algebraic thinking is a particular form of mathematical reflection.In the following, we briefly report some related literature results.
Some research studies concerning the didactics of algebra discuss how learning to solve problems using symbolic algebraic language problems can be hard for students (Bohlmann et al., 2014;Palm, 2009).Students often have difficulty in working with algebraic equations, and it is hard for them to learn the ways in which the symbols should be manipulated to reach solutions, even in simple equations.
Considering as well the cognitive process used by students in order to solve types of problems containing symbolic expressions, some other researches underline a student's lack of awareness of both the structural and operational aspects (Meyer, 2013) related to this kind of algebraic symbolisation.In this sense, Arzarello et al. (2002) have shown that symbolising is a game of interpretation, where, through a continuous and lengthy process more sophisticated conceptual structures are activated, until the student's stream of thought defines its temporal, spatial and logical features into an act of autonomous thought.A key aspect of this process is the relationship between the signs and terms of an algebraic expression (Radford, 2010).
According to the results reported in the literature, solving a non-algebraic problem with the help of algebra requires a student to represent and re-code this problem with algebraic symbols, and this implies the activation of different paths of reasoning with respect to the resolution of the problem itself (Arzarello et al., 2002).Some other researchers showed, in fact, that in case of problems expressed into not-symbolic language like, for example, word problems, students often have difficulty presenting the information given in word problems using symbolic language.
Many factors have been found to contribute to these difficulties.Several research studies have identified contextual and grammatical features of word problems that affect students' success in solving them (Bednarz & Janvier, 1996;Chiappini & Lemut, 1991).
According to our specific mathematics subject, related to algebraic thinking, we finally referred to literature results related to the problematic of the transition between arithmetic and algebra, and all the potential errors that may emerge from this crucial mathematical binomial.The passage between arithmetic and algebra is, in fact, another problematic aspect of algebraic thinking (Kieran, 1992).According to Sfard, the content of an algebraic expression is often a generalisation of an arithmetical narrative (Caspi & Sfard, 2012;Sfard, 1995).Thus, the strength of symbolic language not only lies in being able to address arithmetic generalisations, but also in being able to address a pattern or structure, by which one can solve types of problems.This forms the core of the algebraic thinking, but often it isn't mastered by secondary school students, especially in the resolution of problems that are not expressed in symbolic way, as, for example, word problems.

Sample and Questionnaire
The research we describe here is based on the analysis of the answers given by 118 Tenth Grade students from Palermo, Italy, to six open-ended questions on the use of algebraic thinking.The questionnaire, already validated in a previous research ii  (Benfanti, Di Paola & Raimondi, 2005) was answered by students in a maximum of 45 minutes.It was administered to the students at the beginning of the school year, before any discussion about algebra had taken place.The questionnaire is shown in Appendix A.
Following Clement (1982), as well as Franco de Sá and Fossa (2012), the questionnaire is composed by problems expressed in two different languages, namely natural language and the symbolic language, as typified by algebra.
The first four problems are expressed in natural language, i.e. they present a succession of information given in informal, common life language.Their aim is to evaluate the skills of students in translating a word problem into a symbolic language (Arzarello et al., 2002;Bednarz & Janvier, 1996;Boero, 2001).More specifically, the first two problems have a narrative structure.The third and fourth problem are still expressed in natural language, but are synthetically and explicitly stated.
According to Arzarello et al. (2002), this kind of question could lead the students to not use algebra at all, persisting in the exclusive use of arithmetic methods (i.e. to solve the problems with trial-and-error, numeric methods).
The last two problems are expressed in symbolic language.They are two rather classic algebraic problems, used to study student semantic and syntactic control (Radford & Puig, 2007).

Quantitative Analysis
The quantitative analysis methods that we use in this study are based on clustering techniques.They allow us to partition the students in sub-groups on the basis of their typical behaviour, with respect to the way they tackle the questionnaire.
Cluster Analysis (Everitt et al., 2011) aims at classifying subject behaviours in different groups, or clusters.These can be analysed in order to deduct their distinctive characteristics and to point out similarities and differences between them.The clustering techniques can be divided in two main families, namely hierarchic and non-hierarchic (Everitt et al., 2011).Here, we will discuss only the use of a specific non-hierarchic clustering method, called k-means.We start from the definition of a parameter that can be used to define the "likeness" (or the unlikeness) of the elements in the sample we want to analyse, in our case, the students.As the k-means method is based on geometric considerations, it is natural to use a definition of metric to give a measure of the likeness between two elements.In the next sections we will discuss how to build a correlation coefficient between the elements, and how it can be used to define a distance between students.
Many other techniques are used in the literature to study the likeness of elements in a set.We cite here the likelihood index, first proposed by Lerman (Lerman, 1993), which is at the basis of the Likelihood Linkage Analysis, as well as of the Statistical Implicative Analysis, better known as SIA (Gras et al., 2008).In a way similar to ours, this analysis method allows the researcher to define the likeness of students when answering the questionnaire, and also to build implications between the different answering strategies used by the students.

Categorisation and Codification of Student's Answers
Due to the open-ended nature of the questions, the researchers separately analysed the answers given by each student, trying to examine patterns and trends so as to find common themes emerging from them.Each researcher found typical "answering strategies" used by students when responding to the questions.Then the researchers compared and contrasted their findings, and reached a consensus on a common table of student answering strategies to be used for the subsequent analysis.
As a result of coding and categorisation, a set of M data (the answering strategies) was produced for each of the sample subjects (the N students answering to the questionnaire).As a consequence, each subject, i, can be identified by an array, ai, composed by M components 1 and 0, where 1 means that the subject used a given answering strategy to respond to a question, and 0 means that he/she did not use it.Then, a M x N binary matrix (matrix of answers) modelled on the one shown in Table 1, is built.In it, the columns report the N student arrays, ai, and the rows represent the M components of each array, i.e. the M answering strategies.
Table 1 Matrix of data: the N students are indicated as S1, S2, …, SN, and the M answer strategies as AS1, AS2, ..., ASM For example, let us say that student S1 used answering strategies AS1, AS2 and AS5 to respond to the questionnaire questions.The result of this will be that the S1 column in Table 1 will contain the binary digit 1 in the three cells corresponding to these strategies, while all the other cells will be filled with 0.
The matrix depicted in Table 1 contains all the information to describe the sample behaviour with respect to the questionnaire.In our case, M = 43 answering strategies were found for the whole set of answers given to the six questions (see Appendix B).
The answers of each student were coded in a 43-dimension array, showing the specific answering strategies used by each student.In order to indicate whether a student used a given strategy to answer a question or not, 1s, or 0s, were respectively placed in the array cells.

Distance Index
In order to analyse the data, we correlated the student answers by means of a modified Pearson coefficient, Rm, and calculated a 'distance' between each student and all the others by using Gower metrics (Gower, 1966).
If we want to deal with two elements identified by non-numerical variables (for example, the arrays ai and aj containing the binary coding of the answers of students i and j, respectively), we can use a modified form of the Pearson coefficient, Rm, defined in terms of the properties of the elements (i.e. the numbers of 1s and 0s in the array).A possible definition that we have put forward on the basis of the one used in the field of Econophysics (Tumminello, Micciché, Dominguez, Lamura, Melchiorre, Barbagallo & Mantegna, 2011) is as follows: Equation 1where np(ai), np(aj) are the number of properties of ai and aj that we want to take into account, respectively (the numbers of 1s or 0s in the arrays ai and aj, respectively), M is the total number of properties to be studied (in our case, the M possible answering strategies) and np(aiaj) is the number of properties common to both ai and aj (the common number of 1s or 0s in the arrays ai and aj).
The choice of the type of metrics to use for the distance calculations is often complex, and depends on many factors.If we want two negatively correlated elements ai and aj to be more dissimilar than two elements that are positively correlated (as is often advisable in research in education), a possible definition for the distance between ai and aj, making use of the modified correlation coefficient, Rm, between them, is:

Equation 2
We chose to use this because it is an Euclidean metric (Gower, 1966), as required by the k-means method.
By following Equation 2we can, then build a new N x N matrix containing all the mutual distances between the students.The main diagonal of this matrix is composed of 0s (the distance between a student and him/herself is zero).Moreover, it is symmetrical with respect to the main diagonal.In fact, our subjects can be represented as points in an N-dimensional space, and each subject, j, is represented as a point whose coordinates are related through Equation 2to the values in the array, aj.

Not Hierarchical Cluster Analysis
The k-means clustering method was used to study the clusters that can be originated from the data space.This method was first proposed by Mac-Queen in 1963(MacQueen, 1967).In this method, the starting point is the choice of the number of clusters one wants to populate and of an equal number of 'seed points', randomly selected in the two-dimensional space representing the data.It is then necessary to define a procedure to find two Cartesian coordinates for each student, starting from these N distances between them (considering also the distance from him/herself).This procedure consists of a linear transformation between an Ndimensional vector space and a two-dimensional one, and it is well known in the specialised literature as multidimensional scaling (Borg & Groenen, 1997).The subjects are then grouped on the basis of the minimum distance between students and the seed points.
Starting from an initial classification, subjects are transferred from one cluster to another, or are swapped with subjects from other clusters, until no further improvement can be made.The subjects belonging to a given cluster are used to find a new point, representing the average position of their spatial distribution.This is done for each cluster and the resulting points are defined as the cluster centroids.This process is repeated, and ends when the new centroids coincide with the old ones.The spatial distribution of the set elements is represented in a two-dimensional space.
The k-means method needs, at the beginning of the procedure, to arbitrarily define the number of clusters.A specifically designed function, the Silhouette Function (Rousseeuw, 1987) was used to solve this issue.The values of this function allow us to decide whether the partition of our sample subjects in q clusters was adequate, how dense a cluster was, and how well it was differentiated from the others.In other words, this function allows to understand how well each student array lies within a cluster, and, therefore, to decide the number of clusters best fitted to the data distribution.This particular number of clusters corresponds to the maximum of the average value of the silhouette function for the given data distribution.
It is also well known (Stewart, Mille, Audo & Stewart, 2012) that in cluster analysis, the initial position of the centroids critically influences the final results.Different values of a centroid's initial position could lead to different cluster populations.
For this reason, we repeated the cluster calculations for several values of the initial position of each centroid, selecting the configuration that gave the absolute minimum of the sums of the distances between the centroid and its cluster points.Onehundred thousand iterations were performed for each cluster configuration, each with different initial conditions, where the best one can be chosen.In other words, we obtained an absolute minimum of the sums of the distances between the centroid and its cluster points, for each iteration, and chose the minimum value amongst them.
At the end of the calculations, each cluster can be defined by a point representing the centre of the spatial distribution of the elements in the cluster, called the cluster centroid (Leisch, 2006).Our analysis allowed us to find an array for each centroid, of the same form as the ones describing the students' answering strategies.We used these arrays to characterise the clusters, as it can be demonstrated that they contained the answering strategies recurring with the maximum frequency in the cluster elements (the students).In fact, we can start from the consideration that the centroid is a geometrical point in our data space that minimises the sum of its distances from all the points (the student profiles) included into the cluster defined by the centroid itself.Minimising this sum means maximising the correlation coefficients between the centroid and the student points (see Equation 2).As a consequence of the definition of the correlation coefficient, this happens when the centroid array is made up of the answering strategies recurring with the maximum frequency in the cluster.

Results
All calculations were performed by using custom software written in C language.The graphical representations were obtained by using the MATLAB software.
By using the method described above, we calculated the values of the silhouette function (see Figure 1), and found that the maximum of its mean value (0.71) is obtained for a partition of our sample in three clusters.For this reason, our data set can be best partitioned, in our analysis, into three clusters.In the graph, each horizontal bar represents a student and the values of the silhouette function are reported on the horizontal axis.
Figure 2 shows the three clusters that best partition our data set and the related centroids.Each point in the Cartesian plane represents a student.Each point is placed according to the calculated mutual distances between the students and by using the multidimensional scaling procedure.
The axes' only function is to show the scale used to place each point in the Cartesian plane, taking into account all the mutual distances be-tween them.In other words, the Cartesian coordinates (x, y) depend on the mutual distance between the students, and do not have a particular meaning.It is worth noting that some points may be placed in the vicinity of different clusters, and may actually represent students that exhibit mixed behaviours.In particular, this happens for some points in C1 cluster and some other in C2 cluster.However, the k-means method anyway classifies these students in a specific cluster and associates them to the general, typical behaviour of the cluster elements.The k-means method should, therefore, be understood as giving global-type information, and must not be considered as a way to study the characteristics of each student in detail.
As previously mentioned, the three clusters can be characterised by their related centroids, Ck, (k = 1, …, 3), which are the three points in the graph.If we connected to each centroid Ck to an array ck, it contains (as demonstrated in the previous section) the answering strategies most frequently applied by subjects in the related clusters (see Table 2).The codes used refer to the answering strategies for the questionnaire described in Appendix B.
We will discuss the pedagogical meaning of these results in the next section.

Discussion and Conclusion
The k-means analysis allowed us to find three clusters that are represented by the three centroids, which describe the average behaviour of the students of the clusters.In the following, we discuss the analysis of the typical behaviour of the students on the basis of the answering strategies found in the centroid arrays.As previously noted, these strategies were not defined a-priori, and are not to be considered as the ideal profiles of students (Fazio, Battaglia & Di Paola, 2013), but are obtained as a consequence of the analysis performed by means of the k-means method.
The cluster represented by centroid C2 is characterised by the following array of answering strategies: 1ARa, 1ARb; 2ARc; 3ARb; 4ARa; 5ARa, 6ARb, found as described above.Upon examination, it appears that they represent a student consistent use of 'low-level' strategies, marked as "a" and "b", respectively.The 37 students in the C2 cluster could be defined as purely arithmetic (Di Paola & Spagnolo, 2010;Malisani, 1992).They appear to be 'weak' students, that use the tools and methods of arithmetic even when these are not well fitted to the question or are formally not correct, as can be seen from the use of strategies 1ARa, 1ARb, 2ARc, 3ARb and 4ARa.This student behaviour, found here by means of quantitative analysis, is in good accordance with the results qualitatively found by Arzarello et al. (2002) and Meyer (2013) and discussed in the theoretical framework Section.Particularly, with reference to Meyer (2013), we find a lack of student awareness with respect to the procedures used, which mainly remain arithmetical.We also find aspects related to the difficulty to translate natural language into a symbolic one, as reported by Arzarello et al. (2002).Another example of this behaviour is the use of arithmetic strategies (5ARa, 6ARb) for the last two questions of the questionnaire, namely the ones posed in symbolic form.These students appear to stay hooked to an arithmetic trial-and-error approach, even when they must solve algebraic expressions.This result is also well described by Sfard (1995) by means of a qualitative analysis, and is typical of algebraic thinking.1ARc,2ARb,3ARb,4ALb,5ALa,6ALc 1ARa,1ARb,2ARc,3ARb,4ARa,5ARa,6ARb 1ALa,2ALc,3ALc,4ALd,5ALc, 6ALd Number of subjects 67 37 14 The centroid strategies of C2, all arithmetical ones, show that for the students in the cluster, the transition from arithmetic to algebra is difficult.In their qualitative-type research, Benfanti et al. (2005), Cusi, Malara and Navarra (2011) and Malara and Navarra (2003) find this kind of behaviour, and define these students as students that have not even reached a pre-algebraic thinking.
The cluster represented by centroid C3 is the smallest of the three we found (14 students).It groups the few students that demonstrate welldefined algebraic thinking.The centroid is characterised by the following array of answering strategies: 1ALa; 2ALc; 3ALc; 4ALd; 5ALc; 6ALd.All these strategies are algebraic and 'highlevel' (marked as "c" or "d").The students in this cluster make use of algebra in order to model the proposed word problems.Strategies 1ALa, 2ALc, 3ALc, 4ALd show that these students appear to be able to translate natural language into a symbolic one (Arzarello et al., 2002;Caspi & Sfard, 2012).These strategies show that students in cluster C3 seem to not have too many difficulties in controlling, unlike the results reported by Chiappini and Lemut (1991).
The students also show some confidence when answering Questions 5 and 6.Strategies 5ALc, 6ALd are the proof of this behaviour.According to (Caspi & Sfard, 2012) we can say that the students in this cluster show a good mastery of algebra.Strategies 5ALc and 6ALd highlight the absence of the difficulties found by Bohlmann et al. (2014) and Palm (2009), in students manipulating algebraic symbols.
Finally, the array defining the C1 centroid has the following components: 1ARc; 2ARb; 3ARb; 4ALb; 5ALa; 6ALc.This is the largest students cluster (67 students), and it groups students that put into action mixed arithmetic and algebraic answering strategies.This can be seen by analysing the components described above, which include the use of arithmetic strategies to deal with the first three questions, and the use of algebraic ones for the last two (an example of this is the use of strategies 5ALa and 6ALc) that should suggest an algebraic solution, due to their algebraic formulation.Strategies 5ALa and 6ALc highlight a good accordance with the results of Bohlmann et al. (2014) and Palm (2009), as discussed in the Theoretical framework.
The fourth problem is solved by using an algebraic strategy, although a low-level, and wrong one (4ALb).In fact, these students symbolically write the expression, but then go on by numerically solving it with a trial-and-error procedure, and do not arrive to the correct solution.This can be due to imperfect mastering of the skills required to translate between natural and symbolic language, as also observed in the literature (Bednarz & Janvier, 1996;Benfanti et al., 2005;Boero, 2001).
We also note a coherence in the use of strategies (1ARc, 2ARb) in centroid C1 with respect to questions 1 and 2 (that are similar), and a lack of coherence in the strategies (3ARb and 4ALb) used to answer problems 3 and 4. In fact, the third question, although having the same form as the fourth, was tackled in a completely different way, with the use of arithmetic-type strategies.This last result seems to not be in good accordance with the results discussed in the literature by Arzarello et al. (2002) on the transition from natural language to an algebraic one.An interpretation of these results should call for a deeper analysis, which might take into account simultaneous qualitative and quantitative analysis.
It is worth noting that the cardinality of the cluster defined by C2, is not negligible.This is a result that can underline the complexity, largely discussed in literature (Arzarello et al., 2002;Sfard, 1995), of the didactical aspects related to teaching /learning algebra at this school grade.
In conclusion, we want to underline that the kmeans method we used here allowed us to characterise the common traits in the student answers, giving us the opportunity to safely partition them into groups.These groups are characterised by centroids that, as we said before, represent the answering strategies given with maximum frequency by the students who are part of the cluster.
The results we reported here were obtained without any prior researcher knowledge of what form those groups would take, are largely coherent with the ones already reported in the literature, and were obtained by means of qualitative methods.For this reason we can, at least, consider the use of nothierarchical cluster analysis a valid tool to complement the use of qualitative analysis to study the way of a set of students can be partitioned with respect to the way they answer a questionnaire.

Notes
i.
In our use of the term "symbolic", we understand expressions containing equations, inequations, simultaneous equations, etc. ii.
3ARa.The student does not follow a specific logical line in choosing the triad of numbers required by the question.In particular, he/she does not choose three consecutive numbers and goes on more or less by chance, eventually finding the right result.
3ARb.The student tries to answer the question by several attempts.He/she first tries the triad 1, 2, 3 and verifies if they fit with the requirements of the question.As this is not the case, he/she tries again with 2, 3, 4 and, then 3, 4, 5.In this last case, the sum of the squares is 50, so the student finds his/her answer.After this, he/she does not care to verify if other triads of numbers satisfy to the question requirements.
3ARc.The student follows the steps described in strategy 3ARb, but chooses negative, consecutive numbers.So, he/she first tries the triad (-3, -2, -1).Then (-4, -3, -2) and he/she finds a result (-5, -4, -3) that fits the question requirements.After this, he/she does not care to verify if other triads of numbers satisfy to the question requirements.
3ALa.The student formalises the problem and writes the formula: x2 + y2 + z2 = 50.However, no relationships between x, y and z are found, and so he/she is not able to solve the problem.
3ALb.The student formalises the problem and writes the formula: x2 + (x + 1) 2 + (x + 2) 2 = 50.In order to solve it he/she uses a trial and error, arithmetic procedure.
3ALd.The student formalises the problem and writes a system of 3 equations with 3 variables and solves it by using one of the algebraic methods he/she knows: 4ARa.The student answers "no", without further explanation.4ARb.The student tries to solve the problem by a trial and error, arithmetic procedure, randomly searching numbers.
4ARc.The student decides to proceed by successive approximations.He/she starts from 1 and performs the calculations described in the text: he/she adds 1 to 4 and then multiplies the result by 80, verifying that the obtained value is less than 2360.He/she continues with numbers greater than 1 until he/she finds that, by using 25, the result is 2320 (25 + 4) * 80 = 2320, but by using 26, the result is 2400, that is greater than the required value (2360).As there are no other integers between 25 and 26, the student concludes that the answer to the question is "no".
4ARd.The student draws a rectangle ((x+4); 80) and bases his/her reasoning on the fact that the area of such rectangle, according to the question data, is to be 2360.He/she goes on by a trial and error, arithmetic procedure: 5ARa.The student tries to solve the algebraic expression by successive approximations on the x variable.He/she proceeds by trial and error, randomly choosing values.In this way, after many calculations, the student finds the value 0, and considers it the only correct solution.
5ARb.The student tries to solve the algebraic expression by successive approximations on the x variable.He/she proceeds by trial and error, randomly choosing values.In this way, after many calculations, he/she finds both solutions.
5ARc.The student tries to solve the algebraic expression by successive approximations on the x variable.He/she proceeds by trial and error, choosing values in ascending order (0, 2, 7, ...).In this way the student finds the value 0 and considers it the only correct solution.
5ARd.The student tries to solve the algebraic expression by successive approximations on the x variable.He/she proceeds by choosing values in ascending order (0, 1, 2, 3, ...).In this way the student finds both the solutions.
5ALa.The student tries to simplify the algebraic expression, but fails to do so.He/she, then, uses an arithmetic approach and solves the problem.
5ALb.The student solves the algebraic expression, blindly performing all the calculations.5ALc.The student sees that it is possible to rewrite the expression in a more synthetic way.He/she does so, and therefore easily solves the equation.
6ARa.The student tries to solve the algebraic expression by successive approximations on the x variable.He/she proceeds by trial and error, randomly choosing values.In this way, after many calculations the student finds the value 0, and considers it the only correct solution.
6ARb.The student tries to solve the algebraic expression by successive approximations on the x variable.He/she proceeds by trial and error, randomly choosing values.In this way, after many calculations, he/she finds both the solutions.
6ARc.The student tries to solve the algebraic expression by successive approximations on the x variable.He/she proceeds by trial and error, choosing values in ascending order (0, 2, 7, ...).In this way the student finds the value 0 and considers it the only correct solution.
6ARd.The student tries to solve the algebraic expression by successive approximations on the x variable.He/she proceeds by choosing values in ascending order (0, 1, 2, 3, ...).In this way the student finds both the solutions.
6ALa.The student tries to simplify the algebraic expression, but fails to do so.He/she, then, uses an arithmetic approach and solves the problem.
6ALb.The student solves the algebraic expression, blindly performing all the calculations.6ALc.The student sees that it is possible to rewrite the expression in a more synthetic way.He/she does so and solves it, but finds only one of the two solutions.
6ALd.The student sees that it is possible to rewrite the expression in a more synthetic way.He/she does so and solves it, finding both the solutions.

Figure 1
Figure 1 Silhouette values for the whole sample.Horizontal and vertical axes represent students and silhouette values, respectively.C1, C2 and C3 represent the three centroids of the three clusters formed.The silhouette average value is 0.71.

Figure 2 K
Figure 2 K-means graph.Each point in this Cartesian plane represents a student.Points labelled C1, C2, and C3 are the centroids.

Table 2
An overview of results obtained by k-means method