How to Discover Textual Groups

Multivariate analysis (MVA) can be applied to the New Testament textual tradition in order to investigate grouping among its witnesses. This article applies certain MVA methods to a number of example data sets. Each method operates on a matrix that tabulates distances between pairs of items in a data set. The simple matching distance, which is the proportion of disagreements, can be used as a metric for calculating distances between New Testament witnesses. Analysis methods called classical multidimensional scaling (CMDS) and divisive clustering (DC) are useful for revealing group structure when it is well defined. However, they are less useful when grouping is not very distinct. A method called partitioning around medoids (PAM) provides another way to divide a data set into groups. Local maxima in a plot of a statistic called the mean silhouette width (MSW) indicate preferred numbers of groups. Statistical analysis of a data set allows upper and lower critical limits to be defined for the distance between a pair of witnesses. Distances between these limits are not significant in the sense that the same range of distances is expected to occur for generated pairs whose states are randomly chosen from the available pool. Distances that are either less than or greater than these critical limits are not likely to happen by chance. A distance less than the lower critical limit indicates an adjacent relationship while one greater than the upper limit implies an opposite relationship. Applying CMDS, DC, and PAM analysis to data for the Gospel of Mark reveals interesting features of the textual landscape. Witnesses tend to form groups that have points of contact with conventional categories such as the “Alexandrian,” “Byzantine,” “Western,” and “Eastern” types identified by prior generations of researchers. Multivariate analysis can also be used for novel purposes such as identifying group representatives, group cores, and readings useful for classification purposes. Resume L’analyse multivariee (MVA) peut s’appliquer a la tradition textuelle du Nouveau Testament afin d’examiner les regroupements parmi ses temoins. Cet article applique certaines methodes de MVA a plusieurs exemples d’ensembles de donnees. Chaque methode fonctionne selon une matrice qui compile les distances entre des paires d’elements dans un ensemble de donnees. Le simple appariement des distances, qui est la proportion de desaccords, peut etre utilise comme une unite metrique pour calculer les distances entre les temoins du Nouveau Testament. Les methodes d’analyse appelees echelles multidimensionnelles classiques (CMDS) et regroupement conflictuel (DC) sont utiles pour decouvrir la structure d’un groupe lorsque celui-ci est bien defini. Cependant, elles le sont moins lorsque le regroupement n’est pas tres distinct. Une methode appelee partitionnement autour de medoides (PAM) fournit un autre moyen de diviser en groupes un ensemble de donnees. Des maxima locaux dans un lot de statistiques appele largeur moyenne de la silhouette (MSW) indiquent des nombres de groupes privilegies. L’analyse statistique d’un ensemble de donnees permet de definir des limites critiques plus et moins elevees pour obtenir la distance entre une paire de temoins. Les distances entre ces limites ne sont pas considerables en ce sens que la meme gamme de distances est prevue se produire pour les paires generees dont les etats sont choisis au hasard parmi le bassin disponible. Les distances qui sont soit moins elevees ou plus elevees que ces limites critiques ne se produiront vraisemblablement pas par hasard. Une distance inferieure a la limite critique moins elevee indique une relation adjacente tandis qu’une distance superieure a la limite plus elevee suppose une relation opposee. L’application des analyses CMDS, DC, et PAM aux donnees de l’Evangile selon Marc revele des caracteristiques interessantes du paysage textuel. Les temoins ont tendance a former des groupes qui ont des points de contact avec les categories conventionnelles comme les types « alexandrins », « byzantins », « occidentaux » et « orientaux » identifies par les generations anterieures de chercheurs. L’analyse multivariee peut aussi etre utilisee a des fins novatrices comme par exemple pour identifier des representants de groupes, des noyaux de groupes, et des lectures utiles a des fins de classification.


Introduction
Every book from antiquity which has survived in multiple copies exhibits textual variation. Sites where extant witnesses differ can be identified by manual or computer-assisted comparison. Once the state of every witness has been recorded at every variation site where its text is discernible, a text-to-text distance can be calculated for each pair of witnesses. Multivariate analysis of these distances provides a way to discover textual groups among the witnesses.
The boundaries of a variation site, the list of associated textual states, and the lists of witnesses that attest to the states constitute a structure called a variation unit.
The boundaries of a variation site might be determined by a collation algorithm or through editorial discretion. Once the boundaries are defined, the alternative states of text found among the witnesses can be listed for that site. The term reading is often used to refer to the textual state of a witness. A reading may be classified as: substantive, affecting meaning; orthographic, affecting the surface form but not meaning; or erroneous, for clear blunders. A state that constitutes a substantive difference is often called a variant.
A compact structure called a data matrix is a useful starting point for multivariate analysis of textual variation. ("Analysis of textual variation" [Finney 2011] describes how to prepare data extracted from a critical apparatus for analysis and compares results obtained when various MVA techniques are then applied; "Ancient witnesses," [Finney 1999] applies multivariate analysis to full transcriptions of early manuscripts of the Epistle to the Hebrews; "Mapping textual space" [Finney 2010] presents analysis results based on data extracted from a critical apparatus of Hebrews. J. C. Thorpe, "Manuscript classification," [Thorpe 2002] provides another introduction to multivariate analysis of data relating to New Testament textual variation.) There is a row for each witness, a column for each variation site, and a code representing which state each witness preserves at each variation site where it is well defined. A distinct code such as NA (not available) is used when a witness is not well defined. There are various reasons why this might be so: the witness could be illegible at the place in Finney: How to Discover Textual Groups Art. 7, page 4 of 99 question or, if a translation, might support more than one of the alternative states in the original language.
A data matrix contains the information required to construct a second structure called a distance matrix using some metric to record the dissimilarity of every pair of witnesses being examined. One suitable metric, the simple matching distance (SMD), is the relative frequency of disagreement between two witnesses. Given a set of variation sites where both witnesses have well-defined readings, the SMD is calculated by counting the number of disagreements and dividing by the number of sites. The resulting distance is dimensionless, having no unit, because it is the ratio of two pure numbers. Its magnitude varies from a minimum of zero for perfect agreement to a maximum of one for perfect disagreement.
To keep sampling error below a tolerable level, it is advisable to impose a constraint whereby witnesses are eliminated from the distance matrix if their inclusion would result in any distance being calculated from less than a minimum acceptable number of variation units where the states of both members of a pair are well defined. Witnesses whose inclusion would violate this constraint can be eliminated by an iterative procedure that drops the least well-defined witness at every step until all remaining witnesses satisfy the constraint. For this article, the minimum acceptable number is fifteen.
The textual landscape can be explored by applying multivariate analysis techniques such as classical multidimensional scaling (CMDS) and divisive clustering (DC) to a distance matrix. Also, statistical analysis of a distance matrix allows one to establish what range of distances between witnesses is expected to occur by chance thereby providing a criterion for deciding whether two witnesses share a significant level of agreement. Although useful, these techniques are not always suitable for indicating how many groups exist especially when grouping is not well defined.
Another mode of multivariate analysis called partitioning around medoids (PAM) provides a robust way to divide a set of witnesses into a chosen number of groups.
A statistic called the mean silhouette width (MSW), which can be generated during PAM analysis, indicates how many groups are in the data. Examples Four examples will be used to introduce a number of multivariate analysis techniques, which can be used to explore grouping: 1. actual distances between thirty cities 2. an artificial construct which contains four well-defined groups of three items Analysis of the example data sets is performed with scripts written in R, a statistical computing language (R Core Team 2016). The data sets, scripts, and analysis results are available at a data archive associated with this article (Finney 2018). Truncated versions of data and distance matrices are presented in the text but a complete version of each matrix can be downloaded as a comma-separated vector (CSV) by clicking on its caption. Once downloaded, the file can be imported into a spreadsheet program for inspection. Classical multidimensional scaling results are presented as static images but animated renditions can be retrieved by clicking the image captions.
Finney: How to Discover Textual Groups Art. 7, page 6 of 99

Distances between Cities
The distance matrix of the first example records distances between thirty busy airports identified by their IATA codes. The distances in Table 1 were obtained with a formula that calculates tunnel distances from latitude and longitude coordinates.
The formula is only accurate to 0.5% because the earth is not a perfect sphere.
Resulting distances are rounded to the nearest kilometre.
This distance matrix has a number of characteristic features common to all distance matrices of the kind employed in this article: 1. it is square, because it has the same number of rows and columns with one row and column per case (i.e. city, in this example) 2. it is symmetrical, because the distance from any case A to another case B is the same as the distance from B to A 3. its diagonal is comprised entirely of zeros, because the distance from any case to itself is zero.
Classical multidimensional scaling (CMDS) produces a geometrical construct where cases are represented as points in a space. The procedure uses a least squares method to minimise differences between actual distances, as found in the distance matrix, and corresponding distances between points of the construct. The result is the best representation achievable using the available number of dimensions and chosen method of minimising differences. In this article, all CMDS results are restricted to three dimensions. Achieving a perfect spatial representation of a distance matrix may require any number of dimensions up to one less than the number of cases.
The coefficient of determination or R-squared value indicates how well the construct reproduces the actual distances. It ranges from zero to one, with one indicating a perfect representation. (The R language and R-squared value have independent etymologies.) The CMDS map for the first example seems to chart an alien world. Compared to a conventional globe, the construct in Figure 1 is reflected and rotates around a different axis. This is not unexpected, as the distance matrix contains no information concerning orientation or reflection. The R-squared value of 1.00 means that the map perfectly represents the distance matrix. If three dimensions had not been sufficient to reproduce all of the information in the distance matrix then the value would have been less than one. In this case three dimensions are sufficient because the original distances relate to a three-dimensional world. The axis scales also conform to the data set, with the distance between opposite points (e.g. Singapore and Miami) corresponding to the Earth's diameter of approximately 12,750 km.
Divisive clustering (DC) begins with a single group and ends with individual cases. The relevant program documentation describes the clustering algorithm as follows, where observation refers to a case (Maechler et al. 2017, "Cluster analysis basics and extensions," "diana" method of the "cluster" package): At each stage, the cluster with the largest diameter is selected. (The diameter of a cluster is the largest dissimilarity between any two of its observations.) To divide the selected cluster, the algorithm first looks for its most disparate observation (i.e. which has the largest average dissimilarity to the other observations of the selected cluster). This observation initiates the "splinter group." In subsequent steps, the algorithm reassigns observations that are closer to the "splinter group" than to the "old party." The result is a division of the selected cluster into two new clusters. (32) Divisive clustering analysis produces a dendrogram that shows "heights" at which groups divide into sub-groups. The associated divisive coefficient ranges from zero to one, with larger values indicating more clearly defined grouping. A DC dendrogram is not a genealogical tree of the type produced by phylogenetic analysis. Instead, it merely shows a reasonable way to progressively subdivide an all-encompassing group until every sub-group is comprised of a single case. For examples of phylogenetic analysis results see Spencer, Wachtel, and Howe's "Greek vorlage of the Syra Harclensis" (2002).
The DC dendrogram for the first example (Figure 2) splits at a "height" of just over 12,000 kilometres, corresponding to the diameter of the entire point cloud.
The left-hand branch splits into North American and European groups at a height of about 9,000 km, which is the approximate distance between the centres of these two groups of cities. The North American group splits into eastern and western branches Finney: How to Discover Textual Groups Art. 7, page 9 of 99 at a height of about 4,000 km, corresponding to the width of the continental USA. Sydney (SYD) and Dubai (DXB) are the first to split from the right-hand branch due to their relative isolation. The remaining cities in this branch split into East and Southeast Asian branches at a height of about 5,500 km.
If a case is on the border between two groups then a slight change in the distance matrix can cause it to switch from one branch of the dendrogram to another. A case in point is Dubai, which is about half way between the European and Asian groups. If this city were somehow to migrate closer to Europe then its location in the dendrogram would eventually shift out of the Asian group into the European one.

Well-Defined Groups
The second example introduces two new elements: the data matrix and the control data set. A data matrix records the states of a set of cases for a set of variables. The data matrices used here adopt a convention where rows relate to cases and columns relate to variables. Each row records the state of a case for each variable where it is well defined. If the data relates to textual variation then there is a row for each witness, a column for each variation site, and the states are codes which represent Finney: How to Discover Textual Groups Art. 7, page 10 of 99 alternative readings or, in the case of substantive variation, variants. If the state of a case is not well defined for a particular variable then it is given the code NA (not available).
This example's data matrix ( Table 2) is an artificial construct in which the states of fifteen binary variables (V1-V15) have been chosen to produce four well-defined groups among twelve cases (A1-A12), with three cases per group. A binary variable is one with only two possible states, here represented by the symbols 1 and 2.
Cases within a group are similar while those of differing groups are dissimilar. The corresponding distance matrix ( Table 3) was produced using the simple matching distance (SMD) to quantify the dissimilarity of every pair of cases. Counting the number of disagreements between two cases and dividing by the number of places  where they are compared obtains the simple matching distance. Both cases must be defined at each place where they are compared.
As the states in this example have been chosen to produce well-defined groups, within-group distances should be small relative to between-group distances.
Inspecting the distance matrix confirms that this is so, with cases in the same group having a distance of 2/15 (0.133) while distances between cases in differing groups are either 6/15 (0.400) or 8/15 (0.533). Inspection also shows that the distance matrix has the previously-mentioned characteristics of being square, symmetrical, and having a diagonal comprised entirely of zeros. As in all distance matrices based on the simple matching distance, every distance has a value in the range between zero, representing perfect agreement, and one, representing perfect disagreement. The CMDS result for this distance matrix (Figure 3) reveals the four well-defined groups, placing dissimilar cases apart and similar ones together. The R-squared value of 0.85 obtained during the analysis indicates that the three dimensions allowed for the analysis result are not sufficient to perfectly reproduce the actual distances. On one hand, the CMDS result represents the data set well, giving a useful indication of the distance between dissimilar groups. On the other hand, it hides differences between cases in the same group. The four groups occupy the apexes of a regular tetrahedron because each one is equidistant from the others, as determined in advance when the corresponding data matrix was constructed. Given a different number of artificial groups, the resulting map would present a different picture; if, say, there had been three groups, they would have occupied the apexes of a triangle.
In Figure 4, the DC dendrogram produced from this distance matrix also reveals the four groups. An initial all-encompassing group decomposes into four groups at a height of 0.533, which is the most common distance between cases in dissimilar groups. Each group splits into individual items at a height of 0.133, which is the distance between cases in the same group. In this example, DC is better than CMDS with respect to indicating distances between cases in the same group.
This dendrogram has the characteristic pattern produced by data with well-defined Finney: How to Discover Textual Groups Art. 7, page 13 of 99 groups: long "branches" tipped by tight bunches of "leaves." The divisive coefficient associated with this dendrogram is 0.75.
The control data set is based on random data and shows what analysis results look like in the absence of groups. The control data matrix in Table 4 was produced by a script that generates the required number of cases by randomly selecting between two possible states for the required number of variables. Cases are labelled with an R (for random) prefixed to a numeral, variables with a V prefix, and states as 1 and 2.
The control data sets of the following sections were produced by the same script and have the same labelling system. Cases and variables from different controls are distinct even though their labels may coincide.
The controls are intended to show what analysis results look like for a data set which is comparable to the primary example yet does not contain any real groups.
The generating script is supplied with three parameters: 1. the required number of cases 2. the required number of variables 3. the desired mean distance between cases.
For each case produced, the script randomly selects a state for every variable in a manner that aims to produce the desired mean distance between cases. A distance matrix produced from the resulting data matrix will have approximately the same mean distance between cases as desired. The function that performs random selection needs to be supplied with the probability of the first state being selected. This is calculated using the expression p = (1+ (1-2d) 1/2 )/2, where p is the probability of the first state and d is the desired mean distance. In the limit of an infinite number of cases, using this probability would produce the desired mean distance. The data matrix of the present control was produced using parameters that correspond to the primary example: twelve cases, fifteen variables, and a desired mean distance between cases of 0.424.
The distance matrix for the control data set, seen in Table 5, is again obtained by calculating the simple matching distance for every pair of cases. The mean distance between the randomly generated cases is 0.481. While this is some way off the desired value of 0.424, the difference is not unexpected given that a random process generated the cases. As with all of the controls, agreement between a pair of randomly generated cases is a purely random phenomenon. Apart from having the same two states among which to choose for every variable, none of these cases is related. In view of this, it may be surprising to find that some, such as R4 and R8, are relatively close to each other while others, such as R1 and R5, are relatively far apart.
Due to the nature of random processes, if enough random cases were produced then the distances between pairs would encompass the full range of possible values, with a minimum of zero and a maximum of one. The frequencies with which various distances occur would vary, extreme values being less common than others. In Figure 5, the CMDS map of these unrelated cases is roughly spherical. There are variations in the density of cases across different volume elements within the space, but these are merely random fluctuations. This map illustrates an important point: random agreement can mimic group structure even when none actually exists. The R-squared valuable is 0.64, meaning that the map conveys less than twothirds of the information contained in the distance matrix. Nevertheless, this is still the best representation possible when the analysis is forced to work in only three dimensions. If more dimensions were allowed then more congruent results would follow. However, the problem remains of how to convey such results when our spatial perception is limited to three dimensions. The corresponding DC dendrogram in Figure 6 has a divisive coefficient of 0.51, not zero as might be expected but still a good deal less than the value of 0.75 obtained for the primary example.
A dendrogram can be "cut" at a certain height to partition the cases into a number of groups. This is achieved by choosing a height, drawing a horizontal line across the dendrogram at that height, and then grouping cases that belong to each sub-branch thus defined. Any height might be chosen, one possibility being the mean distance between pairs of cases. Cutting this dendrogram at a height equal to the mean distance of 0.481 produces the partition seen in Table 6.
This illustrates another important point: a method of analysis can be used to partition a data set even when its cases are unrelated. Cutting a dendrogram at some height can produce any number of groups between one and the total number of cases.
At what height should a dendrogram be cut to produce groups where members are actually related? For this dendrogram, an appropriate height would be something less than 0.2, the minimum distance between these randomly generated cases. Cutting at such a height produces a more sensible partition with only a solitary case per group.

Textual Variants in Mark (UBS4)
This example relates to textual variations in the Gospel of Mark as recorded in the fourth edition of the United Bible Societies Greek New Testament (Aland, et al. 1983 = UBS4). Richard Mallett deserves thanks for performing the exacting task of manually constructing the data matrix by encoding the variants found in the UBS4 apparatus. The resulting matrix, seen in Table 7, has 229 rows, one per witness, and 142 columns, one per variation unit presented for the Gospel of Mark. A numeral is used to encode the variant supported by a witness when its state is known at a variation site. If the state is not well defined then the code NA is used. Minuscule 2427 has been left in the data matrix even though it is now regarded as spurious.
Adding or omitting a single witness seldom has much effect on results obtained by the analysis methods used in this article.
The corresponding distance matrix, seen in Table 8, has only 65 rows and columns, one of each per witness that has survived the vetting process required to reduce sampling error to a tolerable level. The other 164 witnesses have been dropped because including any one of them would result in at least one distance Subjecting this distance matrix to classical multidimensional scaling produces a map that may be described as tetrahedral, having three lobes of relatively high witness density diverging away from, or converging towards, a dense concentration of Byzantine witnesses. Regions between the three lobes are practically devoid of witnesses, as can be seen in Figure 7. In this data set at least, it is rare to find a text that lies between two non-Byzantine varieties. There are a few exceptions to this rule, including Old Latin Codex Bobbiensis (it-k) and Codex Koridethi (038): Bobbiensis stands about the same distance from what some would call "Western" and "Alexandrian" groups; Koridethi is between the "Western" group and a complex which includes the Sinaitic Syriac (syr-s), Armenian (arm), and Georgian (geo) versions. The R-squared value for this map is 0.87, indicating that a three-dimensional treatment preserves 87% of the information contained in the distance matrix.
Divisive clustering analysis produces a dendrogram, seen in Figure 8, that, if cut at a height of 0.6, divides the witnesses into approximately the same groups as found in the CMDS map. However, the dendrogram might just as well be cut at another height to produce another number of groups. The associated divisive coefficient is 0.74. Its significance will not become apparent until compared with the divisive coefficient produced through DC analysis of a comparable distance matrix derived from random data.
The same script used to produce all of the controls in this article generated the control data matrix for this example, presented here in Table 9. Parameters supplied Finney: How to Discover Textual Groups Art. 7, page 20 of 99 to the script are those required to produce a comparable distance matrix once it has been calculated from the generated data matrix. To be comparable, the generated data matrix needs to produce 65 cases with 142 variables per case while aiming for a mean distance between cases of 0.471.
The distance matrix calculated from the control data matrix (Table 10) turns out to have a mean distance between cases of 0.467. The distances between pairs of cases range from 0.338 to 0.613, excluding the distance of zero obtained when each case is compared with itself. This gives a sense of the normal range of distances to be expected for 65 unrelated cases of 142 variables each and a mean distance between cases of about 0.471.
Analyzing this distance matrix to produce a CMDS map, seen in Figure 9, results in a spherical point cloud with a number of density fluctuations that might be  misinterpreted as groups. This shows what kind of map to expect when a data set of this size and mean distance between cases contains no groups. Any apparent groups are spurious. The R-squared value is 0.15, indicating that the map conveys less than one sixth of the information contained in the distance matrix. By contrast, the figure for the UBS4 data set is 0.87. Apparently, squeezing the distance information into three dimensions is far easier for the UBS4 data than for analogous random data.
At first glance, the DC dendrogram for the control, seen in Figure 10, is not unlike that of the primary data set for this example. There are a couple of significant differences, however. Firstly, the heights at which branches form range from 0.34 to 0.61, the same range found in the control distance matrix. By contrast, the range Finney: How to Discover Textual Groups Art. 7, page 22 of 99 of heights in the dendrogram of the primary data set is broader, varying between 0.008 and 0.86. Distances between the real cases tend to greater extremes than expected of data where states have been chosen at random. Sometimes the distances are smaller than normal, corresponding to a tendency for some witnesses to have similar sets of readings. Elsewhere the distances are larger than normal, consistent with a process which acted to drive certain texts apart. (Here, normal means what is expected of texts whose readings have been randomly selected from two possible states.) Secondly, the divisive coefficient for the control dendrogram is 0.34. Now the significance of the primary example's divisive coefficient of 0.74 can be appreciated.
The magnitude of this grouping indicator is much greater for the UBS4 data set than for an analogous data set that has no groups. These numbers indicate that grouping among New Testament witnesses is a real phenomenon.

Textual Variants in Mark (INTF)
The fourth example is based on textual variation data collected by the INTF for the Parallel Pericopes installment of the ECM (Strutwolf and Wachtel 2011). The data matrix in Table 11 was generated from an electronic file made available by the INTF at their website. It records the states of 333 texts at 503 variation sites. (While most texts of this data set relate to the first hand of a Greek manuscript, others A complete description is in the "Introduction," 5*-7*. The electronic file is located at http://intf.uni-muenster.de/PPApparatus/.) The corresponding distance matrix in Table 12 retains only 151 of those texts, the other 182 having been eliminated to reduce sampling error. Its distances range from 0.002 to 0.413 and have a mean value of 0.159.
Analysing this distance matrix produces a CMDS map, found in Figure 11, with a similar appearance to the one obtained for the UBS4 data on Mark's Gospel.
Both maps have three lobes of relatively high witness density diverging away from, or converging towards, a dense concentration of Byzantine witnesses. Once again, regions between the three non-Byzantine lobes are practically vacant. The R-squared  value for this map is 0.73, implying that it accounts for almost three quarters of the information contained in the distance matrix.
In the DC dendrogram extracted by analysis of the INTF distance matrix, manuscripts 05 and 032 split away first to form solitary branches. Seen in Figure 12, the remaining witnesses split three ways at a group-to-group distance of about 0.35.
One group contains a number of manuscripts often styled "Alexandrian." Another is   Soden's I β (1279, 1528, …, 752). A number of the dendrogram branches correspond to regions of higher witness density found in the associated CMDS map. The divisive coefficient for this dendrogram is 0.8.
The control data matrix, found in Table 13, was produced by configuring the generating script to make 151 cases with 503 variables per case while aiming for a mean distance between cases of 0.159. The distance matrix obtained from this data matrix, represented here in Table 14, has values ranging from 0.099 to 0.235, less than a third of the distance range found in the primary data set of this example. The mean distance between cases hits the mark of 0.159 that the generating script aimed to produce. Analysis of this distance matrix produces a CMDS map with a roughly spherical inner core surrounded by numerous outliers, shown here in Figure 13. The The control DC dendrogram, seen in Figure 14, has a divisive coefficient of 0.37, considerably less than the corresponding value of 0.8 obtained for the primary example. The contrast between the INTF data set for Mark's Gospel and the analogous data set comprised of randomly produced cases again points to the existence of grouping among New Testament witnesses.

Comparing CMDS and DC Results for UBS4 and INTF Data
The CMDS maps obtained for the Gospel of Mark using the UBS4 and INTF data sets have a number of similarities. However, there are conspicuous differences as well. Two of the four regions of higher witness density found in the two maps can be identified with each other. What to call each region presents a problem but conventional labels will do for now. In the UBS4 map, the regions of higher density may be labelled as follows: 1. "Byzantine" 2. "Alexandrian" (e.g. 01, 03, 04, 019) 3. "Western" (e.g. 05, it-a, it-b, it-d) 4. "Family 1" (e.g. f-1, 28, 205).
The correspondence between the two maps for regions labelled as "Byzantine" and "Alexandrian" is plain enough to require no further comment. As for differences, the INTF map does not have a counterpart for the "Western" group of the UBS4 map but, surprisingly, puts Codex Bezae (05) in the vicinity of Family 1. Also, the UBS4 map does not have a counterpart for the Family 13 group found in the INTF map. Instead, the entity that represents Family 13 in the UBS4 apparatus (f-13) is located near the Family 1 group.
There is an explanation for these differences. Each CMDS map reveals groups found in the corresponding data set. The UBS4 data set has only a single entity to represent Family 13 (i.e. f-13) and the INTF data set has only a single representative of the "Western" family of texts (i.e. 05). If the data sets incorporated more witnesses of the respective families, then CMDS analysis results would contain corresponding groups. It seems that in the absence of multiple representatives of a group, CMDS analysis can place a solitary case closer to other groups than would occur if more members of its tribe were included. Perhaps the difference in location, which would be expressed if more members of a group were included, is being pushed into higher dimensions than those presented in a three-dimensional analysis result.
There is also consensus concerning the membership of a number of other branches when witnesses present in both data sets are considered.  (022) belongs to the "Byzantine" complex while the next closest six (2193,205,209,28,1,1582) are all members of Family 1. These factors help to explain why 032 is solitary in the INTF dendrogram but shares the same branch as Family 1 in the UBS4 dendrogram.
Rather than being contradictory, both dendrograms reveal actual characteristics of 032.
Another difference relates to Family 13, which constitutes a separate branch of the INTF dendrogram. By contrast, the entities that represent Families 1 and 13 in the UBS4 apparatus (i.e. f-1 and f-13) occupy the same branch in the UBS4 dendrogram. The distance between these entities in the UBS4 distance matrix is 0.360; in the INTF distance matrix, minuscules 1 and 13 are a distance of 0.209 apart. By comparison, the mean witness-to-witness distance is 0.471 for the UBS4 distance matrix and 0.159 for the INTF distance matrix. That is, the distance from f-1 to f-13 is less than the mean distance for the UBS4 data set while the distance from minuscule 1 to minuscule 13 is greater than the mean distance for the INTF data set. This suggests an inconsistency between the two data sets that affects Family 1 or 13 or both. Perhaps one of the entities that represent these families in the UBS4 apparatus does not adequately represent its family? The disparity might also occur if minuscules 1 and 13 were not central members of their respective families.
Yet another difference is that the UBS4 dendrogram puts minuscule 28 in the same branch as the entity which represents Family 13 (f-13) while the INTF dendrogram locates 28 in the Family 1 branch. According to the UBS4 distance matrix, the closest three items to 28 are f-13, 205, and f-1. For the INTF distance Finney: How to Discover Textual Groups Art. 7, page 30 of 99 matrix, fifteen of the nineteen closest witnesses to 28 are members of Family 1 or 13, with Family 1 members tending to precede those of Family 13. These fifteen include all but one of the members of Families 1 and 13 identified by the relevant branches of the INTF dendrogram, minuscule 983 being the only one left out. Thus, both dendrograms accurately reflect the situation of minuscule 28 relative to Families 1 and 13 implied by the associated distance matrices. The INTF data set, which is more comprehensive with respect to Greek manuscripts, shows that minuscule 28 is more closely related to Family 1 than Family 13.
Comparing these analysis results has been instructive. The cases of Codex Bezae and Family 13 show how sensitive results can be to the mix of witnesses selected for inclusion in a data set. The case of Codex Bezae also shows that an apparent affiliation indicated by one analysis method should be regarded with suspicion if not confirmed by other methods. Recourse to the distance matrix often provides a better understanding of cases for which analysis results are puzzling.
Both the UBS4 and INTF data sets exhibit weaknesses with respect to representing the New Testament textual tradition of Mark's Gospel. The UBS4 data set suffers from a relative lack of variation sites and Greek manuscripts, and there may be a problem with the entities it uses to represent Families 1 and 13. At the same time, the INTF data set lacks early versions and patristic citations, which offer a valuable context for understanding affiliations among the Greek manuscripts.

Jerome's Early Manuscripts
Jerome says in his prologue to the Vulgate version of the Four Gospels: For if we are to pin our faith to the Latin texts, it is for our opponents to tell us which; for there are almost as many forms of texts as there are copies. If, on the other hand, we are to glean the truth from a comparison of many, why not go back to the original Greek and correct the mistakes introduced by inaccurate translators, and the blundering alterations of confident but ignorant critics, and, further, all that has been inserted or changed by copyists more asleep than awake? … I therefore promise in this short The CMDS map for textual variants in Mark (UBS4) shows that Jerome's revision (vg) lies close to a trajectory which runs between a cluster of Old Latin texts such as Vercellensis (it-a), Veronensis (it-b), Colbertinus (it-c), and Bezae (it-d) at one end and "Byzantine" texts at the other. If these Old Latin texts represent the Latin exemplars used by Jerome, it seems that the "early" Greek manuscripts he used to revise the Latin text of Mark were of the Byzantine variety.

Significant Distances
As a first step towards establishing what constitutes a significant distance between two witnesses, one might consider the number of readings per variation unit.   half of the variation sites have only two readings, about three quarters have three or less, and only about one quarter have four or more when many New Testament manuscripts are compared. These numbers are dependent on the editorial policy used to define variation site boundaries and therefore apply only to the data upon which the ECM is based. Nevertheless, they show that there are usually only a few alternatives at each place where the text varies. If this is the case when many manuscripts are compared, it is reasonable to expect that the numbers of alternative readings at variation sites known to ancient readers or scribes would have been even less. Accordingly, when a reader or scribe knew there were alternatives, he or she would usually have known of only two, sometimes three, rarely more.
The manuscript evidence shows that the copying process was inherently conservative. Klaus Wachtel (2011) writes: The … figures impressively demonstrate the degree of coherence between New Testament manuscripts… This evidence enforces the conclusion that the efforts of scribes to copy their exemplar as precisely as possible were, on the whole, successful. A chain of closely related copies connects the single manuscript texts with the source of the tradition, the initial text. (221) However, the evidence also shows that scribes and readers regularly marked up manuscripts with alternative readings, deleting a phrase here and adding one there.
If a scribe copied a manuscript that included such mark-up, a decision concerning how to deal with alternatives was required at every place where they occurred.
(Nothing has changed!) When faced with such a choice, the scribe might choose one of the options or combine more than one to produce a conflation. This is not the only way that alternative readings entered the text. A reader or copyist could also create a novel reading without any manuscript authority, perhaps in an attempt to repair an apparent corruption or to "improve" the text where there was a perceived difficulty. Then there were unconscious alterations: involuntary additions, substitutions, or omissions that occurred in the process of a copyist reading the exemplar, remembering its words, then writing them down in the copy. These actions sometimes created nonsense readings, which would subsequently attract the attention of a reader or copyist seeking to repair faults in the copy.
Considering the variations alone, a copying event can be modelled as a sequence of choices between readings at a series of variation sites. Not every reading at a variation site would have had the same chance of being selected in a particular copying event. One reading might have stood out as preferable for doctrinal, stylistic, or parochial reasons. Then again, none might have been favoured. It is impossible to say with confidence which alternative was more likely to be chosen by a copyist, although there does seem to have been a preference for readings found in near relatives of the manuscript at hand. As Gerd Mink writes (2004,22), "In a dense tradition, it is typical of contamination that a witness shares most of its variants with its closest relative and if it deviates from this relative the variants concerned can be found in other close relatives." While there is no way to determine the probability that a given reading would have been chosen by a copyist working at a particular place and time, it is possible to make an estimate based on the relative frequency of the reading among extant witnesses. A refinement would be to consider the relative frequency of a reading among closely related witnesses. Yet another approach would be to assume equal probabilities among readings that are relatively common, excluding rarities altogether.

A Simple Model
Adopting the last approach and assuming the common case of only two possible readings per variation unit results in a particularly simple model where each copying event is represented by a sequence of trials, each trial comprised of selecting one alternative from two equally probable states. The model applies to a copyist selecting a series of readings from an exemplar whose variation sites each have only two readings with apparently equal merit. From a statistical perspective, the model is equivalent to a series of coin tosses using an unbiased coin. This equivalence allows a minimum standard to be established for what constitutes a statistically significant level of disagreement between two witnesses. If there are two equally probable states (i.e. readings) for each trial (i.e. selection of a reading at a variation site), the chance of disagreement at each place where a choice has to be made is one half. This is because there are four possible combinations of two states chosen in two trials, half of which constitute disagreement. To illustrate, if the two states are represented by the numerals 1 and 2 then the four possible combinations are (1, 1), (1, 2), (2, 1), and (2, 2), the second and third of which disagree.
The binomial distribution applies to the outcomes of multiple independent trials when the outcome of each trial can have only two states and the respective probabilities of the two states are the same for each trial. By convention, the two states are labelled success and failure. Given a particular number of trials and a fixed probability of success, the binomial distribution describes how frequently each number of successes occurs. Using this distribution, it is possible to obtain critical limits, which are the upper and lower bounds of a confidence interval that specifies the range of numbers of successes that can be confidently attributed to chance. Before obtaining the limits, it is necessary to select an alpha value, which represents an acceptable level of error. While any number of successes between zero and the number of trials can occur, only a central range of numbers of successes is likely. Over many repeats of an experiment consisting of a set number of trials, numbers of successes outside this central range will occur with a relative frequency equal to the alpha value. If the alpha value is small enough, then it is reasonable to assert that a number of successes outside the range defined by the confidence interval is not due to chance. However, such an assertion is expected to be wrong in the proportion of cases corresponding to the alpha value. For this article, an alpha value of 5% is used, producing a 95% confidence interval. Given this alpha value, one expects to be wrong only 5% of the time when asserting that a value outside the 95% confidence interval is not due to chance. (An alpha value of 5% is common for work where the consequences of false positives are not too dire.) Dividing a number of successes by the total number of trials produces a proportion of success. of success expected to occur by chance for various numbers of trials where each trial has a probability of success equal to one half. Each interval uses the notation [lower, upper], where lower and upper are the inclusive limits of the range. The intervals given in the table relate to the simple model where each trial consists of two random selections from two equally probable states. As a success corresponds to a disagreement between two randomly selected states, the proportion of successes corresponds to the proportion of disagreements, which is the simple matching distance.
These intervals only apply to the special case of each variation unit having two equally probable readings. Nevertheless, the table illustrates some important points: 1. Calculating a distance from too few variation units is a futile exercise because any distance thus obtained is reasonably attributable to chance. In this case, if five or less variation units are being compared, then no distance is outside the range expected to occur when states are randomly selected. where n is the number of trials and p is the probability of success. The values 0.025 and 0.975 are the upper and lower quantiles, which specify the probability of a random variable being less than the corresponding limit. The difference between these quantiles is 0.95, the complement of the alpha value.) 2. The relative size of the confidence interval bounded by the upper and lower critical limits decreases as the number of trials increases. Here, the relative size of the interval is 100% for five, just under 50% for fifteen, and 20% for one hundred trials.
3. Just as a distance less than the lower bound of a confidence interval is statistically significant, so is one greater than the upper bound. To use the example of one hundred trials given in this table, a distance between two witnesses which is larger than 0.6 is just as unexpected as one less than 0.4.  A more realistic approach establishes critical limits by reference to the distribution of witness-to-witness distances found in the data set being studied.
The histograms, Figures 15 and 16   is superimposed on each histogram to show the range of distances expected to occur for unrelated pairs of witnesses. These curves were generated by an R expression that uses the following parameters to obtain relative frequencies from the binomial distribution: 1. the number of trials as estimated by the rounded mean number of variation units from which entries in the distance matrix were calculated 2. the probability of success as estimated by the mean value of distances in the distance matrix.
The R script named groups-scripts-hist.R was used to produce the histogram and binomial curve in each case. The mean number of variation units is rounded because the number of trials used to define a binomial distribution must be an integer. Each binomial curve has been scaled up to the maximum height of the corresponding histogram; if not scaled, the vertical height is much less, making it more difficult to see the horizontal limits implied by the binomial distribution. Due to the scaling, the vertical scale does not give probability values for the binomial curves.
A binomial curve generated from mean values for the number of variation units and distance between witnesses shows the typical range of distances which is expected to occur by chance. However, there are reasons why such a range should not be expected to apply to all pairs of witnesses in the data set. First, the readings of a witness might not be well defined at every variation unit so the number of variation units used to calculate a distance could vary from one pair of witnesses to the next. Second, the expected distance between witnesses may vary. Fortunately, the curves are not particularly sensitive to changes in these parameters. Nevertheless, if the values for the entire data set are far different to those appropriate for a particular set of witnesses then it is better to use the particular values to obtain the corresponding expected range of distances. To illustrate, if one were interested in a fragmentary witness, then the number of trials would be constrained by the number of variation units at which its readings are defined. As another example, if one were studying a closely related subset of witnesses, then the probability of success would be the mean distance between members of that subset. were obtained with the same R script as used to produce the histograms and binomial curves. The relevant R expression is the same as was used to obtain critical limits for the simple model. There, the probability value was 0.5; here, it is the mean distance.
Any distance between the upper and lower limits of an interval is normal in the sense of not being unexpected if readings are randomly chosen. A distance outside this range is not expected to happen by chance: if less than the lower limit, then the relevant pair of witnesses are adjacent, being closer together than normal; if greater than the upper limit there is an opposite relationship, the pair being further apart than normal. An active process is implied for distances outside the normal range, one that has either driven texts closer together or further apart than would be expected if readings had been chosen at random. For the UBS4 data set, the lower limit is 0.382, corresponding to a percentage agreement of 61.8%. Thus, for this data set, a percentage agreement greater than 61.8% is significant in a statistical sense. For the INTF data set, a distance less than 0.127, corresponding to a percentage agreement greater than 87.3%, is statistically significant.
We are now in a position to provide another answer to the question posed earlier: "At what height should a dendrogram be cut to produce groups where members are actually related?" A reasonable value to use for this purpose is the lower bound of the range of distances that are likely to occur by chance for a given data set. To illustrate, the control for the well-defined groups example is comprised of twelve Finney: How to Discover Textual Groups Art. 7, page 40 of 99 cases with fifteen variables per case and a mean distance between cases of 0.481. The corresponding lower critical limit is 0.267. (The limit is obtained using the R expression qbinom(alpha/2, n, p)/n, where the alpha value is 0.05, the number of trials n is 15, and the probability of success p is 0.481.) Cutting the relevant dendrogram at this height succeeds in assigning most cases to solitary groups, as is appropriate because the cases are unrelated. However, cutting at that height also produces two spurious groups as well, one containing R4 and R8, the other R5 and R11. This serves as a reminder that pairs of randomly generated cases can occasionally be closer together or further apart than expected. In the long run average of many trials, the frequency of such cases approaches the alpha value.
It is also reasonable to use the upper critical limit in the context of dendrograms.
Branches obtained by cutting a dendrogram at the upper limit represent superstructures that are more dissimilar than would be expected if their members were comprised of randomly chosen states. Using the UBS4 dendrogram as an example, cutting at the upper critical limit (0.561) produces the partition seen in Table 18.

Sampling Error
The sampling error of a distance between two texts is the difference between the actual distance that would be obtained by examining the entire population of variation sites and the estimated distance obtained from a sample of the places where the two texts vary. The actual distance is a parameter obtained from the entire  (The R script named groups-scripts-sampling.R was used to produce these values. It randomly selects the requested number of variables then compares them for the two cases specified. The estimated distance is the simple matching distance between the two cases and is obtained by counting the number of disagreements and dividing by the number of places compared. The limits are obtained with the expression qbinom(c(alpha/2, 1 -alpha/2), n, p)/n, where alpha, n, and p are the alpha value (i.e. 0.05), number of trials (i.e. the number of places compared), and probability (i.e. the estimated distance). Three decimal places are given for all of the values even though this level of precision may be unwarranted in view of the confidence interval.) Finney: How to Discover Textual Groups Art. 7, page 42 of 99 of information is supplied by a statistical analysis of the sample to produce a confidence interval that probably contains the actual distance. The interval's limits are calculated from the binomial distribution using the estimated distance as the probability of success and the number of compared variation units as the number of trials. To illustrate, Table 19 gives a distance estimate and 95% confidence interval obtained when variables are randomly selected from two of the cases found in one of the example data sets.
Another way to express a confidence interval is by giving an estimated value and a margin of error. The lower margin of error is the difference between the lower critical limit and the estimate while the upper margin of error is the difference between the estimate and upper critical limit. If the interval is symmetrical with respect to the estimate then the upper and lower margins are the same and the expression estimate ± margin can be used to specify the estimate and its confidence interval.
The magnitude of the margin of error for simple matching distances and the binomial distribution is approximately t (p (1-p)) 1/2 /n 1/2 , where t is the appropriate t-distribution value, p the probability, and n the number of trials. The value of t tends towards the corresponding z value for the normal distribution (e.g. 1.96 for an alpha value of 0.05) as the number of trials increases. For smaller values of n, the t value may be obtained using the R expression qt(1 -alpha/2, df = n-1). Taking the seventh row of the table as an example, when n is 50 and the alpha value is 0.05 then the t value is 2.01. Using the estimated distance (0.22) as the probability produces a margin of error estimate of 2.01 × (0.22 × 0.78) 1/2 /50 1/2 , which is 0.118. This agrees quite well with the margins of 0.10 (lower) and 0.12 (upper) obtained with the binomial distribution.
As can be seen from the table, estimates based on only a few places of comparison are unreliable because the range of values expected to occur for an estimate (i.e. the confidence interval) covers a large part of the range of possible values. Increasing the number of places compared makes the relative size of the confidence interval decrease. It is therefore desirable to use as many variation units as possible when estimating distances between witnesses. However, one is sometimes forced to use a lesser number, as when fragmentary witnesses are involved. What then is Finney: How to Discover Textual Groups Art. 7, page 43 of 99 an acceptable lower limit for the number of variation units? There is no absolute guide. In this article, a distance is only used if calculated from a minimum of fifteen variation units where both witnesses are defined. In order to satisfy this standard, the script used to calculate distance matrices for this article (i.e groups-scripts-dist.R) goes through an iterative process, dropping the least well-defined member of the least well-defined pair at every step until all remaining distances are calculated from at least fifteen variation units.

Ranking Witnesses by Distance from a Reference
Given a distance matrix, it is straightforward to rank witnesses by distance from a reference witness. Furthermore, a confidence interval can be established for each distance estimate using the number of variation units from which it is calculated. It is then possible to identify adjacent and opposite witnesses with respect to a reference witness. Those that are adjacent are less distant than the lower limit of the relevant  (The R script named groups-scripts-rank.R produced this list. An asterisk marks any distance that is statistically significant for an alpha value of 0.05. The confidence interval used to decide whether each distance is statistically significant was calculated using an alpha value of 0.05, the number of variation units used to calculate the distance, and the rounded mean distance between all pairs of witnesses.) Finney: How to Discover Textual Groups Art. 7, page 44 of 99 interval, while those that are opposite are more distant than the upper limit. Table 20 ranks witnesses by distance from the entity which represents Family 1 in the UBS4 data set (i.e. f-1). This table shows that f-1 is adjacent to 205, 28, Lect, f-13, 1424205, 28, Lect, f-13, , 1241205, 28, Lect, f-13, , geo, 1505. Many witnesses (1243, …, it-i) occupy the middle ground with distances from f-1 that are not statistically significant; there may be a relationship with f-1 in each case, but the sampling error associated with the available number of defined variation units in the data set is too large to allow a confident decision to be made on the matter. At the other end of the scale, f-1 is opposite to Delta (037), cop-bo, it-ff-2, it-b, it-c, it-r-1, 2427, it-a, Aleph, B, D, it-k, and it-d. This indicates that Family 1 is non-Western and non-Alexandrian in the Gospel of Mark. (One assumes that the entity used to represent Family 1 in the UBS4 apparatus (i.e. f-1) does represent that family of texts.)

The Random Walk
The random walk is a statistical problem that considers how far from the starting point a thing will end up if every movement is random. The classic example is a drunk staggering along a gutter. The man is so drunk that a forward or backward step is equally likely. How far from the beginning will the drunk end up? If this scenario is extended into two dimensions (e.g. a level field) then the drunk could end up anywhere on a flat surface within a maximum distance of his beginning point, that maximum being the number of steps times the average step length. For three dimensions (involving stairs or ladders), the final location would be anywhere within a sphere of the same maximum radius. While possible for the drunk to take a step in the same direction every time, it is unlikely. In fact, the drunk will probably end up somewhere within a smaller distance of the origin, which distance is the order of the square root of the number of steps times the average step length.
To the extent that the New Testament textual tradition can be modelled as random choices among readings, one might expect the diameter of the point cloud in a CMDS diagram to be about the same as obtained for a random data set of the kind generated for the controls presented above. As it happens, the diameters of a number of the major clusters in the CMDS maps of the UBS4 and INTF data sets are What might explain these larger than expected differences? One possibility is conscious selection among readings that resulted in distinctive texts, perhaps due to theological differences between the users (or promulgators) of those texts.
Another possibility is suggested by the apparent association of certain clusters with early versions, seen in Table 21. Perhaps the early versions were players in the New Testament's divergence into some of the major textual streams seen in the analysis results? It is not unreasonable to expect that a scribe copying a Greek manuscript in a region where a particular version prevailed would tend to make the Greek conform to a back-translation of that version.

Partitioning a Data Set
Analysis techniques such as classical multidimensional scaling and divisive clustering reveal how many groups exist when the groups are well defined. However, these techniques do not give clear guidance on the number of groups when grouping is poorly defined. As shown above, a classical multidimensional scaling map produced from randomly generated cases exhibits density fluctuations which might be mistaken for actual groups; also, divisive clustering can be used to partition a data set which does not contain any groups. (If grouping is particularly well defined then there is no need to apply multivariate analysis techniques because it is straightforward to identify the groups by inspection of the data or distance matrix. An abuse such as grouping unrelated cases by means of divisive clustering casts the researcher in a poor light, not the analysis method. Embarrassment might be avoided by knowing the data and the limitations of each analysis technique.) The problem of how to define a group is exacerbated by the phenomenon of mixture. Viral readings have leapt from text to text, making it harder to untangle the strands of textual transmission. Mixture blurs the boundaries of textual groups, causing each group to merge into its neighbours. In the case of the New Testament text, mixture is so ubiquitous and the number of copies so large that one cannot expect there to be vacant regions between groups. A chain of closely related witnesses can usually be found to connect even the most disparate ones. There is no reason to expect large gaps between families of witnesses. If such a gap does exist then it is quite possibly due to an accident of history whereby witnesses that once occupied the space are now lost.
Fortunately, there are modes of multivariate analysis that allow groups to be discovered even when mixture is present. One such technique called partitioning around medoids (PAM) divides the cases of a data set into a predetermined number of groups. A set of this many representative cases called medoids is then chosen so that the sum of all distances from cases to the selected medoids is a minimum. This technique is more robust than another popular partitioning technique called k-means clustering because it is less sensitive to noise (such as sampling error) and outliers (i.e. eccentric cases). There are two phases to the procedure (Maechler et al.

2016):
1. build: the algorithm selects a tentative set of medoids 2. swap: cases are swapped with tentative medoids until no further reduction in the sum of distances occurs.
It may seem preposterous to use a grouping technique that requires the number of groups to be specified beforehand. After all, the aim is to discover groups, not to make arbitrary decisions about how many there might be. Fortunately, a statistic called the silhouette width provides a way forward. A silhouette width approaching a value of one indicates that a case is in the correct cluster, a value approaching zero indicates that a case lies between clusters, and a negative value indicates that a case is in the wrong cluster. The mean silhouette width (MSW) is the average of all silhouette widths obtained when a particular number of groups is specified. The MSW tends to be greater when the preordained number matches how many groups are actually contained in the data. Consequently, peaks in a graph of MSW versus numbers of groups indicate how many groups actually exist. The MSW tends to decrease as the number of groups increases so it is worth considering more than just the first peak when trying to discern preferable numbers of groups for a data set.  The use of PAM analysis in conjunction with the mean silhouette width to discover how many groups exist will now be demonstrated by reference to the example data sets.

Distances Between Cities
Plotting the MSW versus number of groups for the first example produces the results seen in Figure 17. The MSW value is given for each number of groups from two up to one less than the number of cases in the data set. The tendency for the MSW to decrease as the number of groups increases is apparent. Local maxima occur for three, seven, eleven, and seventeen groups.
Using PAM to partition the data set into three groups produces the divisions shown in Table 22. Comparing with the corresponding CMDS map shows that this partition makes sense with respect to the geographical distribution of the cities, having isolated North American, Asian, and European groups. A similar partition is obtained by cutting the corresponding DC dendrogram at a height of 6,000 km although DXB (Dubai) and SYD (Sydney) form solitary branches when that is done.  The medoids identified by PAM analysis are DFW (Dallas and Fort Worth), HKG (Hong Kong), and FRA (Frankfurt), which stand near the geographical centres of the regions associated with the groups.
The next local maximum occurs for seven groups. The corresponding partition in Table 23 also makes sense when compared with the geographical data. The North American group is now split east-west, the Asian group is split north-south, and the two most isolated cases (DXB and SYD) form singletons. (A singleton is a set that contains only one element.) While the exercise could be continued with the other numbers of groups identified by the MSW plot, these two partitions suffice to show the merit of the approach. The example of distances between cities shows that sensible groupings are obtained even though there is no "correct" number of groups for the data set.

Well-Defined Groups
The well-defined groups example does have a "correct" number of groups, which is four. The MSW plots for this example's primary and control data sets can be seen in Figures 18 and 19. At first glance these two do not seem very different.
Both exhibit a tendency for the mean silhouette width to decrease as the number  of groups increases. However, there is a major difference. The highest peak for the primary example (0.728) is more than three times greater than the highest peak of the control (0.190). Given that the control is based on randomly generated cases, it is prudent to ignore any peak in the primary example if its magnitude is not much greater than the value obtained for the same number of groups in the control. The highest peak in the primary example's MSW plot correctly identifies the number of groups in the data set, and the corresponding partition correctly assigns the cases to their respective groups, seen here in Table 24.  The first plot has quite prominent peaks that exceed the noise level for three, six, eleven, twenty-four, thirty-four, forty-two or forty-three, and forty-nine groups. As far as grouping is concerned, this data set is like the example for inter-city distances rather than the one for well-defined groups in that there is no clear winner, no "correct" number of groups. Instead, certain numbers of groups have greater claim  than others to be "natural" when partitioning the data set. For such a data set, peaks in the MSW plot are suggestive rather than emphatic.
One is then left wondering what number of groups is best. In this case, the peak at twenty-four groups seems particularly conspicuous in view of the general tendency for the mean silhouette width to decrease as the number of groups increases. However, dividing the witnesses into so many groups tends to dissolve larger entities, which, though not as coherent as ones that remain together, are nevertheless important for comprehending the broad structure of the textual tradition. Partitions of the UBS4 data set based on three, six, eleven, and twenty-four groups will therefore be presented below.
Using PAM analysis to split the UBS4 data set into three groups produces the divisions in Table 25. Being the most central witness makes the medoid a useful representative of its group. In addition, the siglum of the medoid serves as a label.
There are reasons why it is better to use a medoid siglum rather than a conventional name such as "Alexandrian," "Byzantine," "Caesarean," "Western," "Family 1," or "Family 13" to label a group. For one thing, partitioning a data set into a large number of groups tends to split any structure for which a broad categorical label such as "Alexandrian" might be apt. For another, the most central witness of a textual family is often not the one that the family is named after. Sometimes, however, corresponding groups in different partitions do not have the same medoids. For example, the group with medoid 044 in the three-way partition of the UBS4 data set has the same core members as the one with medoid 03 in a six-way partition of the

044
UBS 01  same data set. A group's medoid can change if even a single case is added or removed because another case can then become the most central one. Consequently, while the medoid does serve as a convenient and appropriate label for a group, it is not a reliable guide to identifying corresponding groups in different partitions of the same data set. A better approach for this purpose is to look for common constituents. If the medoid of a textual complex does change from one partition to the next, then the sequence of medoids that complex has for different numbers of groups can be chained together to form a label.
From this point forwards, the medoid siglum will be used to label its group.
If groups that have the same medoid but are from different partitions need to be distinguished, then the number of groups in the relevant partition will be added to the label in parentheses. If the medoid of a group changes for different partitions of the same data set, then the sequence of medoids will be chained together to form a label. For example, Gr 044 refers to a group whose medoid is 044, Gr Byz (3) to the group with medoid Byz in a three-way partition, Gr Byz (6) to the one with medoid Byz in a six-way partition, and Gr 044/03 to the group whose medoid changes from 044 to 03 in different partitions of the same data set. (Frederik Wisse [1982] introduced group labels comprised of a "Gr" prefix and manuscript siglum.) The groups that emerge from a three-way partition of the UBS4 data set are in some respects similar to traditional categories: Gr 044 contains a number of "Alexandrian" witnesses, Gr Byz is mainly comprised of "Byzantine" ones, and Gr it-i includes a number of "Western" texts. However, the groups also contain witnesses that are not normally associated with the conventional categories. Some of the witnesses that seem out of place are out of place. In a situation analogous to hammering square pegs into round holes, they do not fit their assigned places. When deciding how to partition a data set, numbers of groups with larger values of the mean silhouette width are preferable. Although the average value of the silhouette widths may be relatively large, individual silhouette widths might be small. Indeed, a case can have a negative value for the statistic, indicating a particularly poor fit to its assigned Dividing the witnesses into six groups gives poorly fitting witnesses the freedom to migrate into new groups where they are more at home. Other witnesses stay in the remnants of groups from the three-way partition, as can be seen in Table 26. In this partition, Gr 03, Gr Byz, and Gr it-ff-2 are reminiscent of traditional "Alexandrian," "Byzantine," and "Western" categories. Gr vg is centred on the entity that UBS4 uses to represent the Latin Vulgate. While one might expect Augustine's quotations and a number of Latin manuscripts to be here, it is surprising to find 038, the Ethiopic (eth), and the Palestinian Syriac (syr-pal) included as well. Two of these, 038 and the Ethiopic, have negative silhouette widths to indicate that they are not a good fit.
Nevertheless, this partition suggests that 038, the Ethiopic, and the Palestinian Syriac primarily translations of the Vulgate! According to Metzger (1977, 82), "the text of the Palestinian Syriac version agrees with no one type of text, but embodies elements from quite disparate families and texts." Rochus Zuurmond (1995, 146) writes,   "Whatever the vicissitudes of the Eth may have been, and granted that influences from non-Greek sources may have played their role already at an early stage, the Eth is an immediate descendant of the Greek textual tradition." As for 038, also known as Θ or Codex Koridethi, a glance at the corresponding CMDS map shows that it does seem drawn towards the region of textual space associated with the Latin Vulgate. Using the UBS4 distance matrix to rank witnesses by distance from 038 confirms that a number of its closest neighbours are of the Latin Vulgate kind (e.g. Augustine, vg, it-l), as seen in Table 27.
Returning to the six-way partition, all eight members of Gr arm and Gr 205 fall into a category which Streeter (1924, 27) regarded as an "Eastern type," having subvarieties associated with the provincial capitals of Syria and Palestine. The witnesses that Streeter regards as primary, secondary, tertiary and supplementary members of  (This ranked list was produced using the R script named groups-scripts-rank.R. It uses a distance matrix constructed from a data matrix that only includes Mark chapters 6-16. Any distance marked by an asterisk is unlikely to occur for pairs of cases whose states have been randomly selected from those available.) each sub-variety are listed in his chart of MSS and local texts (1924,108), summarized in Table 28.
Larry Hurtado (1981) challenges the view that 032, also known as Codex W or Washingtonensis, has a "Caesarean" text: If Codex Θ is a good representative of the "Caesarean text," the poor and unexceptional agreement of Codex W with Θ makes it highly unlikely that W is related in any special way to this text-type. (83) The textual nature of 032 is thought to change part of the way through the Gospel of Mark, which change Hurtado (1981,19) locates in the vicinity of Mark 5.6. Streeter (1924,69) regarded the latter part of 032 as "Caesarean." Ranking witnesses by distance from 032 in Mark chapters 6-16 helps to reveal the manuscript's character in this block (see Table 29). Hurtado is right to say that agreement between 032 and 038 is poor. In fact, the texts of 032 and 038 have an opposite relationship, being further apart than would be expected to happen by chance. Even so, all of the seven closest witnesses to 032 belong to Streeter's "Eastern" branch, although none is adjacent to 032 in the sense of being closer than expected by chance. However, as mentioned before, lack of statistical significance for a distance does not imply lack of relationship between two witnesses. Instead, an adjacent or opposite relationship may exist but it is not possible to say so confidently without analysis of a more comprehensive data set. In the context of the UBS4 data set for Mark, 032 does not have any close neighbours and might therefore be described as an eccentric text. Nevertheless, it remains true that the closest witnesses to 032 in this data set are members of Streeter's "Eastern" category.
How are we to explain that 032 is closest to texts of Streeter's "Eastern" variety yet is unlike 038? The six-way partition is not inconsistent with Streeter's identification of a distinct textual variety that includes 032, Family 1, 28, the Sinaitic Syriac, Armenian, and Georgian. At the same time, the relevant CMDS map shows that 038 and 565 lie on a trajectory between the Armenian and Georgian versions at one end and "Western" witnesses at the other. Ironically, it seems that the two manuscripts Streeter regarded as primary authorities for the "Caesarean" subvariety of his "Eastern" branch are mixtures of "Eastern" and "Western" readings in the Gospel of Mark. Others have already noticed the "Western" leanings of 038 and 565. Hurtado (1981, 88) writes, "The quantity of Western readings in Θ and its allies (565, 700) is so great that the present writer would suggest that perhaps the text represented by these MSS is a form of the Western text as it was shaped in the East." Stephen C. Carlson (2004, 20-21) writes, "The practice of anchoring the 'Caesarean' label on the branch containing Θ and 565 now appears unwise, since Θ and 565 come from a family that originated as a late mixture of Branch gamma (to which Origen's text belongs) and a Western text substantially similar to D" (Carlson includes P45, W, Families 1 and 13, 28, Codex Bobbiensis (i.e. Old Latin k), and Origen's text in his "Branch gamma").
The relevant medoids of the six-way partition, namely the Armenian version and minuscule 205, are better representatives of the textual complex that Streeter called the "Eastern type," and 032 is closer to these than the other medoids. Whether Gr arm and Gr 205 should be associated with Antioch and Caesarea remains an  The next preferable number of groups is eleven, represented in Table 30. When a data set is split into two different numbers of groups, first a smaller then a larger number, it sometimes happens that groups present in the first partition remain substantially unchanged in the second. Such groups are more coherent than others, tending not to fragment. Less coherent groups lose members or split into pieces.
Cases that have migrated out of one group might combine with others to form a new group in the second partition while other cases form singletons. Examples of these phenomena are seen by comparing the six-and eleven-way partitions. Gr Byz, Gr it-ff-2, Gr 205, and Gr vg are coherent, remaining much the same when the data set is split into a larger number of groups, although they do lose some of their constituents. Gr 03 (6) and Gr arm (6) (i.e. Groups 03 and arm of the six-way partition) fragment in the eleven-way partition. Gr 03 (6) splits into three parts: a smaller Gr 03 (11) with the same core as its parent; Gr 037, which picks up 04 as well; and Gr cop-bo, which gains 892. Gr arm (6) loses 032 and 565, leaving behind syr-s, arm, and geo. More eccentric witnesses such as 032 and it-k are the first to form singletons. Gr 038, comprised of 038, 565, and syr-pal, forms from cases that have migrated out of other groups. There is a good deal of overlap between Streeter's "Eastern" category and Gr 038, Gr 205, and Gr arm of the eleven-way partition.
Using PAM analysis to divide the UBS4 data set into the next preferable number of twenty-four groups produces the result shown in Table 31. This partition has a Finney: How to Discover Textual Groups Art. 7, page 61 of 99 claim to being the most "natural" one because the corresponding MSW plot has a large magnitude for this number of groups despite the general tendency for MSW values to decrease as the number of groups increases. If this is the best partition then it is reasonable to describe the UBS4 data set as comprised of many small groups and  The process of division into ever-larger numbers of groups could be continued although not much would be gained by doing so. Examining partitions with smaller numbers of groups has already revealed the main contours of the data set's group structure.

Variants in Mark (INTF)
The MSW plot for the INTF data set, seen in Figure 22, displays a series of peaks, each indicating a preferable number of groups for partitioning. The plot for the corresponding control, shown in Figure 23, indicates the noise level for each number of groups. Comparison shows that each peak in the primary example's MSW plot is worth considering except the one associated with 151 groups. As in the UBS4 data        set, division into too many groups produces fairly uninteresting partitions comprised mainly of small groups or singletons. Much of the group structure is revealed by partitions based on the first four peaks in the MSW plot, which occur at two, four, seven, and seventeen groups. A two-way partition produces Gr A and Gr 1339, represented here in Table 32.

Medoid Members
The medoid of the first group is the synthetic Ausgangstext (initial text) that is printed in editions such as the ECM, Nestle-Aland's Novum Testamentum Graece, and the UBS Greek New Testament. It is joined by a number of manuscripts that might be described as some flavour of "Alexandrian," although not many would place 05 (Codex Bezae) in this traditional category. As it happens, 05 has a negative silhouette width for this partition indicating a poor fit. Gr 1339 is broad, containing many witnesses often characterized as "Byzantine" along with others that are not.
In a four-way partition, shown in Table 33, Gr A remains static while Gr 1339 spawns Gr 209 and Gr 826. These last two largely correspond to Families 1 and 13, respectively. It is interesting to see that Gr 209 contains 032 and that Gr 826 includes 038 and 565. Codex Bezae (05) remains in Gr A but again has a negative silhouette width to indicate a poor fit.
In a seven-way partition, shown in Table 34, Gr A, Gr 209, and Gr 826 recur almost unchanged. Codex Bezae (05) is still located in Gr A, and still has a negative silhouette width to indicate a poor fit. Codex 032 is again included in Gr 209 (i.e. Family 1). Gr 1339 retains a coherent core and continues to produce new groups, namely Gr 041, Gr 517, and Gr 1528.  Finney: How to Discover Textual Groups Art. 7, page 68 of 99 a poor fit for 1071 and 1273. Looking at the corresponding CMDS map shows that the singletons for this partition, namely 05, 032, 28, and 792, are somewhat isolated in textual space. Many of the poorly classified manuscripts would move into other groups or form singletons if the data set were divided into yet more pieces. A number of these complexes are already known, as shown in Table 36, Finally, dividing into a really large number of parts reveals group cores (see Table 37). Singletons are omitted to leave only those sets with more than one member. It is interesting to see that seven of the ten core members of Gr 1339 are from the Saba monastery near Jerusalem. In this group at least, there is a strong correlation between locality and text. (Strutwolf and Wachtel [2011, 5*], suggest that the Saba manuscripts [Gregory-Aland numbers 1328-1348] "may help to answer whether common location results in textual similarity.")

Slices of a Data Set
Analysis has so far focused on entire data sets, excluding only as many cases (i.e. witnesses) as necessary to maintain a tolerable level of sampling error. Sometimes, however, it is worthwhile to narrow the scope of analysis to a subset or slice of an original data set. Such a slice might consist of a subset of cases, variables, or both.
In terms of a data matrix based on New Testament textual information, a case-wise

Fragmentary Witnesses
In order to reduce sampling error to a tolerable level, the analytical procedures used in this article drop a witness if including it would cause any distance to be calculated from less than fifteen places of comparison. Unfortunately, this policy causes certain witnesses whose readings are not defined at every variation unit to be excluded from consideration. In the present context of New Testament data sets, a number of circumstances can cause the reading of a witness to be undefined at a variation site: (The distance matrices were calculated by the R script named groups-scripts-dist.R.
The witness to be retained is called the reference, in this case P45. Rather than select only those variation units where the reference is defined, the script detects the two least well-defined witnesses on every iteration. It normally drops the least well-defined but will retain it if the reference witness, dropping the other instead.
If the witness to be retained is not defined for the minimum number of variation units then the procedure is abandoned at the outset.) Finney: How to Discover Textual Groups Art. 7, page 73 of 99 • a manuscript may be illegible at the relevant place • the text preferred by a Church Father may not be discernible due to absence or ambivalence of relevant patristic evidence at the site, or • the Greek reading supported by a version may not be discernible because a back-translation of the relevant passage is consistent with more than one of the Greek alternatives.
One such witness is P45, a third century papyrus manuscript, which, due to its fragmentary state, has not yet appeared in the analysis results presented in this      Ranking witnesses by distance from P45 using the INTF distance matrix produces the ordered list in Table 42. All of these distances are larger than the lower critical limit of the confidence interval of distances that can be attributed to chance. That is, P45 is eccentric by the standard of manuscripts included in the INTF data set, being a long way from any of them. Its nearest neighbours are the Ausgangstext (A) followed by 04 from Gr A, 032 from Gr 209, then an array of manuscripts from Gr 1339 of the relevant fourway partition. In the context of the INTF data set, it seems reasonable to describe P45 as an isolated text which is roughly the same distance from Gr A, Gr 209, and Gr 1339.

03
UBS 01  Origen's text is another witness that falls victim to the vetting process, and it too can be included by restricting analysis to an appropriate slice of the data set, seen here in Table 43 The MSW plot indicates that a four-way partition is preferable. When the data set is divided into four, shown in Table 44, the group occupied by Origen's text includes 038, 28, 565, the Sinaitic Syriac, Armenian, and Georgian. Streeter, aware of the tendency for Origen's citations of the Gospel of Mark to agree with these witnesses, concluded that this kind of text was already established in Caesarea when Origen moved there in 231 and thus provides a "fixed point for the history of the text of the New Testament" (1924,.

Block Mixture
Sometimes the textual affiliation of a witness changes from section to section. One way such block mixture might have occurred is through partial correction of one text to another. A corrector might have begun to "improve" his or her copy by reference to a second text then have lost interest or run out of time before the work was completed. If the corrected text was later copied, then that copy and its descendants would contain as many of the second text's readings as had been transferred by the corrector then retained by the copyist. Another way block mixture might have occurred is through replacement of damaged leaves by ones copied from a different variety of text.
Shifts in the textual character of a witness can be studied by first dividing a data matrix into consecutive blocks with approximately equal numbers of variation units, producing a distance matrix for each block, then performing multivariate analysis on each block's distance matrix. There is a trade-off concerning how many blocks to use. On one hand, shifts that occur in a small section of text have a better chance of being detected if the data set is divided into correspondingly small blocks. On the other hand, sampling error increases as the number of blocks increases because of the decreasing number of variation units in each block. Consequently, increasing the number of blocks also increases the number of fragmentary witnesses that must be excluded to maintain the integrity of analysis results. Every distance in a distance matrix based on incomplete data is a mere estimate of the actual distance that would be obtained if every place of variation between two witnesses were compared. As it happens, the sampling error of the distance estimate is roughly equal to the square root of the number of places sampled. Thus, if two witnesses are compared at one hundred places and disagree at fifty then the estimated simple matching distance is 50/100 (0.5) and the associated sampling error is approximately equal to the square root of one hundred, which is ten. Consequently, in this example it would not be unlikely for the actual distance to be anywhere between (50-10)/100 (i.e. 0.4) and (50 + 10)/100 (i.e. 0.6).
In the CMDS maps, sampling error causes plotted witness locations to be randomly displaced from where they would be given a larger sample. In the DC dendrograms, sampling error can cause witnesses to jump between branches, especially if those  witnesses have mixed texts that stand between the textual varieties associated with the branches. One must search for genuine textual shifts against this noisy background, and this dividing of data sets to search for block mixture only makes matters worse.
Dividing data sets into blocks also creates difficulties for PAM analysis. MSW plots are unlikely to suggest the same number of groups for every block. However, it is desirable to use a single number of groups when comparing group membership across blocks. The chosen number of groups should not be so small that actual shifts will be missed or so large that spurious shifts will occur through sampling error. Despite these difficulties, useful analysis results can be obtained provided that each block retains a sufficient number of variation units. Dividing the UBS4 data matrix into four consecutive blocks of approximately equal size results in each one having about thirty-five variables. This is enough to avoid too many witnesses being dropped due to the constraint on the minimum number of variables required to calculate a distance, which in this study is set to fifteen. The sampling error in distances between witnesses for each block is roughly twice that of distances calculated using the undivided data set covering all of Mark. The factor of two occurs because the data set is divided into four, and four divided by its square root is two. The results shown in  Comparing the CMDS maps shows that the overall structure of the plots is similar across blocks. However, some texts exhibit substantial shifts in their relative locations.
To name a few, Family 1 (f-1), 038, 565, Old Latin Codex Corbeiensis II (it-ff-2), the Ethiopic (eth), and Sahidic Coptic (cop-sa) do not seem to have stable locations across blocks. Each of these shifts is consistent with partial correction of an ancestral text to another standard. For example, the Sahidic is near "Eastern" texts in the first block but near "Alexandrian" ones in the other three blocks. Perhaps the initial text of this Coptic version had an "Alexandrian" flavour throughout but was then partially revised so that it became more "Eastern" in the first few chapters of Mark?    The DC dendrograms also exhibit a similarity of analysis results across blocks with the exception of a few texts that tend to shift from branch to branch. The exceptions are generally the same texts that undergo substantial shifts in the CMDS maps. Codex 032, which is thought to be more "Western" in the initial chapters of Mark, occupies the "Western" branch of the DC dendrogram for the first block.
Partitioning into a single number of groups is desirable for comparison across blocks using PAM analysis. However, the MSW plots for these blocks do not agree on a single preferable number. To choose what seems a reasonable compromise, partitioning into five groups returns a fairly high MSW value for each block while providing enough slots to allow some differentiation but not so many that like texts are liable to fall into different slots through sampling error.
Tables 49, 50, 51, and 52 give five-way partitions of the four blocks. The medoids of some groups change from block to block. Nevertheless, corresponding groups in different blocks can be identified because certain witnesses recur within them. In what follows, groups will be labelled by chaining together the sigla of the associated medoids, using a final ellipsis in cases that have three or more medoids across corresponding groups (e.g. Gr geo/205/…). A few texts such as 04, 044, the Sinaitic Syriac (syr-s), Old Latin k (it-k), and Bohairic Coptic (cop-bo) are not present in every division due to the vetting procedure exercised to reduce sampling error. Nothing can be said about their character in divisions that omit them. (Nevertheless, the script that constructs a distance matrix can be instructed to retain a specific case so it is often possible to obtain analysis results for omitted witnesses if desired.) It is prudent to also treat poorly classified witnesses (i.e. those with a negative silhouette width) as absent. Excluding these from consideration, a comparison of corresponding groups across blocks helps to identify witnesses that exhibit textual shifts, seen here in Table 53.
Some of these shifts may be due to witnesses being located midway between adjacent groups. When a witness is in such a position, slight perturbations of the distance matrix may cause a jump from one group to another. Inspection of the CMDS maps for the four blocks shows that 04, 892, and 1342 are approximately equidistant from the centres between which they shift. That leaves 037, 038, f-1, Finney: How to Discover Textual Groups Art. 7, page 89 of 99 28, 205, 565, Latin codices Monacensis (it-q) and Corbeiensis II (it-ff-2), the Sahidic Coptic, and Ethiopic as texts which appear to exhibit block mixture.
Certain texts adhere to one standard in the first and last blocks but another in between. For example, the entity that represents Family 1 in the UBS apparatus (i.e. f-1) starts and ends as Gr Byz but belongs to Gr geo/205/… in the middle; 038 starts and ends as Gr geo/205/… but changes allegiance to Gr it-d/it-ff-2 or Gr vg in the central parts of Mark. Such witnesses may have resulted from partial efforts to make one text conform to another, cosmetic renovations that affected the outer leaves of a book but left the interior untouched.
PAM analysis does not identify 032 (i.e. Codex W) as a shifting text but instead classifies it as a member of Gr geo/205/… in all four blocks. This is surprising because the first chapters of Mark in 032 have long been considered to preserve a "Western" text. Henry A. Sanders (1912, 73) thought the initial part of Mark (up to 5.31) was the Greek equivalent of the Old Latin version, agreeing with Codex Palatinus (it-e) in particular, and noticed an increasing number of agreements with "Syriacising" manuscripts such as Family 1, Family 13, 28, and 565 in the remainder (i.e. after 5.31). Streeter (1924, 598-600) compared the part of 032 following Mark 5.31 with members of his "Caesarean" text (i.e. 038; Families 1 and 13; minuscules 28, 565, and 700) and concluded that it is "a member of the Θ family, the text of which has suffered, but not too greatly, from Byzantine revision." (Streeter also gives an imaginative account of how 032 might have acquired the mixture of texts it preserves.) The DC dendrogram for Mark 1.1-4.24 places 032 in a branch that is safely described as "Western." Table 54, which lists the ten nearest neighbours of 032 in each block, also reveals the "Western" tendency of 032 in the first few chapters of Mark. Codices Palatinus (it-e), Veronensis (it-b), and Bezae (05) are among the nearest neighbours of 032 in the first block. In the remainder of Mark, the nearest neighbours of 032 are predominantly members of Gr geo/205/…. This group, which corresponds to Streeter's "Eastern" text, includes the Armenian, Georgian, and Sinaitic Syriac. It is possible that these versions share the blame for the distinctive text which Greek members of this group, including 032, bear.
Streeter's statement about Byzantine revision needs to be qualified. While there is Byzantine influence in the second block, as attested by the proximity of 02, 011, and 1243, this component disappears in the third and fourth blocks where members of Gr 03/2427 appear instead. None of the listed texts is unexpectedly close to 032, as indicated by lack of asterisks attached to distances. The text of 032 in Mark seems to be isolated yet cosmopolitan, with varying "Eastern," "Western," "Byzantine," and "Alexandrian" components in consecutive blocks.

Medoids as Representatives
There is such a great cloud of New Testament witnesses that it is often a practical necessity to restrict ones retained in a summary of the evidence to those few that sufficiently represent the many that have to be omitted. The medoids identified by PAM analysis constitute suitable representatives although alternative witnesses must be selected to represent a group when its medoid is absent at some variation sites or is affected by block mixture. The selection procedure begins with a data set that includes as many witnesses as practicable. PAM analysis is performed for every possible number of groups to plot mean silhouette width versus the number of groups. Next, the plot is used to identify a suitable number of groups. PAM analysis is then applied again to partition the data set into the chosen number of groups, identifying the medoids in the process. Just how many groups are selected depends on the purpose.
If aiming to produce a compact apparatus then a smaller number would be chosen; a larger number would be appropriate if presenting a comprehensive survey of, say, the Byzantine textual complex. To give an example, the medoids of the eleven-way partition of the UBS4 data set presented above would serve as a useful starting point if seeking representative texts for a compact summary of the textual situation in Mark. As another example, medoids found by a many-way partition of the INTF data set for Mark would be suitable candidates for witnesses that represent the entire spectrum of extant Greek textual varieties of that Gospel.

Multiple Correspondence Analysis
A multivariate analysis technique called correspondence analysis produces a biplot that simultaneously displays both the cases and variables of a data matrix. This allows variables that are useful for differentiating between cases to be identified, thus providing a basis for classification. If the variables are comprised of categorical data, as is the case for the New Testament textual data analysed in this article, then a technique called multiple correspondence analysis (MCA) should be used. (Concerning categorical data, the only meaningful basis for comparison is equivalence. For example, when comparing the encoded readings of two witnesses at a variation site, one can only say whether they are the same or different; it does not make sense to say that one reading is more or less than the other in this context. Even though New Testament textual data is treated as categorical in this article, the readings of a variation unit can have an inherent order. In fact, the INTF's Coherence-Based Genealogical Method relies on there being a discernible order of development for the readings of a variation unit.) The biplot in Figure 41 is produced by applying MCA to the first block of UBS4 data for Mark.
The plot is cluttered by inclusion of a label for every variant of every variation unit in the data set. Nevertheless, it does illustrate the potential of the technique to identify variants that are useful for classification purposes. The best variants to use are those whose vectors (indicated by red arrows in this diagram) have a large magnitude. Some vectors point directly at known groups, making the associated variants especially suitable for identifying members of that group. To illustrate, the correspondence of the witness label 03 (representing Codex Vaticanus) and the vector pointing to the variant labelled Mk.2.22.2.1 (representing the first variant listed at the second variation unit of Mark 2.22 in the UBS4 apparatus) indicates that this variant is a useful test for the kind of text found in this manuscript. If the analysed data contains so many variants that the biplot is impossible to use, then the coordinates of witnesses and variants can be printed out to identify variation sites useful for classifying texts. The location of an unknown witness with respect to known groups could be quickly established by comparing the variants it supports with those supported by group medoids at variation sites identified by this procedure.

Conclusion
The quest to discover textual groups among New Testament witnesses has a long history. Although an interesting challenge in its own right, the quest is often motivated by the need for a compact, comprehensive, and comprehensible apparatus for a critical text. As Frederik Wisse (1982) says: Ideally, a critical apparatus gives all pertinent MS evidence necessary for the establishment of the best possible text, and nothing more. Since the number of MSS used in an apparatus must be kept within reasonable limits, it is clear that only a fraction of the total number of Greek MSS of the NT can be included. This could easily lead to arbitrariness -and it often has -unless somehow true representation could be assured. Selection is defensible only if the user of the apparatus can be convinced that the number of MSS presented spans and represents the whole tradition in text, date, and, insofar as this is known, provenance. (6) It may be that the number of witnesses required to be presented cannot be significantly reduced because the tradition is so diverse that few witnesses are adequately represented by others. However, if presenting a less-than-comprehensive picture is acceptable, as, say, in a concise summary, then a great reduction in the number of witnesses that need to be presented is certainly achievable. To reduce the clutter and thereby make the presented information more comprehensible, it is necessary to identify witnesses to represent the various complexes that populate the textual landscape. To cover the entire landscape, it is also necessary to include mixed texts that stand between groups and eccentric ones that stand apart.
Prior attempts to find groups, identify representatives, and classify as yet unclassified witnesses of the New Testament often fall into one of two categories. The first is of "quantitative" methods, which count agreements with selected witnesses to discover where an unclassified witness lies in relation to them. A weakness of this approach is that the selected witnesses are typically chosen by an ad hoc method, often based on a survey of prior studies. If the text being classified does not belong to any of the groups represented by the selected witnesses, then it is likely to be Finney: How to Discover Textual Groups Art. 7, page 94 of 99 misclassified. The second is of "profile" methods, which search for combinations of shared readings. These methods are inductive, relying on an initial phase where many texts are compared to identify variation sites that seem to be useful for discriminating between groups. This approach succeeds in identifying witnesses that belong to groups identified during the initial phase. However, if the initial phase does not cover all extant witnesses, then there is a chance that important variation sites for group classification will not be discovered.
Both approaches suffer from a crisis of definition whereby it is unclear how to establish group membership on anything but an arbitrary basis. In profile methods, one has to set a standard for the proportion of group readings that must be supported by a manuscript before it is regarded as a group member. Typically, a proportion (e.g. two-thirds) is simply stated but no analytical basis for the criterion is provided. Similarly, quantitative methods declare critical values that must be satisfied for a witness to be regarded as a group member without performing the statistical analysis necessary to decide which values are appropriate for the data set.
For example, Colwell and Tune (1969, 59) propose that "the quantitative definition of a text-type is a group of manuscripts that agree more than 70% per cent of the time and is separated by a gap of about 10 per cent from its neighbours." As shown above, what constitutes a statistically significant level of agreement varies from one data set to the next. Also, mixture among texts and the great number of surviving witnesses mean that one cannot expect to find large gaps in levels of agreement between members of neighbouring textual groups.
Others have noted practical weaknesses in Colwell's definition of a group. Richards (1977,43) writes, "If one were to use 70 percent, let us say, as a minimum percentage for showing a text-type, then we would have to conclude that there is no such thing as a distinction between the Byzantine and Alexandrian text-types." Using the example of percentage agreements with Codex Sinaiticus, Richards (1977, 53) goes on to say, "As far as the 10 percent gap is concerned, there is no noticeable gap at all below the 70 percent line." According to Klaus Wachtel (2003, 39), Colwell's criteria for defining a text-type, including that group members should share exclusive group readings, are very unlikely to be met when the analysis is based on comprehensive evidence. Colwell and Tune's quantitative definition of a text-type may have been appropriate for the data they were using but it is not suitable for general application. Studies which have used this definition or a variation upon it (e.g. requiring less than 70% agreement) may have reached erroneous conclusions about grouping among witnesses, either missing statistically significant levels of agreement or asserting that relationships exist when the associated levels of agreement can be attributed to chance.
The multivariate analysis methods used in this article provide robust ways to identify where a text lies in relationship to others and are useful for comprehending the broad outlines of the textual space constituted by the witnesses. Classical multidimensional scaling reveals groups of varying scope and density formed by extant texts. It locates member texts within, mixed texts between, and eccentric texts outside extant groups. Divisive clustering gives a complementary presentation of what Bengel called the companies, families, tribes, and nations formed by New Testament witnesses. Partitioning around medoids allows a data set to be divided into groups, and the mean silhouette width indicates which numbers of groups are the more natural. PAM analysis also identifies a medoid for each group, namely that witness standing nearest to the centre of its group. As such, the medoid is often a suitable representative of its group although an alternative may need to be found if, say, the medoid is fragmentary or is not a Greek manuscript. PAM analysis thus provides a way to identify preferable numbers of groups, partition witnesses into those numbers of groups, and identify a representative of each group.
These multivariate analysis techniques are rooted in sound statistical reasoning.
If they give a vague result concerning some question then it is possible that the data set being analysed does not contain the information required to give a more definite answer. However, analysis of a more comprehensive data set may still leave the question unanswered. For example, no amount of further information will help decide which group a text belongs to if it has a mixture of readings taken from differing groups. All that can be said is where such a text lies in relation to those groups whose readings it contains.
In their "Introduction," Westcott and Hort (1881, 40) say "all trustworthy restoration of corrupted texts is founded on the study of their history, that is, of the relations of descent or affinity which connect the several documents." The multivariate analysis Finney: How to Discover Textual Groups Art. 7, page 96 of 99 techniques used in this article do not seek to discover relations of descent among the textual states represented by the various witnesses. However, they do show how texts relate to one another with respect to affinity. A comparison of data sets based on variation sites among New Testament witnesses with control data sets that have no relationships among their cases shows that the New Testament textual tradition has a definite group structure that is consistent with there being a number of textual varieties. It is not easy to say how many varieties there are, however. The lack of a clearly preferable number is implied by lack of a clearly preferable peak in a typical plot of mean silhouette width versus the number of groups. Nevertheless, some numbers of groups are better than others; they reflect more "natural" divisions of a data set.
Partitioning the UBS4 data set for the Gospel of Mark into a small number of groups results in divisions that correspond fairly well to conventional "Byzantine," "Alexandrian," and "Western" types. There is also a distinct group corresponding to Streeter's "Eastern" type that counts the Old Syriac, Armenian, Georgian, P45, W, Family 1, minuscule 28, and Origen's quotations of Mark among its affiliates. Codex Koridethi (038) and minuscule 565, which Streeter regarded as primary authorities for his "Caesarean" text, actually seem to be mixtures of the "Western" and "Eastern" varieties. Another cluster corresponds to Jerome's Vulgate version of Mark's Gospel. It is located directly between a group of Old Latin texts and the "Byzantine" cluster, suggesting that the "early" Greek manuscripts Jerome used to revise the Latin text were of the "Byzantine" variety.
An interesting feature of the analysis results based on the UBS4 data set is the collocation of early versions of the New Testament with some of the major textual branches: Coptic versions are associated with the "Alexandrian" branch; Latin versions with the "Western;" and the Old Syriac, Armenian, and Georgian with the "Eastern." This might be construed as evidence that these early versions played a part in the textual tradition's divergence into the associated varieties.
The INTF data set is more comprehensive with respect to Greek manuscripts.
Analysis of this data set reveals a number of the same groups found when the UBS4 data set is analysed, although there are notable differences as well. A number of the groups identified by PAM analysis of the INTF data set correspond to ones that have been noticed in prior studies. Core members are revealed by partitioning the data set into a large number of groups. In the group core that has minuscule 1339 as its medoid, seven out of ten manuscripts are from the same monastery. In this case at least, the textual character of a group correlates with the provenance of its members.