Studies in Semitic Vocalisation and Reading Traditions

EDITED BY AARON D. HORNKOHL AND GEOFFREY KHAN This volume brings together papers rela� ng to the pronuncia� on of Semi� c languages and the representa� on of their pronuncia� on in wri� en form. The papers focus on sources representa� ve of a period that stretches from late an� quity un� l the Middle Ages. A large propor� on of them concern reading tradi� ons of Biblical Hebrew, especially the vocalisa� on nota� on systems used to represent them. Also discussed are orthography and the wri� en representa� on of prosody.


AN EXPLORATORY TYPOLOGY OF NEAR
The present study is a codicological and linguistic classification of 296 Torah codices in the Genizah collections of Cambridge University Library that have nearly all of the characteristics of 'modelʼ codices 2 and that have standard and non-standard Tiberian vocalisation patterns. Such a study is warranted due to multiple gaps in modern scholarship on the codicology and vocalisation of the Hebrew Bible.
In previous scholarship in the field, attention has been focused on the most codicologically-sophisticated manuscripts. 1 I wish to thank Prof. Geoffrey Khan for his support and comments; There has not been sufficient differentiation and study of Bibles that are sophisticated, but lack the full range of the features associated with exemplar manuscripts, such as Codex Leningradensis. 3 In previous scholarship, descriptions of 'modelʼ codices generalised specific feature groupings that, in fact, appear to be distinct from each other, hiding important differentiation in manuscript features. For example, Yeivin states: The majority of older texts and Geniza fragments are beautifully written and "complete" (that is, masoretic notes and vowel and accent signs were systematically added). They were written on parchment, with great care taken over the forms of the letters and over corrections, and they contain the Mm, Mp, and vowel and accent signs. They were written with two or three columns to a page. 4 In this article I introduce a new category of Torah codex: the 'near-modelʼ codex, and I show how the different feature patterns in this type of codex fall into statistically-verifiable subtypes. Near-model codices have nearly all, but not the complete range, of the codicological and textual features that exemplar Tiberian Bibles have. Because none of these exemplar codices have fewer than three columns, I question Yeivin's grouping two-column manuscripts with the most complete, model Bibles, and I consider two-column codices with masoretic notes, vocalisation, and cantillation to be near-model. Moreover, there are many three-3 By exemplar, I mean specifically specimens such as Codex Leningradensis, the Aleppo Codex and the Cairo Codex of the Prophets. 4 Yeivin (1980, 11). column manuscripts that fall just shy of the 'complete' criteria that Yeivin lists above. These I also consider near-model and show to be statistically distinct from their two-column peers.
Within all of the Torah manuscripts that have Tiberian vocalisation there is a substantial group of manuscripts that use Tiberian vowels in non-standard ways. There have been some studies of this type of Tiberian vocalisation, which is referred to by a variety of terms, the most common being  In such studies, however, there has not been sufficient attention on the diversity of non-standard vocalisation patterns that exist in Genizah manuscripts. In this article I show that there were many non-standard Tiberian (hereafter, NST) 6 patterns, and I delineate an exploratory typology of these patterns in Genizah Torah manuscripts using statistical methods. 5 The best literature reviews of this subject are found in Fassberg (1991, 55); Saenz-Badillos (2008, 92-94); Blapp (2017, 8-32); Khan (2017, 265-266). This kind of vocalisation is generally characterised in scholarship by an 'extendedʼ use of dagesh and rafe, the vowel interchanges of pataḥ/qameṣ and segol/ṣere, and the non-standard placement of shewa and ḥaṭef vowels. 6 Blapp (2017) was the first to introduce the term 'non-standard Tiberian' (or NST) outside of the Davis-Outhwaite catalogues. I follow Blapp here in using this term to delineate any pattern of deviation from the standard Tiberian (ST) of Codex Leningradensis that uses Tiberian vowel signs.
Another gap in scholarship on the Hebrew Bible that this study addresses is the lack of communication between codicological and textual studies on manuscripts. 7 In preliminary casestudies of the corpus I observed that not only do there appear to be sub-types of NST, but that various codicological features present in near-model codices also appear to be arranged into definite subtypal patterns. Moreover, it seemed that NST subtypes tended to correlate with these codicological subtypes. The aim of this study is to map NST diversity onto near-model Torah codicology in order to demonstrate (statistically) that the correspondence is not completely random.

Terminology, Structure, and Hypotheses
The key descriptors of codices that I am using in this paper are as follows: 7 Yeivin (1980, 11-12)  patterns. This seems to suggest that NST was part and parcel of sophisticated Bible codex production in the main Genizah period (ninthtwelfth centuries CE). parchment manuscripts without Masoretic notes still retained a high degree of careful execution. It seems, therefore, that greater column numbers can be associated with a higher level of codicological sophistication, but this is not the case with the lack of Masoretic notes. Lack of Masoretic notes is not a sophisticating factor for three-column Torahs. It is, however, a major de-sophisticating factor for two-column Torahs. 9 The present research is guided by two hypotheses that are tested through statistical, codicological, and linguistic analysis: 1.
Near-model Torah parchment manuscripts with two or three columns in the Genizah have distinguishable patterns in their codicological features that indicate the presence of sub-groups in the manuscript corpus. Moreover, column number is a major factor in distinguishing these subgroups, because nearly-model manuscripts with two columns are codicologically distinct from nearly-model manuscripts with three columns.

2.
There are statistically distinguishable patterns in the NST vocalisation of these manuscripts, indicating sub-groups of NST vocalisation. These patterns can be linguistically validated. Moreover, these patterns tend to correlate with the codicological patterns of hypothesis 1.
The findings can be summarised as follows: first, a tentative, yet statistically-sound, typology of near-model manuscripts 9 There is not space here to analyse the large population of two-column parchment codices without Masoretic notes; they are addressed in my PhD thesis.
can be established and subtypes within this typology can be identified. Second, NST is not a monolithic phenomenon, but contains significant subtypes. These subtypes reflect regional patterns of scribal activity comprising various streams of diversity in pronunciation traditions and in the application of Tiberian vowel signs to represent the pronunciation. Finally, subtypes of NST map onto codicological features in a broad sense. This indicates that there is a linkage between the codicology of a manuscript and the features of the written text that it contains.

The Evidence Threshold
As a general rule, predictive statistical tests are considered significant if they have a probability value (p-value) of at least 0.1. This indicates that there is less than a 10 percent probability that the particular statistical relationship tested for happened by chance. However, p-values are not meant in this study to be used as a definitive marker of typology: a p-value which approaches significance, but which fails the full test, is still treated as meaningful and placed on a spectrum alongside the significant results. 10 10 The current attitude of researchers towards p-values is that they should be interpreted on a continuum indicating weakness or strength in the results, not treated as categorical, black-and-white measures of the subject being studied (Amrhein, Greenland, and McShane, 2019).
This is the approach that I embrace in the present research.

Sampling Strategy
The data in this study consist of fragments of two-or three-col- In total, 296 two-and three-column fully dimensioned fragments meet the aforementioned conditions for the study. This is an estimated 98-99 percent of manuscripts with these codicological features in Cambridge (as always, it is possible that some manuscripts may have been overlooked, so I do not assume complete comprehensiveness). The research is therefore representative for the Genizah collections in Cambridge.

Palaeography
A cautious approach was taken regarding palaeographic assessment. Each of the manuscripts in the corpus which had NST vocalisation was assigned a general palaeographic identification, with a focus on determining the provenance rather than on pinpointing an exact date. The assessments involved establishing the palaeographic type of script on the basis of comparative samples and estimating a date spanning two centuries. 11 Below are the categories used as general palaeographic descriptors for region: 11 It is fully expected that further research may (and should) correct and clarify some of the palaeographic assertions made in this study. The palaeographic estimations were based on comparative sources and used the methods developed in the following scholarly resources: Birnbaum  'Oriental': manuscripts with a 'Northeastern' or 'Southwestern' 12 Oriental script style.
 'Palestinian-Byzantine': manuscripts with a script style that is characteristic of manuscripts produced in a region ranging from the Levant to Asia Minor.
 'Italian-Byzantine': manuscripts with a script style that is characteristic of manuscripts produced in a region ranging from Italy to Asia Minor.
 'Sephardi': manuscripts with a clear Sephardi style of script.
The regional labels I attach to specific scripts should be seen as approximations rather than fixed assessments. The mobility of scribes and the variability of script styles in the Genizah often makes the exact pinpointing of regions and dates problematic. For purposes of this typology, the regional labels should be taken as wide estimations rather than exact diagnoses.
(1971); Beit-Arie, Engel and Yardeni (1987); David (1990);and Yardeni (2002). Judith Olszowy-Schlanger also assisted in the assessment of a number of the manuscripts and provided me with methodological insight and feedback.

Statistical Procedures
The statistical approach taken in this study was non-experimental and relied mainly (but not exclusively) on non-parametric statistical tests (meaning that no statistical prediction/probability was involved). Data were stored in an SQL database which I created especially for the research. In collecting linguistic data, only one page (single or conjoined) was read per manuscript in order to avoid assigning multiple-page manuscripts greater weight than single leaves (multiple pages of a manuscript generate more linguistic data and this could bias the statistics against single-leaf manuscripts).
The general descriptive statistics (basic distributions of features) are reported first. Then three kinds of clustering algorithms are performed on the data (k-means, k-modes, and mean-shift clustering), because their different mechanisms elucidate different aspects of the data. The computer ran each algorithm up to ten times: the data are clustered and re-clustered by the computer until the numerical distance between each group is optimal. 13 Codicological and linguistic features were assessed separately. The results of the codicological clustering are given in section 4, and the results of the linguistic clustering are given in section 5. In the conclusion of the study, the results of the codicological and linguistic clusters are compared: the major finding is that manuscripts that cluster together in the codicology also tend to cluster together in the linguistic groups.
13 See section 4.2 for a more in-depth explanation of clustering algorithms and relevant literature.

Textual and Linguistic Analysis
The textual data of the manuscripts were compared with photo-

SIS: CODICOLOGY AND LINGUISTIC FEATURES
The following report on the feature distributions of codicology concerns all 296 manuscript fragments which are the subject of this study. The report on linguistic feature distributions concerns the 55 NST manuscript fragments which were found in the corpus of the whole 296.
14 National Library of Russia, I Firkovitch Evr. I B 19a. 15 Blapp (2017)  (45 manuscripts total had this feature = 29 percent) than the three-column group (26; 18.3 percent). As a group, the two-column manuscripts tended to have more variation in margin width than the three-column group, which was more homogeneous.

Illumination and Decoration
Extra-textual decoration was rare for both groups. Differential results: 

Dimensions
The distribution of leaf length and width differ for the two groups: Length: The three-column group has a distribution that somewhat resembled a normal 20 distribution: of the whole group. The interquartile range is triple that of the three-column group, meaning more manuscripts vary in their length from the average. The extremely low result of the Shapiro-Wilk test indicates that the data are far from normally distributed. These results indicate that there are smaller sub-groups of similarly-sized manuscripts within this heterogenous data set.

Width:
The difference in distribution of widths between groups is noteworthy.
Three-column: Differential results: There are many more Italian-Byzantine NST manuscripts in the two-column group (9; 40.9 percent). The three-column group has significantly fewer Italian-Byzantine specimens (4; 12.1 percent). Oriental manuscripts (both Northeastern and Southwestern) predominate in the three-column group (29; 87.8 percent) and are large minorities in the two-column group (13; 59 percent). In the charts below, 'Egyptian-Palestinian' indicates scripts with a 'Northeastern' Oriental script style (which had spread to the Levant and to Egypt: see footnote 12).

Discussion of Descriptive Codicological Statistics
The descriptive statistical findings indicate three levels of codicological feature distribution, viz. common, less common, and infrequent features (but not necessarily all in the same manuscript in all three levels of occurrence).
Common features in both groups include a portrait format, no evident pricking holes, regular/even margins, minimal decoration, Masoretic line breaks, a square and professional script that is balanced in size and with an 'Oriental' (either Northeastern or Southwestern) palaeography, an ST vocalisation, 23-33 cm long x 20-30 cm wide, and 20-23 lines.
Less common features include square manuscripts, wider margins, a greater amount of decoration, a small and professional script that is Byzantine or Italian, NST vocalisation, more variation in size and number of lines. It is likely that there are multiple sub-groups of Bible types indicated by these data that can be uncovered through correlational statistics and clustering.
Finally, infrequent features include a landscape format, pricking on both margins, narrow or unbalanced margins, very late Oriental or Italian scripts, complex illumination, no line breaks, no vocalisation, and extremes in size and number of lines.
The most important finding of these descriptive statistics is that they clarify the differences and similarities between Torahs with two and three columns. The two groups of manuscripts had at least one significant difference in the distribution of features for each feature presented above. For example, there are many more Italian-Byzantine near-model Bibles with two columns, while more Oriental near-model Bibles tend towards three columns ( §3.1.10). Ultimately, the data show that the two-and three-column manuscripts are related on many points, but distinct in a significant number of ways.
The most noteworthy trend regards dimensions. Two-column Bibles are more heterogenous in terms of dimensions and line number, which indicates that multiple sub-groups may be more clearly defined in the corpus. Three-column manuscripts, on the other hand, are much more homogeneous, which means that while sub-groups exist, they may be less distinct.
Ultimately, while two-and three-column 'near-model' Torah codices can be grouped together in terms of average shared features, it is clear that we should not conflate them based on their commonalities; they are better characterised as close sisters within the same family.

Descriptive Statistics: Linguistic Features
Within the corpus, the three-column group contains 33 manuscripts with NST vocalisation, and the two-column group contains 22 manuscripts with NST vocalisation (55 total NST manuscripts). By comparing these manuscripts with Codex Leningradensis (hereafter, L), I identified 103 distinct types of variation in all of the manuscripts. Of the total of 103 types of variation, 76 are relevant to the present study. 24 24 Features such as plene and defective spellings, qere in place of ketiv, and textual differences were not incorporated into the statistics presented here. Rafe was also not a factor in the statistics due to the unpredictability of its usage. As Blapp (2017) points out, all the exemplar The two-column group had fewer distinct vocalisation or diacritical features (60) than the three-column group (92). The general distributional trends of these features are presented below.

Feature Frequency Distributions
There are three kinds of distributions of NST features in the corpus of manuscripts: A.
Infrequent occurrences: There are a significant number of features in both groups that occur once or at most twice in a manuscript. Either the feature is the only deviation from L present in the manuscript, or the feature is the result of a larger pattern of more complex phonological changes in the pronunciation of the vowels in the text.

B.
Even distributions: some features occur evenly through a spread of multiple manuscripts. For example, the feature 'dagesh in an ʾalef' occurs at regularly increasing intervals between one and fifty times in two-column manuscripts.
These kinds of distributions are rare, making up at most 10 percent of the data. They indicate that the feature is generally common for that group.
C. Uneven distributions: These are distributions in which a particular feature occurs infrequently in many manuscripts, codices use rafe in a different way, and "this observation suggests that rafe has not been standardised, which makes it necessary to study rafe in each manuscript" (223).  Missing mappiq (10 times) The above list indicates the NST features that predominate in the corpus and that seem to play the most critical roles in the patterns of NST vocalisation. There are, however, many other deviations from L that occur at lower frequencies, but that are still important for shaping differences in sub-groups of vocalisation.

Discussion
These data complement findings stated in previous scholarship on NST vocalisation. Blapp is indeed correct when he states "we have to be aware that the degree of non-standardness of all the manuscripts [in his thesis] varies". 27 This applies also to the present corpus. Blapp noted, furthermore, that some manuscripts in his corpus, for example, T-S A13.18, contain very few NST features. 28 Likewise, in the present study, there are specific groups of features that occur once or twice in an otherwise fully ST manuscript.
He noted, in addition, extensive non-standard use of dagesh.
Apart from the interchanges of ḥolem/qameṣ and ḥireq/shewa, all of these features predominate to a high degree in my larger corpus of 55 manuscripts.

Methodology Review
The statistical methodology was chosen with the aim of exploring meaningful patterns within the dataset and was therefore nonexperimental. The main focus was upon finding patterns using appropriate clustering algorithms and then verifying their linguistic and codicological meaningfulness. The general methodology took three steps:

1.
Three clustering algorithms, k-means, k-modes, and meanshift (defined in section 4.2), were run on the data in order to establish the initial boundaries of large patterns in codicological and linguistic data. The clustering algorithms assessed all of the manuscripts and grouped them based on which features (codicological and linguistic, respectively) certain manuscripts share, and how often those features occur per manuscript in the group. The results of the algorithms are lists of manuscripts that share features.

2.
These patterns were analysed in order to identify the most critical factors and to refine the clustering process by identifying and removing distracting variables.

3.
Where applicable, traditional tests of significance (ANOVA, Chi-Squared, etc.) were run to clarify the strength of correlations between specific codicological or linguistic features that were unearthed by the clustering results.

Cluster Analyses
Statistical clustering is a branch of unsupervised machine learning that is targeted towards data mining and towards establishing the shape of patterns in large-scale data. 29 It is, therefore, an appropriate strategy for identifying patterns in Torah manuscripts in the Genizah. 30 Different clustering algorithms group the data together based on similarities, which, when compared in person by the researcher, allow for cross-validation and a more complete picture of patterns within the dataset.
K-means is the most commonly used algorithm, because it works with the mean (average) of numeric data of a manuscript 29 An explanation of the statistical processes used in this research can be found in the following introductory volume: James, Witten, Hastie, Tibshirani (2015). More technical papers are cited in the footnotes below. 30 In one instance, the computer found separate leaves of the same manuscript and placed them together in the same cluster. This was confirmed by Zina Cohen, who kindly performed her microscopic reflectography method on some of the manuscripts in this corpus (Cohen, Ol- computer considered too many outlying variables, two manuscripts which shared many codicological features would be artificially separated on the basis of an inconsequential difference.
On the whole, it is better to test on fewer, more critical features, rather than many. Controlling the number of variables produces the best results and can sometimes find the most critical features in the typology. Whilst this method may be susceptible to bias, I was careful to avoid bias by investigating outliers and outlier clusters separately. It, therefore, does not increase the risk of missing out on rare features, because manuscripts which lack the more common, tested features are placed by the computer in an 'outlier' group. This allows the researcher to further investigate and find the rare features that set them apart.
Therefore, avoiding the inclusion of rare features and reducing the number of different factors for the computer to analyse results in clearer groups. Most notably, features that are not included in the clustering, if they truly are part of a pattern, will self-organise around the features that are tested, and the researcher will catch important details.

Codicological Cluster Analysis and Results
After the cluster analyses, the next step was to identify the major factors that distinguished the clusters. As some features were identified as biasing the clustering results, they were removed and the clustering was re-performed. The critical features that were included in the final round of codicological clustering were: format, pricking location, margin width, illumination, script size,

Codicological Manuscript Sub-Groups 35
The following subtypes are selected representatives of the full thirty subtypes found across the 296 manuscripts that were clustered.
Small Italian-Byzantine Codex 36 (Two-column) This was the smallest and most homogeneous group in the typology.

Near-Model and Non-Standard Tiberian Torah Manuscripts 509 Average Monumental Oriental 42 (Two-column)
This group is the most informal of all the groups represented in the two-column corpus. This is due mainly to the fact that most of them are either re-written in a very clumsy hand, or the hand is not very sophisticated. Regardless, these manuscripts still contain sophisticated codicological features.

Oriental-Byzantine Landscape 46 (Two-column)
This is the smallest group identified by the algorithms, containing only a few manuscripts. These manuscripts, however, are distinct from any other group in that they have a landscape format (width longer than the length). No correlational statistics could be run to test the strength of their features since they all are so alike.

Discussion of Clustering Results
Though only a few of the thirty total groups found in the research are presented here, the results indicate two main findings. Egyptian; at the other end are groups containing mainly (or only) Egyptian manuscripts. This indicates that some codicological formats were perhaps regional, while others were more widespread.
Most importantly, the manuscripts are also visually similar to the others within their respective groups.

A LINGUISTIC TYPOLOGY OF NON-STANDARD TIBE-RIAN VOCALISATION: THE PRESENTATION OF THE CLUSTERING RESULTS
The linguistic findings presented below were clustered using the three clustering algorithms discussed above. Then the clusters were assessed by a thorough linguistic analysis. The results of the clustering generally fit into the schema that appears below, which was developed independently from the statistical analysis, through rigorous linguistic analysis of the data. 48 Due to limited space, I have chosen to prioritise the presentation of the linguistic results of the clustering analysis over the specific statistical details behind the results.
The findings are organised first by presenting the manuscripts of the main groups established by the clustering and linguistic analysis. Then, manuscripts which are connected to the main groups, but which are outliers in some way, are presented separately and the reason for their uniqueness is described. Furthermore, the two-column group had a small subgroup of individual outliers which did not connect clearly with any main group; these are summarised in footnote 49.
In the schema below, there are two hierarchies of vowel interchange. Patterns X and Y are notational, while the numbered patterns 1 and 2 (and the subtypes) may reflect phonetic changes induced by language contact.

Two-column manuscripts: NST Linguistic Typology
The results below describe the language features of selected manuscripts within all of the clustering groups found (alongside their corresponding schema patterns). Not all manuscripts within the groups are presented here. The full lists of manuscripts are in the corresponding footnotes for each group. Note that specific vowel interchanges are reported with the vowel that appears in the manuscript first, and the vowel which appears in L second, after a hyphen. For example, a pataḥ for segol interchange is written: There were a few main groups established by the cluster- (4) a group of manuscripts exhibiting a three-way interchange between ṣere, segol, and pataḥ. 49 49 There also were four manuscripts which were found by the computer to be unique individual outliers unconnected to these four main groups.
These are: T-S NS 248.5, which has the Byzantine trio with a more ex- The following collection of two-column manuscripts contains a clear pattern which I have called the 'Byzantine trio of features'.
This pattern was found solely by the computer clustering. The Byzantine trio is as follows:  Dagesh/Mappiq 51 occurs in consonantal ʾalef, contrasting with rafe on mater lectionis ʾalef and on historical spellings of ʾalef that have no consonantal pronunciation. Its function is to differentiate consonantal and non-consonantal ʾalefs, thereby ensuring that consonantal pronunciation is preserved.
Mappiq is typically also extended from word-final heh to word-initial and word-medial heh and has the same function of marking the heh as consonantal.  Extended use of dagesh to certain 'weak' consonants after a vowelless consonant: mainly lamed, mem, and nun, but occasionally on sibilants such as sin, shin, and samekh, and the emphatics ṭet, ṣade, and qof. In some manuscripts in the group, these consonants without the dagesh take rafe.
 The presence of a silent shewa on word-final ʿayin and ḥet. This has the function of ensuring a wordfinal guttural is pronounced by explicitly marking that the consonant closes the syllable.
While these features can independently appear in manuscripts from other groups, they occur together in this trio only in manuscripts with Italian/Byzantine or distinct Palestinian scripts.

ST Codices with Lexically-Specific NST features (No Schema Pattern)
This group is the most standard of the two-column manuscripts.
It consists of those manuscripts which contain a few one-off NST features that do not form a particular pattern, alongside one NST feature that occurs in a lexically-specific pattern on only one word throughout. This feature is the placement of shewa for ḥaṭef segol on the word ‫ים‬ ‫ֹלהִ‬ ‫אֱּ‬ 'God, gods'. This probably does not rep-  This manuscript is connected to the above three-way interchange group in that it has pataḥ and segol interchanges, but is an outlier because it lacks any interchange with ṣere, making it unique. Like the previous group, it lacks qameṣ interchange and has a high level of non-phonetic sign interchange. Vowel interchanges:

Three-column Manuscripts: Non-Standard Linguistic Typology
The main difference between the two-column manuscript data and the three-column data is that manuscripts in the two-column corpus tend to have small, discrete counts of features with a moderate number of vowel interchange. The three-column corpus has a few manuscripts with extremely high counts of one or two types of vowel interchange.ֿIt also has manuscripts with complex patterns of vowel interchange, while the two-column corpus tends to have simpler interchange patterns. Because of these outliers and complexity, I relied only on the k-modes algorithm, as it is less affected by high or low feature counts.

Concluding Discussion: Linguistic Typology
The above typology for two-and three-column NST near-model  A striving to reproduce the pronunciation of ST, but doing so by using Tiberian vowel graphemes in a non-standard way (orthoepy).
 Lexically-specific NST features that occur in otherwise ST manuscripts, which are probably learned spellings particular to the scribe or to the community that produced the text.
 Sign interchange (specifically, shewa and ḥaṭef vowels, or vocalic shewa and pataḥ), which is only notational and does not represent a phonetic shift in vowels.
 Vocalic interchange patterns of varying degrees of complexity, often occurring alongside the non-standard use of diacritics such as dagesh or silent shewa, and which are likely to reflect pronunciations influenced by Aramaic or Arabic.
The most crucial finding uncovered by the clustering algorithms was that the feature frequencies differ between the twoand three-column manuscripts. This affected not only which clustering algorithm was most appropriate for the specific group, but the typology. Two-column manuscripts had the following general features:  They exhibited on average a moderate amount of vocalic interchange, and the outlier manuscripts could usually be clearly tied to a specific group (or more than one specific group).
 Many of the manuscripts were either from the Southwestern Oriental (Palestinian-Byzantine) or Italian-Byzantine group.
 The pronunciation behind the vocalic interchange seemed to be associated with influence due to Aramaic language contact, as seen in the schema patterns.
 Orthoepic features that reinforced ST pronunciation in a non-standard way are associated with the two-column group.
The three-column group had the following different general features:  Within this group were manuscripts with extreme counts of NST features, or extremely complex patterns of vocalic interchange, including the manuscript with the most NST features (T-S NS 72.1).
 The extremity of the outlying features indicated that only the k-modes algorithm was appropriate to assess the group statistically, because other clustering algorithms would be biased by the outliers.
 Patterns with extended use of dagesh were associated with the three-column group.
 The majority of the manuscripts in this group were clearly Oriental (Egypt and Palestine, especially twelfth c. Egypt

The Correlation between Codicological and Linguistic Subtypes
In general, the linguistic patterns found above were distinct not only regarding differences between two-and three-column manuscripts, but also regarding the fact that manuscripts with similar linguistic patterns tended to group together in either the same codicological subgroup, or in related codicological subgroups:  The results of these general correlations show that, while linguistic features do co-occur in patterns alongside codicological subtypes, these co-occurrences are in wider regional swaths of similarity. It is also to be noted that the specific date of the scripts was not a major factor in this study. Apart from a few late manuscripts that grouped together, further analysis may refine these correlational findings by clarifying the palaeographic date of the manuscripts. It can safely be said, however, that subtypes of NST can be regionally defined and generally correlate with regional patterns of codicology.

Final Conclusions
The analysis in this paper is, to date, the most comprehensive assessment of a large number of manuscripts on many grounds: both codicological and linguistic. It has introduced a new methodology that allows the researcher to analyse effectively thousands of individual data points and 296 manuscript fragments.
The results clarify our understanding of near-model and NST vocalisation phenomena in the Genizah.
Firstly, it can be affirmed that near-model manuscripts exist as a conceptual category of codex type within the Genizah, and that, when considered as parts of larger groups, those with two columns are distinct, both codicologically and linguistically, from those with three columns. These kinds of manuscripts represent the threshold of the standard, exquisite Bibles, which have been the focus of scholarship, and show that rich diversity lies just below the surface of what has been analysed in the past.
Secondly, it has been demonstrated that codicology can be regionally defined and that styles of book-making practices and scribal habits differed slightly (and in a statistically verifiable way) from region to region in the Genizah. Most importantly, dimensions and line number are the most reliable measures for distinguishing differences in codicological styles across regions.
Thirdly, NST can be considered a hypernym for what is in fact an internally diverse phenomenon with distinct subtypes.
These subtypes can represent many things, ranging from an adherence to the pronunciation of the ST text (but non-adherence in notation), to a completely different phonological profile, which is most likely due to language contact and regional pronunciations of Biblical Hebrew in Egypt, the Levant, Asia Minor, and Italy.
Finally, this study has shown that language and codicological features complement each other and, when studied together, can aid the researcher in understanding the larger picture of the background of the manuscript. Since codicological styles varied by region, and since NST language features also varied by region, codicology and language can indeed be used to help clarify each other. This demonstrates that medieval Hebrew manuscripts are holistic entities, which, in order to be studied properly, must have both their physicality and their language features taken into account This study is a first, exploratory step in using the methodology that I have developed here. The methodology should be applied to other groups of manuscripts in order to refine it properly, to find pitfalls, and to calibrate it for further improvements of analysis. It has great potential to allow scholars to look at the wider picture of a corpus of manuscripts without sacrificing detail. Furthermore, statistical clustering puts the researcher above the data and allows for the prioritisation of the most critical data and details.
Avenues for future research include applying this same analysis to other groups of non-standard Hebrew Bible codices (which is the topic of my current PhD research 64 ), as well as re-64 Working title: "A Codicological and Linguistic Typology of Non-standard Torah Codices from the Cairo Genizah.
fining the typology presented above by means of further investigation into specific aspects. These include patterns of Masorah, cantillation, or, especially, the extreme outliers identified in this paper. In any case, it is hoped that the present study has not only opened conceptual doors to further bolster our study of medieval Jewish manuscripts, but has also introduced a new methodology and set of tools by which to do so.