Data-Driven Syllabification for Middle Dutch

The task of automatically separating Middle Dutch words into syllables is a challenging one. A first method was presented by Bouma and Hermans (2012), who combined a rule-based finite-state component with data-driven error correction. Achieving an average word accuracy of 96.5%, their system surely is a satisfactory one, although it leaves room for improvement. Generally speaking, rule-based methods are less attractive for dealing with a medieval language like Middle Dutch, where not only each dialect has its own spelling preferences, but where there is also much idiosyncratic variation among scribes. This paper presents a different method for the task of automatically syllabifying Middle Dutch words, which does not rely on a set of pre-defined linguistic information. Using a Recurrent Neural Network (RNN) with Long-Short-Term Memory cells (LSTM), we obtain a system which outperforms the rule-based method both in robustness and in effort.

1 Introduction §1 The main aim of this study is to develop a tool for automatically syllabifying Middle Dutch words. It goes without saying that the best way to go about this task would be through a simple look-up query in a dictionary, where words are stored alongside their syllabified versions. This method, however, is unattainable for Middle Dutch because of mainly two reasons: 1. Although a dictionary for Middle Dutch does exist (Verdam and Verwijs 1885), it lacks information about syllable boundaries.
2. More importantly, what we today call "Middle Dutch" is a container concept, used for the various dialects spoken in the Low Countries (today: Flanders and the Netherlands) between ca. 1150 and 1500. Since there is no standardized spelling yet in this period, the same word can be spelled in many different ways, depending on where or when a text was written.
Orthographic variation may even be present in the same text, written by one and the same scribe. For instance, the Middle Dutch word for "damsel" has the following -and more -spelling variants: joncfrouwe, joncvrauwe, joncvrouwe, joncvrovwe, jonvrowe, ioncfrouwe, ionfrouwe, ioffrouwe, etc. (the extensive orthographic in Middle Dutch is also the subject of a paper by Van Halteren and Rem (2013), who noted that the lemma gelijk ("similarly") has 24 different word forms in the Corpus Van Reenen-Mulder).
Since there is no list available with all the different spelling variants of every Middle Dutch word, and since the existing dictionary does not contain syllabified versions of lemmas, one would like an automatic system that is able to correctly determine syllable boundaries, while dealing with this multitude of spelling variation in a flexible way. To achieve this, we propose a syllabification method that takes a preannotated list of syllabified Middle Dutch words as input for an RNN-tagger. 2 Rules for syllabification of Modern Dutch §2 Before we discuss the results of the Middle Dutch syllabifier, it is important to gain insight into the rules that form the basis of correct syllabification for Dutch in general, and for Middle Dutch in particular. Syllable structure in Dutch has been the subject of various studies (Vennemann 1988;Booij 1999;Trommelen 2011). As it is beyond the scope of this paper to provide an exhaustive explanatory model, we will set out the general rules and principles that were followed when annotating the training data. §3 To give an example of the task at hand, consider the Dutch word kerstavonden (IPA: [kεrstaːvɔndɘn]; English translation: "Christmas Eves"). On a naive and unsubstantiated basis, we can propose a couple of different syllabifications, presented in Table 1. On an intuitive basis, however, we get a feeling that some of these candidates are less likely than others. But then what are the requirements for 'what can be a possible syllable in Dutch?' §4 In any language, a syllable consists of up to three elements: an onset, a nucleus, and a coda. The nucleus is indispensable, and is -at least in modern Dutch -always a vowel or a diphthong. Onset and coda, both optional, are the collections of consonants that respectively precede or follow the nucleus. Nucleus and coda combined form the syllable's rhyme. Altogether, the internal structure of a syllable is represented as in Figure 1 (with a lower-case sigma [σ] as the standard symbol for a syllable in phonology studies). §5 In order to obtain the correct syllabification of a Dutch word, we have to take into account a couple of principles and constraints that are imposed onto this template: 1. Sonority Ranking Hierarchy -The main constraint relates to the contents of onset and coda. The order of consonants in both clusters is determined by the Sonority Ranking Hierarchy (Figure 2). The sonority   of consonants has to decrease towards the outer edges of a well-formed syllable. Subsequently, a mirror effect can be perceived: the onset of the syllable will have the least sonorous consonants placed at the beginning, whereas in the coda, they have to be placed near the end (Selkirk 1982;Kager and Zonneveld 1986 (Booij 1999, 29-30). Or, in other words, syllable boundaries tend to follow morphological boundaries. Since kerstavonden is a compound 1 Without going too deep into the matter, it should be noted that there are some exceptions to the Sonority Ranking Hierarchy. For Germanic languages, the most prominent one is the possibility of /s/ functioning as a sort of appendix, which, in some cases, can be added to the beginning of an onset, or to the end of a coda. This results in the possibility of having e.g. /str/, /skr/ and /spl/ as legitimate onsets in Dutch, and /lps/, /rks/, /rts/ as well-formed codas (Booij 1999, 26-29).
In conclusion, correct syllabification comes down to ticking boxes. Firstly, when syllable boundaries surpass morpheme boundaries, one can immediately reject the proposed syllabification. Secondly, consonants are maximally assigned to the onset, while -thirdly -respecting the Sonority Ranking Hierarchy. For the proposed syllabifications of kerstavonden, ticking these boxes produces a set of possibilities as in For instance, the letters 〈u〉 and 〈v〉 are often used interchangeably. In a word like huse/hvse ("house") they both have the sound value of the vowel /y/, yet in the word over/ouer ("about") they should be pronounced as a consonant /v/. Sometimes even, one word can have both phonetic realizations: in geualueert ("evaluated"), the first 〈u〉 represents a consonant and the second one a vowel. When we will be evaluating the syllabifier in the end, we must pay special attention to cases like this where there is an increased risk of incorrectly appending a grapheme to the onset, coda or nucleus. Table 3 provides an oversight of such "graphemic pitfalls" for Middle Dutch syllabification. §7 Dealing with the orthographic variation also means making decisions -some more arbitrary than others -with regard to what will be considered as a correct syllabification of a Middle Dutch word. An example of such a decision can be illustrated by our dealing with the graphemic cluster 〈ie〉. It is generally assumed that Middle Dutch script 〈ie〉 was pronounced as monophthongal /iː/. In In the same text, we also find rhyme combinations such as philosophie : lije, paertije : normendie, indicating that in Maerlant's Spiegel Historiael the syllabified versions of these words should therefore be: with Middle Dutch's orthographic variation (as outlined in Table 3). An example of such a rule is the one that has to prevent 〈u〉 from being recognized as a vowel (/y/), when it is actually a consonant (/v/): In the sequences 〈aue〉, 〈eue〉, and 〈oui〉, 〈u〉 almost always functions as a 〈v〉. Therefore, we replace such sequences with 〈aUe〉, 〈eUe〉, and 〈oUi〉, respectively, where we use 〈U〉 as the character that denotes a 〈u〉 functioning as a consonant (Bouma and Hermans 2012, p. 33).
Although the premise of this rule is legitimate, a significant risk is lurking: the prospect of being incomplete. And indeed, the above-mentioned rule disregards many other possible grapheme clusters.  Table 4). Additionally, some inconsistent syllabifications were amended. For example, in their corrected word list, Bouma and Hermans sometimes syllabified the word ending 〈iaen〉 as (i) σ (aen) σ , yet at other times they left (iaen) σ as one syllable. Since we want a machine learning algorithm to learn from a list of syllabified words, this consistency is especially important. In total, 95 words from Bouma and Hermans' list were found to be inconsistently syllabified (for some examples, see Table 5). When manually reviewing the 43,710 syllabified words used in our experiment, we made sure that this consistency was continued throughout the entire data set.
(3) Finally, several words were deleted from the list 3 Some tokens in the CRM contain diacritic symbols to indicate abbreviations, clitic forms, or unclear parts in the original charter. Striving towards orderliness, such tokens were excluded when collecting the data.

Haverals et al: Data-Driven Syllabification for Middle Dutch
Art. 2, page 11 of 23 when the orthography did not match the phonetic realization. Roman numerals and ordinals such as cccxc, lxxvij, xxiiiisten or xxvjsten do not require syllabification, since they are not pronounced according to their graphemic representations.
Their pronunciations are respectively: driehonderdnegentig ("three hundred ninety"), zevenenzeventig ("seventy-seven"), vierentwintigste ("twenty-fourth") and negenentwintigste ("twenty-ninth").  words, alphabetically ranging from a to zy-wer-des ("sideways"). Statistics on the average length of words, the average number of syllables per word and the average number of characters per syllable are provided in Table 6. The entire data set is also made freely available for exploration and research purposes (Haverals 2018). layers, as illustrated in Figure 3. The original paper on LSTM machine learning is by Hochreiter and Schmidhuber (1997). One of the great benefits of the LSTM-model lies in its capability to take into account the larger context: by letting information flow throughout the entire sequence model, the model learns not only from the immediately adjacent graphemes, but has the ability to also retain information about the entire sequence. This way, the model is especially efficient at learning about e.g.

Model
the maximal saturation of the onset of a syllable and morpheme boundaries. of the total amount of data. §20 Each word in our model is presented to the model at the character-level, i.e. as a sequence of graphemes. As customary in this sort of models, we do not represent graphemes with a so-called one-hot encoding, but we use embeddings to represent each grapheme (with a fixed dimensionality of 64). Additionally, special symbols are appended to the beginning and end of each token (BOS and EOS).
Finally, words get padded to a standard length (i.e. the size of the longest training token + 2, to accommodate for the BOS and EOS symbols) using a dedicated padding symbol (PAD). Naturally, the predictions for these dummy symbols were not included in the final evaluation. Characters that were not encountered in the training material receive a special encoding (UNK). The task of the model, then, is to predict either 0 or 1 for each grapheme. When 0 is predicted there is no indication of a syllable word "kerstauonde" as input and predicting the output to be "kerst-a-uon-de".
boundary before this particular grapheme. The prediction of 1 means that a syllable boundary is detected right before this grapheme. §21 All code for this paper is available from GitHub; the code uses Python 3.6+ and has the following major dependencies: NumPy (Oliphant 2006), SciKit Learn (Pedregosa et al. 2011), Keras (Chollet et al. 2015) and TensorFlow (Abadi et al. 2016).

Results
§22 When evaluating the model, we make a distinction between word accuracy and hyphenation accuracy. Word accuracy is the percentage of fully correct syllabified words, whereas hyphenation accuracy is the percentage of correctly inserted hyphens across all words and across all syllable boundaries. The latter can be calculated at the character-level. An illustration of both concepts is provided in the word list, shown in Table 7. Word accuracy in this fictitious example is fairly low at 60% since only 3/5 words are correctly syllabified. With 9 out of 11 hyphens placed correctly, hyphenation accuracy is at 82%. §23 The results presented in Table 8 and 9 are obtained after a training regime of 30 epochs, with a batch size of 50 words. We used the cross-entropy loss in Table 7: Exemplary words list with predictions and correct syllabifications. This list serves as an example for the purpose of explaining the difference between word and hyphenation accuracy. Examples are not actual predictions made by the model.  after the embedding layer, between the recurrent layers, and before the final dense layer (Srivastava et al. 2014). The key idea here is that by randomly dropping units (along with their connections) during training, overfitting can be prevented.

Model inspection §24
The best results are obtained with the two-layered model combined with 256 dimensions. Under these circumstances, our model yields a word accuracy of 97.55% (Table 8) and a hyphenation accuracy of 99.50% (Table 9) on the test set.
Overall, scores improve most when stepping up from a 1-layered to a 2-layered model. When adding a third layer, some minor improvements are also noticeable, but overall scores appear to have already stabilized in the 2-layered model. As to the number of dimensions, the difference of scores between 64 and 256 dimensions never exceeds an improvement of 1.00% on the word level and 0.20% on the hyphenation level. §25 We compared the output of our best LSTM-model with the output of Bouma and Hermans' rule-based model (both on the same test set of 4,371 words).
From this comparison, we gained the results shown in Table 10: on the level of word accuracy, the LSTM-model (97.55%) outperforms Bouma and Hermans' model (91.33%) by 6.25%. On the hyphenation accuracy-level, the improvement is more subtle with an increase of 1.53% over the rule-based model.  (Levenshtein 1966). Briefly put, the logic behind this metric is defined by the number of edits that are required to convert one string into the other. 4 In our case, we apply the Levenshtein distance in order to compare the correct, gold standard syllabifications to the predictions made by our model. As one can see in Table 10, the Levenshtein distance of our model is very low with an average of .04 edits, which is more than three times as low as the distance calculated for Bouma and Hermans' model (.17). With the F 1 -score, finally, we wanted to gain insight in the balance between precision and recall on the character level, because there is a significant imbalance for the two classification labels in our model. Here also, the score obtained for the LSTM-model is nearing the perfect score of 1.0. §27 Surely, developing an automatic syllabifier is only really interesting if it can also be effectively deployed onto other corpora. In order to get a good understanding of the syllabifier's potential, we evaluated our model on an out-of-corpus sample of Middle Dutch words. To this end, we randomly selected 2,000 words from the Cd-rom Middelnederlands (1998). Unlike the legal and administrative character of the Corpus Van Reenen-Mulder, the Cd-rom Middelnederlands is a corpus of literary texts, both rhymed and prose. In order to make sure that our evaluation was carried out 4 As a clarifying example for the Levenshtein distance, consider the following two syllabifications: af-ter-wards and aft-er-ward-s. The Levenshtein distance here is 3, since it would require three edits in order to transform one sequence into the other (twice the deletion and once the insertion of a "-").  Table 10 can therefore be an underestimation of the model's performance "in the wild".

Model criticism §29
Where does it still go wrong? From an inspection of the mistakes made by both our LSTM-model and Bouma and Hermans' model, we learn the following (Table 12): (1) the LSTM-model is very accurate at respecting morpheme boundaries.
We notice this especially from adjectives ending in -heit and adverbs ending in -like.
In almost all cases, such words are syllabified correctly by the LSTM-model, which rightly treats the suffixes of these words as independent domains of syllabification (e.g. domp-li-ke, ern-stic-heyt, rijp-heyt, siec-he-de, etc.). In syllabifications produced by Bouma and Hermans' model, we notice that the final letter of the stem sometimes gets added to a morpheme that it does not belong to (e.g. dom-pli-ke, ern-sti-cheyt, rij-pheyt, sie-che-de, etc.). Also prefixes like and-, ver-and on-are kept intact by the 6 Conclusion §30 Essentially, there are two approaches to the task of automatic syllabification: rule-based and data-driven. An automatic syllabifier for Middle Dutch was first developed by Bouma and Hermans (2012), whose approach fundamentally is a rule-based one. The way they approach the task is very elegant and the scores they achieve are high. Nevertheless, one could argue that specifically for Middle Dutch, their model is not a very robust one. By heavily relying on a set of rules that describe possible nuclei, onsets and codas, their model underestimates the somewhat erratic nature of Middle Dutch orthography. Because the spelling of Middle Dutch allows a lot of variation both in diachronic and synchronic terms, it is risky business to hard-code this information. The automatic syllabifier presented in this paper responds to the need of not having to explicitly describe any definitions, and thus guaranteeing more flexibility when it comes to spelling variation. By resorting to a purely data-driven method, our model is extremely effective at predicting syllable boundaries while respecting morpheme boundaries.
Using LSTM machine learning techniques, we obtain high results at the word level: 97.55% on the test set of the training material corpus, and 98.74% on an outof-corpus sample. The results of the automatic syllabifier for Middle Dutch are therefore in line with comparative research on different syllabification methods, finding data-driven methods to outperform rule-based techniques usually by huge margins Marchand et al. (2009).