Keys with nomenclatures in the early modern Europe

Abstract We give an overview of the development of European historical cipher keys originating from early Modern times. We describe the nature and the structure of the keys with a special focus on the nomenclatures. We analyze what was encoded and how and take into account chronological and regional differences. The study is based on the analysis of over 1,600 cipher keys, collected from archives and libraries in 10 European countries. We show that historical cipher keys evolved over time and became more secure, shown by the symbol set used for encoding, the code length and the code types presented in the key, the size of the nomenclature, as well as the diversity and complexity of linguistic entities that are chosen to be encoded.


Introduction
Despite the great fascination and public interest in historical ciphers, we know little about the usage, evolution, and development of ciphers throughout history.How did cipher keys evolve over the centuries?What linguistic entities were encrypted and how?How big was the vocabulary in keys, and what entities were chosen to be encoded in the nomenclatures of the keys?What kind of code structure was chosen and what symbol systems were used for encryption?What can we expect to be hidden in a ciphertext and how can we try to reconstruct the key and decrypt a ciphertext from a particular area and time period?
One reason behind the lack of studies on the evolution of keys has been the lack of large(r) collections of cipher keys from a wide range of geographic areas over the centuries.Today, we have digital collections that allow empirical studies on keys and ciphertexts, giving us the possibility to make comparisons across centuries.
In this paper, we examine over 1,600 cipher keys-with and without nomenclatures-collected from 10 European countries that have been created and used during early Modern times from the 15th to the 18th centuries.We give a systematic overview of the key structure and present a typology for keys on the basis of what was encrypted and how.More specifically, we look at the structure of the plaintext elements and their corresponding code structure used to create the ciphertext.
Next, in Sec. 2, we exemplify historical keys, clarify some terms and explain how we use them throughout this paper.Then, in Sec. 3, we present previous work on the study of keys along with available collections of keys.In Sec. 4, we introduce the dataset that this study is based on and in Sec. 5, we define the typology of keys including the plaintext, ciphertext, and the encoding structure used in the keys.We then describe how we analyzed over 1,600 keys from various centuries and geographic areas.In Sec. 7, we present the results and in Sec. 8 we show tendencies over time and highlight some typical characteristics for selected geographic areas.We discuss our findings in Sec. 9, and conclude the paper in Sec.10.

Historical keys and some terminology
Ever since humans have been using writing as a form of communication, historical keys were created to hide the content of the message from unintended eyes.The key defines the transformation of the plaintext into ciphertext to encrypt the message by replacing the plaintext elements with codes as specified by the key.The plaintext elements can be, among others, alphabetical letters, double letters, syllables, morphemes, words, names, phrases or sentences.Codes can be represented by a wide range of symbols-fixed or of variable length-including alphabetic characters, digits or various kinds of graphic signs (Megyesi et al. 2021).Nullities signaling fake plaintext, cancelation signs marking which elements in the message shall be removed, and instructions can also be part of the key.We illustrate parts of keys from the 14th, 15th and 16th centuries in Figure 1.
The keys can be typically divided into different sections: one with the alphabet, and one with longer plaintext entities, such as syllables, words and other elements.In literature, keys are sometimes called nomenclatures when they include larger plaintext elements than the alphabet.In our work, hereafter, we make a distinction between a key and what constitutes a nomenclature.By key we mean an entire key used for encryption which contains at least a key alphabet and sometimes also nullities, cancelation signs, and/or a list of nomenclature elements.By key alphabet we refer to the section of a key that encodes the collection of individual letters which can be combined in order to create a message in the key's target language(s).We hereby specify that the alphabet can also contain digraphs containing two characters and even trigraphs containing three characters in some cases, depending on the alphabet of the individual languages (e.g."sz" and "dzs" in Hungarian).By contrast, the nomenclature contains elements which are two or more letters long and which are independent from the alphabet.These can be syllables, function words, lexical items, personal names, place names, or simply some combinations of alphabet letters.The nomenclature elements are typically marked in a special section different from the alphabet.In Figure 2 we illustrate the various parts of a key.In the first part, on the top of the manuscript page, we see the dating and the explanation about the key.The second part contains the Latin alphabet listed horizontally with the codes underneath, also in Latin letters-a classic example of Caesar shift.The third part consists of three columns, listing personal names and two place names with capitalized Latin code letters, followed by other place names coded with digits.The first part of the third column lists military terms coded with metaphors, while the second part contains personal names coded with graphic signs.In the bottom right corner two more names for persons are given in cleartext, probably the involved persons in the correspondence.
Moreover, we make a distinction between various text types in the encrypted sources, following Megyesi, Blomqvist, and Pettersson (2019).Plaintext is defined as the unencrypted data element that is intended to be encrypted or decrypted as defined by the key.Cleartext, on the other hand, is unencrypted data that is not meant to be encrypted but gives information about the origin or the functioning of the key in clear text.Such texts typically indicate when the key was created, in which relation it was meant to be used (the name of the correspondent) or, in case of long nomenclatures or codebooks, whether the keys is used for encryption ("chiffrant") or for decryption ("d echiffrant").Shorter or longer instructions that explain how to use the key, what is a nullity (when nulls are not merely listed) and how the key is supposed to be exchanged with the addressee are also written in what we call "cleartext."In Figure 2, the first part of the key stating the date and the instructions in the upper right corner would constitute cleartext.

Related work
In this section, we give a brief overview of previous studies and collections of keys in early modern times.After that, we shortly summarize online resources which provide digital collections of historic ciphers and cryptographic keys.

Previous studies
Early modern cipher keys have been the center of attention since the beginnings of historical cryptography research.Already in the 19th and the early 20th centuries large collections of such keys have been copied from the archives and published.In the last decade of the 19th century, Ludwig von Rockinger transcribed dozens of keys from a Bavarian collection (Rockinger 1892).
Shortly after the turn of the century, Aloys Meister edited two volumes, one on the 15th century cryptography of the Italian city states, and a second one on the late medieval and early modern practices of the papal court.Meister's volumes, which are still often cited by specialists, are not merely key transcriptions, he also edited a few classic, cryptography relevant texts and provided his books with helpful introductions (Meister 1906(Meister , 1902)).Half a century later, J. P. Devos copied and published the keys of Philipp II from the Spanish archives (Devos 1950).These publications have certainly induced a lot of scholarly interest and facilitated expert research.However, the text editions did not go very deep in analyzing the keys, nor did they try to carry out a quantitative analysis.
A much more substantial contribution is thanks to David Kahn who, in his grand oeuvre, "The Codebreakers" (Kahn 1996), offered a historical analysis of cryptography.Yet, even he did not undertake a quantitative analysis, in contrast, he called for the importance of doing such a research in a later publication: This research requires merely examining the many nomenclatures in the archives of Italy and France and timing and quantifying the change.I suppose it will be tough, living in Europe for a year and having an aperitif after a day examining antique manuscripts.But somebody should do it!(Kahn 2008, p. 58) Nothing is known about the number of aperitifs consumed by the historians doing research in the archives, but Kahn's words were not in vain.The proposed project was done by several scholars on the basis of the archival materials of various geographical areas: L ang for the Central European area (L ang 2015, 2018), Lasry, Megyesi, and Kopal did it for the early modern papal correspondence (Lasry, Megyesi, and Kopal 2021), and the authors of the present article provided a first larger quantitative analysis on 700 cipher keys (Megyesi et al. 2021) and a more recent, detailed study on the plaintext elements in 1,384 keys (Megyesi et al. 2022).
The temporal sequence of these studies also marks a gradual shift from the more qualitative to the predominantly quantitative approaches.L ang's monographic analysis (L ang 2018) was based on a sample of 400 cipher keys, of which he formed structural clusters, and described the typical representatives of each group.He traced a six step evolution proceeding from the classical monoalphabetic ciphers using symbols, through those monoalphabetic ciphers which were completed with a short list of nomenclatures, to the "weak" homophonic ciphers where only the most frequent vowels are substituted, which were in turn complemented with nomenclatures, arriving finally to the mature homophonic ciphers with a large nomenclature table, which in turn gave their place to the large code books.L ang also gave a typology of the various elements of the nomenclature table, the nullities, the grammatical functions and the letters of the alphabet.
Lasry et al. chose a particularly exciting context, that of the 16-18th century papal correspondence (Lasry, Megyesi, and Kopal 2021).The papal court was certainly the most innovative in the field of cryptography in the 16th century, steadily declining and giving its leading position to other diplomatic centers in the two following centuries.The lengthy article reconstructs a number of keys solving previously unknown and unread ciphertexts with computerized methods.The various algorithms developed for decryption were also implemented in the open source software CrypTool 2. 1 The authors present sophisticated enciphering practices typical for the court of the Pope.These included practical but relatively unsafe polyphonic methods, and variable length homophonic ciphers which have made historical and modern decryption extremely hard.An important conclusion of this article is that no easy typology of cipher keys can be offered, as various innovations and methods intermingled in the diplomatic practice, and thus the evolution of keys was far from being a linear improvement.Megyesi et al. (2021) undertook a task similar to that of the present article: a systematic analysis of 700 cipher keys collected at that time in the Decode database (Megyesi, Blomqvist, and Pettersson 2019) focusing on the symbol system, languages, nulls, and code types.A year later, the authors continued with studies of the plaintext entities in nomenclatures encoded in 1,384 keys (Megyesi et al. 2022).The two articles can be seen as previous steps of the present research in two senses: the first being carried out on a limited set of keys, and the second investigating only part of the components of the keys.To our knowledge, the studies by Megyesi et al. (2021Megyesi et al. ( , 2022) ) were the first that used large scale statistics to analyze the trends and the morphology of 1 https://www.cryptool.org/en/ct2/the cipher keys but none of them involved the entire nomenclatures of a significantly bigger dataset, making the present study more representative and exhaustive than the previous ones.
The present paper builds on the earlier results, but is much more ambitious.First, the empirical basis of the research has considerably grown (from 700 to 1,600) and became more balanced over centuries, and second, the scope includes not only the plaintext elements as in Megyesi et al. (2022), but the entire nomenclatures as well.

Digital online collections of historical ciphers
Many historical cipher keys have been destroyed for security reasons, but we can find plenty of keys that are accessible after some time digging.Sometimes keys are indexed as ciphers or similar items, but oftentimes not, so we rely on a kind and hopefully enthusiastic librarian to uncover the right boxes.We can find single keys with or without their original plaintext and the applied ciphertext(s) stored together in a box, or we find an entire collection of keys from a particular time period kept together in a pile among other documents.Once they are found, we prefer digitized copies of the material to be able to study them.
Besides many archives offering online services, e.g.online catalogues as well as digital libraries offering image scans of their kept manuscripts, the majority of their housed documents have not been digitized yet.Thanks to a handful of researchers, a few online resources dealing with ciphers as well as cryptographic keys are now available.
Antal's "Portal of Historical Ciphers"2 (Antal and Zajac, 2020) hosts a yet small but growing database of original historical ciphers from the 17th up to the 21st century.Right now, their database contains 54 ciphertexts, of which 12 are unsolved.
Tomokiyo's private homepage "Cryptiana"3 is one of the largest online sources for ciphers and keys.This well organized website features collections of original material from all over the world.Besides images of the original ciphers and keys, the website also contains helpful material dealing with cryptanalysis of historical ciphers.The collection contains material from the 15th up to the 20th century.
Being a crypto expert and blogger, Klaus Schmeh collects all sorts of historical ciphers and keys and discusses these in his blog "Cipherbrain," 4 presenting the material to a broader audience.With the help of his blog readers, many previously unsolved ciphers have now been solved.
Finally, the DECODE database (Megyesi, Blomqvist, and Pettersson 2019) is the biggest source for historical ciphers and keys today.At the time of writing, the database contains over 2,800 historical encrypted sources: 1,185 ciphers and 1,610 original historic keys.The present study is based on some of the keys stored in the DECODE database, which will be described next.

Sample of keys
For the structural description and empirical analysis of keys, we extracted 1,610 keys from the DECODE database (Megyesi, Blomqvist, and Pettersson 2019).They were collected and digitized from various archives in Europe and uploaded and registered to the database by June 14, 2021 at latest.Images of the original documents and metadata such as current location, dating, symbol set, among other features, are available and recorded in the DECODE database.
The keys that are analyzed in this study are deposited in archives from 10 European countries: Austria, Belgium, France, Germany, Hungary, Italy, Spain, The Netherlands, UK, and the Vatican City State.Figure 3 illustrates the origin of the keys kept in various libraries and archives in Europe.The majority, over 900 keys, originate from Austria.Most keys are dated from the 17th century, followed by the 15th century.The sample is rather opportunistic, and is unfortunately not balanced, as far as the geographical distribution of the source holding archives and libraries is concerned.We investigated what we could get from these holders.However, we believe the data is rather representative for the various centuries for the development of a structural description of keys.A major reason for this optimism about the representativeness of the data is that cipher keys held today in a specific archive (e.g. the Haus-, Hof-und Staatsarchiv in Vienna) originate from a large number of geographical areas, and do not merely represent what can be called the history of a certain country.In the case of Austria, for instance, the ciphers have been used by the Habsburg chancellery, and the Habsburgs were ruling in the Holy Roman Empire, Spain, Belgium, and parts of Italy.Their secretaries were in correspondence with various other courts of Europe.When sufficient metadata can be found on a specific key, it often turns out to be of a Tuscan or a French origin despite the fact that it is kept today in Vienna.And the situation is quite similar with other major archival collections: their holdings go far beyond the history of the state in which they are kept, and due to the diplomatic and bilateral nature of cryptography, they carry information about more numerous geographic areas.
In the sample of keys, we find various types of symbols that were used in the codes, including graphic signs (G) such as zodiac or alchemical signs, digits (D), and alphabetical letters (A). Figure 4 illustrates the distribution of the various symbol types across centuries.Interestingly, but perhaps less surprisingly, the graphic signs and alphabetical letters become less popular over time while the use of digits greatly increases in popularity.
The symbol types are often combined in the keys into symbol sets, mixing digits, alphabetical letters, and various types of graphic signs in the same key.Figure 5 illustrates the symbol sets used across centuries.The most frequent one in the 15th century was a mixture of alphabetical letters, digits, and graphic signs, which greatly decreased in frequency in the 16th century, when various combinations were more equally distributed.From the 17th century onwards, digit-based ciphers became dominant.The underlying plaintext and cleartext languages represented in the keys vary in the collection, see Figures 6 and 7.However, the languages and their quantitative distribution indicated in the figures can only provide a   rough tendency as the languages in most of the keys have not yet been analyzed (77% of all keys lack a label for the cleartext language and 56% for the plaintext language).Nevertheless, what clearly can be seen is that there is a range of different languages, including international and local languages, used between the 16th and the 18th century.One other tendency coming to the fore is that French gains importance and is dominating as a plaintext language in the eighteenth century even though only a small share of all keys included in our dataset stems from a French archive.The dominance of French is not unexpected as French functioned as the "lingua franca" in courts, culture, as well as in diplomacy during that time period.The rise of French happens mainly at the expense of Latin, which drops in relative frequency compared to the vernacular languages.This tendency is even more significant for the cleartext languages.
The cleartext languages in keys aims to provide information about the sender and receiver, date, place, and any other non-encrypted text.The high rate of German is due to the fact that documents from the former Habsburg Empire now held in Austria, and with German as the main corresponding language, form by far the biggest part of our dataset.
The distribution of the above mentioned characteristics of keys over centuries in Europe shows similar results as the pilot study based on a smaller sample of 700 keys described in Megyesi et al. (2021).

The structure of keys with nomenclatures
Following the morphological description of keys described in Megyesi et al. (2021), the features analyzed concern characteristics on three different levels: the plaintext elements, the ciphertext elements and the encoding structure.Here, we are particularly interested in the nomenclatures of the keys, and apply the three-level analysis to all parts of the key that are not part of the alphabet but belong to the nomenclature list: the plaintext elements and the ciphertext in the nomenclature list, as well as the encoding structure used for the nomenclatures.
With regard to the plaintext elements, we analyze what kind of linguistic entities were subject to encoding in the nomenclatures.These include named entities, such as names of persons or locations, content words, functional words, syllables, morphological suffixes, phrases, sentences, numbers, and punctuation marks.
On the level of the nomenclature ciphertext, the symbol set for encoding is characterized by a possible systematic use of diacritics in the code elicited.Furthermore, the code length and type are specified and the possible use of lexical cleartext items is determined.
Lastly, our attention is directed to the encoding structure, which includes the size of nomenclatures with respect to how many key-value pairs they contains, the layout arrangement of plaintext and nomenclature items (including a possible use of sections and headings), and the type of ordering system with regards to the content.
For features with binary values (i.e.whether the feature exists or not) we use a binary notation by assigning a value of 1 to the feature we are looking at if it is present in the nomenclature we analyze, and a 0 if it does not occur.For features with more than two values, we introduce a tag for each value to reduce inconsistencies among annotators.We provide details on how we describe each feature below.

Plaintext elements
We begin by analyzing and categorizing the different kinds of plaintext elements present in the nomenclature.We mention that, with the exception of the feature for named entities, all other features in this category are binary ones.
The first characteristic that we look into is whether or not the nomenclature contains any kind of named entities, which can be made up of one or more lexical units.Here we differentiate between three main categories, namely "people," "location" and "other," marked with "P," "L" and "O," respectively.By "people" we refer to both proper names, as well as titles that are specific enough to denote one singular specific entity (i.e."The king of France" would be taken into account, but just "The king" would be too vague)."Location" denotes any geographical markers, mainly for settlements, such as names of cities, countries, geopolitical spaces etc. while the category of "Other" named entities encompasses a broad array of itemsfrom bodies of water to political entities, or anything that does not fit within the scope of the first two categories.If no named entities are represented in the nomenclature, we mark this feature as 0.
The second item on our list is checking to see if the key contains any numbers.Under this category we count numerals in any form, cardinals or ordinals, whether they are represented as digits or by name (e.g."1st," "first," "25," "centum," "mille").Here we include even those keys that have a section for special markers for numbers, which occurs frequently in some Spanish keys (i.e."Numeros").If we encounter any of these cases, we mark this category as "1." The next aspect we look into is whether or not the key contains any content words.By this we mean any word that carries semantic meaning and which can contribute to the meaning of the sentence it would be used in, such as nouns, verbs, adjectives, or adverbs.On the other hand, we also look at function words, which are lexical units that have very little semantic meaning in themselves, but which aid in providing grammatical context between the words of a sentence.Here we take into consideration, among others, pronouns, prepositions, or conjunctions.
Moving away from independent, self-sufficient lexical components, we look into whether or not our nomenclature encodes syllables, which are groups of two or more letters, out of which at least one is typically a vowel and which can be combined in order to build words.We therefore do not take into account clusters of double letters or other clusters which are part of the key alphabet and not the nomenclature itself, as specified in Sec. 2. Furthermore, another category of lexical units that contribute to word formation is that of morphological endings.We mark this category as "1" when there is a specific section for morphological endings, or they are clearly distinguishable within the key, for example as grammatical gender or number.We do not consider morphological endings to be present if the key just happens to have a syllable that could potentially also be used as a morphological ending.
Our feature set also includes phrases and sentences as potential components of a nomenclature.What these two categories have in common is the fact that both are clusters of two or more words that are not named entities.The difference here is that, while sentences are fully formed linguistic expressions that convey a complete thought, phrases are simply clusters of words that can carry contextual meaning but are not grammatically complete sentences.
Lastly, we check if any punctuation has been encoded in the nomenclature.Here we take into account punctuation represented in any form, either by the graphic sign or by the name (e.g."." or "punctum").

Nomenclature ciphertext
We begin by analyzing the kinds of symbols used for encipherment.In terms of symbol set, we differentiate between "alphabet," "digits" and "graphic signs," represented as "A," "D" and "G," respectively.By "alphabet," we refer to those cases where the letters of the Latin alphabet are used as ciphertext.Any combination of two or more of these symbol sets is allowed as a valid representation for this feature.For example, if we encounter a nomenclature that uses an indexing system for encoding (i.e. a combination of letters and numbers, such as A1, … An, B1, … Bn), we mark "A, D" for symbol set.
In addition to the main symbol set, some nomenclatures make use of diacritics in order to alter and reuse already existing symbols.This is accounted for as a binary "0/1" feature in our analysis, where "1" signifies the presence of additional symbols used in a systematic manner in order to extend the symbol set.We note that if diacritics are only represented in the instructions for the nulls (e.g."a code with a þ on top becomes null"), we do not mark 1 for diacritics as we only analyze the nomenclatures.
Furthermore, we also assess the length of the codes used in the nomenclature.Here we distinguish between fixed, variable, or undefined length, represented as "F," "V" and "U," respectively.We say that a nomenclature is of fixed length when every single code employed in the encipherment is of the same length-such as a key that uses 3-digit codes exclusively in encoding the nomenclature.Any change in the length of the codes and we instead mark the nomenclature as "variable" in length.We do mention here that we count all graphic signs as being of length 1, and so a nomenclature that only uses graphic signs would be of fixed length.Lastly, the "undefined" parameter refers to those cases where entire words or phrases are used for encoding plaintext, as exemplified in Figure 8.
Another feature we look at when analyzing the ciphertext is the code type used for encryption.Said feature can have one-or a mix of-the following values: "S" (simple substitution), "H" (homophonic substitution), and "P" (polyphonic substitution).By contrast with other features that accept several values irrespective of which order the annotator decides to mark them in, this particular features require that the values are marked in order of their prevalence in the nomenclature.For example, a notation such as "S, H" would signify that the nomenclature uses mainly simple substitution, but it also contains some homophonic elements.On the other hand, if we see "H, S," this would mean that the nomenclature contains a significant amount of homophonic elements (at least 30% or more, approximately).The same principle also applies if we have all three code types within one nomenclature, and so we would interpret "S, H, P" as a nomenclature that mainly uses simple substitution, but also has homophonic elements, and even fewer polyphonic ones.We also mention that, if the nomenclature encodes several variants of the same word under the same code, or even a word root along with several (semantically) different variants, as seen in Figure 9, we represent this as polyphonic substitution.
Lastly, we have one final binary feature for the analysis of ciphertext, where we indicate the presence of lexical "cleartext" items, or metaphors, used for encoding.We mark this feature as "1" when we encounter the use of words or phrases used to encipher plaintext elements, similar to the example we give in Figure 8.

Encoding structure
Once we conclude the analysis of the plaintext and the nature of the ciphertext of the nomenclature, we proceed to investigate the encoding structure as well.
We begin the analysis of the encoding structure by first assessing the size of the nomenclature.We express this in intervals, according to the following list: 1-10, 11-20, 21-50, 51-100, 100þ items.Although some of the nomenclatures can even contain thousands of elements, we cannot account for intervals that would fall in the realm of thousands of entries.Due to the fact that our analysis was conducted manually, it would be rather difficult to accurately state which higher range these nomenclatures fall in, as a manual count of such a high volume of entries is prone to human error.
We then proceed to judge the visual layout of the nomenclature in terms of it being structured in different fields.We resort once again to the binary notation of "0" or "1" in order to establish if the nomenclatures contain sections or headings.By sections we refer to a visual segmentation in the body of the nomenclature, such as a gap between items starting with the letter "A" and items starting with the letter "B." Headings on the other hand are titles or explanations that let the reader know what the section contains, without the heading having any code assigned to itself (e.g. a capital "A" to mark the beginning of the section containing elements that start with the letter "A").Since we do not consider nulls to be part of the nomenclature, we therefore do not consider phrases such as "Nullas," "Nihil significantia" et al. to be headings in our analysis.When analyzing the structure of the nomenclature, we also look at the plaintext and ciphertext arrangement.We assess these categories in terms of the orientation of the text on the page, which can be either horizontal, vertical or random (marked as "H," "V," and "R," respectively).We mention here that this classification is not limited to only one of these possible values, as some nomenclatures switch their ordering from section to section (e.g. the syllables are arranged horizontally at the top of the document, while the rest of the nomenclature is organized vertically below said section).
In terms of plaintext arrangement, we also try to identify potential ordering systems.Here we differentiate between alphabetic ("A") and/or thematic ("T") ordering.If there is no identifiable ordering in the nomenclature, the feature value becomes "0."Whether we mark "A," "T," or both "A, T," that does not have to mean that the ordering system is perfect.It is common for nomenclatures not to follow an exact alphabetic ordering, either due to human error in creating the key or due to later additions to the nomenclature list, but our purpose is to mirror the general tendencies and the intention of the scribe.

Investigating keys with nomenclatures
For the analysis of keys that include nomenclature elements, out of 1,610 keys in total 226 records were excluded from the set, either because they only contain the key alphabet with no nomenclature elements, or because they were empty key templates, in the sense that they only contain the plaintext elements with no codes assigned to them.
The remaining 1,384 keys were annotated manually by the authors, an interdisciplinary research group with competence in cryptology, history, language technology, and linguistics.The annotation was carried out in a three step procedure.First, we discussed possible features that we found to be relevant in our various areas in the analysis of keys.A feature set with attributes was compiled and preliminary definitions for the features established (see Sec. 5).The feature set then was applied by each of us annotating the same key.On the basis of our findings, we revised the definition of features and their values.In a second step, groups of two researchers annotated the same 50 keys individually and compared their respective results.In total, 350 keys were annotated in this round, every researcher being part of two different groups.In this way we could combine our various expertise pairwise.On the basis of results and problematic cases coming to the fore in the pair analyses, the group further developed the feature catalogue and refined the definitions for the single features.Lastly, the remaining keys were allocated to the single researchers and annotated individually.The annotations were collected in a shared spreadsheet and complicated cases were discussed in the group.Uncertain cases were marked in each round to be discussed, and corrected according to the consensus in the end.

Nomenclatures
We present here the outcome of the annotation of the various features with regard to the nomenclature elements in the 1,384 keys, originating from the 15th to the 18th centuries.

The plaintext
Nomenclatures typically include named entities (NE), such as persons and locations (Megyesi et al. 2022).In the sample of this study, only 3.8% of the nomenclatures do not contain named entities.Figure 10 illustrates the proportion of named entities expressing locations, persons, and other named entities throughout the centuries.While nomenclatures include location quite evenly across centuries, references to persons decrease over time and other named entities increase.
Looking at the usage of combination of various types of NEs, shown in Figure 11, we can see that in the 15th century, encoding personal names alone or a combination with place names were most common, while in later periods, a diverse list of elements formed part of the named entities.
Even though 70% of the nomenclatures did not include any numbers listed as plaintext in keys, it is clear that encrypting numbers became more and more popular over time, see Figure 12.
The presence of content words is, not surprisingly, commonly occurring in nomenclatures and quite evenly distributed over the centuries, as shown in Figure 13.Encoding function words, on the other hand, was popular in the 15th century but became less frequent in keys from the 16th century, to win popularity again later, in the 17-18th centuries, see Figure 14.Syllables were used in nomenclatures throughout the entire time, but became more frequent in the 17th century, where almost 70% of the keys contained syllables.The popularity of syllables reached its peak in the 18th century where 90% of the keys included syllables in the nomenclatures, see Figure 15.
Encoding morphological endings was not common in history, but they occur mostly in keys from the 17th century and partly also from the 18th century, as shown in Figure 16.
Expressing phrases in nomenclatures cannot be said to be very frequent but clearly occurs over all time periods, and most frequently in the 17th century, see Figure 17.The encoding of complete sentences is a very rare phenomenon and could only be found in 21 keys (1.5%) and there is not a clear chronological tendency.
Lastly, punctuation marks are rarely present in keys from the 15th century due to the lack of their usage in writing in general, but as they become more frequent in non-encrypted texts, they also increase in frequency in cipher keys, see Figure 18.

The ciphertext
In this section, we find out how plaintext elements have been encoded with code structure.First, we look at the symbol types and how they are composed in the nomenclatures.Similarly to the entire keys, we find a large number of graphic signs and Latin letters in the nomenclatures from the 15th century, while digits became more common later in the 16-17th centuries, to become standard in the 18th century, as illustrated in Figure 19.Also, the use of combinations of various symbol types in nomenclatures was common; graphic signs with Latin letters dominate in the 15th century, other combinations are more evenly distributed in the 16th century, while digits with Latin letters are more frequent in the 17th century, as shown in Figure 20.
Instead of introducing new symbols, diacritics could be used to distinguish between codes (see Figure 21).Diacritics appear in 40% of the nomenclatures from the 15th century to separate codes, but their usage became less frequent over time as digits became more popular for encoding.Diacritics occur mostly with graphic signs (G), followed by alphabetical letters (A), and less likely to occur with digits (D), see Figures 22 and 23.
Code words could be another way to encrypt plaintext entities.Typically, person names and locations could be encoded by another name, often a  metaphor for a physical person.Similarly to the usage of diacritics, metaphors were most common in the 15th and 16th centuries to become rarely used later, as shown in Figure 24.
The code length in nomenclatures-whether it is fixed or varies depending on the type of nomenclature-differs across centuries, see Figures 25   and 26.While the more advanced, varied code length was most frequent throughout all centuries in general, the more easily broken, fixed length codes were also applied, mostly in the 16th and 17th centuries.This is in line with the findings by Lasry, Megyesi, and Kopal (2021), that the papal   ciphers from the 15th century were more advanced and became less sophisticated in the 16th and 17th centuries.
Lastly, the occurrence of code types-whether the nomenclature uses simple, homophonic or polyphonic codes-shows great variation over time, see Figure 27.The dominant code type in the nomenclatures was simple substitution in general, but homophonic and polyphonic codes became more frequent later at the expense of simple substitution.The assignment of several codes to nomenclature elements to create homophonic substitution is not surprising, making decryption more difficult.However, the increased usage of polyphonic codes where several plaintext elements could share the same code is rather surprising.This could be explained by the increasing size of nomenclature elements where the constructor of the key had difficulties keeping the codes in mind and simply made mistakes.
The entire list of code types with all possible combinations can be found in Figure 28.In the 15th century, simple substitution was standard and we  found only a few cases with homophonic or polyphonic codes in the nomenclatures.In the 16th and 17th centuries, these combinations increased at the cost of simple substitution.

The encoding structure
The size of the nomenclature in keys increased successively over time where nomenclatures of over 100 elements became more and more common, to finally become standard in the 17th and 18th centuries, as illustrated in Figure 29.Typically in the 15th century, nomenclatures of between 21 and 100 elements were most frequent and the list greatly increased in size.Rather surprisingly, small nomenclatures of less than 10 elements were rarely occurring overall.
The content of the nomenclatures could be organized either alphabetically listing plaintext in alphabetical order, or thematically, by persons and geographic areas, and other entities in groups, or in combination.In Figure 30, we show the distribution of content arrangement types in nomenclatures over time.With larger nomenclature size, more order is needed to find the plaintext elements for encryption and decryption.This is indeed the case-nomenclatures without any order decrease over time as shown in the dark blue bars.Also, small nomenclature lists from the 15th century were ordered thematically to a greater extent than alphabetically, while larger lists were preferred to be ordered alphabetically in combination with themes.
The nomenclature lists oftentimes contained sections, see Figure 31, mostly without separate headings, see Figure 32.However, the long nomenclatures which became popular in the 18th century also contained separate headings, making it easier to find the various elements.
The plaintext could be arranged vertically (V), horizontally (H), or randomly (R).Figures 33 and 34 illustrate the content arrangement over time.In the 15th century, we find all types with some dominance over randomly ordered, shorter nomenclatures followed by horizontally ordered lists.In the 16th century, randomly ordered plaintexts were replaced by vertically ordered lists at the expense of horizontally ordered plaintext elements.In the 17th and 18th centuries, most of the nomenclature lists were ordered vertically.Looking at the arrangement of the ciphertexts, or codes, see Figures 35  and 36 they seem to follow the arrangement of the plaintext to a great extent.This means that if the nomenclature elements are randomly ordered, the codes also tend to be randomly assigned; if the plaintext is vertically ordered, the digit codes are also assigned vertically, and vice versa, if the plaintext is horizontally ordered, the digit codes also follow horizontal order, often by an increasing number sequences.

Tendencies
The final step of our study is to see how various features correlate with each other over time.We investigate how the size of the nomenclatures overlaps with some certain aspects of the keys.We also look at the linguistic complexity of the keys and regional tendencies concerning various feature types.Lastly, we discuss some tendencies with respect to security issues.

Nomenclature size and structure
We concluded in Sec.7.3 that the size of nomenclatures increased over time, see Figure 29.It turns out that we can find a clear correlation between nomenclature size and code types including simple, polyphonic and homophonic (S, P, H, resp.)substitutions, as shown in Figure 37 for the individual code types, and in Figure 38 in combinations.Simple substitution is dominating in the nomenclature lists throughout all centuries, but encoding the same plaintext by several codes (H) increases as the size of the nomenclature grows.Interestingly, polyphonic codes also increase with nomenclatures of 100 items or more, probably unintentionally; it is more difficult to keep track of used codes in long lists and mistakes may occur.
The arrangement of the elements changes also with the increasing size of the nomenclatures.While shorter lists are mostly arranged randomly and vertically, more structure is required in longer lists, preferably with vertical order column-wise.Horizontal arrangement of the plaintext most commonly occurs with lists of size 10-100, see Figures 39 and 40   The longer the list of elements in the nomenclature, the more likely it is that the list is arranged in sections.Section headings-sometimes with and sometimes without codes-might also occur to make searching in the nomenclature lists easier, see Figures 41 and 42.

Linguistic complexity
The nomenclatures contain more or less complex linguistic entities.In all nomenclature lists, we expect to have nouns, and among them personal names and place names in the first place.However, nomenclatures also encode various other linguistic features, such as syllables, function words, or certain morphemes.We can even find specific codes assigned for grammatical morpheme types such as for number (singular and plural), or case (nominative, accusative, dative).The longer the nomenclature, the more abstract linguistic elements we expect the nomenclature to contain.As illustrated in Figures 43 and 44, the number of keys including content words increases with the size of the nomenclature list, and the same is true  for the keys with function words.Syllables are also listed in nomenclatures, above all in nomenclatures of size 50 or bigger, see Figure 45.In nomenclatures containing 100 elements or more, we find syllables in as many as 77%.Another interesting observation is that there is a hierarchy in the occurrence of the different linguistic elements.The more abstract elements such as grammatical morphemes can be found clearly less frequently than the more concrete elements such as named entities.As has been shown in Sec.5.1, named entities occur in 96.4% of all the keys with nomenclatures.The   frequency of occurrence decreases the more abstract a linguistic entity is, see Figure 46.
This hierarchy of elements also comes to the fore in the co-occurrence of different linguistic elements in the same key.If more abstract entities such as grammatical morphemes are encoded, the same nomenclature is likely to contain more concrete linguistic elements as well.For instance, 95% of the keys with morphological endings also contain syllables and 97% of the keys that encode function words also list codes for content words.Moreover, only 1.4% of the keys including content words do not contain named entities.Conversely, it is not possible to assume that more abstract linguistic entities are also present when more concrete entities can be found in a key.For instance, in our data set only 8.3% of all the nomenclatures with named entities likewise contain content words.However, it has to be mentioned that the rate of co-occurrence for syllables and function words is deviates somewhat from the linguistic hierarchy pattern shown above as only 85% of the keys with syllables also list function words.Hence, it is  possible that syllables occur in historical cipher keys even if other linguistic entities are absent.
Considering this hierarchy, it can be stated that the keys with the greatest diversity of linguistic elements and the highest degree of linguistic complexity are the ones that encode morphological endings.Such keys become more common and gain a wider geographic distribution in the 17th and 18th centuries, see Figure 47.In our data set, they only occur in Austria already in the 15th and 16th centuries.

Regional tendencies
In the previous sections, we were focusing on the content and the structure of keys and how they developed over time.Here, we investigate possible similarities and differences across regions.Unfortunately, in many cases we have no information about the actual usage of the keys and how they were set in practice in various regions by the users.However, we have detailed and reliable metadata about the whereabouts of the keys, i.e. the holder countries, which can shed some light on the regional differences assuming that it is highly probable that the cipher keys are stored in the archives of the countries/regions where the cipher keys were created or used.
We look for correlations across holder countries in the same set as described in the previous sections with the exception of countries with a small sample of keys.Hence, we do not consider keys from the collections that are stored in France, Spain, and the Netherlands, as they contained less than 10 keys each.
Starting with the symbol set per area in Figure 48, several tendencies can be pointed out.It is in the Italian city states and in the Vatican, where we see the largest quantity of purely alphabetic ciphers, which is due to the fact that these areas preserved the richest collections from the early epoch of cryptography in Europe, i.e. the 15th century, when short tables consisting of alphabets were used and digits were not yet widespread.The fairly uniform distribution in the case of Vienna indicates that the Austrian material is the most representative throughout the ages, and it shows well the technical evolution going through all possible variants.The dominance of digits and of the alphabet-digit combination in Germany and Hungary can be explained by the fact that the bulk of these collections is mostly from the late 17th and 18th centuries, when purely alphabetical ciphersbeing utterly impractical in large tables-were outdated.
The size of the nomenclature varies across the regions, as illustrated in Figure 49.Longer lists of 50 elements or more are most dominating in general.However, there are some regional differences where Italy shows fewer keys with more than 100 elements, while keys kept in Belgium tend to be longer.It is difficult to draw any conclusion about the regions as our sample is not balanced in terms of time period and region.
Looking at the distribution of the usage of various code lengths across regions, see Figure 50, we observe a relatively even geographical distribution of fixed and variable length codes.Codes of various length in the  same keys were dominating in all regions with the exception of Italy where a surprising peak of fixed length codes appear.The explanation can be found in our unbalanced sample; all the 16 Italian cipher keys included in our material originate from the 15th century.Interestingly, keys from the same area but a different region, namely the Vatican City state, show the opposite tendency.The keys from the Vatican City, also originating mainly from the 15th century show variable length codes as the most dominating code length.
Concerning the code types across regions, simple substitution is clearly dominating in all regions, as shown in Figure 51.The most commonly used code type combined with simple substitution was homophonic substitution across all regions with the exception of the Vatican.As we mentioned before, the keys from the Vatican state use purely simple substitution codes.But instead of making decipherment more difficult for an attacker by varying the used code types, the Vatican chose to vary the lengths of the codes within a key to make the cipher more difficult to break.The cryptanalyst had to start with the troublesome segmentation of the code sequences before trying to find the corresponding plaintext to  each code.In a very few cases, especially in longer nomenclatures, polyphonic codes also appear, especially in Austria and the UK.

Security tendencies
The results presented throughout the sections clearly indicate the increasingly more secure keys over time shown by the symbol set used for encoding, the code length, and the code types used in the key, the size of the lists of plaintext elements that are part of the nomenclature, and the types of plaintext that are chosen to be encoded.We list some tendencies with regard to the named features below.
Symbols: A visible indication for more secure (and easier to use) ciphers was the decline of the use of graphic and alphabetic symbols in favor of the use of digits to construct the code elements.Letters in the Latin and Greek alphabets, digits, graphic signs and diacritics in various combination were replaced by digit-based codes allowing a large number of combinations of codes for encryption.The digit-based codes became more difficult to segment and spaces marking word boundaries were replaced by scriptio continua (continuous script) making decryption more difficult.In the 15th century 16% of the nomenclatures were constructed by using digits, while in the 18th century nearly all nomenclatures ( 90%) were using digits only.The choice of the seemingly simple set of symbols-the 10 digits (0-9)-allowed for a large combination of codes in a flexible way.With three-digit codes, for example, we can create a thousand nomenclature elements, ranging from "000" to "999."The digit-based coding system could clearly lead to the development of large nomenclature tables which eventually became code books in the 19th and the beginning of the 20th century.
18th century 23% of the tables contain homophones and 17% contain polyphones, see Figure 27.Surprisingly, we have not found any examples of polyalphabetic keys in our manually annotated sample.The reason might be the fact that the advanced polyalphabetic systems were difficult to apply in practice.Nomenclature size: The increasing security of keys can be seen by the growing size of the nomenclature tables over the centuries, as depicted in Figure 29.While in the 15th century the vast majority ( 80%) of the keys contained nomenclatures with less than 100 elements (and about 50% even fewer than 50 elements), in the 18th century more than 80% of the nomenclature tables were larger than 100 elements.The larger nomenclature tables became codebooks over time, oftentimes with two separate parts, one for encryption and another for decryption.Plaintext: The increasing size of the nomenclatures over time can be correlated with more complex linguistic entities.Starting by listing a few personal and place names in the 15th century, the lists grew beyond nouns where verbs, adjectives, function words, phrases, as well as morphological entities and syllables could be encrypted with specific codes.For example, while in the 15th century only a small number ( 25%) of tables contained syllables, we find them in nearly all keys ( 90%) from the 18th century.The introduction of more complex plaintext elements in the nomenclature tables allowed for the construction of more sophisticated and secure ciphers.
Our study clearly indicates that the average security trend in Europe led to higher security standards of the keys from the 15th to the 18th century.There are, however, local, regional tendencies such as the sophisticated cipher keys from the 15th century used by the Vatican in the papal correspondence, which is in line with the findings of Lasry, Megyesi, and Kopal (2021).Further studies are needed to investigate regional differences in more detail.However, we can conclude that the more advanced key structures allowed the cipher clerks more varied encryption and smarter combinations of choices, thereby making cryptanalysis more difficult.

Discussion
The analysis of over 1,600 cipher keys from 10 countries and four centuries showed some clear development of keys over time, as well as some regional differences.However, we would like to emphasize that the data analyzed in this paper is limited in various respects.An important issue is the uneven geographical distribution of the data sample.The reason for this distribution lies in the collection process of cipher documents within the DECRYPT project that initiated with visits to the State Archive in Vienna and to the Vatican Secret Archives and the main library of the Vatican.Due to the coronavirus pandemic in 2020-2021, data collection on site had to be stopped and we relied on already digitized material available online, mostly at the websites of national libraries.We are also very grateful to the people who generously share their private collections with us or point us to specific documents in archives all over the world, and the archivists/librarians who open the virtual doors to their treasury, being our hands and eyes, and who kindly digitize and send us the documents.It is nice to see the increasing generosity to share data without which large-scale empirical studies, such as the one presented here, would not be possible.Our attempts and efforts to collect more ciphers and keys to enrich and update the DECODE database will continue.In the nearest future, we will include more material from the Netherlands and the UK, as well as from various archives in Italian medieval cities and Eastern Europe.
In a while, when we will have further increased the collection, we aim to study the regional tendencies century by century from the various aspects that we presented in the paper with respect to the plaintext, the code and the encoding structure with some modifications.We realize that the size of the nomenclature could have been categorized in more detail, above all allowing for more granularity over 100 elements.A category of 500 and 1,000 elements would further increase the quality of the study.The reason for stopping at 100 elements is the cumbersome work of manual counting; we aimed for avoiding unnecessary errors.Adding the number of unique codes in keys could be an additional useful feature indicating the size of the nomenclature more precisely.
We believe that automatic document analysis provided by the image processing community could help to automatize the description of the key structure, and the further examination of the arrangements of different parts in the keys.A tool could help in recognizing the various parts of the key: the oftentimes horizontally listed alphabet, and the list of syllables or codes for nullities, along with the columns of the nomenclatures with the coded elements as lines.
Analyzing a large number of keys sheds light on the development of the cryptographic knowledge over the centuries.However, if we do not investigate how the keys were applied, we cannot say much about their usage in practice.We know from the set of keys for which we have available ciphertext(s) that keys were not necessarily applied as they were intended by their creators.For example, the full homophonicity was not applied; only a few homophones were used instead of the full scale making the cipher less secure for cryptanalysis.An investigation of the actual usage of the keys would further increase our knowledge of historical cryptology, whichagain-would need a large sample of ciphertexts and their original keys, which is not an easy endeavor given that these are typically not stored together.

Conclusion
In this study, we aimed at shedding some light on the development of cryptography from the 15th century to the 18th century in Europe.We analyzed over 1,600 cipher keys from 10 European countries.We investigated the structure of nomenclatures in over 1,300 keys to find out what was encoded and how.Given the importance of the nomenclatures in keys, we focused mainly on their structure, including the plaintext, the codes and the encoding structure, and studied their development over time in various regions in Europe.With regard to the plaintext, we found that nomenclatures typically contained named entities, in particular personal and place names.The nomenclatures grew in size over time and included a more diverse set of words beyond nouns, such as adjectives, prepositions, and verbs, or even parts of words, such as syllables and morphemes.Interestingly, phrases, sentences and punctuation marks were typically not part of the nomenclatures in historical keys.
We investigated the ciphertext and studied the codes in nomenclatures.We found that while we can see a large variation in the usage of various symbol sets-alphabets, digits and graphic signs with a combination of diacritics-to encode words in earlier time periods, digits only became the standard in later centuries.Similarly, coding words with other words, such as metaphors, appears in earlier centuries, but these types of codes are absent in the nomenclature tables from later time periods.The length of codes in nomenclatures could typically vary in the nomenclatures in later centuries, while fixed length codes were more typical in earlier time periods with some exception of regional differences.The nomenclature elements could be encoded with simple, homophonic or even polyphonic codes; even though simple substitution with one-to-one mapping of nomenclature element and code was most dominant throughout all centuries, the more advanced homophonic and polyphonic codes became more frequent only later.
The size of the nomenclature successively increased over time and lists longer than 100 elements became finally standard in later centuries.It is worth noting that smaller lists of less than 10 elements rarely occurred at all.As the nomenclatures grew, the more structured they had to become, as more ordering of the elements was needed.Small nomenclature lists from the 15th century were typically ordered thematically, listed horizontally or vertically, while longer lists were arranged alphabetically column-wise line by line in combination with themes divided into sections, oftentimes with headings.
We also showed a correlation of the ordering of elements and the ordering of the code digits; when elements were randomly ordered, the codes also tended to be randomly assigned and if the plaintext was vertically ordered, the digit codes were also assigned vertically.
Finally, we looked into some regional differences with respect to the holder countries where the keys are currently kept.We could observe some interesting tendencies.Among all, we could conclude that the keys became more complex and more secure against attacks over time with respect to their usage of symbol set, code length, code type, the nomenclature size and the linguistic complexity.However, our data sample is unbalanced and rather opportunistic-we took what we could find-further empirical studies would be needed on a larger, more balanced data set with respect to various regions and countries.Anna Lehofer is an economist on regional and urban planning who writes her PhD in the topic of cryptology at Budapest University of Technology and Economics (Hungary).She is interested in historical cryptology and her research focuses on decryption methodologies for early modern ciphers, especially on hierarchical clustering.In the DECRYPT project she takes part in archival collection and historical analysis.Karl de Leeuw is an intelligence historian and wrote his PhD on the History of Cryptology in the Netherlands at the University of Amsterdam.He published extensively about this subject in journals, such as Cryptologia (1993,1995,2001,2003,2013,2015), The Historical Journal (1999), Diplomacy & Statecraft (1999), Intelligence & National Security (2015), Yearbook of the Grimmelshausengesellschaft (2014) and ISIS (2019).He acted as editor of the History of Information Security.A comprehensive Handbook (2007).Karl de Leeuw passed away on July 14, 2022.
Michelle Waldisp€ uhl is an associate professor in German linguistics and language education at the Department of Languages and Literatures, University of Gothenburg, Sweden.She is a historical linguist specialized in philology, Germanic language history, and the linguistics of writing.Her current research interests include spelling variation, historical sociolinguistics and multilingualism with a particular focus on onomastics, runic studies, and cryptography.

Figure 3 .
Figure 3. Distribution of keys over centuries and regions.

Figure 4 .
Figure 4. Distribution of symbol types over centuries.

Figure 5 .
Figure 5. Distribution of symbol sets over centuries.

Figure 6 .
Figure 6.Distribution of plaintext languages over centuries.

Figure 7 .
Figure 7. Distribution of cleartext languages over centuries.

Figure 37 .
Figure 37. Code types by size.

Figure 38 .
Figure 38.Code type combinations by size.

Figure 43 .
Figure 43.Content words by size.

Figure 44 .
Figure 44.Function words by size.

Figure 47 .
Figure 47.Occurrence of linguistically most diverse keys in different regions and time periods.

Figure 50 .
Figure 50.Code length by region.

Figure 51 .
Figure 51.Code types by region.
The Netherlands, The Hague, Nationaal Archief (National Archives) UK, Kew, The National Archives Vatican City State, Vatican City, Biblioteca Apostolica Vaticana Vatican City State, Vatican City, Archivio Apostolico Vaticano