Effects of Language Ontology on Transatlantic Automatic Speech Understanding Research Collaboration in the Air Trafﬁc Management Domain

: As researchers around the globe develop applications for the use of Automatic Speech Recognition and Understanding (ASRU) in the Air Trafﬁc Management (ATM) domain, Air Trafﬁc Control (ATC) language ontologies will play a critical role in enabling research collaboration. The MITRE Corporation (MITRE) and the German Aerospace Center (DLR), having independently developed ATC language ontologies for speciﬁc applications, recently compared these ontologies to identify opportunities for improvement and harmonization. This paper extends the topic in two ways. First, this paper describes the speciﬁc ways in which ontologies facilitate the sharing of and collaboration on data, models, algorithms, metrics, and applications in the ATM domain. Second, this paper provides comparative analysis of word frequencies in ATC speech in the United States and Europe to illustrate that, whereas methods and tools for evaluating ASRU applications can be shared across researchers, the speciﬁc models would not work well between regions due to differences in the underlying corpus data.


Broad Context of the Study
For more than a decade, researchers in the United States and Europe have been developing and proving the benefit of Automatic Speech Recognition (ASR) applications in the Air Traffic Management (ATM) domain. In the United States, in support of the Federal Aviation Administration (FAA), the MITRE Corporation (MITRE) has developed capabilities to use Air Traffic Controller (ATCo)-pilot voice communication information for operational purposes, such as notifying ATCos of unsafe situations or analyzing operations to identify opportunities for safety or efficiency improvements. In Europe, as part of the Single European Sky ATM Research (SESAR) program, the German Aerospace Center (DLR) has led the development and testing of prototypic applications to enhance ATCo automation interactions, reduce ATCo workload, and identify safety issues in real time. Both MITRE [1] and DLR [2] have investigated the potential for automatic detection of readback errors, which are pilot errors in reading back ATCo instructions.
Key to most applications of ASR is the semantic meaning of the words spoken and transcribed, specifically in the context of the application in which the information will be used. Thus, we use the term Automatic Speech Recognition and Understanding (ASRU) to describe the speech-to-text and the text-to-meaning processes as one. ASRU for the Air Traffic Control (ATC) domain needs to transcribe domain-specific words and phrases and then interpret their ATC meaning. For example, "lufthansa three twenty one one seventy knots until four contact tower eighteen four five" needs to be understood to capture the flight's callsign (DLH321) and the instructions it received (speed 170 knots until four miles from the runway; contact the tower on this radio frequency 118.450).
To represent the information contained in the speech-both the words and their semantic meaning in the ATC context-MITRE and European stakeholders, led by DLR, independently developed ATC language ontologies in support of ATM application development. A common ontology, used in both Europe and the US, could enable better sharing and reuse of data, models, algorithms, and software between the US and Europe.
In a recent paper [3], we described our collaboration to compare ontologies and identify opportunities for improvement and harmonization. This paper expands on that topic to discuss the impact of the ontology on future research and development collaboration, describing several ways that an ATC ontology is critical to facilitating collaboration between researchers and to appropriately evaluating ASRU applications in the ATM domain. This paper also examines the word-level differences between United States and European ATC speech to provide quantitative understanding of the corpus data that feed the ASRU models, informing their potential cross-use between regions. The analysis shows that whereas the methods and tools for developing and measuring ASRU performance can be shared across regions (e.g., between the US and Europe), the specific models built for the different regions would likely not work well across regions.

Structure of the Paper
This paper expands on the ontology study described in [3]. The following sections are organized as follows. Section 1.3 summarizes the uses of ASRU in ATC to date. Section 1.4 lays out the levels of an ontology in the context of the ATC domain. Section 2 presents two different concrete instantiations of ATC ontology and recalls examples presented in [3] that illustrate representations of ATC semantics using these ontologies. Section 3 describes the value of ATC ontology in facilitating collaboration between research groups and presents specific applications and the semantic representations they rely on. Section 4 presents a quantitative comparison of ATC speech at the word level between the United States and Europe. Finally, Section 5 completes the paper with our conclusions and next steps.

Background
Voice communications are an essential part of ATC because they are the primary means of communicating intention, situation awareness, and environmental context. Over the last decade, researchers have invested tremendous effort into advancing the accuracy and sophistication of in-domain ASR and Natural Language Understanding (NLU) capabilities to enable human-machine teaming that improves aviation safety and efficiency [4].
Early applications of ASR and NLU focused on simulation pilots for high-fidelity controller training simulators because these applications were in controlled environments with well-defined phraseology and a limited set of speakers [5][6][7]. Other examples for replacing pseudo-pilots in training environments are from the FAA [8,9], DLR [10], and DFS [11]. Later applications in lab settings expanded to simulation pilots for human-inthe-loop simulations in ATM research measuring workload [12]. With the adoption of electronic flight strips in ATC facilities, Helmke et al. [13] applied ASRU to demonstrate the effectiveness of speech assistants in reducing controller workload and improving efficiency. Prototypes demonstrating the use of ASRU to enhance safety in live operations also emerged. ASRU can support the detection of anomalous trajectories [14]. It can also support the detection of closed runway operations and wrong surface operations in the tower domain [15]. The efficacy of using ASRU to automatically detect readback discrepancies was analyzed in the US [1] and in Europe [2]. A safety monitoring framework that applied ASR and deep learning to flight conformance monitoring and conflict detection has been proposed by [16]. The growing prevalence of uncrewed aerial vehicles has also led such as CTL RWY 27L or RW27L CLEARED_TO_LAND. Later in this paper, we discuss how these semantics are represented in the European and MITRE ontologies.
The conceptual level deals with a higher level of understanding that goes beyond the semantic level. It captures the bigger picture, which in the ATC domain can be bigger than the sum of the individual radio transmissions. An example of an event at the conceptual level is the concept of an aircraft being in the arrival phase of flight. For some applications, this is more important than knowing the particular set of altitude and speed reductions an ATCo issued. Another example is the speech associated with a go-around, which might involve a back-and-forth discussion between an ATCo and pilot followed by a series of ATCo instructions.
In this paper, the ontology instantiations we describe primarily address the lexical and semantic level described above. However, we believe ontologies can and should expand to cover any information that is relevant to the application using language interpretation.

A Comparison of Two ATC Ontologies
This section recaps the comparison of US and European ATC ontology instantiations described in [3].

Lexical Level
At the lexical level, MITRE's ontology specifies that both speech and non-speech sounds during ATC radio communications should be captured in the transcription. Furthermore, the transcription should closely represent the sounds present in the audio without additional annotation or meaning inference. This means speaker hesitation sounds such as "um" and "uh", partially spoken words, foreign words such as "bonjour" or "ciao", and initialisms such as "n d b" and "i l s" are transcribed as they sound. These rules were based on best practices in automatic speech recognition training corpus creation.
The European ontology at the lexical level requires that both speech and non-speech sounds be annotated in the transcription. Special annotation is associated with non-English words spoken in a radio transmission to indicate non-English content. Domain-specific acronyms and initialisms such as "NDB" and ILS" are transcribed as words in the vocabulary. Special handling is associated with domain-specific synonyms such as "nine" and "niner", which are transcribed to a single lexical representation, "nine". Both ontologies stick to the standard 26 letters in the English alphabet, i.e., "a" to "z" in lower-and uppercase form. Diacritical marks such as the umlaut "ä" in German or the acute accent "é" in French are not supported.
The differences that we observed at the word level can be summed up as fitting into the following categories:

•
How initialisms are handled (e.g., ILS versus i l s). • Words with similar meaning and different pronunciations and spelling (e.g., nine versus niner). • Words absent from one ontology or the other (e.g., the word altimeter does not occur in European ATC communications and the corresponding ICAO term QNH is absent from US ATC communications) [28]. • Whether speech disfluencies and coarticulation are captured at the word level (e.g., cleartalan versus cleared to land).

•
Words not represented in the US English language (e.g., the German word wiederhoeren for a farewell).
These differences can have an impact on ASR speed and accuracy performance and on the end user or downstream software application.

Semantic Level
At the semantic level, MITRE's ontology (SL US ) specifies a set of entities, attributes, and relationships that capture meaning at the command or clearance level. Figure 1 illustrates the ontology of SL US in graph format. At the highest level, SL US starts with a concept called Command Interpretation that represents an instruction, and it has a mandatory attribute called Command Type. The Command Type attribute declares the type of the instruction, such as an aircraft maneuver such as "climb" or a clearance to fly a procedure such as "cleared ILS two one approach". on the end user or downstream software application.

Semantic Level
At the semantic level, MITRE's ontology (SLUS) specifies a set of entities, at and relationships that capture meaning at the command or clearance level. Figur trates the ontology of SLUS in graph format. At the highest level, SLUS starts with a called Command Interpretation that represents an instruction, and it has a manda tribute called Command Type. The Command Type attribute declares the type of the tion, such as an aircraft maneuver such as "climb" or a clearance to fly a procedu as "cleared ILS two one approach". Each Command Interpretation can have zero or more child concepts called Q and Parameters. Both characterize, modify, and/or add values to the instruction. Q disambiguate or characterize Parameters by representing value units that are lexic sent in the transcript, e.g., "flight level", "heading", "knots", etc. Qualifiers can b to represent deeper, hierarchical relationships. For example, to represent the co "until the dulles VOR", the highest-level Qualifier would represent the preposition its child Qualifier would represent the waypoint type "VOR", and its child P would represent the name of the waypoint "dulles".
Parameters represent the value payloads for instructions that require a value, a heading (in degree) for a turn instruction or an altitude (in feet or flight level) for instruction. A Parameter may exist without a Qualifier parent if the format of the P value or the instruction's command type makes the Parameter inherently unamb For example, in the instruction "climb three four zero", the command type "climb us to infer that an altitude must be represented in the Parameter and the value fo three digits allows us to infer that the altitude is in flight level even though a un explicitly stated. Figure 2 illustrates the SLUS ontology as a block diagram for com with the semantic level of the European ontology in Figure 3. Each Command Interpretation can have zero or more child concepts called Qualifiers and Parameters. Both characterize, modify, and/or add values to the instruction. Qualifiers disambiguate or characterize Parameters by representing value units that are lexically present in the transcript, e.g., "flight level", "heading", "knots", etc. Qualifiers can be nested to represent deeper, hierarchical relationships. For example, to represent the condition "until the dulles VOR", the highest-level Qualifier would represent the preposition "until", its child Qualifier would represent the waypoint type "VOR", and its child Parameter would represent the name of the waypoint "dulles".
Parameters represent the value payloads for instructions that require a value, such as a heading (in degree) for a turn instruction or an altitude (in feet or flight level) for a climb instruction. A Parameter may exist without a Qualifier parent if the format of the Parameter value or the instruction's command type makes the Parameter inherently unambiguous. For example, in the instruction "climb three four zero", the command type "climb" allows us to infer that an altitude must be represented in the Parameter and the value format in three digits allows us to infer that the altitude is in flight level even though a unit is not explicitly stated. Figure 2 illustrates the SL US ontology as a block diagram for comparison with the semantic level of the European ontology in Figure 3.    In comparison, Figure 3 illustrates the semantic level of the European ontology (SLEU). At its highest level, SLEU starts with a concept called Instruction, i.e., a mandatory Callsign, a mandatory Command, and optional Conditions. If the Callsign cannot be extracted from the transmission, the Callsign is "NO_CALLSIGN". A Command concept always has a Type attribute that declares the type of instruction represented. When no Command is found in a transcript, a Command concept with Type "NO_CONCEPT" is created. Depending on the Type, no Value or one or more Values can follow. If a Value is available, the optional attributes Unit and Qualifier are possible. The optional Condition concept can be present for any Type and more than one may be associated with one Command.
Type can consist of a subtype, as illustrated by the command CLEARED ILS. The Speaker attribute can have the values "ATCO" or "PILOT". If not specified, it is ATCO or can be derived from additional available context information. The Reason attribute is only relevant for pilot transmissions. Then the values "REQ=REQUEST", "REP=REPORTING", or an empty value are possible. The empty value, i.e., the default value, in most cases contains a pilot's readback. The Reason attribute is motivated by the examples in Table 1.  In comparison, Figure 3 illustrates the semantic level of the European ontology (SL EU ). At its highest level, SL EU starts with a concept called Instruction, i.e., a mandatory Callsign, a mandatory Command, and optional Conditions. If the Callsign cannot be extracted from the transmission, the Callsign is "NO_CALLSIGN". A Command concept always has a Type attribute that declares the type of instruction represented. When no Command is found in a transcript, a Command concept with Type "NO_CONCEPT" is created. Depending on the Type, no Value or one or more Values can follow. If a Value is available, the optional attributes Unit and Qualifier are possible. The optional Condition concept can be present for any Type and more than one may be associated with one Command.
Type can consist of a subtype, as illustrated by the command CLEARED ILS. The Speaker attribute can have the values "ATCO" or "PILOT". If not specified, it is ATCO or can be derived from additional available context information. The Reason attribute is only relevant for pilot transmissions. Then the values "REQ=REQUEST", "REP=REPORTING", or an empty value are possible. The empty value, i.e., the default value, in most cases contains a pilot's readback. The Reason attribute is motivated by the examples in Table 1. The differences that we observed between SL US and SL EU at the semantic level can be summed up as fitting into the following categories:

•
How callsigns are represented.

•
The extent of and representation of inferred and implied information in the semantic representations.

•
The level of detail represented for advisory-type transmissions (e.g., traffic advisories, pilot call-in status information). • Which less-common ATCo instructions have defined representations.

•
How ambiguous ATCo instructions are represented.
For a detailed comparison of the semantic-level ontology overlap between the MITRE and European ontology instantiations, refer to Tables A1-A6 in the Appendix A.

Examples of Ontology Representations from ATC Communications
In reference [3], we presented several examples of word-level and semantic interpretation representations as defined by the European and MITRE ontology instantiations. We summarize them again below in Table 1 to illustrate the similarities and differences between the two ontology instantiations.

Quantifying the Differences
MITRE and DLR each exchanged 100 transmissions, with transcripts and semantic annotations, from the terminal area of a major US airport and a European hub airport. The US transcripts and annotations were manually transformed into the European format and vice versa. We assessed the word-level differences at the transcript level in terms of Levenshtein distance [29].
Out of 1554 total words in transmissions, 187 of them required modification to adhere to the other party's ontology, i.e., 12.0% of words were modified through substitution (89), deletion (35), and insertion (63). We omit uppercase to lowercase transformation from this measure. Figure 4 shows a sample transcript and its transformation.
Aerospace 2023, 10, x FOR PEER REVIEW 9 of 34 MITRE and DLR each exchanged 100 transmissions, with transcripts and semantic annotations, from the terminal area of a major US airport and a European hub airport. The US transcripts and annotations were manually transformed into the European format and vice versa. We assessed the word-level differences at the transcript level in terms of Levenshtein distance [29].
Out of 1554 total words in transmissions, 187 of them required modification to adhere to the other party's ontology, i.e., 12.0% of words were modified through substitution (89), deletion (35), and insertion (63). We omit uppercase to lowercase transformation from this measure. Figure 4 shows a sample transcript and its transformation.   In the following bullets, we list and explain some of the most often occurring cases from the 200 transcripts that are represented differently at the word level in the MITRE and European ontologies as sketched in Section 2.3: • Separation and combination of words/letters "ILS" vs. "i l s" (23 times) "southwest", etc., vs. "south west", etc. (19 times) • Different spellings "nine" vs. "niner" (9 times) "juliett" vs. "juliet" (6 times) "OK" vs. "okay" (4 times) • Special sounds and their notation "[unk]" vs. no transcription (7 times) "[hes]" vs. "uh" (7 times)

Type of Semantic Comparison
Overlap of Concepts Concept present in both ontologies before adaptation 82% Corresponding concept after small adjustments 95% Achievable match with existing model structures 100%

Impact of Ontology on Collaboration
Up to this point in the paper, we have described and compared two ontology instantiations that define simplifying meaning representations for ATC communications. In the remainder of this paper, we will describe how these ontologies assist collaboration, highlighting their benefits and shortfalls. Specifically, we examine the extent to which data, models, algorithms, and applications can be shared between research groups given operational and geographic differences and how the differences manifested in ATC communications can be bridged with the help of ontologies.

Text Data
In a perfect world, there would exist only one ground truth transcript for a segment of speech audio. However, as the ontology differences summarized above show, even when there is agreement on what was spoken, lexical representation of the spoken content can still differ. Although these differences in representation may seem superficial, they leave lasting impressions on models created using these lexical representations and can lead to artificially inflated error metrics if overlooked and in some cases can increase the number of actual errors.
For example, consider the two nominal examples in Table 3, where the original ground truth transcripts are transcribed according to the European ontology rules and the automatically transcribed text is generated by a speech recognizer that has modeled language following the MITRE ontology rules. In Example 1, when the automatically transcribed text is assessed against the ground truth transcript using word error rate (WER), a common metric for assessing speech recognition accuracy, the WER evaluates to 12.5% because of one substitution error (three by four) and one deletion error (four is missing) against a total of 16 words in the ground truth. This WER is reasonable because in this scenario the ground truth and the speech recognizer have the same lexical representation for all words in the transcript.
In contrast, in Example 2, the three errors (1 substitution and 2 insertions) resulting from differences in lexical representation ("i l s" instead of "ILS") compound the actual substitution error ("four" by "five") and results in a WER of 100%. In this scenario, lexical differences artificially inflate the true WER from 25% to 100%. Furthermore, if the semantic parser does use the same lexical representation, the difference can lead to parse errors, which in turn lead to semantic errors.
Thus, a mechanism for translation between different lexical representations is often required when sharing raw text data. By explicitly defining the rules for lexical represen-tation, ontologies play a critical role in highlighting what is required of the translation process and facilitate its design without extensive data analysis and exploration. Because WER is an indicator of lexical representation mismatch, it can be repurposed to measure the effectiveness of the translation process.

Semantic Annotations
Semantic representation differences are often much more obvious than lexical representation differences but they still require the same, if not more, attention to translation. The complexities of semantic representation make ontologies even more critical to the translation design process. Though an exhaustive comparison of ontology instantiations may seem daunting, it is still much easier than an exhaustive search for syntactic and semantic samples in raw text data! As in the case with lexical translation, a measure of semantic representation mismatch is needed to assess effectiveness of the translation process. We outline below our simple scheme for measuring semantic translation accuracy that is independent of semantic concept type or subcomponents and treats all semantic components with equal importance. These metrics can be used to compare semantic labels that have been mapped from one ontology representation to another and then back again to assess semantic content loss from the conversion. Table 4 lists definitions that are the building blocks for the accuracy metrics, and Table 5 defines the metrics and their formulas.  Consider the nominal example of semantic translation for the transcript in Table 6, "good day american seven twenty six descend three thousand feet turn right heading three four zero". We use this example to illustrate the metrics in action. In this example, there are 2 TPs (greeting and heading change) and 1 FP (due to the missing altitude in the altitude change), 0 FN, 0 TN, and TA = 3. Table 7 summarizes the accuracy metrics calculated on this nominal example. Table 7. Accuracy metrics calculated on nominal example of semantic translation.

Name
Definition The range for all metrics, with the exception of the Command Recognition Error Rate (CRER), is between 0 and 1. The CRER could go above 1 if (many) concepts not present in the ground truth are generated.
These metrics provide a general measure of the semantic coverage overlap between ontologies, i.e., when there is significant overlap, the CRER is low and when there is little overlap, the CRER is high. These same metrics can measure the extraction accuracy of a rules-based or deep neural network semantic parser in a general sense, but they should be modified and supplemented before use as a measure of application accuracy performance. We detail the rationale for and examples of application-specific metrics later in this section.

Automatic Speech Recognition Models
In today's world of large pre-trained models, automatic speech recognition models are usually robust enough to transplant into new geographic regions, environments, and domains with minimal finetuning. Some models can even adapt to language changes with little to no finetuning! However, there are idiosyncrasies in the ATC language that can reduce a speech recognition model's performance if they are not addressed during transplantation between geographic regions or simply throughout prolonged use. Specifically, the quantity of airspace and region-specific, i.e., site-specific, proper nouns used during ATC radio communications requires special handling and maintenance when operating a speech recognition model in the ATC domain.
A lot of the vocabulary that appears in ATCo-pilot communications includes general purpose words such as climb, descend, cleared, to, for, and, until, one, two, three, alfa, bravo, and charlie. These are simple to document in the word level of an ontology. However, depending on the quantity of airspace that the ASRU is intended to cover, a significant percentage (90% or more) of the vocabulary could be made up of names, such as those for airline callsigns, facility identifiers, location identifiers, navigational aids, and procedure identifiers.
The site-independent, general-purpose vocabulary is relatively static and short-just a few hundred words covers most ATCo-pilot voice communications. Section 4 will show that 551 words cover 95% of the spoken words in the US data. The vocabulary of names that are used in ATCo-pilot voice communications is much larger (tens of thousands if covering the entire United States airspace) and subject to change to accommodate airspace and procedure revisions and airline and pilot callsign name additions. This name list is disproportionately large compared with the general-purpose word list but not excessively large by ASR standards. More importantly, the list of names is much more dynamic, which creates a challenge. Just as software can deteriorate over time (i.e., software rot), ASRU ontologies (and their associated models) can degrade over time if they are not maintained. For ASRU applications, an outdated word-level ontology is likely to result in out-of-vocabulary errors, which can negatively affect ASR accuracy and the accuracy of all downstream capabilities. The same applies for the sequence of words, i.e., the ICAOphraseology and the deviation from ICAO phraseology [28]. This is a serious lifecycle maintenance issue. It is a particularly large challenge for applications that need to be scaledup to cover multiple ATC sectors and facilities. Newer ASR models, which transcribe at the letter level, and language model tokenizers, which tokenize at the subword level, may eliminate the problem of "out-of-vocabulary" words but not the challenge of correctly recognizing and interpreting these words given their low occurrence in the training data. Furthermore, the unconstrained vocabulary in these models presents its own problems to interpretation.
Changes on the ATC operations side are made on the 28-day AIRAC (Aeronautical Information Regulation And Control) cycle. The number of changes during any one AIRAC cycle is usually small and the changes are known well in advance. Changes on the commercial airline side do not follow an official cycle but tend to be relatively uncommon. There are two subcategories of names that can present unique problems for ASRU: Military callsigns and five-letter waypoint names.
Military callsigns are a challenge because they can be introduced ad hoc and are not always known in advance of a flight's departure. The FAA ATC handbook [26] states that: U.S. Air Force, Air National Guard, Military District of Washington priority aircraft, and USAF civil disturbance aircraft. Pronounceable words of 3 to 6 letters followed by a 1-to 5-digit number. These pronounceable words may be pilots' names or nicknames and these words might not otherwise appear in an ATC ontology and associated ASR model. For example, "honda five" and "maverick zero zero seven" are examples of accepted military callsigns.
Five-letter waypoint names present a different challenge for ASRU. They are part of the AIRAC update cycle and are published in advance, but only the five-letter codes are published, not their pronunciations. In many cases, the waypoint codes correspond to obvious words or can be sounded out using a simple algorithm-but not always! For example: GNDLF, YEBUY, and ISACE. Whereas pronunciation can be handled manually on a small scale by talking to the ATC personnel for a facility for some applications, it does not easily scale to applications involving multiple ATC facilities or large amounts of airspace.
In ASRU, there is a fundamental tradeoff between a vocabulary that is too small, resulting in out-of-vocabulary errors, and a vocabulary that is too large, resulting in confusion between similar sounding words. An ASR built using a larger word-level ontology is not always better. Furthermore, it may not be possible to know and include the region-specific names in the vocabulary until you know the region where the model will be used. Thus, a word-level ontology may only specify the general-purpose vocabulary explicitly and define rules for how this vocabulary should be augmented with site-specific names before use. This issue contributes to the challenge of sharing ASR models trained and/or used between different ATC facilities or regions. Well-designed ASRU tools can simplify the adding of this site-specific information to the ontology and corresponding software.

Semantic Parsing Algorithms
Semantic parse algorithms translate lexical representations into semantic representations by capturing and translating the syntactical relationships between words. The mechanism for semantic parsing could be a rules-based algorithm or a machine-learningbased neural network model. Both are sensitive to lexical representation changes because they operate so closely on lexical and syntactic relationships.
Rules-based semantic parse algorithms could be considered a part of the ontology at the syntactical level because they contain rules about which relationships between lexical representations are meaningful and how they can be interpreted to construe higherlevel semantic concepts. As every acceptable permutation of words must be explicitly or implicitly specified for interpretation, rules-based parse algorithms inherently document the syntactic level of the ontology; however, they can be incredibly labor intensive to create and maintain. Transplanting a rules-based semantic parse algorithm into a new region requires adapting the parse algorithm to regional lexicons, site-specific operational communications, and jargon. This inherently updates the syntactic level of the ontology as part of the model transition process.
Machine-learning-based models for semantic parsing learn the syntactic relationships from the hierarchies present in the semantic labels. In one sense, this eases the burden of rule creation, but it shifts it instead to data labeling, because the data labels must reflect the relationships between lexical entities in order for the model to learn them. Furthermore, as the syntactic rules are no longer explicitly stated as rules but hidden within the model weights, exact syntactic relationships can be difficult to discover and adjust for new model users, hampering reuse and even certification. In the absence of explicit syntactic rules, the semantic definitions of the ATC ontology become even more important as they capture and relay semantic hierarchies that might otherwise be overlooked without exhaustive data search and analysis.

Sharing and Reusing Applications
In ATC, there are common areas for improvement that come up again and again as possible avenues for ASRU application. As a result, the potential for application transition and reuse is high when an application is successful, even across geographic boundaries. In this section, we describe how ontologies facilitate application transition. We also discuss the importance of application-specific metrics and why they should be added to the ontology on an as-needed basis.

Examples of Application Specific Ontologies
Most applications incorporating ASRU are unlikely to use all the semantic concepts defined in an ATC ontology. Indeed, some of the applications prototyped between MITRE and DLR have only used a handful each. However, some semantic concepts appear across multiple applications, marking them as particularly important and worthy of focused research to improve extraction accuracy. Callsign is a recurring semantic concept that is relevant to multiple applications. Thus, both MITRE and DLR have special handling, such as context-based inference, to improve the detection accuracy of this concept. Table 8 summarizes different applications of ASRU prototyped by DLR and MITRE. The table elucidates by an "X", which command semantics are used in each application. The applications are described in greater detail below the table and references of published reports are provided where available.

Closed Runway Operation Detection (CROD)
MITRE prototyped and field tested a closed runway operation clearance detection system that uses ASRU to detect landing or takeoff clearances to runways that are designated as closed. The system relies purely on manual entry of runway closures and passive listening on the local controller radio channel to detect a clearance to a closed runway and issue an alert. For more information on this application, please see [15].

Wrong Surface Operations Detection (WSOD)
An expansion on the closed runway operation clearance detection system, this more advanced prototype combines ASRU on radio communications with radar data in real time to detect discrepancies between the landing clearance runway issued over the radio and the projected landing runway inferred from radar track data. When a discrepancy is detected, the system generates an alert to the tower ATCo. MITRE's voice data analytics capability was used to mine radio communications for approach clearances to inform a post-operational approach procedure utilization and conformance study [18]. The study used spoken approach clearances and radar tracks to detect trends in when and where flights received their approach clearances, correlation between aircraft equipage and approach clearance, and the effect of weather conditions on procedure utilization. The study was also able to use detected approach clearances to differentiate aircraft flying visual approaches from aircraft flying Required Navigation Performance (RNP) procedures and then analyze RNP procedure conformance. For more information on this application, please see [18] for details.

Electronic Flight Strip in Multiple Remote Tower Environment (MRT)
In multiple remote tower operations, controllers need to maintain electronic flight strips for a number of airports. The manual controller inputs can be replaced by automatic inputs when using ASRU support. In the HMI interaction modes for the Airport Tower project, the tower/ground controller had to simultaneously take care of three remote airports. Their responsibilities included entering flight status changes triggered by issued clearances, such as pushback from gate, taxi with taxiways, line-up, runway clearances, etc., with an electronic pen into the flight strip system. When ASRU support was active, the flight status changes were automatically recognized from the controller utterances, entered into the flight strip system, and highlighted for their review. If an automatically detected flight status change was not manually corrected by the controller within ten seconds of entry, the values were accepted by the system. The prototypic system was validated with ten controllers from Lithuania and Austria in 2022. More details can be found in the presentation "Understanding Tower Controller Communication for Support in Air Traffic Control Displays" given at the SESAR Innovation Days in Budapest in 2022 by Ohneiser et al.

Integration of ASRU with A-SMGCS for Apron Control at Frankfurt and Simulation Pilots in Lab Environment (SMGCS and SPA)
In June 2022, Frankfurt Airport (Fraport), together with DLR, ATRiCS Advanced Traffic Solutions GmbH, and Idiap performed validation trials with 15 apron controllers in Fraport's tower training environment under the STARFiSH project. An A-SMGCS (Advanced Surface Movement Guidance and Control System) was supplemented with ASRU to enable integration of recognized controller commands into the A-SMGCS planning process and simultaneously improve ASRU performance with the addition of context from A-SMGCS. Together with manual input from the ATCo, the A-SMGCS is able to detect potentially hazardous situations and alert the ATCo. The addition of ASRU reduces the burden on the ATCo to manually input issued clearances over the radio into A-SMGCS. Research results showed that up to one third of the working time of controllers is spent on these manual inputs, which is detrimental to overall efficiency because ATCos spend less time on the optimization of traffic flow. More details can be found in the presentation "Apron Controller Support by Integration of Automatic Speech Recognition with an Advanced Surface Movement Guidance and Control System" given at the SESAR Innovation Days in Budapest in 2022 by Kleinert et al. Table 8 contains two columns for this application. The column "SMGCS" corresponds to the support of the ATCo in this application, whereas the column "SPA" corresponds to the support of the simulation pilots by ASRU.

Workload Prediction for London Terminal Area (WLP)
Under the Highly Automatic Air Traffic Controller Working Position with Artificial Intelligence Integration (HAAWAII) project, DLR, together with NATS (the Air Navigation Service Provider of the United Kingdom), University of Brno, and Idiap developed a tool that determines an ATCo's workload in real-time, based on input from ASRU. The radio communications between ATCos and pilots at London TMA, for Heathrow Approach, was analyzed. Length of utterances, frequency usage rate, number of greetings, and number of miscommunications (say again, etc.) were evaluated for this purpose [30]. Callsign information is of minor importance here.

Integration of ASRU and CPDLC (CPDLC)
Under the HAAWAII project, DLR, together with NATS and Isavia ANS evaluated the performance of ASRU and CPDLC integration. More details can be found in the deliverable D5.3 of the HAAWAII project "Real Time Capability and Pre-Filling Evaluation Report". In the future, ATCos and pilots will communicate their intentions via both data link, e.g., CPDLC (Controller Pilot Data Link Communication), and radio communications. In this envisioned state, ASRU and CPDLC are not competitors but complementary tools. Current CPDLC applications are expected to advance with the advent of data link with lower latency (LDACS). ASRU can reduce the number and complexity of mouse clicks required to create a CPDLC message.

Pilot Weather Reports (PWR)
MITRE performed a post-operational analysis on the quantity of weather-related pilot reports (PIREPs) that could be automatically detected and submitted as "synthetic PIREPs" by an ASRU-enabled capability [19]. One of the goals of this analysis was to see if synthetic PIREPs could supplement the manually submitted PIREPs present in the system today and better inform strategic and tactical planning of ATC operations throughout the US National Airspace System (NAS) while also easing the ATCo workload. This use case relied on the Callsign and Pilot Report semantic representations to generate a formatted synthetic PIREP. More details about the motivation, outcomes, and conclusions of this analysis can be found in [19].

Use of Visual Separation (VFR)
Pilot-to-pilot visual separation is an important component of NAS safety and efficiency because it allows aircraft to fly closer together with the pilot assuming responsibility for separation. However, determining whether pilots were maintaining visual separation can only be determined from the voice communications between ATCo and pilot. ASRU can be used to detect traffic advisories (when an ATCo points out traffic to a pilot), the pilot reporting the traffic in sight, and the instruction for pilots to "maintain visual separation" in post-operations analysis. This information is critical to understanding the safety of a given encounter between aircraft. The information can therefore be used to better prioritize operations for safety assurance review. Visual separation information can also be used to inform efficiency-perspective analysis of operations (e.g., what percentage of flights are visual separated), because it informs the spacing between aircraft, which informs throughput/capacity.

Simulation Pilots in Enroute Domain Controller Training (SPET)
MITRE designed and prototyped high-fidelity simulation training consoles to support controller training in the enroute domain [5]. To reduce training and simulation costs, these consoles included a real-time simulation pilot system that uses automatic flight management, ASRU, and text-to-speech technology to interact with controllers during training simulations. Automated simulation pilots can handle more aircraft workload, provide consistent performance and response times to controller instructions, and require less training than human simulation pilots. The success of this prototype led to other follow-up projects, such as terminal training applications, Human-In-The-Loop (HITL) simulations to support new technology prototyping, procedure and airspace design, and research studies in MITRE's Integration Demonstration and Experimentation (IDEA) Lab, and bilingual training consoles for international use.

Readback Error Detection for Enroute Controllers (RB-E)
Under the HAAWAII project, DLR, together with the Icelandic Air Navigation Service Provider Isavia ANS, University of Brno, and Idiap developed a readback error detection assistant and tested it on pilot and ATCo voice utterances directly recorded in the ops room environment of Isavia ANS [2].

Readback Error Detection for Tower Controllers (RB-T)
In 2016, MITRE conducted a feasibility study into the automatic detection of readback errors at the tower/local controller ATCo position using recorded live-operations audio [1]. The study focused on runway and taxiway use clearances and assessed the readiness of ASRU performance to support this type of application. Whereas automatic speech recognition performance was promising, the study found that more complex understanding logic was needed to differentiate acceptable readback discrepancies from alert-worthy readback errors. The study also identified the importance of detecting the nuances of dialogue between the ATCo and pilot during which the ATCo might have already taken corrective action and nullified the need for an alert.

Application-Specific Metrics
We previously described general semantic accuracy metrics for evaluating how well labeled concepts are extracted in general, irrespective of a downstream application. In an ideal world, we could have a single set of objective ASRU metrics that could be used to communicate accuracy and be meaningful across all applications. However, we cautioned that these general semantic metrics should be supplemented before use with a downstream application. In this section, we describe why metrics must be tailored to the application in order for it to be useful.
The first set of metrics to consider is the set that describes the accuracy performance of the application, i.e., the performance that is relevant to the end user (who could be an ATCo, pilot, data analyst, policy maker, etc.). The application accuracy is the ultimate measure of performance because the application's benefit is the ultimate measure of the utility of the capability.
However, there are situations where the application accuracy can diverge from the accuracy of the underlying ASRU. One case is when the application logic is such that an incorrect ASRU result can still produce the correct application output. Another case is when there is non-speech information used after ASRU processing that can improve wrong or missing ASRU output.
For example, consider the application described in Section 3.3.3, in which ASRU is used to detect the ATCo landing clearance and then surveillance track information is used to determine if the arrival is lined up for the correct runway. If the arrival is lined up for the wrong runway, the application issues an alert to the ATCo; if no landing clearance is detected for an arrival, the application does nothing.
Incorrect ASRU detection of the callsign will likely result in no alert because the system will not be able to compare the flight's track with a clearance. No alert will likely be the correct application response because most arrivals line up for the correct runway. Similarly, missing the landing clearance would also result in no alert. In other words, we are getting the right results but for the wrong reason.
In contrast, incorrect ASRU detection of the callsign could be corrected through use of other information, e.g., using the arrival's position in the landing sequence to fill in the gap in knowledge, resulting in correct application performance.
It is clear from these examples that although application performance is the ultimate measure of success, it obscures some detail of the ARSU accuracy. Detail of the ASRU accuracy can be critical for two reasons. One, it provides understanding of what kinds of application errors will result from ASRU errors. Two, it provides understanding of where ASRU accuracy can and should be improved.
Continuing the example of using ASRU to detect landing clearances that can be compared with arrival alignment to identify wrong surface alignment, ASRU errors in callsign recognition will result in ASRU failing to associate the landing clearance with the correct aircraft. Given that most aircraft line up correctly, this missed recognition will likely still result in a correct "no alert" response at the application level. On the other hand, ASRU errors in runway recognition could result in ASRU producing an incorrect assigned runway for the flight, which could then result in a false alert to the ATCo.
Thus, for an application that aims to detect and alert on runway misalignment, the ASRU accuracy measures should be defined corresponding to the ontology concepts that need to be detected for the application: callsign, landing clearance, and runway. For each concept, detection accuracy can be evaluated using the metrics defined in Table 5.
These metrics should then be produced for each concept separately, such that callsign, landing clearance, and runway would each have several associated accuracy measures: recall, precision, etc. These metrics can then be used to identify and measure performance improvements in the ASRU. For example, they differentiate between missed landing clearances due to missed callsign detection and those due to missed landing clearance detection.
Note that the concept detection accuracy can be rolled up into a single metric, producing an overall concept recognition error rate by combining the TP, FP, TN, and FN for all concepts. This overall concept recognition error rate provides a general measure of the ASRU accuracy, and improvement in this measure generally means better ASRU accuracy for the application, which in turn means better overall efficacy for the application. However, as the previous examples illustrate, rolling the detection of these concepts up into a single measure will obscure understanding about the effects of the errors on application performance or where ASRU improvements should be targeted. Using Figure 5, consider the following example. These metrics should then be produced for each concept separately, such that callsign, landing clearance, and runway would each have several associated accuracy measures: recall, precision, etc. These metrics can then be used to identify and measure performance improvements in the ASRU. For example, they differentiate between missed landing clearances due to missed callsign detection and those due to missed landing clearance detection.
Note that the concept detection accuracy can be rolled up into a single metric, producing an overall concept recognition error rate by combining the TP, FP, TN, and FN for all concepts. This overall concept recognition error rate provides a general measure of the ASRU accuracy, and improvement in this measure generally means better ASRU accuracy for the application, which in turn means better overall efficacy for the application. However, as the previous examples illustrate, rolling the detection of these concepts up into a single measure will obscure understanding about the effects of the errors on application performance or where ASRU improvements should be targeted. Using Figure 5, consider the following example. Consider evaluation of ASRU performance on a set of 10 transmissions for this hypothetical application where all three basic concepts are needed to generate correct application output. A concept can be the callsign, the command type, the command value, etc. The Concept Error Rate (CER) measures the accuracy of the ASRU in detecting each concept, and a CER should be measured for each concept, not combined into a single metric covering the accuracy of detecting all semantic concepts. In contrast, the Command Recognition Error Rate (CRER), as defined in Table 5, measures the accuracy of the ASRU in detecting complete commands, which requires both the callsign and the instructions, which can be composed of different concepts, again.
In Case A, ASRU produces fully correct concepts for nine of the ten transmissions but zero correct concepts for one transmission. A "combined" concept error rate (CER, 3/30 = 10%) and the application error rate (1/10 = 10%) are the same. In Case B, ASRU produces fully correct output for seven of the ten transmissions but two out of three correct concepts for the remaining three transmissions. The combined CER is still 3/30 = 10% but the application error rate is now 3/10 = 30%. The CRER for Case A is 10% whereas the CRER for Case B is 30%.
The application performance for Case A is clearly better than for Case B. It is clear from this example that combined CER is obscuring important information. First, Case A will result in better application performance than Case B, despite the two having the same combined CER. Second, neither the combined CER nor the CRER tells us which concepts have room for improvement. For the example in Case A, the issue may be a systematic problem with a transmission that affects the recognition of all three concepts, such as bad audio or incorrect segmentation. For the example in Case B, did the system miss the callsign each time or one of the other concepts? Individual measures of precision and re- Consider evaluation of ASRU performance on a set of 10 transmissions for this hypothetical application where all three basic concepts are needed to generate correct application output. A concept can be the callsign, the command type, the command value, etc. The Concept Error Rate (CER) measures the accuracy of the ASRU in detecting each concept, and a CER should be measured for each concept, not combined into a single metric covering the accuracy of detecting all semantic concepts. In contrast, the Command Recognition Error Rate (CRER), as defined in Table 5, measures the accuracy of the ASRU in detecting complete commands, which requires both the callsign and the instructions, which can be composed of different concepts, again.
In Case A, ASRU produces fully correct concepts for nine of the ten transmissions but zero correct concepts for one transmission. A "combined" concept error rate (CER, 3/30 = 10%) and the application error rate (1/10 = 10%) are the same. In Case B, ASRU produces fully correct output for seven of the ten transmissions but two out of three correct concepts for the remaining three transmissions. The combined CER is still 3/30 = 10% but the application error rate is now 3/10 = 30%. The CRER for Case A is 10% whereas the CRER for Case B is 30%.
The application performance for Case A is clearly better than for Case B. It is clear from this example that combined CER is obscuring important information. First, Case A will result in better application performance than Case B, despite the two having the same combined CER. Second, neither the combined CER nor the CRER tells us which concepts have room for improvement. For the example in Case A, the issue may be a systematic problem with a transmission that affects the recognition of all three concepts, such as bad audio or incorrect segmentation. For the example in Case B, did the system miss the callsign each time or one of the other concepts? Individual measures of precision and recall for each ontology concept (callsign, landing clearance, and runway in the example used above) are needed to fully assess the ASRU accuracy.
As another example, if the application only requires one concept to be detected (e.g., the closed runway operation clearance detection application described in Section 3.3.2) and does not require a callsign, then a metric such as CRER is not appropriate because it incorporates unnecessary concepts into the metric.
In summary, there is not a single metric nor type of metric that is appropriate for all applications. Practitioners should develop metrics specific to the application, covering both the application level (i.e., the performance of the application from the user's perspective) and the ASRU level (i.e., the performance of the ASRU on individual concepts needed for the application). These application-specific metrics may expand beyond accuracy measures and incorporate requirements on computing and speed performance as applications come closer to being fielded in operational settings with specific resource constraints and demands on response time.

Quantitative Analyses with Applied Ontologies
Thus, application-specific metrics assess overall application readiness for an operational setting and acceptability to the end-user. In this capacity, they are as important, if not more so, than the lexical and semantic level ontology when applications are transplanted into new operational environments. The general semantic accuracy metrics we described previously help researchers evaluate data, algorithms, and models; however, applicationspecific metrics describe the end-user experience and how he or she will be impacted by the addition of the application to the operational environment. For this reason, we recommend application-specific metrics be added to the conceptual-level definitions and rules of the ontology when an application is transitioned. These application-specific metrics can go beyond TN/TP/FN/FP and include metrics even more relevant to operations, such as false alerts per hour.
The following two subsections describe example applications and the types of ontologyrelated metrics needed to assess their accuracy performance.

Application-Specific Metrics for a Workload Assessment in the Lab Environment
This application is briefly described in Section 3.3.5. Table 9 summarizes the applicable semantic concepts relevant to this application. The impact on workload and safety was measured in terms of the number of missing and incorrect radar label inputs when ASRU support was present and when it was not.  Table 9 summarizes the command detection accuracy when ASRU support was present during operations. Row "0.0%" shows the command detection performance with a perfect speech-to-text conversion, i.e., all incorrect detections come from errors in semantic extraction. Row "3.1%" shows the actual command detection performance during the validation trials with a speech-to-text engine that had an average WER of 3.1%.
For this use case, the application-specific metrics closely aligned with the semantic accuracy metrics described in Section 3.1.2 because the command detection accuracy translated directly into radar label entry accuracy. The number of correctly detected commands, or the command recognition rate (RcR), translated into how many entries the ATCo did not have to manually enter into the automation system. The number of incorrectly detected commands, or the command recognition error rate (CRER), translated into the number of safety risks introduced due to incorrect radar label inputs. The metric recall corresponded approximately to the command detection accuracy. They would be equal if TP+FP+FN+TN was equal to the total number of command samples (Total). The metric RER approximated 1-Prc. This correlation between RcR and Acc and the inverse correlation between RER and Prc was not present in our nominal example in Section 3.1.2 but was present in this experiment.

Application-Specific Metrics for a Post-Operations Pilot Report Analysis
The application itself is briefly described in Section 3.3.10. For the context of this paper, we discuss here the value of the application-level metrics used to measure the validity of this prototyped application's overall performance.
From the analyst perspective, the relevant metrics for this application were: 1.
The number of correctly detected and accurately formatted pilot reports (PIREPs), i.e., correct PIREPs.

2.
The number of correctly detected but incorrectly formatted PIREPs (incorrect PIREPs because they are incomplete, misleading, or both). 3.
The number of PIREPs not detected or not mapped to a formatted PIREP (missed PIREPs).
The first quantity informs how much reliable supplemental information could be introduced into the US National Airspace System (NAS) by this capability. The second quantity informs how much supplemental information introduced might be misleading and potentially detrimental to planning. The final quantity informs how much potential supplemental information is being missed but would not negatively affect planning except by omission.
However, there is not a direct one-to-one correspondence between the semantic accuracy of the individual Callsign and Pilot Report concepts and the application metrics. Figure 6 illustrates the effect of different errors during the automatic PIREP detection logic within the application and their effect on the overall application performance. As the diagram shows, an error in Callsign extraction could lead to either an incorrect PIREP or a missed PIREP; an error in Pilot Report extraction could also independently lead to an incorrect PIREP or a missed PIREP, and only the combined accurate extraction of both the Callsign and Pilot Report semantics could lead to a correct PIREP. Table 10 recaps the concept metrics of the application originally published in [19]. The final output quantities show that even when a PIREP concept is correctly detected, it may not be fully and correctly encoded (i.e., the application-level success).
Using the sample results from Table 10, we define application-specific metrics for precision and recall. We define true positive PIREPs as those that are encoded with complete information and PIREPs that are encoded with correct but incomplete information, on the reasoning that some information is better than none; this is an application-specific consideration. Using that definition, we calculate precision as 88% = (79 + 26)/(79 + 26 + 14). Recall is then calculated as 63% = (79 + 26)/168.
The complexity of the final application metrics is compounded by additional upstream probabilistic processes such as speech diarization, speech recognition, and text classification that could all introduce errors affecting the final result of the application. The interwoven effects of the different internal model and algorithm errors mean that no one model or algorithm is the most important and no individual model or algorithm accuracy metric could estimate overall application accuracy. Thus, the application-specific metrics are necessities invaluable for assessing the overall value of the prototype and its readiness for use in an operational setting.  Table 10 recaps the concept metrics of the application originally published in [19]. The final output quantities show that even when a PIREP concept is correctly detected, it may not be fully and correctly encoded (i.e., the application-level success).

European Word-Level Challenges and Statistics
As already described in Section 3.2.1, a lot of the vocabulary that appears in ATCo-pilot communications are general-purpose words such as climb, descend, cleared, etc. A large and significant percentage of the vocabulary is made up of names, e.g., airline designators, facility identifiers, location identifiers, navigational aids, and procedure identifiers. They, however, seldom occur, i.e., training data might not be available as needed.
The following Table 11 shows the results of the top 10 words in the two applications from the laboratory environment, described in Sections 3.3.5 and 3.3.7. "# Spoken" shows how often the word was really said. "Freq" shows how often this word was recognized relative to the number of all words spoken. The Vienna data is based on 118,800 spoken words, whereas the Apron application is based on 125,800 spoken words. In "light blue", we marked the words that were only present in one or the other top 10 list but not in both. In Frankfurt, most of the taxi way names start with the letter "N", e.g., N1, N6, etc. Most of the flights to and from Vienna are from "Austrian Airlines", whereas it is "Lufthansa" for Frankfurt. Table 12 shows the top 10 word for London TMA (Section 3.3.8) and for the enroute traffic managed by Isavia ANS (Section 3.3.9). The fact that "Reykjavik" is within the Top 10 of Icelandic traffic control is quite clear. Reykjavik is the capital of Iceland and the station name ATCos and pilots are using. "speed" being the sixth most frequent used word in London traffic might be surprising; however, knowing that "speed" is used both in speed commands and also in the callsign "speed bird" (for British Airways) explains the high occurrence. The London data is based on 102,952 spoken words, whereas the enroute application is based on 73,980 spoken words. Investigating the statistics for all four ASRU applications, we get the values shown in Table 13. The ten digits make the top 10. The digit "four" has the highest word error rate. It is often mixed with "for", a problem which can be solved afterwards at the semantic level.  Table 14 shows the "Number of Words" evaluated for each of the four applications. For Vienna, 179 words were observed more than four times, i.e., at least five times. The first 62 most occurring words for Vienna already sum up to 95% of all the spoken words. For 99% of all spoken words, we need 112 words. All in all, we have 347 different words observed for the Vienna ASRU applications (row "words for 100%"). The word statistics in Table 14 also show the difference between lab experiments and real-life data from the ops room. The number of used words is much bigger in the ops room environment than in the lab environment. This is supported by the number of words occurring more than four times and also by the 95%, 99%, and 100% thresholds. In the Icelandic enroute airspace English, Icelandic and Norwegian words are used, which explains the high number of different words.

US Word-Level Statistics
A similar analysis has been performed by MITRE. It is based on 70 ATC facilities all over the US with a corpus of 1,248,436 words. Table 15 is similar to Table 14 for the European word-level statistics.  Table 16 shows the top 10 word occurrences from the MITRE analysis. The 10 digits are also the most frequently used words in the US. Looking into the details we observe some other interesting differences such as "one" and "two" also being the top words in US. The word "nine" would be rank only sixteen by occurrence; however, when combined with "niner", the composite moves into the top 10 in terms of occurrence frequency. One surprising observation is that "nine" is used more often than "niner", although niner is the recommended spoken form for the digit by the ICAO [28]. The European transcription ontology does not even distinguish between "nine" and "niner". Both words are mapped to "nine". Europe also does not distinguish between "five" and "fife" or "three" and "tree". Manual transcribers may not have even been able to distinguish between them.
The digit "oh" for "zero", transcribed in Europe as a capital "O", is observed in the European data only 59 times and only in the operational environment data sets from NATS and Isavia. This is a negligible percentage. However, in the US data, the more than 7000 occurrences constitute a significant percentage.
The 10 digits from "zero" to "nine" cover 42% of all words observed in the European DLR data set. In the MITRE data set, the same digits comprise 29% of all spoken words, when "niner", etc., are also considered. Our hypothesis for this is that ATCos and pilots are not limited to the ten digits, as recommended by ICAO [28]. They also use the other groupform digit words such as "ten", "twenty", "thirteen", "fourteen", "hundred", "thousand", etc. When these additional numbers are summed up together with "zero" through "nine", then numerical words comprise 40% of all words spoken, which is shown in Table 17. The words "hundred" and "thousand" have nearly the same frequency in the MITRE and DLR data sets. These words are recommended by ICAO. The combined occurrences of words for 11 through 90 are negligible in DLR's data set. They sum up to only 0.6% of the words spoken, whereas in the MITRE data sets they sum up to over 10%, which is significantly more. Furthermore, analysis of the US data set by speaker showed that ATCos and pilots used group-form numbers about equally, so the difference in group-form word occurrence between the US and European data sets can be attributed to differences in word usage by region, i.e., between the US and Europe, not speaker.
Moreover, very interesting is how small a percentage of the most frequently occurring words in the data set comprise in the overall data set vocabulary. Table 18 summarizes the top occurring words that comprise 95% of words in the data set and the percentage of the vocabulary they represent. This top 95% of words present in the corpus is made of 551 distinct words and includes all the numbers and letters but not most of the airline, ATC facility, and waypoint names. This 551-word set is about 7.61% of the data set's 7236 distinct word vocabulary, which means the remaining 92.39% of the distinct words in the vocabulary comprise only 5% of the data corpus in terms of occurrence.
This last statistic illustrates one of the biggest challenges for ASRU in the ATC domain. The large variety of distinct waypoint, airline, and airport names relevant to understanding is hard to recognize correctly because they have low occurrence in the data set. The reason for their low occurrence is because a training corpus for ASR or semantic parse is often deliberately varied to improve robustness and reduce overfitting, which means they are collected from many facilities and regions. However, the geographical spread of the audio data sources, while improving general robustness, dilutes the observation frequency of regional waypoint, airline, and facility names. This scarcity of a large percentage of the vocabulary in the training data subsequently leads to misrecognition of these words and misinterpretation unless deliberate action is taken to correct or improve their detection.
The findings of this analysis lead to our conclusion that although the methods and tools for developing and measuring ASRU performance can be shared across regions (e.g., between the US and Europe), the specific models built for specific regions would likely not work well across regions.

Conclusions
This paper built off our comparative analysis of the two ontologies in [3] in two ways. First, this paper describes the impact of ontologies on collaboration on data, models, and applications. We described several ways that an ATC ontology is critical to facilitating collaboration between researchers and to appropriate evaluating ASRU applications in the ATM domain, using examples of specific applications to illustrate how ontology facilitates development of the metrics targeted for the application.
Second, this paper presents a word-level comparison of US and European ATC speech, specifically focusing on similarities and differences in the types of words. Although there are significant similarities (e.g., in both regions, digits make up the top 10 most spoken words), there are also significant differences (e.g., the frequency of group-form numbers). This analysis leads to our conclusion that whereas the methods and tools for developing and measuring ASRU performance can be shared across regions (e.g., between US and Europe), the specific models built for the different regions would likely not work well across regions.
Future work is needed to develop capabilities to make methods and tools more shareable between ontologies. This effort could involve modifying one or both ontologies and/or creating translation mechanisms to automatically convert data from one ontology to the other. Ultimately, research funding is critical to informing the effective and available paths forward. Data Availability Statement: Data can be made available by contacting the corresponding author if the data is not protected by, e.g., GDPR or other contracts.
Acknowledgments: Many thanks to Yuan-Jun Wei and Weiye Ma for their curation of the MITRE ASR corpus. MITRE also acknowledges the support from the FAA for the research funding and for allowing access to recorded ATCo-pilot live-operations radio voice communications.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.