Challenges in Automating Maze Detection

,


Introduction
Assessing a child's linguistic abilities is a critical component of diagnosing developmental disorders such as Specific Language Impairment or Autism Spectrum Disorder, and for evaluating progress made with remediation.Structured instruments ("tests") that elicit brief, easy to score, responses to a sequence of items are a popular way of performing such assessment.An example of a structured instrument is the CELF-4, which includes nineteen multi-item subtests with tasks such as object naming, word definition, reciting the days of the week, or repeating sentences (Semel et al., 2003).Over the past two decades, researchers have discussed the limitations of standardized tests and how well they tap into different language impairments.Many have advocated the potential benefits of language sample analysis (LSA) (Johnston, 2006;Dunn et al., 1996).The analysis of natural language samples may be particularly beneficial for language assessment in ASD, where pragmatic and social communication issues are paramount yet may be hard to assess in a conventional test format (Tager-Flusberg et al., 2009).
At present, the expense of LSA prevents it from being more widely used.Heilmann (2010), while arguing that LSA is not too time-consuming, estimates that each minute of spoken language takes five to manually transcribe and annotate.At this rate, it is clearly impractical for clinicians to perform LSA on hours of speech.Techniques from natural language processing could be used to build tools to automatically annotate transcripts, thus facilitating LSA.
Here, we evaluate the utility of a set of annotated corpora for automating a key annotation in the de facto standard annotation schema for LSA: the Systematic Analysis of Language Transcripts (SALT) (Miller et al., 2011).SALT comprises a scheme for coding transcripts of recorded speech, together with software that tallies these codes, computes scores describing utterance length and error counts, among a range of other standard measures, and compares these scores with normative samples.SALT codes indicate bound morphemes, several types of grammatical errors (for example using a pronoun of the wrong gender or case), and mazes, which are defined as "filled pauses, false starts, and repetitions and revisions of words, morphemes and phrases" (Miller et al., 2011, p. 48).
Mazes have sparked interest in the child language disorders literature for several reasons.They are most often analyzed from a language processing perspective where the disruptions are viewed as a consequence of monitoring, detecting and repairing language, potentially including speech errors (Levelt, 1993;Postma and Kolk, 1993;Rispoli et al., 2008).Several studies have found that as grammatical complexity and utterance length increase, the number of mazes increases in typically developing children and children with language impairments (MacLachlan and Chapman, 1988;Nippold et al., 2008;Reuterskiöld Wagner et al., 2000;Wetherell et al., 2007).Mazes in narrative contexts have been shown to differ between typical children and children with specific language impairment (MacLachlan and Chapman, 1988;Thordardottir and Weismer, 2001), though others have not found reliable group differences (Guo et al., 2008;Scott and Windsor, 2000).Furthermore, outside the potential usefulness of looking at mazes in themselves, mazes always have to be detected and excluded in order to calculate other standard LSA measures such as mean length of utterance and type or token counts.Mazes also must be excluded when analyzing speech errors, since some mazes are in fact self-corrections of language or speech errors.
Thus, automatically delimiting mazes could be clinically useful in several ways.First, if mazes can be automatically detected, standard measures such as token and type counts can be calculated with ease, as noted above.Automatic maze detection could also be a first processing step for automatically identifying errors: error codes cannot appear in mazes, and certain grammatical errors may be easier to identify once mazes have been excised.Finally, after mazes have been identified, further analysis of the mazes themselves (e.g. the number of word in mazes, and the placement of mazes in the sentence) can provide supplementary information about language formulation abilities and word retrieval abilities (Miller et al., 2011, p. 87-89).
We use the corpora included with the SALT software to train maze detectors.These are the corpora that the software uses to compute reference counts.These corpora share several characteristics we expect to be typical of clinical data: they were collected under a diverse set of circumstances; they were annotated by different groups; the annotations ostensibly follow the same guidelines; and the annotations were not designed with automation in mind.We will investigate whether we can extract usable generalizations from the available data, and explore how well the automated system performs, which will be of interest to clinicians looking to expedite LSA.

Background
Here we provide an overview of SALT and maze annotations.We are not aware of any attempts to automate maze detection, although maze de-tection closely resembles the well-established task of edited word detection.We also provide an overview of the corpora included with the SALT software, which are the ones we will use to train maze detectors.

SALT and Maze Annotations
The approach used in SALT has been in wide use for nearly 30 years (Miller and Chapman, 1985), and now also exists as a software package1 providing transcription and coding support along with tools for aggregating statistics for manual codes over the annotated corpora and comparing with age norms.The SALT software is not the focus of this investigation, so we do not discuss it further.
Following the SALT guidelines, speech should be transcribed orthographically and verbatim.The transcript must include and indicate: the speaker of each utterance, partial words or stuttering, overlapping speech, unintelligible words, and any nonspeech sounds from the speaker.Even atypical language, for example neologisms (novel words) or grammatical errors (for example 'her went') should be written as such.
There are three broad categories of SALT annotations: indicators of 1) certain bound morphemes, 2) errors, and 3) mazes.In general, verbal suffixes that are visible in the surface form (for example -ing in "going") and clitics that appear with an unmodified root (so for example -n't in "don't", but not the -n't in "won't") must be indicated.SALT includes various codes to indicate grammatical errors including, but not limited to: overgeneralization errors ("goed"), extraneous words, omitted words or morphemes, and inappropriate utterances (e.g.answering a yes/no question with "fight").For more information on these standard annotations, we refer the reader to the SALT manual (Miller et al., 2011).
Here, we are interested in automatically delimiting mazes.In SALT, filled pauses, repetitions and revisions are included in the umberella term "mazes" but the manual does not include definitions for any of these categories.In SALT, mazes are simply delimited by parentheses; they have no internal structure, and cannot be nested.Contiguous spans of maze words are delimited by a single set of parentheses, as in the following utterance: (1) (You have you have um there/'s only) there/'s ten people To be clear, we define the task of automatically applying maze detections as taking unannotated transcripts of speech as input, and then outputting a binary tag for each word that indicates whether or not it is in a maze.

Edited Word Detection
Although we are not aware of any previous work on automating maze detection, there is a wellestablished task in natural language processing that is quite similar: edited word detection.The goal of edited word detection is to identify words that have been revised or deleted by the speaker, for example 'to Dallas' in the utterance 'I want to go to Dallas, um I mean to Denver.'.Many investigations have approached edited word detection from what Nakatani et al. (1993) have termed 'speech-first' perspective, meaning that edited detection is performed with features from the speech signal in addition to a transcript.These approaches, however, are not applicable to the SALT corpora, because they only contain transcripts.As a result, we must adopt a text-first approach to maze detection, using only features extracted from a transcript.The text-first approach to edited word detection is well established.One of the first investigations taking a text-first approach was conducted by Charniak and Johnson (2001).There, they used boosted linear classifiers to identify edited words.Later, Johnson and Charniak (2004) improved upon the linear classifiers' performance with a tree adjoining grammar based noisy channel model.Zwarts and Johnson (2011) improve the noisy channel model by adding in a reranker that leverages features extracted with the help of a large language model.Qian and Liu (2013) have developed what is currently the best-performing edited word detector, and it takes a text-first approach.Unlike the detector proposed by Zwarts and Johnson, Qian and Liu's does not rely on any external data.Their detector operates in three passes.In the first pass, filler words ('um', 'uh', 'I mean', 'well', etc.) are detected.In the second and third passes, edited words are detected.The reason for the three passes is that in addition to extracting features (mostly words and part of speech tags) from the raw transcript, the second and third steps use features extracted from the output of previous steps.An example of such features is adjacent words from the utterance with filler words and some likely edited words removed.

Overview of SALT Corpora
We explore nine corpora included with the SALT software.Table 1 has a high level overview of these corpora, showing where each was collected, the age ranges of the speakers, and the size of each corpus both in terms of transcripts and utterances.Note that only utterances spoken by the child are counted, as we throw out all others.
Table 1 shows several divisions among the corpora.We see that one group of corpora comes from New Zealand, while the majority come from North America.All of the corpora, except for Expository, include children at very different stages of language development.
Four research groups were responsible for the transcriptions and annotations of the corpora in Table 1.One group produced the CONVERSA-TION, EXPOSITORY, NARRATIVESSS, and NAR-RATIVESTORYRETELL corpora.Another was responsible for all of the corpora from New Zealand.Finally, the ENNI and GILLAMNT corpora were transcribed and annotated by two different groups.For more details on these corpora, how they were collected, and the annotators, we refer the reader to the SALT website at http://www.saltsoftware.com/resources/databases.html.
Some basic inspection reveals that the corpora can be put into three groups based on the median utterance lengths, and the distribution of ut- terance2 lengths, following the groups Figure 1, with the EXPOSITORY and CONVERSATION corpora in their own groups.Note that the counts in Figure 1 are of all of the words in each utterance, including those in mazes.We see that the corpora in Group A have a modal utterance length ranging from seven to ten words.There are many utterances in these corpora that are shorter or longer than the median length.Compared to the corpora in Group A, those in Group B have a shorter modal utterance length, and fewer long utterances.In Figure 1, we see that the CONVER-SATION corpus consists mostly of very short utterances.At the other extreme is the EXPOSITORY corpus, which resembles the corpora in Group A in terms of modal utterance length, but which generally contains longer utterances than any of the other corpora.

Maze Detector
We carry out our experiments in automatic maze detection using a statistical maze detector that learns to identify mazes from manually labeled data using features extracted from words and automatically predicted part of speech tags.The maze detector uses the feature set shown in Table 2.This set of features is identical to the ones used by the 'filler word' detector in Qian and Liu's disfluency detector (2013).We also use the same clas-Table 2: Feature templates for maze word detection, following Qian and Liu (2013).We extract all of the above features from both words and POS tags, albeit separately.t0 indicates the current word or POS tag, while t−1 is the previous one and t1 is the following.The function I(a, b) is 1 if a and b are identical, and otherwise 0. y−1 is the tag predicted for the previous word.

Category Features
(c) Others Figure 1: Histograms of utterance length (including words in mazes) in SALT corpora sifier as the second and third steps of their system: the Max Margin Markov Network 'M3N' classifier in the pocketcrf toolkit (available at http:// code.google.com/p/pocketcrf/).The M3N classifier is a kernel-based classifier that is able to leverage the sequential nature the data in this problem (Taskar et al., 2003).We use the following label set: S-O (not in maze); S-M (single word maze); B-M (beginning of multi-word maze); I-M (in multi-word maze); and E-M (end of multi-word maze).The M3N classifier allows us to set a unique penalty for each pair of confused labels, for example penalizing an erroneous prediction of S-O (failing to identify maze words) more heavily than spurious predictions of maze words (all -M labels).This ability is particularly useful for maze detection because maze words are so infrequent compared to words that are not in mazes.

Evaluation
We split each SALT corpus into training, development, and test partitions.Each training partition contains 80% of the utterances the corpus, while the development and test partitions each contain 10% of the utterances.We use the development portion of each corpus to set the penalty matrix system to roughly balance precision and recall.
We evaluate maze detection in terms of both tagging performance and bracketing performance, both of which are standard forms of evaluation for various tasks in the Natural Language Processing literature.Tagging performance captures how effectively maze detection is done on a wordby-word basis, while bracketing performance describes how well each maze is identified in its entirety.For both tagging and bracketing performance, we count the number of true and false positives and negatives, as illustrated in Figure 2. In tagging performance, each word gets counted once, while in bracketing performance we compare the predicted and observed maze spans.We use these counts to compute the following metrics: 2P R P + R Note that partial words and punctuation are both ignored in evaluation.We exclude punctuation because punctuation does not need to be included in mazes: it is not counted in summary statistics (e.g.MLU, word count, etc.), and punctuation errors are not captured by the SALT error codes.We exclude partial words because they are always in mazes, and therefore can be detected trivially with a simple rule.Furthermore, because partial words are excluded from evaluation, the performance metrics are comparable across corpora, even if they vary widely in the frequency of partial words.
For both space and clarity, we do not present the complete results of every experiment in this paper, although they are available online 3 .Instead, we present the complete baseline results, and then report F1 scores that are significantly better than the baseline.We establish statistical significance by using a randomized paired-sample test (see Yeh (2000) or Noreen (1989)) to compare the baseline system (system A) and the proposed system (system B).First, we compute the difference d in F1 score between systems A and B.Then, we repeatedly construct a random set of predictions for each input item by choosing between the outputs of system A and B with equal probability.We compute the F1 score of these random predictions, and if it exceeds the F1 score of the baseline system by at least d, we count the iteration as a success.The significance level is at most the number of successes divided by one more than the number of trials (Noreen, 1989).

Baseline Results
For each corpus, we train the maze detector on the training partition and test it on the development partition.The results of these runs are in Table 3, which also includes the rank of the size of each corpus (1 = biggest, 9 = smallest).We see immediately that our maze detector performs far better on some corpora than on others, both in terms of tagging and bracketing performance.We note that maze detection performance is not solely determined by corpus size: tagging performance is substantially worse on the largest corpus (CONVERSATION) than the small-3 http://bit.ly/1dtFTPl( and then it ) oh and then it ( um ) put his wings out .Gold ( and then it oh ) and then it ( um ) put his wings out .

Generic Model
We train a generic model for maze detection on all of the training portions of the nine SALT corpora.We use the combined development sections of all of the corpora to tune the loss matrix for balanced precision and recall.We then test the resulting model on the development section of each SALT corpus, and evaluate in terms of tagging and bracketing accuracy.We find that the generic model performs worse than the baseline in terms of both tagging and bracketing performance on six of the nine corpora corpora.The generic model significantly improves tagging (F1=0.925,p ≤ 0.0022) on the NZSTO-RYRETELL corpus, but the improvement in bracketing performance is not significant (p ≤ 0.1635).There is improvement of both tagging (F1=0.805,p ≤ 0.0001) and bracketing (F1=0.677,p ≤ 0.0025) performance on the NARRATIVESSS corpus.The generic model does not perform better than the baseline corpus-specific models on any other corpora.
The poor performance of the generic model is somewhat surprising, as it is trained with far more data than any of the corpus-specific models.In many tasks in natural language processing, increasing the amount of training data improves the resulting model, although this is not necessarily the case if the additional data is noisy or out-ofdomain.This suggests two possibilities: 1) the language in the corpora varies substantially, perhaps due to the speakers' ages or the activity that was transcribed; and 2) the maze annotations are inconsistent between corpora.

Multi-Corpus Models
It is possible that poor performance of the generic model relative to the baseline corpus-specific models can be attributed to systematic differences between the SALT corpora.We may be able to train a model for a set of corpora that share particular characteristics that can outperform the baseline models because such a model could leverage more training data.We first evaluate a model for corpora that contain transcripts collected from children of similar ages.We also evaluate task-specific models, specifically a maze-detection model for story retellings, and another for conversations.These two types of models could perform well if children of similar ages or performing similar tasks produce mazes in a similar manner.Finally, we train models for each group of annotators to see whether systematic variation in annotation standards between research groups could be responsible for the generic model's poor performance.
We train all of these models similarly to the generic model: we pool the training sections of the selected corpora, train the model, then test on the development section of each selected corpus.We use the combined development sections of the selected corpora to tune the penalty matrix to balance precision and recall.
Again, we only report F1 scores that are higher than the baseline model's, and we test whether the improvement is statistically significant.We do not report results where just the precision or just the recall exceeds the baseline model performance, but not F1, because these are typically the result of model imbalance, favoring precision at the expense of recall or vice versa.Bear in mind that we roughly balance precision and recall on the combined development sets, not each corpus's development set individually.

Age-Specific Model
We train a single model on the following corpora: ENNI, GILLAMNT, NARRATIVESSS, and NARRATIVESTORYRETELL.As shown in Table 1, these corpora contain transcripts collected from children roughly aged 4-12.In three of the four corpora, the age-based model performs worse than the baseline.The only exception is NAR-RATIVESTORYRETELL, for which the age-based model outperforms the baseline in terms of both tagging (F1=0.794,p ≤ 0.0673) and bracketing (F1=0.679,p ≤ 0.0062).

Task-Specific Models
We construct two task-specific models for maze detection: one for conversations, and the other for narrative tasks.A conversational model trained on the CONVERSATION and NZCON-VERSATION corpora does not improve performance on either corpus relative to the baseline.A model for narrative tasks trained on the ENNI, GILLAMNT, NARRATIVESSS, NARRA-TIVESTORYRETELL, NZPERSONALNARRATIVE and NZSTORYRETELL corpora only improves performance on one of these, relative to the baseline.Specifically, the narrative task model improves performance on the NARRATIVESSS corpus both in terms of tagging (F1=0.797,p ≤ 0.0005) and bracketing (F1=0.693,p ≤ 0.0002).

Research Group-Specific Models
There are two groups of researchers that have annotated multiple corpora: a group in New Zealand, which annotated the NZCONVERSA-TION, NZPERSONALNARRATIVE, and NZSTO-RYRETELL corpora; and another group in Wisconsin, which annotated the CONVERSATION, EXPOSITORY, NARRATIVESSS, and NARRA-TIVESTORYRETELL corpora.We trained research group-specific models, one for each of these groups.
Overall, these models do not improve performance.The New Zealand research group model does not significantly improve performance on any of the corpora they annotated, relative to the baseline.The Wisconsin research group model yields significant improvement on the NARRATIVESSS corpus, both in terms of tagging (F1=0.803,p ≤ 0.0001) and bracketing (F1=0.699,p ≤ 0.0001) performance.Performance on the CONVERSA-TION and EXPOSITORY corpora is lower with the Wisconsin research group model than with the corpus-specific baseline models, while performance on NARRATIVESTORYRETELL is essentially the same with the two models.

Discussion
We compared corpus-specific models for maze detection to more generic models applicable to multiple corpora, and found that the generic models performed worse than the corpus-specific ones.This was surprising because the more generic models were able to leverage more training data than the corpus specific ones, and more training data typically improves the performance of datadriven models such as our maze detector.These results strongly suggest that there are substantial differences between the nine SALT corpora.
We suspect there are many areas in which the SALT corpora diverge from one another.One such area may be the nature of the language: perhaps the language differs so much between each of the corpora that it is difficult to learn a model appropriate for one corpus from any of the others.Another potential source of divegence is in transcription, which does not always follow the SALT guidelines (Miller et al., 2011).Two of the idiosyncracies we have observed are: more than three X's (or a consonant followed by multiple X's) to indicate unintelligble language, instead of the conventional X, XX, and XXX for unintelligible words, phrases, and utterances, respectively; and non-canonical transcriptions of what appear to be filled pauses, including 'uhm' and 'umhm'.These idiosyncracies could be straightforward to normalize using automated methods, but doing so requires that they be identified to begin with.Furthermore, although these idiosyncracies may appear to be minor, taken together they may actually be substantial.
Another potential source of variation between corpora is likely in the maze annotations themselves.SALT's definition of mazes, "filled pauses, false starts, and repetitions and revisions of words, morphemes and phrases" (Miller et al., 2011, p. 48), is very short, and none of the components is defined in the SALT manual.In contrast, the Disfluency Annotation Stylebook for Switchboard Corpus (Meteer et al., 1995) describes a system of disfluency annotations over approximately 25 pages, devoting two pages to filled pauses and five to restarts.The Switchboard disfluency annotations are much richer than SALT maze annotations, and we are not suggesting that they are appropriate for a clinical setting.However, between the stark contrast in detail of the two annotation systems' guidelines, and our finding that crosscorpus models for maze detection perform poorly, we recommend that SALT's definition of mazes and their components be elaborated and clarified.This would be of benefit not just to those trying to automate the application of SALT annotations, but also to clinicians who use SALT and depend upon consistently annotated transcripts.
There are two clear tasks for future research that build upon these results.First, maze detection performance can surely be improved.We note, however, that evaluating maze detectors in terms of F1 score may not always be appropriate if such a detector is used in a pipeline.For example, there may be a minimum acceptable level of precision for a maze detector used in a preprocessing step to applying SALT error codes so that maze excision does not create additional errors.In such a scenario, the goal would be to maximize recall at a given level of precision.
The second task suggested by this paper is to explore the hypothesized differences within and between corpora.Such exploration could ultimately result in more rigorous, communicable guidelines for maze annotations, as well as other annotations and conventions in SALT.If there are systematic differences in maze annotations across the SALT corpora, such exploration could suggest ways of making the annotations consistent without completely redoing them.

Figure 2 :
Figure 2: Tagging and bracketing evaluation for maze detection.TP = True Positive, FP = False Positive, TN = True Negative, FN = False Negative Pred.( and then it ) oh and then it ( um ) put his wings out .Gold ( and then it oh ) and then it ( um ) put his wings out .Tag TP ×3 FN TN ×3 TP TN ×4 Brack.FP, FN TP

Table 1 :
Description of SALT corpora

Table 3 :
Baseline maze detection performance on development sections of SALT corpora: corpus-specific models est (NZSTORYRETELL).