Interpreting Architecture : The ARCHINT Corpus

This article introduces the ARCHINT (Architecture in Interpreting) Corpus, a parallel, bilingual, unidirectional corpus consisting of ten video recorded conference presentations in English, their simultaneous interpretations into Spanish, and their transcriptions. The goals of this paper are to describe the ARCHINT Corpus, to review its past and current applications, and to suggest future developments and research possibilities. In addition, the article places ARCHINT in line with Corpus-based Interpreting Studies (CIS).


Introduction
ARCHINT is a parallel, bilingual, unidirectional corpus consisting of video recordings and transcripts of ten simultaneous interpretations in Spanish paired with their source speeches in English.It is an authentic, domain-specific, machine-readable corpus, consisting of almost twenty recorded hours and over 129.000 words.The speeches were live-recorded in the third and fifth editions of the Encontros de Arquitectura, a three day international conference on Architecture that was celebrated in Spain every two years between 1999 and 2007.Acknowledged by the press as one of the most relevant cultural events in the Iberian Peninsula , the conference had brought together academics, students and professionals alike, and had become an international forum for the discussion of ideas, narratives and motivations that lie behind some of the most acclaimed twenty-first century architectural masterpieces.
ARCHINT was created as part of a PhD research project exploring the concept of terminological accuracy in simultaneous interpreting from a theoretical and an empirical perspective (Cabrera, 2015).Currently, it is being used as an observational tool to identify factors that make accuracy difficult to achieve.
The goals of this paper are to describe the corpus, to review its past and current applications and to suggest future developments and research applications.The paper is organized as follows: Section 1 presents an overview of Corpus -based Interpreting Studies, a research paradigm that is essential for an understanding of the theoretical background behind the development of interpreting corpora.Sections 2 and 3 provide a synthesis of the compilation process and a description of the corpus.In Section 4, an overview of past and current applications is presented.Finally Section 5 highlights the conclusions that can be derived from the compilation process and the corpus applications.In addition, it lists future developments and outlines plans for future research.

Corpus-based Interpreting Studies
Corpus-based Interpreting Studies (CIS) is a research paradigm developed over the past twenty years that has adopted corpus-based techniques and methodologies for the empirical study of interpreting.
Although in theory this paradigm applies to both the simultaneous and the consecutive mode, most studies developed to date relate to oral simultaneous interpreting.
The idea of using contemporary corpus-based techniques to study interpreting corpora with IT software was put forward for the first time by Shlesinger (1998) under the influence of the work initiated by Baker (1995Baker ( , 1996) ) and Laviosa (1998) in the applied field of Translation Studies.In her article titled Corpus-based Interpreting Studies as an Offshoot of Corpusbased Translation Studies, Shlesinger paved the way for the exploitation of large interpreting corpora with specialized software, synthesized the unique characteristics of this mode in terms of orality, multilingualism, situatedness and immediacy, and delineated the unique challenges of interpreting data transcription (Shlesinger, 1998: 4-5).
Over the last ten years, CIS has grown rapidly and steadily.This growth is reflected in both the increasing number of corpus-based interpreting studies that have been developed so far, as well as in the proliferation of large-scale simultaneous interpreting corpora.
One of the largest machine-readable interpreting corpora to date is the European Parliament Interpreting or EPIC Corpus.This ready-made1 , parallel and trilingual corpus with texts in Italian, English and Spanish was developed by researchers at the University of Bologna, Italy (Bendazzoli andSandrelli, 2005, 2009).It consists of European Parliament speeches and their simultaneous interpretations2 .Among other applications, it has been used to explore the effects of directionality and, following Baker's work on the study of Translation as a variety of language behavior (Baker, 1995), it has also been applied to the observation of the unique features of interpreting as a language variety.The Interpreting Football Press Conferences or FOOTIE Corpus (Sandrelli, 2012) is one of the few domain-specific, specialized corpus to date.It is comprised by the recordings and transcripts of sixteen European football championship conferences and their interpretations.It is highly homogeneous, as for all the texts come from the same non-institutional setting and the same type of communicative event (UEFA, EURO 2008).It is also a multilingual corpus, with texts available in Italian, English, French and Spanish.Currently, FOOTIE is being used to look into specific features of press conferences that make them particularly challenging for interpreters, to perform studies on lexical aspects such as lexical density and variety, and to further explore interpreting strategies in this setting.
The Quality Evaluation in Simultaneous Interpreting or ECIS Corpus, is also a large, readymade, multilingual (English, Spanish, French, German) corpus available in audiovideo and written format.It consists of European Parliament debates and their simultaneous interpretations.It was developed by researchers at the University of Granada, Spain, and it is being used as an observational tool to experimental research on quality and performancerelated factors in simultaneous interpreting (Collados Aís et al., 2007, Collados Aís et al.,  2011 3 ).
Finally, the Center for Integrated Acoustic Information Research or CLAIR Corpus is an experimental4 , parallel and bilingual corpus, and it has texts in English and Japanese.The corpus was developed by researchers at the University of Nagoya in Japan and it is representative of non-technical academic lecturers.Among other applications, so far it has been used as a tool to analyze aspects of the simultaneous interpreting process, such as patterns of segmentation or word-level variations in ear -to-voice span (Setton, 2011:44).Currently, it is also part of a large-scale research towards creating an automatic simultaneous interpreting system (Shimitzu et al., 2014).
In its relatively short period of life, CIS has so far positively contributed to Interpreting Studies by providing access to reliable, readily-gathered interpreting corpora in multiple formats, and also by facilitating the development of corpus-based studies.In spite of the heterogeneity of these studies, they share at least two common traits, which are the adaptation of corpus-based techniques and methodologies to study interpreting, and the research interest in elucidating what simultaneous interpreting is about, how it is performed or what are the factors involved.In less than twenty years, the paradigm is already in a stage of maturity, as attested by the publication of specialized monographs (Sergio and Falbo, 2012), complete overviews (Setton, 2011), and the organization of specialized workshops (2015 Forli International Workshop), and it is making fast progress towards its consolidation in the field.

Compilation process
ARCHINT consists of ten conference presentations in English, ten simultaneous interpretations into Spanish, and their transcriptions.The conferences were recorded in the third and fifth editions of the Encontros Internacionales de Arquitectura , two three-day conferences on Architecture that were held in the years 2003 and 2007 correspondingly.The corpus provides both textual information (Examples 1 and 2) and contextual information (Images 1 and 2).The compilation process was split in three stages: (i) speech selection, (ii) transcription and (iii) tagging.Each stage is discussed below.

Speech selection
Although access to the full recordings of the third and fifth editions of the conference had been provided by the conference organizers, only those speeches that met the English>Spanish language directionality were taken into consideration (Appendix 1).
As a way to ensure homogeneity in the resulting corpus, three selection criteria applicable to the source speeches were set forth: (i) international projection of the speaker, (ii) degree in Architecture or similar by a reputable institution in an English-Speaking country, and (iii) use of English on a regular basis.This information was attained by conducting a targeted search in personal web pages, public profiles and specialized monographs (i.e.El Croquis, International Architecture Magazine) .This step was followed by a detailed description of selected speeches and their simultaneous interpretations.For the source speeches, descriptions included linguistic and extra-linguistic information such as register, grammar, and so forth.This information was collected through an observational analysis with the help of a rubric as an assessment and data collection tool (Appendix 2).The rubric listed six linguistic and extra-linguistic components to be graded on a seven point scale.These were: accent (marked, unmarked), diction (unclear, clear), intonation (monotonous, non-monotonous), voice (unpleasant, pleasant), grammar (incorrect, correct), and style (informal, formal).For the target speeches, the descriptions included the same type of linguistic and extra-linguistic information, which was collected by following the same procedure, and information on interpreters' (i) specialized training, (ii) prior preparation and (iii) professionalism.The use of strategies such as reformulation, slicing, simplification, generalization, strategic omissions, summarizing, explanation and anticipation were taken as indicators of specialized training .In contrast, the use of domain-specific terminology and booth behavior were taken as evidence of prior preparation and professionalism correspondingly.This information was collected by viewing and assessing ten minutes of each simultaneous interpretation twice, first individually, and the second time by comparing fragments of the source videos with the corresponding transcriptions of the simultaneous interpretations.The fragments of each speech were randomly selected and then analyzed with the help of a specialized interpreting trainer with an extensive background in interpreting research5 .

Transcription
To ensure compatibility of interpreting corpora with Corpus Linguistics software such as Wordsmith Tools, the source and target speeches were transcribed following the EPIC transcription protocol, which prioritizes linguistic features over prosodic and other para-and extra-linguistic features.The language varieties were English (UK) for the source speeches, and Peninsular Spanish for the target speeches.The speeches were segmented in meaning units, which are pieces of information taken from a text, that are rich enough to be processed, translated and understood as a single semantic unit.Segmentations were performed on the grounds of intonation and syntactic information.The end of each meaning unit was marked with a double bar (//).Truncated words, silent and non-silent pauses were marked with the signs of #, ( _), (…) and (ehm) correspondingly.Mispronounced words, false starts, anacoluthons, word fillers and repetitions were written as literally as possible.To avoid compatibility problems with specialized software, punctuation signs were omitted (Examples 1 and 2).
The transcription procedure was split up in three stages.Preliminary drafts of the source and target speeches were elaborated.This was followed by two more revisions of each draft by the researcher.A fourth and final revision was performed by a specialist with bilingual competence6 .

Tagging
Tagging refers to the description of the source and target speeches through a system of tags, which are informative labels placed at the top of each speech within a database.Some of the tags in ARCHINT were adopted from EPIC (Sandrelli andBendazzoli, 2009, Sandrelli et al., 2010).These include speech date, id, language, type, duration, speech length, text length, speed of delivery, words per minute, mode of delivery, speaker, gender, country, mother tongue and comments.Other tags were used in ARCHINT for the first time.These were: academic/working language, content structure, token, types, type/token ratio (TTR) and standardized type/token ratio (STTR) (Example 3).
Speech date refers to the date the speech was pronounced.Id is a speech identification number that is assigned to each speech.Language describes if the speech is in English or Spanish, and type if the speech is produced by a speaker or an interpreter.Duration refers to the duration of the speech in terms of long, medium or short, and timing, text length and length to its description in terms of amount of time and number of words in a transcript.Speed describes the elocutionary rhythm of a speech, which is further calculated as the number of words per minute (w/min)7 .Mode of delivery is connected with the way the speech has been delivered, which may vary from read to impromptu to a mix of read and impromptu.Speaker, gender and country highlight sociological information about the speakers, and mother tongue and academic/working language indicate if the speakers are heritage speakers, or they use it as an academic/working language.Content structure looks at the organization of ideas within a text.The Comments tag was used to further describe the speeches with additional information from the assessment rubric (Appendix 2) and the criteria used for the speech selection (see 3.1.Speech Selection).Finally, Type, Token, TTR and STTR values register the number of repetitions in a text.These values are attained by dividing the total amount of types by the total amount of items, and are expressed in percentages.The higher the TTR value is, the more different words a text has, or vice-versa.Because one of the unique characteristics of specialized texts is the greater repetition than usual of terms, phrases, sentences and even full paragraphs (Faber, 2012: 8), a low TTR value is generally taken as an indicator of a specialized text.The STTR values calculate the TTR in regular intervals and are usually higher than the TTR values.<speech date="07-11-09-m" id="006" lang="en" type="org-en" duration="long" timing="49' 23" text length="long" length="5988" speed="medium-fast" words per minute="121" delivery="impromptu" speaker="Hadid, Zaha" gender="F" country="Irak" mother tongue="no" working language="yes" content structure="introduction, outline of different projects" token ="2597.00"types ="1020.00"type/token ratio="39.44"standardized type/token ratio="31.17"comments="expert; international prestige and recognition; degree in Architecture from the Architectural Association in London; chair and guest professorships in UK and USA; marked foreign accent; unclear diction; monotonous intonation; informal style; speaker describes slides"> Example 3. Tagging (Zaha Hadid, Interpreter A)

Results
The corpus has a total of almost twenty recorded hours (19 h 19 min 32 s), almost 129,000 words, 69,540 tokens and 2,393 types.ARCHINT is also a parallel corpus, and it can be further subdivided in an English subcorpus and a Spanish subcorpus, each of which are described individually in the next subsections.

English subcorpus
Length and duration .The English subcorpus has almost 65,000 words, totaling over nine hours of recorded conference material at an average speed of 112 w/min, a type-token ratio value of 3.36% and a standard type-token ratio of 29.46%.The average speech duration is fifty-two minutes (Table 1).
Speakers .Selected speeches were delivered by Sandra Hemingway, American architect and project associate at Eisenman Architects; Craig Dykers, principal at Snøhetta; Frank Barkow, from Barkow Leibinger; Roger Diener, Swiss architect principal at Diener & Diener Architekten; German architect Thomas Herzog, from Herzog+Partner Architekten; Wilfred Wang, from Hoidn Wang Partners; Yung Ho Chang, Chinese architect principal at FCJZ; Janina Masojada, from Design Workshop in South Africa; Shigueru Ban, from Shigueru Ban Architects with offices in Japan, Paris and New York; Jonathan Serginson, founding partner of Sergison Bates Architects in the UK; and Zaha Hadid, Iraqi architect with offices in London and principal at Zaha Hadid Architects .The gender distribution was four female and seven male speakers.The speakers graduated from an English-speaking university such as Harvard, Yale, the London Architectural Association or similar in standing and prestige, had a high level of prestige and international recognition, and used English as an academic or professional language on a regular basis (Appendix 3).Speech structure.Speeches were structured in three parts, namely an introduction, a project review and a closing argument.Introductions presented biographical data and references to the theme of the conference.The project review was the most extensive and complex part of each speech.It included an in-depth discussion of an architectural project or series of projects, and it was further split in different sections.The first section was usually an introduction to the spatial context of the building and the unique characteristics of the site.This was followed by a description of the design thinking process.The closing evaluation brought in final comments on the building or buildings reviewed and, at times, echoed ideas that were introduced at an earlier moment.The structure of the speeches was similar to the genre known as Architectural Review (Caballero, 2006).Mode of delivery.Five speeches were read from a written script, four were delivered in a mixture of read and impromptu, and two were delivered impromptu.

Spanish subcorpus
Length and duration.The Spanish subcorpus has almost 65,000 words, representative of over nine hours of recorded conference material at an average speed of 111 w/min.The typetoken ratio value is 3.61% and the standard type-token ratio 32.29%.The average speech duration is fifty-two minutes (Table 2).
Interpreters.The simultaneous interpretations were delivered by four interpreters, two male and two female.Interpreters A and B (female) interpreted in the third edition held in 2007, and interpreters C and D (male) in the third edition held in 2003.
The four interpreters had a marked accent representative of Northwest Spain.Interpreters produced conventionally acceptable speeches, made effective use of grammar and register, and achieved a target text corresponding to the source speeches.Interpreter A showed the greatest control of production, which was reflected in output and content-related features such as clear diction, control of monotonous intonation and adequate use of grammar.Clarity of output was deemed satisfactory overall except for interpreter D, whose accent was particularly marked and at times affected the readability of the target speech.
Evidence of training and professionalism was collected.Interpreters built their own syntax as they went along, favoring chunked constructions to allow for possible changes of course.They seemed to be familiar with the use of strategies such as reformulation through adding synonyms or by expanding and explaining the content of the source speech.They were familiar with interpreting décalage, with inserting self-corrections, and with re-arranging the information to suit the norms of the target language.Likewise, they would use domain-specific terms such as tejado ('roof'), retícula urbana ('urban grid'), pared ('wall'), voladizo ('cantilever'), tejido urbano ('urban fabric'), steel mesh ('malla de acero'), madera contrachapada ('plywood'), aislamiento ('insulation'), piso ('floor'), maqueta ('model'), alzado ('elevation'), boceto ('sketch'), planos de diseño ('design plans'), columna ('column'), techo ('ceiling') and so forth.In addition, they showed awareness of professional booth behavior, especially in the control of background noise and the use of the silence button.Interpreter A in particular seemed to be closely monitoring her production, as evidenced by the lack of background noise such as 'ahhgings' and 'uhmings'.

Past and current applications
ARCHINT was created as part of a PhD research project that explored the concept of accuracy in simultaneous interpreting by using a multidisciplinary framework and from a theoretical and an empirical perspective (Cabrera, 2015).A summary of this project is provided below.

Accuracy in interpreting
Accuracy has traditionally been acknowledged a core feature of the ideal conference interpreter's performance, and for many years, it has occupied a focal position in Interpreting Studies.Allusions to accuracy in institutional or specialized conferences pervade early works such as the interpreter's Handbook (Herbert, 1952) and interpreting theories such as the Sense Theory (Seleskovitch, 1978) .In quality-oriented research, accuracy is generally part of the definition of core quality criteria such as terminology or correct or complete transfer of meaning (Kurz, 2001:401, Collados Aís et al., 2007), and whenever users have ranked quality criteria, correct or accurate terminology has usually being placed at least at a similar level as logical cohesion and complete rendition (Kurz, 1993(Kurz, , 2001)).Despite the emphasis on terminological accuracy in conference interpreting, it has only recently become a focus in research.As a result, the concept remains rather ambiguous, and this ambiguity hinders any attempt to improve terminology management practices geared towards maximizing accuracy in output.In an attempt to contribute to this research gap, a PhD research was conducted in 2015.Among its objectives were to come up with a definition of accuracy more in tune with recent cognitive terminology theories, to analyze a sample of interpreters' output in search of accuracy features or lack thereof, and to observe the effect of terminological inaccuracies in the conceptualization of the source message by the target users.Three different studies were developed, each based on the same material (ARCHINT Corpus), but with different research methodologies: a corpus-based terminological analysis, an user-oriented quality evaluation study, and a conceptualization, identification and post-evaluation study.
The corpus-based terminological analysis involved the selection and analysis of a sample of terms from the Spanish corpus, and it was performed in 3 stages: selection, analysis and comparison.The terms were chosen manually from a list of words extracted semiautomatically with Wordlist.This step was followed by the identification of a source referent for each term from the English subcorpus, and by the creation of generic categories that were used as a structuring tool for a group of lexical concepts that shared certain properties (i.e.function, attributes, etc.).The analysis phase consisted of a dictionary definition as well as a contrastive analysis.The dictionary analysis involved parsing definitions to extract conceptual information for each term selected from the Spanish subcorpus, and the corpus analysis a comparison of the frequency and collocations of these terms against a monolingual, ad-hoc corpus consisting of parallel texts in Spanish.
The user-oriented quality evaluation study explored and compared the quality evaluations that three groups of subjects with varying degrees of specialization performed of two fragments of a specialized simultaneous interpreting selected from the ARCHINT target subcorpus, one of them with terminological inaccuracies.The Theoretical and methodological framework was provided by user-oriented Quality Expectations and Evaluation Survey Studies, a research paradigm that has attracted considerable attention over the last twenty years (Bühler, 1986, Collados Aís, 1998;Collados Aís et al., 2007).
The conceptualization, identification and post-evaluation study looked into the differences or similarities in the way that architects, architecture students and subjects unrelated to this domain detected terminological inaccuracies in an interpreted speech on Architecture (ibidem).A mixed methods approach was used, which combined qualitative data, hand drawing and the correction of a fragment of a transcript from the Spanish subcorpus.The theoretical framework was grounded in both Frame-based Terminology, and in research performed under the expert-novice paradigm, which explores the effects of expertise and prior knowledge in the performance of expert and novice individuals along the same set of problems drawn from a given domain of specialization (Tanaka and Taylor, 1991;Tanaka et al., 2005).
ARCHINT is being used as an observational tool to identify factors that make accuracy difficult to achieve.Currently, the results attained in the corpus -based terminological analysis are being further analyzed in an attempt to identify patterns leading to general causes of terminological inaccuracies and, to date, up to four patterns have been identified (Cabrera and Faber, in progress).In the future, it is expected that this research will eventually lead towards the identification of loopholes in terminology preparation that need to be addressed in order to achieve an accurate output.Although the project is still at an early stage, the preliminary data certainly looks promising.

Future developments and research possibilities
ARCHINT is an expanding corpus.In the short term, there are plans to align the source and target speeches, to expand the size of the corpus by adding seven new texts from the fifth edition with a new language directionality (Spanish>English), and to make the transcripts available to the research community through the creation of an online platform or virtual database of restricted access.
It is expected that the expansion of the corpus by adding more speeches and a new language direction will open up further research opportunities.Like the EPIC corpus, ARCHINT could potentially be used to help discerning language-specific and directionspecific features of the interpreted output, and unveil unique aspects of interpreters' individual performance when interpreting specialized knowledge into their A or B language.Besides, the addition of a new language direction could lead towards the interactions between translational patterns and specific language directionality in domain-specific discourse.Likewise, the alignment of the source and target speeches may well lay the foundations for the study of the challenges, strategies and creative mechanisms involved in the simultaneous interpreting of a specialized speech in the field of Architecture, and broaden its research possibilities to accommodate other research questions.
In the future, there are plans to create an online platform or virtual database to load the corpus transcriptions.This will enable external researchers to access the material more easily and enhance its research potential.Just like any other corpus, ARCHINT can be used to explore descriptive aspects of interpreting such as the frequency of words, grammatical constructions, discourse patterns, co-occurrences, lexical density and type-token ratios.ARCHINT can also be used contrastively against other interpreting corpora to perform studies on lexical density and variety, or to identify factors involved in the success or failure of a specialized interpretation on the grounds of the setting (freelance market vs. institutional setting).It can also be used to test interpreters' processing capacity of domain-specific terminology, or to elaborate systematic descriptions of the processing operations that take place while interpreting in simultaneous mode.The appeal of ARCHINT may not be limited to interpreting scholars, but to the research community in general, especially to those who conduct research in specialized language, cognition and communication.In agreement with Setton (2011:34), unlike monolingual and translation corpora, interpreting corpora usually bring in the interlingual, oral and context dimensions.Very likely, ARCHINT, just like any other interpreting corpora, is a rich observational resource to research on psycholinguistic and pragmatic language processes under different conditions.

Conclusions
In the last ten years, Corpus-based Interpreting Studies has been mimicking the rapprochement between Translation Studies and the fast-growing field of Corpus Linguistics, and Interpreting studies has benefited from this proximity in several ways.For instance, one of the outcomes of Corpus-based Interpreting Studies has been the development of interpreting corpora, the implementation of Corpus Linguistics methods of analysis, and the blossoming of corpus-based and corpus-oriented studies that are currently allowing for the testing of hypothesis and validating existing theories in a systematic way.
For a long time, one of the greatest obstacle hampering empirical research on this mode in particular has been the collection of sufficient and adequate material that is representative of an interpreting scenario.Up until recently, available interpreting corpora were often not machine-readable, or too small to be representative of interpreting scenarios beyond the one portrayed in the sample analyzed.The development of interpreting corpora such as ARCHINT have made readily available large collections of authentic, cautiously selected, interpreting material, and have opened the door to the implementation of statistical analysis, which are known to potentially contribute towards the reduction of the speculation and subjectivity surrounding these studies.
Likewise, the use of corpora as a research tool and of corpus-based techniques to explore interpreters' output and performance have brought one step closer Interpreting Studies and disciplines that have adopted Corpus-based techniques and methodologies such as Translation Studies, Applied Linguistics or Terminology.An example is given by the corpus -based terminological study on accuracy conducted on the ARCHINT Corpus, which is illustrative of a terminological approach to the study of simultaneous interpreting.Although the specific results of the study should only be taken as representative of the specialized setting portrayed at the conference, the study was successful in reducing at least part of the ambiguity surrounding the concept of accuracy and in coming up with a definition that was consistent with theories that adopt an encyclopedic approach to meaning such as Framebased Terminology.The analysis conducted supported the stance that a term is not accurate per se, but it becomes more or less accurate in situational and linguistic contexts, and with reference to a source term, intentionality, an expected knowledge representation and target audience that have already got preconceived ideas of the content they will be receiving, and the verbal package that will wrap such content.The study led to questioning assumed, intuition-based, well-established abstract opinions on accuracy, most of which have traditionally favored the idea that accuracy is an intrinsic property of the form, style or grammar of a target-language utterance.Furthermore, it serves to illustrate how interdisciplinary projects combining Interpreting Studies and Terminology can be performed by adopting a corpus-based approach to the study of interpreting.
While the outcome of the interdisciplinary avenue prompted by Corpus-based Studies in general and by the greater availability of interpreting corpora is still unknown, overall the prospects seem particularly promising.Essentially, Corpus-based Interpreting Studies has laid the foundations for greater interdisciplinary cooperation, and is extending an invitation to interpreting researchers and researchers beyond the field to embarking on the study of interpreting oral productions.

Table 1 .
English and Spanish subcorpus description