ComPara: A corpus linguistics in English of computation in architecture dataset

ComPara is a corpus linguistics dataset in English focused on computational architecture or architecture where technology functions as a driver for its conceptualization, design, and materialization. Sometimes computational architecture is also referred to as digital, parametric, algorithmic or generative architecture, and, as has been shown, each of these terms has different flavours [9]. Other corpus linguistics for architecture have been built containing texts written over a relatively limited time span and focusing on the language used in the profession in general [1,2]. The text which makes up ComPara is written between 2005 and 2019 and focuses on computational architecture. The corpus is built from two sources: the journal Architectural Design[3] and the eVolo skyscraper competition[4]. The former is one of the journals which has focused most on the theoretical discourse surrounding computation in architecture [5], while the latter is one of the most prestigious competitions focusing on ‘technological advancements in architecture’ [4]. The corpus includes the titles of Architectural Design's journal issues, titles of all articles and the keywords which are associated to the Introduction article in the journal's web page for each issue for the period between 2005 to 2019. From the eVolo Skyscraper competition, the titles of all winning projects and honorable mentions as well as all abstracts describing the projects between 2006 and 2019 were collected. This amounts to around 100.000 words. The purpose of building this dataset was to help gain a better understanding of the digitalization of architecture over 15 year time-span [8]. Further quantitative, qualitative or mixed method analysis can be carried out using the ComPara corpus by following specific topics or trends over time or by comparing the corpus to other sources.


a b s t r a c t
ComPara is a corpus linguistics dataset in English focused on computational architecture or architecture where technology functions as a driver for its conceptualization, design, and materialization.Sometimes computational architecture is also referred to as digital, parametric, algorithmic or generative architecture, and, as has been shown, each of these terms has different flavours [9] .Other corpus linguistics for architecture have been built containing texts written over a relatively limited time span and focusing on the language used in the profession in general [1 , 2] .The text which makes up ComPara is written between 2005 and 2019 and focuses on computational architecture.The corpus is built from two sources: the journal Architectural Design [3] and the eVolo skyscraper competition [4] .The former is one of the journals which has focused most on the theoretical discourse surrounding computation in architecture [5] , while the latter is one of the most prestigious competitions focusing on 'technological advancements in architecture' [4] .The corpus includes the titles of Architectural Design's journal issues, titles of all articles and the keywords which are associated to the Introduction article in the journal's web page for each issue for the period between 2005 to 2019.From the eVolo Skyscraper competition, the titles of all winning projects and honorable mentions as well as all abstracts describing the projects between 2006 and 2019 were collected.This amounts to around 10 0.0 0 0 words.The purpose of building this dataset was to help gain a better

Value of the Data
• This corpus is an insight on the language used to talk about architecture in one journal and one competition which have been focusing on technological advancements in the field.Architectural Design presents theoretical insights, and the eVolo skyscraper competition shows the language used to describe conceptual projects that have received awards or honourable mentions.
• Architectural theorists and historians can use the data to re-read the recent history of the section of architecture which focuses on technological advancements.Those who want to submit conceptual projects to the eVolo skyscraper competition can use this corpus to get a better understanding of the themes which have been successful through the years.
• The data can be used to gain a better understanding of how computation is penetrating the field of architecture over a 15 year period.• The dataset can be processed quantitatively (for topic modelling by natural language processing), qualitatively or through mixed method approaches.

Experimental Design, Materials and Methods
The corpus was built with the help of the web scraping tool Octoparse [6] from the websites of the journal Architectural Design and eVolo and the data collection was done in several steps detaild below.
For the journal Architectural Design , there were two steps: (a) the titles of the issues and the titles of the articles in each issue were collected -all this for the period between 2005 and 2019.
(b) In the second step, the keywords for the associated to the Introduction article in the journal's web page of each issue were collected for the period 2005-2019.The .url for each issue was created automatically using a script made with the visual programming language Grasshopper [7] .This meant simply changing the year, volume number and issue number at the end of the .urlstring.These .urlswere introduced into Octoparse as batches for every year between 2005 and 2019.For each yearly batch, an Octoparse job was created.The target data to be extracted from the batches was selected manually in the browser.This target data was: The title of the issue and the titles of the articles in each issue.Each yearly batch was placed in separate .txtfile.The word clouds from the .txtfiles were created using the Cirrus function from Voyant Tools.The Cirrus function deletes any punctuation and connection words, only shows the most common 500 words, and dimensions them by frequency.
For the eVolo Skyscraper Competition , the data was collected in two steps: (1) first, the winning projects, and (2) second, the honourable mentions..urlswere created manually for each years' winners and each years' honourable mentions between 2006 and 2019.The .urls were used to create yearly batches in Octoparse and then jobs were created for each batch.The target data to be extracted was selected manually in the browser and it was: the titles of the projects, the authors, the country, and the abstract describing the project.Each yearly batch resulted in two .csvfiles, one with winning projects and one with honourable mentions.The word clouds of the project titles were created, again, using the Cirrus function from Voyant Tools.Then, two maps were created using a Grasshopper script which coloured the interior of the polyline of countries on a world map according to how many times a country appeared in the list of (1) winning projects and (2) honourable mentions and winning projects in the .csvfile described above.The dataset can be found online at [10] .

Ethics Statement
Hereby, I Anca-Simona Horvath consciously assure that for the manuscript ComPara: A Corpus Linguistics in English of Computation in Architecture Dataset , the following is fulfilled: 1) This material is the authors' own original work, which has not been previously published elsewhere.
2) The paper is not currently being considered for publication elsewhere.
3) The paper reflects the authors' own research and analysis in a truthful and complete manner.4) The paper properly credits the meaningful contributions of co-authors and co-researchers.5) The scraping of Architectural Design's repository is only done on data available to the general audience who do not have to be registered customers to see titles of journal issues and articles and keywords associated to the Introduction article of each issue.Additionally, scraping is done in accordance to Wiley's Text and Data Mining Agreement.6) The eVolo skyscraper competition repository is licensed under a Creative Commons License permitting non-commercial sharing with attribution.
The violation of the Ethical Statement rules may result in severe consequences.I agree with the above statements and declare that this submission follows the policies of Solid State Ionics as outlined in the Guide for Authors and in the Ethical Statement.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.