Published November 27, 2023 | Version v1
Project deliverable Open

CLS INFRA D6.3 Standards beyond TEI / Extended Transformation Matrix / Alternative Formats

  • 1. Austrian Academy of Sciences, Austrian Centre for Digital Humanities and Cultural Heritage
  • 2. University of Potsdam

Description

This deliverable builds on and further extends the findings of D6.1 "Inventory of existing data sources and formats" surveying the landscape of literary corpora, as well as D8.1 "Tools for NLP" cataloguing the set of tools in the context of CLS. Focusing on the wealth of formats used when encoding and processing text, it offers a comprehensive overview of common formats for encoding textual data, beyond the "lingua franca", TEI, both in the domain of computational literary studies and computational linguistics, highlighting potential discrepancies in the approach between these two areas of research. The overview reveals a very heterogeneous landscape with a plethora of formats, devised for differing tasks, from philological encoding of historical text material, to computational annotation and processing of text.

Considering interoperability an indispensable key to reusability, the deliverable explores the challenges and approaches converting between formats.

This information compilation is considered input for further developing the Transformation Matrix, introduced in D6.1, which shall serve as a conceptual framework to consolidate existing solutions for format conversion in the Transformation Toolbox to be delivered by the end of the project (D6.2). The Transformation Matrix shall allow to capture information about specific data structures (features) present in datasets as well as data structures required or produced by tools. This requires a sufficiently expressive formalised description, which is proposed in the CLSCor data model.

Files

D6.3_Standards_beyond_TEI.pdf

Files (897.2 kB)

Name Size Download all
md5:5cdd5f629f138d819e6f7aba06b54c89
897.2 kB Preview Download

Additional details

Funding

CLS INFRA – Computational Literary Studies Infrastructure 101004984
European Commission