The Linked TEI: Text Encoding in the Web

TEI Conference and Members Meeting 2013: October 2-5, Rome (Italy)

Abstracts of papers

The Linked Fragment: TEI and the encoding of text re-uses of lost authors

The goal of this paper is to present characteristics and requirements for encoding quotations and text re-uses of lost works (i.e., those pieces of information about lost authors that humanists classify as ‘fragments’). In particular the discussion will focus on the work currently done using components of Perseids (http://sites.tufts.edu/perseids/), a collaborative platform being developed by the Perseus Project that leverages and extends pre-existing open-source tools and services to support editing and annotating TEI XML source documents in Classics.

Working with text re-uses of fragmentary authors means annotating information pertaining to lost works that is embedded in surviving texts. These fragments of information derive from a great variety of text re-uses that range from verbatim quotations to vague allusions and translations. One of the main challenges when looking for traces of lost works is the reconstruction of the complex relationship between the text re-use and its embedding context. Pursuing this goal means dealing with three main tasks: 1) weighing the level of interference played by the author who has reused and transformed the original context of the information; 2) measuring the distance between the source text and the derived text; 3) trying to perceive the degree of text re-use and its effects on the final text.

The first step for rethinking the significance of quotations and text re-uses of lost works is to represent them inside their preserving context. This means first of all to select the string of words that belong to the portion of text which is classifiable as re-use and secondly to encode all those elements that signal the presence of the text re-use (i.e., named entities such as the onomastics of re-used authors, titles of re-used works and descriptions of their content, verba dicendi, syntax, etc.). The second step is to align and encode all information pertaining to other sources that reuse the same original text with different words or a different syntax (witnesses), or that deal with the same topic of the text re-use (parallel texts), and finally different editions and translations of both the source and the derived texts.

This paper addresses the following requirements for producing a dynamic representation of quotations and text re-uses of fragmentary authors, which involve different technologies including both inline and stand-off markup:

  • Identifiers: i.e. stable ways for identifying: fragmentary authors; different kinds of quotations and text re-uses; passages and works that preserve quotations and text re-uses; editions and translations of source texts; entities mentioned within the text re-uses; annotations on the text re-uses.

  • Links: between the fragment identifier and the instances of text re-use, the fragment identifier and the attributed author, the fragment identifier and an edition which collects it; between the quoted passage and the entities referenced in it; between the quoted passage and translations.

  • Annotations: the type of re-use; canonical citations of text re-uses; dates of the initial creation of the re-use, of the work which quotes it, author birth and death; editorial commentary on each text re-use; bibliography; morphosyntactic analysis of the quoted passage; text re-use analysis (across different re-uses of the same text); syntactic re-use analysis; translation alignments (between re-used passages and their translations); text reuse alignments (between different re-uses of the passage in the same language).

  • Collections (the goal is to organize text re-uses into the following types of collections): all text re-uses represented in a given edition which includes re-uses from one or many authors; all text re-uses attributed to a specific author; all text re-uses quoted by a specific author; all text re-uses referencing a specific topic; all text re-uses attributed to a specific time period, etc.

In the paper we discuss in particular how we are combining TEI (http://www.tei-c.org), the Open Annotation Collaboration (OAC) core data model (http://www.openannotation.org/spec/core/), and the CITE Architecture (http://www.homermultitext.org/hmt-doc/cite/index.html) to represent quotations and text re-uses via RDF triples. The subject and object resources of these triples can be resolved by Canonical Text and CITE Collection Services to supply the TEI XML and other source data in real time in order to produce new dynamic, data-driven representations of the aggregated information.

The CITE Architecture defines CTS URNs for creating semantically meaningful unique identifiers for texts, and passages within a text. It also defines an alternate identifier syntax, in the form of a CITE URN, for data objects which don’t meet the characteristics of citable text nodes, such as images, text re-uses of lost works, and annotations. As URNs, these identifiers are not web-resolvable on their own, but by combining them with a URI prefix and deploying CTS and CITE services to serve the identified resources at those addresses, we have resolvable, stable identifiers for our texts, data objects and annotations. In the paper we supply specific examples of URNs, and their corresponding URIs, for texts, citations, images and annotations.

The CTS API for passage retrieval depends upon the availability of well-formed XML from which citable passages of texts can be retrieved by XPath. The TEI standard provides the markup syntax and vocabulary needed to produce XML which meets these requirements, and is a well-accepted standard for digitization of texts. Particularly applicable are the TEI elements for representing the hierarchy of citable nodes in a text. The Open Annotation Core data model provides with us a controlled vocabulary to identify the motivation for the annotations and enables us to express our annotation triples according to a defined and documented standard.

In the paper we present practical examples of annotations of text re-uses of lost works that have been realized using components of the Perseids platform. In Perseids we are combining and extending a variety of open source tools and frameworks that have been developed by members of the Digital Classics communitity in order to provide a collaborative environment for editing, annotating and publishing digital editions and annotations. The two most prominent components of this platform are the Son of SUDA Online tool developed by the Papyri.info (http://papyri.info) project and the CITE architecture, as previously mentioned. The outcome of this work is presented in a demonstration interface of Perseids, The Fragmentary Texts Demo (http://services.perseus.tufts.edu/berti_demo/). We also present the data driving the demo, which contains sets of OAC annotations (http://services.perseus.tufts.edu/berti_demo/berti_annotate.js) serialized according to the JSON-LD specification.

The final goal is to publish the annotations and include all the information pertaining to fragmentary texts in the collection of Greek and Roman materials in the Perseus Digital Library. The purpose is to collect different kinds of annotations of text re-uses of fragmentary authors with a twofold perspective: 1) going beyond the limits of print culture collections where text re-uses are reproduced as decontextualized extracts from many different sources, and representing them inside their texts of transmission and therefore as contextualized annotations about lost works; 2) allowing the user to retrieve multiple search results using different criteria: collections of fragmentary authors and works, morphosyntactic data concerning text re-uses, information about the lexicon of re-used words, cross-genre re-uses, text re-use topics, etc.


  • Almas, Bridget and Beaulieu, Marie-Claire (2013): Developing a New Integrated Editing Platform for Source Documents in Classics. In: Literary and Linguistic Computing (Digital Humanities 2012 Proceedings) (forthcoming).
  • Berti, Monica (2013): Collecting Quotations by Topic: Degrees of Preservation and Transtextual Relations among Genres. In: Ancient Society 43.
  • Berti, Monica, Romanello, Matteo, Babeu, Alison and Crane, Gregory R. (2009): Collecting Fragmentary Authors in a Digital Library. In: Proceedings of the 2009 Joint International Conference on Digital Libraries (JCDL ’09). Austin, TX. New York, NY: ACM Digital Library, 259-262. http://dl.acm.org/citation.cfm?id=1555442
  • Büchler, Marco, Geßner, Annette, Berti, Monica, and Eckart, Thomas (2012): Measuring the Influence of a Work by Text Reuse. In: Dunn, Stuart and Mahony, Simon (Ed.): Digital Classicist Supplement. Bulletin of the Institute of Classical Studies. Wiley-Blackwell.
  • Crane, Gregory R. (2011): From Subjects to Citizens in a Global Republic of Letters. In: Grandin, Karl (Ed.): Going Digital. Evolutionary and Revolutionary Aspects of Digitization. Nobel Symposium 147. The Nobel Foundation, 251-254.
  • Romanello, Matteo, Berti, Monica, Boschetti, Federico, Babeu, Alison and Crane, Gregory R. (2009):Rethinking Critical Editions of Fragmentary Texts by Ontologies. In: ELPUB 2009: 13th International Conference on Electronic Publishing: Rethinking Electronic Publishing: Innovation in Communication Paradigms and Technologies. Milan, 155-174. http://hdl.handle.net/10427/70403
  • Smith, D. Neel and Blackwell, Chris (2012): Four URLs, Limitless Apps: Separation of Concerns in the Homer Multitext Architecture. In: A Virtual Birthday Gift Presented to Gregory Nagy on Turning Seventy by His Students, Colleagues, and Friends. The Center of Hellenic Studies of Harvard University. http://folio.furman.edu/projects/cite/four_urls.html

“Reports of My Death Are Greatly Exaggerated”: Findings from the TEI in Libraries Survey

Historically libraries, especially academic libraries, have contributed to the development of the TEI Guidelines, largely in response to mandates to provide access to and preserve electronic texts (Engle 1998; Friedland 1997; Giesecke, McNeil, and Minks 2000; Nellhaus 2011). At the turn of the 21st century, momentum for text encoding grew in libraries as a result of the maturation of pioneering digital library programs and XML-based web publishing tools and systems (Bradley 2004). Libraries were not only providing “access to original source material, contextualization, and commentaries, but they also provide[ed] a set of additional resources and service[s]” equally rooted in robust technical infrastructure and noble “ethical traditions” that have critically shaped humanities pedagogy and research (Besser 2004).

In 2002, Sukovic posited that libraries’ changing roles would and could positively impact publishing and academic research by leveraging both standards such as the TEI Guidelines and traditional library expertise, namely in cataloging units due to their specialized knowledge in authority control, subject analysis, and of course, bibliographic description. Not long after, in 2004, Google announced the scanning of books in major academic libraries to be included in Google Books (Google 2012), and in 2008 many of these libraries formed HathiTrust to provide access to facsimile page images created through mass digitization efforts (Wilkin 2011), calling into question the role for libraries in text encoding that Sukovic advocated. In 2011, with the formation of the HathiTrust Research Center and IMLS funding of TAPAS (TEI Archiving, Publishing, and Access Service, http://www.tapasproject.org/), we see that both large- and small-scale textual analysis are equally viable and worthy pursuits for digital research inquiry in which libraries are heavily vested (Jockers and Flanders 2013). More recently, we are witnessing a call for greater and more formal involvement of libraries in digital humanities endeavors and partnerships (Vandegrift 2012; Muñoz 2012) in which the resurgence of TEI in libraries is becoming apparent (Green 2013; Milewicz 2012; Tomasek 2011; Dalmau and Courtney 2011). How has advocating for such wide-ranging library objectives — from digital access and preservation to digital literacy and scholarship, from supporting non-expressive/non-consumptive research practices to research practices rooted in the markup itself — informed the evolution or devolution of text encoding projects in libraries?

Inspired by the papers, presentations and discussions that resulted from the theme of the 2009 Conference and Members’ Meeting of the TEI Consortium, “Text Encoding in the Era of Mass Digitization,” the launch of the AccessTEI program in 2010, and the release of the Best Practices for TEI in Libraries in 2011, we surveyed employees of libraries around the world between November 2012 and January 2013 to learn more about text encoding practices and gauge current attitudes about text encoding in libraries. As library services evolve to promote varied modes of scholarly communications and accompanying services, and digital library initiatives become more widespread and increasingly decentralized, how is text encoding situated in these new or expanding areas? Do we see trends in uptake or downsizing of text encoding initiatives in smaller or larger academic institutions? How does administrative support or lack thereof impact the level of interest and engagement in TEI-based projects across the library as whole? What is the nature of library-led or -partnered electronic text projects, and is there an increase or decrease in local mass digitization or scholarly encoding initiatives? Our survey findings provide, if not answers to these, glimpses of the TEI landscape in libraries today.

The survey closed on January 31, 2013, with a total of 138 responses, and a completion rate of 65.2%. Since the survey was targeted specifically toward librarians and library staff, we turned away respondents for not meeting that criterion, with a final total of 90 responses. Most of the respondents are from North America (87%), and affiliated with an academic library (82%). Respondents from academic institutions come from institutions of various sizes, with a plurality (31%) falling in the middle range (10,000-25,000 student enrollment). Of those responding, 81.2% are actively engaged in text encoding projects. Preliminary data analysis shows that those not yet engaged in text encoding (or not sure whether their institution is engaged) are planning to embark on text encoding based on grant funding or new administrative support for text encoding projects. It seems that reports of the death of TEI in libraries are greatly exaggerated, though this is not to say that TEI in libraries is not struggling.

Our paper will unveil a fuller analysis of the data we have gathered, and when applicable, a comparative examination against the following raw data sources and publications for a more complete picture:

  • TEI-C membership profile of library institutions from 2005 to 2012
  • Evolution/devolution of electronic text centers within libraries from as early as 2000 to present
  • Findings from a study by Harriett Green (2012) on library support for the TEI
  • Findings from a study by Siemens et al. (2011) on membership and recruitment for the TEI Consortium

Emerging trends and issues will inform the future direction and agenda of the TEI’s Special Interest Group on Libraries.


  • Besser, Howard., 2004. “The Past, Present, and Future of Digital Libraries.” A Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth. Oxford: Blackwell. http://www.digitalhumanities.org/companion/.
  • Bradley, John. 2004. “Text Tools.” A Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth. Oxford: Blackwell. http://www.digitalhumanities.org/companion/.
  • Dalmau, Michelle and Angela Courtney. 2011. “The Victorian Women Writers Project Resurrected: A Case Study in Sustainability.” Paper presented at Digital Humanities 2011: Big Tent Humanities, Palo Alto, California, June 19–22.
  • Engle. Michael. 1998. “The social position of electronic text centers.” Library Hi Tech 16 (3/4): 15–20. http://dx.doi.org/10.1108/07378839810304522.
  • Friedland, LeeEllen. 1997. “Do Digital Libraries Need the TEI? A View from the Trenches.” Paper presented at TEI10: The Text Encoding Initiative Tenth Anniversary User Conference, Providence, Rhode Island, November 14–16. http://www.stg.brown.edu/conferences/tei10/tei10.papers/friedland.html.
  • Giesecke, Joan, Beth McNeil, and Gina L. B. Minks. 2000. “Electronic Text Centers: Creating Research Collections on a Limited Budget: The Nebraska Experience.” Journal of Library Administration 31 (2): 77–92. http://digitalcommons.unl.edu/libraryscience/63/.
  • Google. 2012. “Google Books History.” Last modified December 21. http://www.google.com/googlebooks/about/history.html.
  • Green, Harriett. 2012. “Library Support for the TEI: Tutorials, Teaching, and Tools.” Paper presented at TEI and the C(r l)o(w u)d: 2012 Annual Conference and Members’ Meeting of the TEI Consortium, College Station, Texas, November 8–10.
  • Green, Harriett. 2013. “TEI and Libraries: New Avenues for Digital Literacy?” dh+lib: Where Digital Humanities and Librarianship Meet. http://acrl.ala.org/dh/2013/01/22/tei-and-libraries-new-avenues-for-digital-literacy/.
  • Jockers, Matthew L. and Julia Flanders. 2013. “A Matter of Scale.” Keynote lecture presented at Boston-Area Days of DH 2013. http://digitalcommons.unl.edu/englishfacpubs/106/.
  • Milewicz, Liz. 2012. “Why TEI? Text > Data Thursday.” Duke University Libraries News, Events, and Exhibits. http://blogs.library.duke.edu/blog/2012/09/26/why-tei-text-data-thursday/.
  • Muñoz, Trevor. 2012. “Digital Humanities in the Libraries Isn’t a Service.” Notebook. http://trevormunoz.com/notebook/2012/08/19/doing-dh-in-the-library.html.
  • Nellhaus, Tobin. 2001. “XML, TEI, and Digital Libraries in the Humanities.” Libraries and the Academy 1(3): 257–77. http://muse.jhu.edu/journals/portal_libraries_and_the_academy/v001/1.3nellhaus.html.
  • Siemens, Ray, Hefeng (Eddie) Wen, Cara Leitch, Dot Porter, Liam Sherriff, Karin Armstrong, and Melanie Chernyk. 2011. “The Apex of Hipster XML GeekDOM” Journal of the Text Encoding Initiative 1. http://jtei.revues.org/210.
  • Sukovic, Suzana. 2002. “Beyond the Scriptorium: The Role of the Library in Text Encoding.” D-Lib Magazine 8.1. http://www.dlib.org/dlib/january02/sukovic/01sukovic.html.
  • Tomasek, Kathryn. 2011. “Digital Humanities, Libraries, and Scholarly Communication.” Doing History Digitally. http://kathryntomasek.wordpress.com/2011/11/02/digital-humanities-libraries-and-scholarly-communication/.
  • Vandegrift, Micah. 2012. “What is Digital Humanities and What’s It Doing in the Library?” In the Library with the Lead Pipe. http://www.inthelibrarywiththeleadpipe.org/2012/dhandthelib/.
  • Wilkin, John. 2011. “HathiTrust’s Past, Present, and Future.” Remarks presented at the HathiTrust Constitutional Convention, Washington, D.C., October 8. http://www.hathitrust.org/blogs/perspectives-from-hathitrust/hathitrusts-past-present-and-future.

From entity description to semantic analysis: The case of Theodor Fontane’s notebooks

Within the last decades, TEI has become a major instrument for philologists in the digital age, particularly since the recent incorporation of a set of mechanisms to facilitate the encoding of genetic editions. Editions use the XML syntax while aiming to preserve the quantity and quality of old books and manuscripts, and to publish many more of them online mostly under free licences. Scholars all over the world are now able to use huge data sets for further research. There are many digital editions available, but only a few frameworks to analyse them. Our presentation focusses on the use of web technologies (XML and related technologies as well as JavaScript) to enrich the forthcoming edition of Theodor Fontane’s notebooks with a data driven visualisation of named entities and to build applications using such visualisations which are reusable for any other edition within the world of TEI.

State of the art

The TEI Guidelines provide various mechanisms for tagging references to entities in texts, as well as solutions for encoding metadata supplied by editors about such entities. Such methods are frequently employed in digital editions. For example, on the website of the edition of John Godwin’s diaries¹ we are able to highlight the names within the text in different colors. Often these parts are rendered in HTML as <acronym> and are equipped with a <div> box containing further information that pops up as the user clicks on or hovers over them. This is a simple and easy to use way to deliver further information and some search options, but it does not per se facilitate a detailed analysis.

With help of the <speaker> tag within TEI encoded drama, a quantitative analysis of spoken words becomes possible. One example is provided by the Women Writers Project, that visualize speakers in drama by gender.² It is also possible to get a quantitative overview of the coappearance of two or more characters, which is done for Victor Hugo’s Les Misérables with the help of the D3.js JavaScript library.³

Persons and places seem to be the most common types of tagged entities. These are usually normalized, i.e. spelling variations are merged and matched to an authoritative name, and some additional data not found in the encoded source text is provided – most commonly biographical dates for persons and geographic coordinates for places. Additional data might include excerpts from encyclopedias, or map visualisations of the location of places. In the case of most editions, the usage of entity encoding can be characterised as descriptive, rather than analytical: information is provided about entities, but the way in which they are referenced in source texts and how the entities relate to each other is recorded and used for navigational purposes only. This paper, employing the example of a TEI edition project of 19th century notebooks, discusses further potential uses of such TEI encoded semantic annotations.

Theodor Fontane’s notebooks

From 1859 until the late 1880s, the German poet Theodor Fontane (1819–1898) filled almost 10,000 pages in 67 notebooks, which have not yet been published in their entirety. They include diary entries, travel notes, theater criticism and drafts for novels and poems, resulting in a wide spectrum of text types and images.⁴ The complete edition of the notebooks both in print and online is being prepaired at the Theodor Fontane-Arbeitsstelle, Department of German Philology at Göttingen University, in collaboration with the Göttingen State and University Library.⁵ In his notebooks, Fontane made extensive use of underlining, cancellations, corrections and additions, and consequently the crucial aspect of the philological edition project is to precisely transcribe, encode, annotate and visualize the appearance of Fontane’s handwriting, in order to help the reader to decipher and understand it. Another important task within this project, however, is to identify and encode references to entities in the notebooks. These include:

  • persons, organizations – linked to authority files such as GND⁶ or VIAF⁷, online historical encyclopedias
  • places – all of the above, plus linked to geographical databases such as GeoNames or the Getty Thesaurus of Geographic Names
  • dates – normalized to machine-readable standards, so that dates can be sorted and durations calculated
  • artworks, buildings – linked to their creators, locations, and provided with their dates of creation
  • literary works, musical works – linked to their authors and, where applicable, online versions
  • events (e.g. battles) – linked to places and provided with dates
  • characters in works of fiction – linked to the respective works.

Because of the density of occurrences and the variety of entity types, Fontane’s notebooks lend themselves to advanced methods of semantic analysis.

Semantic analysis

These entity occurrences are encoded in a fairly common way, using <rs> elements which link to lists of elements in which the entities are described and linked to external authority records, and <date> elements in the case of chronological references. At a later project stage, we will explore the possibilities to derive other formats from this data which facilitate the extraction and processing of their semantic content, such as Geography Markup Language (GML)/Keyhole Markup Language (KML) for spatial data, or CIDOC-CRM for events. This paper will explore how our entity data, which is available in similar form in many other TEI encoded editions, can be put to use in ways that go beyond the traditional uses described above, and which enter the realm of semantic analysis. Examples include:

  • counting entities and calculating their relative frequency. We expect a high number and a concentration for pages where we can find short notations or lecture notes. Thus, we hope to be able to distinguish these parts from literary manuscripts;
  • enriching personal data with birth and death dates from authority files and calculating differences in order to identify historical strata;
  • identifying co-occurrences of persons and other entities and constructing networks in order to calculate graph theoretical measures;
  • connecting places to routes, visualizing them on maps and calculating their distances using coordinates from external databases. Place entity references can occur in several different roles⁸: in this context, we must distinguish places visited by Fontane where he took notes, and distant places only mentioned by Fontane. It will be of interest to analyse the differences and similarities between these two geographic networks, particularly when a chronological dimension (i.e. the date of Fontane’s visit, or the date of a historic event referred to by Fontane which took place at a mentioned site) is added;
  • comparing Fontane’s statements about entities, such as dates, locations, and names, with what we know about them today.

These data aggregations will be provided to the user as interactive graphics using D3.js or in the case of locations connected to a specified time or period, using the DARIAH GeoBrowser e4d⁹. Therefore we develop XSLT transformation scenarios, build with XQuerys within our exist-db (project portal), that delivers the needed JSON (D3.js) or KML (e4d¹⁰) formats and transfer these data sets using appropriate interfaces.


  • [1] James Cummings, “The William Godwin’s Diaries Project: Customising and transforming TEI P5 XML for project work”, in: Jahrbuch für Computerphilologie 10 (2008), http://computerphilologie.de/jg08/cummings.pdf (April 29, 2009), last visited on March 27, 2013
  • [2] Women Writers Project, “Women Writers Online”, http://www.wwp.brown.edu/wwo/lab/speakers.html, last visited on March 27, 2013
  • [3] Mike Bostock, “Force Directed Graph”, http://bl.ocks.org/mbostock/4062045, last visited on March 27, 2013; based on data provided by Donald Knuth, “The Stanford GraphBase: A Platform for Combinatorial Computing”, Reading 1993
  • [4] Gabriele Radecke, “Theodor Fontanes Notizbücher. Überlegungen zu einer überlieferungsadäquaten Edition”, in: Martin Schubert (Ed.), Materialität in der Editionswissenschaft, Berlin 2010 (= Beihefte zu editio; Bd. 32), pp. 95–106. – The Berlin State Library is the owner of the notebooks and an associated partner of the project.
  • [5] Project website http://www.unigoettingen.de/de/303691.html and http://www.textgrid.de/community/fontane/
  • [6] Gemeinsame Normdatei / Integrated Authority File of the German National Library, http://www.dnb.de/EN/Standardisierung/Normdaten/GND/gnd_node.html, last visited on March 27, 2013
  • [7] Virtual International Authority File, http://viaf.org/, last visited on March 27, 2013
  • [8] Humphrey Southall, “Defining and identifying the roles of geographic references within text: Examples from the Great Britain Historical GIS project”, in: Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references – Volume 1, pp. 69-78, doi:10.3115/1119394.1119405
  • [9] europeana4D: exploring data in space and time, http://dev2.dariah.eu/e4d/, an example using the content from one single page can be found at http://goo.gl/TSNDf, last visited on March 27, 2013
  • [10] EuropeanaConnect: “KML Specifications”, http://tinyurl.com/e4d-kml, last visited June 27, 2013

Ontologies, data modelling, and TEI


In philosophy, Ontology denotes the study of being, with traces at least 2500 years back in history. In computer science, ontologies, uncapitalised and in the plural, has been a topic of study for some thirty years, initially connected to the artificial intelligence community. Computer science ontologies refer to shared conceptualisations expressed in formal languages (Gruber, 2009). They have not been of much importance in digital humanities before the last 10-15 years, but are now gaining momentum, connected to the development of the semantic web.

In the paper I will discuss ontologies in the context of the Text Encoding Intiative (TEI Consortium, 2012), based on the computer science tradition. However, even if computer science ontologies are different from philosophical Ontology, the two are not totally disconnected (Zúñiga, 2001) and some remarks will be made on links to philosophy as well. The focus will be on how meaning can be established in computer based modelling, in connection with the sources. Meaning can be based on the sources and the interpretation of them, but can also be established through the development of the ontologies themselves.

It is sometimes claimed that TEI expresses an inherent ontology, and in some sense it is true. TEI represents a shared conceptualisation of what exists in the domains relevant to text encoding. However, even if TEI can be expressed in formal models, it is questionable whether TEI can be seen as an ontology in the computer science sense. According to the classification in Guarino et al. (2009, 12–13), XML schemas are typically not expressive enough for the formality we need for ontologies. However, the level of language formality forms a continuum and it is difficult to draw a strict line where the criterion of formal starts. This continuum can be connected to different parts of the TEI. Some parts, such as the system of persons, places, and events, may be closer to an ontology than other, less formalised parts of the standard (Ore and Eide, 2009).

Two ways of modelling

There are no ontologies without models—an ontology, after all, represent a model of a world or of a certain corner of it. The discussion in the paper will focus on active engagement with models, that is, on how meaning is generated and anchored when ontologies and other models are developed and used. For TEI specifically, creating the standard was of course dependent on ontological consideration in the philosophical sense. Further, using it may also include similar ontological studies of the source material.

I will distinguish between two different, although overlapping, ways of modelling. First, one may use already existing models for data integration. An example of this is the task of integrating data from several different libraries and archives in order to create a common data warehouse in which the detailed classifications from each of the databases are preserved. In the process, one will want to use a common ontology for the cultural heritage sector, for instance, FRBRoo (FRBR, 2012). In the process, one must develop a thorough understanding of the sources, being they TEI encoded texts or in other forms, as well as of the target ontology—one will develop new knowledge.

The task is intellectually demanding and the people engaged in it will learn new things about the sources at hand. Still, the formal specification of the corner of the world they are working towards is already defined in the standard. Only in a limited number of cases will they have to develop extensions to the model. Once the job is done, making inferences in the ontology based data warehouse can be used to understand the sources and what they document even better. Yet, all the learning included, the process is still mostly restricted to the use of what is already there.

The second way of working with models is to create an ontology or another formal model through studying a domain of interest. In this case, a group of people will analyse what exists in the domain and how one can established classes which are related to each other. This may, for instance, be in order to understand works of fiction, as in the development of the OntoMedia ontology, [URL: http://www.contextus.net/ontomedia/model (checked 2013-03-30)] which is used to describe the semantic content of media expressions. It can also be based on long traditions of collection management in analog as well as digital form, as in the development of CIDOC-CRM (CIDOC, 2011) in the museum community. Although one will often use data from existing information systems, the main goals of such studies are not mappings in themselves, but rather to understand and learn from previous modelling exercises in the area of interest.

The historical and current development of TEI can be seen in this context. The domain of TEI has no clear borders, but the focus is on text in arts and cultural history. In order to develop a model of this specific corner of the world, one had to analyse what exists and how the classes of things are related to each other. This is a process in which domain specialists and people trained in the creation of data models must work together, as the history of TEI is an example of.

When applying either of the two ways of modelling, knowledge is gained through the process as well as in the study and use of the end products; one can learn from modelling as well as from models, from the process of creating an ontology as well as from the use of already existing ones. It is a common experience that actively engaging with a model, being it in creating or in using it, gives a deeper understanding than just reading it. Reading the TEI guidelines is a good way of getting an overview of the standard, but it is hard to understand it at a deeper level without using it in practical work, and it is quite clear that among the best TEI experts are those who have taken part in creating the standard.

There is no clear line between the two ways of modelling, and they often use similar methods in practice. They both have products as the end goal, and new knowledge is created in the process. Some of this new knowledge is expressed in the end products. For example, working to understand better what is important for a concept such as “person” in the domain used will results in new knowledge. This knowledge will be shared by the parties involved and may be expressed in the end product. However, there is a stronger pressure towards expressing clearly such new knowledge when a data standard is created than when a mapping is created.


An ontology may or may not include contradictory facts, and may contain them at different levels. How this can be related to different interpretations of the source material will be discussed in the paper and differences between TEI and ontologies such as CIDOC-CRM will be pointed out.

While an ontology is a model of the world, a specific mapping to an ontology will be based on sources. Ways of linking ontologies to their sources in order to ensure scholarly reproducibility will be presented in the light of co-reference and of links between text encoding and ontologies in general. As a case study, this will be done through a study of ways of linking TEI to CIDOC-CRM. While the two standard will continue to develop, and in some areas, such as person, place, event, and possibly object, they may grow closer, they will still continue to be two separate standards, different in scope as well as in the ways in which they are formalised.

The paper will investigate into various ways of interconnecting the two as part of modelling work, and develop a draft categorisation of the most common types. I will be looking forward to receiving feed-back from a qualified audience on the draft system in order to develop it further.


  • CIDOC (2011). Definition of the CIDOC Conceptual Reference Model. [Heraklion]: CIDOC. Produced by the ICOM/CIDOC Documentation Stan- dards Group, continued by the CIDOC CRM Special Interest Group. Ver- sion 5.0.4, December 2011.
  • FRBR (2012). Object-oriented definition and mapping to FRBR(ER) (Version 1.0.2). [Heraklion]: International Working Group on FRBR and CIDOC CRM Harmonisation. “The FRBRoo Model”.
  • Gruber, T. (2009). Ontology. In L. Liu and M. T. Özsu (Eds.), Encyclopedia of Database Systems, pp. 1963–1965. [S.n.]: Springer US.
  • Guarino, N., D. Oberle, and S. Staab (2009). What Is an Ontology? In S. Staab and R. Studer (Eds.), Handbook on ontologies, pp. 1–17. Berlin: Springer. 2nd ed.
  • Ore, C.-E. S. and Ø. Eide (2009). TEI and cultural heritage ontologies: Exchange of information? Literary & Linguistic Computing 24(2), 161– 172.
  • TEI Consortium (2012). TEI P5: Guidelines for Electronic Text Encoding and Interchange. [2.1.0]. [June 17 2012]. [S.n.]: TEI Consortium.
  • Zúñiga, G. L. (2001). Ontology: its transformation from philosophy to information systems. In N. Guarino, B. Smith, and C. Welty (Eds.), FOIS ’01: Proceedings of the international conference on Formal Ontology in Information Systems – Volume 2001, pp. 187–197. Ogunquit, Maine, USA: ACM.

TEI and the description of the Sinai Palimpsests

The library of the Monastery of St. Catherine’s in the Sinai Desert is well known as the source of Codex Sinaiticus and the home of the palimpsest Syriac Sinaiticus, both of which date to the 4th Century C.E. It also preserves a collection of 120 known palimpsests in Greek, Syriac, Georgian, Armenian, Arabic, and several other languages. Few of these had been studied extensively. The same team of technical experts, engineers and scientists responsible for imaging the Archimedes Palimpsest, the Galen Syriac Palimpsest, the Waldseemüller 1507 World Map, and David Livingstone’s 1871 Field Diary are now producing enhanced images of the original undertext in the Monastery’s palimpsests. After a 2009 technical survey by the team, in 2011 a five-year project began to image and survey the palimpsests at the monastery in a collaboration of St. Catherine’s Monastery and the Early Manuscripts Electronic Library. This latest project builds on the team’s previous spectral imaging work, which pioneered the use of spectral imaging techniques in several modalities to collect manuscript data and produce processed images to enhance the visibility of the erased undertexts.

The project is also responsible for documenting the physical condition of the manuscripts, each palimpsest folio, and identifying the texts inscribed in each undertext layer. To encode the very complex descriptions of the manuscripts and their undertext layers, the project will need to employ the TEI.

This paper will discuss the Sinai Palimpsest Project’s use of the TEI to describe the palimpsests, building on the methods developed in previous projects including the Archimedes Palimpsest, Livingstone Diary, and the Walters Art Museum’s series of NEH-funded manuscript preservation and access projects.

It will also provide a survey of methods employed and challenges encountered. Most importantly, it will elicit advice and suggestions for future TEI use, and identify areas where the TEI may need to be modified to aid in complex palimpsest descriptions.

The palimpsests at St. Catherine’s have varied and complex structures. Some folios have been reused more than once, so that in the collection there are several double-palimpsests, and even some triple-palimpsests with multiple layers of scraped or washed off text. The orientations of undertext to overtext vary from manuscript to manuscript, and even within a single manuscript. Some leaves were created by stitching together portions of reused folios, so that some present-day leaves are literal palimpsest patchworks. These conditions present challenges not only for scholars reading the undertexts, but also for their system presentation by computer applications.

The Sinai Palimpsests Project employs a complex model for describing palimpsest structure. Each manuscript has a number of palimpsest folios. Each folio may have one or more undertext layers. Participating scholars are assigned sets of undertext layers from a manuscript, grouped by language and script, based on each scholar’s area of expertise. Some manuscripts have undertext layers in several languages and scripts, and, thus, have several under text layer groupings. The scholar examines each folio undertext layer in the assigned grouping and links the undertext layer to an “undertext object”. An undertext object is a collection of folio undertext layers that have the same textual content and are written in the same hand. The requirement for an undertext object is rather strict. For example, folio undertext layers written in the same hand, but belonging to two separate New Testament books would be assigned to two undertext objects. By this method each manuscript is divided by language and script, and then digitally sorted into undertext layers that likely belonged together in the same ‘original’ manuscript.

In a second level of analysis, scholars will examine undertext objects to determine which ones originally belonged together and link them together. Linked undertext objects may be from the same present-day manuscript or, as will often be the case, from separate present-day manuscripts. An example of this is ongoing studies of the Syriac Galen Palimpsest, which appears to have leaves scattered about the globe. These leaves are just now being tracked down by scholars. If the assumption is correct – that many of these palimpsests were generated at Sinai – it is likely that leaves from a number of manuscripts were reused and spread across two or more later manuscripts.

The TEI structure used to describe the palimpsests must express this complexity. The resulting TEI encoding will identify each undertext work, and, where possible, describe the reconstructed undertext manuscripts. Doing so will require reconstructing undertext folios from palimpsested pieces that often span more than one present-day folio. The TEI encoding will integrate the project’s manuscript images; and undertext folio descriptions should map to that image data as much as possible. One goal of the project is to provide TEI manuscript descriptions that will allow applications to display images of folios in their current overtext or the reconstructed undertext form and order. Using encoded information about a palimpsest, such a tool should be able to select an image of a folio, rotate it to display an undertext layer the right way up, and if need be join that image with another image or images to present a reconstructed view of the original undertext folio. In the case of patchwork folios, the tool should be able to select the portion of an image corresponding to an undertext layer. The markup that supports these functions should provide the following.

  • A list of the overtext folios
  • For each folio a list of each undertext layer, including:
    • A description of the undertext content
    • The undertext orientation, relative to the overtext
    • The layout of the undertext (columns, number of lines)
    • The portion of the undertext folio preserved (bottom half, top half, lower left quarter, etc.)
  • For “patchwork” folios, a method for designating a region of a folio as an undertext layer and linking that undertext layer to a region of an image
  • A method for linking several undertext layers together as parts of a single undertext folio
  • A method for collecting several undertext folios together as part of a reconstructed “undertext manuscript”, which will have its own manuscript description

The complexity of the problem raises the question of whether a single TEI file can adequately and fully describe a manuscript and its undertexts, or whether this information can even be encoded in the TEI alone. One approach would be to create separate TEI files for each present-day manuscript, and then one for each reconstructed undertext manuscript. This approach solves the problem of dealing with undertext manuscripts that span several modern ones, but it necessitates markup that spans files to express relationships between over- and undertext folios. An alternate method would create a single TEI file for each current manuscript, including over- and undertexts, by assigning each reconstructed manuscript to its own TEI msPart. If the use of the TEI proves unwieldy for some features, a custom standoff markup, linked to the TEI may be used to encode complex overtext and undertext relationships. This paper will give examples of each method.

The volume, complexity and variety of the Sinai palimpsests provide a unique opportunity to explore the use of the TEI for palimpsest descriptions in support of global virtual scholarly studies. TEI can serve as a key tool in this and other scholarly studies of complex texts with well researched and documented application of the opportunities and utility of the TEI to support scientific, technical, and scholarly applications to digital humanities.


  • Bockrath, Diane E., Christopher Case, Elizabeth Fetters, and Heidi Herr. “Parchment to Pixel: The Walters Islamic Manuscript Digital Project.” Art Documentation, 29, no. 2 (2010): 14-20.
  • Emery, Doug, Alexander Lee, Michael B. Toth, “The Palimpsest Data Set”, The Archimedes Palimpsest, I. Catalogue and Commentary (Cambridge: Cambridge University Press), pp. 222-239.
  • Emery, D, F.G. France, and M.B. Toth, “Management of Spectral Imaging Archives for Scientific Preservation Studies”, Archiving 2009, Society for Imaging Science and Technology, May 4-7 (2009), 137-141
  • Emery, D., M. B. Toth, and W. Noel, ‘The convergence of information technology and data management for digital imaging in museums’, Museum Management and Curatorship, 24: 4, 337 — 356 (2009)
  • Porter, Dot, “Facsimile: A Good Start,” TEI Member’s Meeting, King’s College London, November 2008.
  • TEI Consortium, eds. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.3.0. 17 Jan 2013. TEI Consortium. http://www.tei-c.org/Guidelines/P5/ (30 March 2013)

From TUSTEP to TEI in Baby Steps

The “Mannheimer Korpus Historischer Zeitschriften und Zeitungen”1 (MKHZ) aims at documenting German news language in the 18th and 19th century. The corpus is available both, as high resolution jpeg files and as TUSTEP transcriptions that have been acquired in a double keying procedure. The current version of the corpus comprises 21 magazines with 652 individual volumes with over 4.1 Mio word tokens on 4678 pages.

In this paper we briefly describe the original TUSTEP markup available for MKHZ and introduce an iterative and staged pipeline for transforming TUSTEP markup to TEI. The pipeline is set up in three main stages: (1) Syntactic transformation of TUSTEP to well-formed TUSTEP XML, (2) transformation of TUSTEP XML to generic TEI, (3) refinement of generic TEI with magazine specific logical structure.

The corpus has been transcribed using TUSTEP conventions2. TUSTEP [2] is a framework for compiling critical editions. It predates XML, and parallels SGML, as the predecessor of XML. The main unit of a TUSTEP transcription is a numbered line. For the MKHZ corpus the TUSTEP markup represents layout structure (lines, columns and pages), logical structure (paragraphs with alignment information, tables, figures, running headers, and footnotes), typographic information (font family, style, and size), and special symbols (mostly glyphs), numbers, etc.

The layout structure is fairly complex. In particular advertising sections make heavy use of multiple, possibly nested columns, which do not necessarily range over an entire page. In contrast, the marked up logical structure is fairly simple. There exists no explicit distinction between headings and ordinary paragraphs, though heuristic rules based on style information, such as text-alignment or typography can be used to differentiate between these elements. Moreover, individual articles and their sections are not marked up explicitly. Altogether the TUSTEP markup of MKHZ focusses on layout structure and typographic annotation, which is translated to TEI in three main stages:

(1) In the first stage the TUSTEP markup is transformed to well-formed XML, which reflects the original markup as closely as possible3, without losing any markup and content or introducing spurious markup. This comprises two main challenges: Firstly, TUSTEP employs a significantly more diverse markup syntax than XML, and secondly, it interleaves layout structure with logical structure and makes liberal use of tag omission.

To capture TUSTEP’s diverse syntax, we extract and iteratively refine markup patterns and specify their translation to XML markup. To resolve conflicts between layout structure and logical structure, we break up logical elements, such as paragraphs and tables, and insert continuation milestones to link the broken up elements with each other.

Technically, this stage is implemented in Perl4, as a pipeline of custom event-based parsers, one for producing basic well-formed XML, and one for transforming tabulated tables into tables consisting of rows and cells. Where tag omissions or wrong markup cannot be resolved automatically, the original TUSTEP markup is modified and documented in the form of a diff list. From the resulting XML we generate and manually refine an XML-Schema to validate the output and guide the transformation in Stage 2.

(2) The adhoc XML vocabulary resulting from Stage 1 is rather complex, comprising about 50 elements. This complexity is deliberate, because it allows for a fine-grained check of markup balance based on XML’s well-formedness criterion. In Stage 2 this complexity is reduced by mapping the vocabulary along TEI guidelines [6]. Typographic markup is transformed to highlight elements with style attributes, structural markup is unified to paragraphs with appropriate style and type attributes, and all other elements are mapped to appropriate TEI elements. Moreover, the continuation milestones introduced in Stage 1 are used to link separated logical elements by means of so called virtual joins along the guidelines in [6, Section 20.3].

Technically, this stage is implemented as a pipeline of XSLT scripts, one for mapping to TEI, followed by one for inserting virtual joins. The result of this stage is TEI compliant markup, which still represents the original markup without information loss, but largely differentiates by means of attributes rather than elements, resulting in a significantly less complex schema.

(3) The final stage aims at explicating hidden logical structure, in particular identifying independent articles within an issue and capturing meta information such as issued date. This requires heuristic rules specific for the 21 individual magazines. The rules use local context information such as (usually centered) headings and typographic patterns to group sequences of paragraphs into articles. This final transformation is carried out by means of iteratively refined custom XSLT scripts and manual annotation.

In summary, the presented pipeline aims at managing the complexity of transformation by dividing it into several stages, which can be individually refined and validated. Each stage simplifies and unifies the markup and underlying model, making the subsequent stage more tractable. The modular structure of the pipeline also facilitates its adaptation to other TUSTEP sources. However, especially the mapping from TUSTEP XML resulting from stage 1 to TEI probably requires adaptations to the particular TUSTEP vocabulary at hand.

The resulting TEI representation is used as a pivot model for generating a visualization in xhtml + css, which closely reflects the original layout structure, for extracting meta data as a basis for archiving the corpus in the IDS Repository [7], and for generating a representation in the IDS Text Model [8] for import into the Corpus Search and Analysis System COSMAS II [9].


  • [1] Silke Scheible, Richard J Whitt, Martin Durrell, and Paul Bennett: Annotating a historical corpus of German: A case study. Proceedings of the LREC 2010 workshop on Language Resources and Language Technology Standards”, Valletta, Malta, 18 May 2010. 64-68.
  • [2] Universität Tübingen; Zentrum für Datenverarbeitung. TUSTEP 2012: Handbuch und Referenz (electronic Version, in German). Available at: http://www.tustep.uni-tuebingen.de/
  • [3] René Witte, Thomas Kappler, Ralf Krestel, and Peter C. Lockemann: Integrating Wiki Systems, Natural Language Processing, and Semantic Technologies for Cultural Heritage Data Management, In: Language Technology for Cultural Heritage, pp.213-230, Springer, 2011.
  • [4] Ruth Christmann: Books into Bytes: Jacob and Wilhelm Grimm’s Deutsches Wörterbuch on CD-ROM and on the Internet. http://germazope.uni-trier.de/Projekte/DWB/bibliographie/books.htm (accessed March 23, 2013)
  • [5] Wilhelm Ott, Tobias Ott, Oliver Gasperlin. TXSTEP – an integrated XML-based scripting language for scholarly text data processing. Digital Humanities 2012.
  • [6] TEI Consortium, eds.: TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.3.0. Last updated on 17th January 2013. TEI Consortium. http://www.tei-c.org/Guidelines/P5/ (accessed March 23, 2013).
  • [7] Peter M. Fischer, Andreas Witt. Developing Solutions for Long-Term Archiving of Spoken Language Data at the Institut für Deutsche Sprache. In: Proceedings of the LREC 2012 Workshop ‘Best Practices for Speech Corpora in Linguistic Research’, Istanbul, May 21, 2012 (pp. 47-50). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/workshops/03.Speech%20Corpora%20Proceedings.pdf
  • [8] Harald Lüngen, C.M. Sperberg-McQueen: A TEI P5 Document Grammar for the IDS Text Model In: Journal of the Text Encoding Initiative (2012), H. 3. http://jtei.revues.org/508
  • [9] Franck Bodmer: COSMAS II. Recherchieren in den Korpora des IDS. In: Sprachreport 3/2005. S. 2-5 – Mannheim: 2005.

How TEI is Taught: a Survey of Digital Editing Pedagogy

One of the remarkable shifts in the field of humanities computing and the digital humanities has been its emergence in recent years as a topic of instruction across universities in Europe and North America. From a cluster of specialized research techniques, humanities computing is increasingly encountered in the classroom as a subject of scholarly discussion in its own right. In humanities education, the boundary between “content” and “skills” has long been blurry and contested, and the rapid increase in courses devoted to digital humanities is testing that boundary in new and exciting ways. TEI holds a significant place within this larger picture. In Lisa Spiro’s 2011 survey of 134 digital humanities syllabi, XML and TEI were, by an overwhelming margin, the most frequently taught technologies (Spiro). In workshops, seminars, general courses in the digital humanities, and specialized courses on digital editing with “angle bracket technologies,” students encounter TEI both as a set of skills to master and as a topic among other topics within a disciplinary field of knowledge. TEI is taught in diverse formats to diverse audiences. In this talk, we will present an overview of TEI pedagogical documents (course syllabi, workshop and seminar descriptions, and instructional materials) as well as the results of our ongoing survey of TEI instructors. Our purpose will not be prescriptive nor predictive; that is, we will not outline a program for how TEI should be taught nor provide directions for the future. Instead, our purpose is simply to provide a picture, with as much detail as possible, of the state of TEI in 2013 from the perspective of the classroom.

TEI on the Syllabus

In our preliminary survey of TEI instructors, 52% of respondents reported teaching TEI in college courses devoted in part or in whole to digital editing. Our presentation will focus on syllabi and course descriptions that include TEI in order to see how TEI is practiced and imagined across disciplines and departments. The syllabi range from English, history, digital humanities, information technology, and library and information science. XML and TEI often feature prominently in digital humanities courses, where they tend to be studied alongside media theory and computational analysis. DH instructors often lead one- or two-day sessions in the middle of the semester on TEI, which is then used as the format for class projects. In this context, TEI work is often described as a “practical” use of “tools” within the DH curriculum (for example, Galey). Through the construction of research protocols, attention to cultural histories, and “major epistemological, methodological, technological, and institutional challenges” (Presner), students are exposed to TEI as a tool with which to understand, know, and explore the products of culture. In addition to providing a framework for undergraduate research, XML is increasingly presented to graduate students as a part of their introduction to digital work, sometimes on the belief that it is less likely than other digital formats to become obsolete (Reid). In the field of Library Science, TEI is written into course descriptions and syllabi as having both practical and theoretical aspects worth considering, yet hands-on practice is, by and large, at the fore. Information science courses, such as “Advanced XML: Electronic Publishing Standards and Systems” (Walsh) and “Information Modeling in XML” (Binkley), tackle advanced technical skills like XSLT and linked data. On the other hand, courses like “Seminar in Historical Editing in the Electronic Era,” taught in a history department, foreground the editorial questions and problems sparked by digital remediation (Katz; see also Rehbein and Fritze). Our discussion will provide an overview of our syllabus collection as a whole and analyze pertinent examples of general trends. Our emphasis will be on the most recent courses, and we expect our body of data will change significantly when Fall 2013 courses are announced.

The Workshop as a Genre of TEI Instruction

An important genre of TEI instruction continues to be the workshop or seminar, typically lasting from 1 to 5 working days. Workshop series hosted by Oxford and Brown have reached a wide community of students. Oxford’s Summer 2012 TEI workshop offerings ranged from introductory surveys in which basic mark-up and TEI Guidelines and approaches to publishing TEI texts were addressed to more advanced workshops in which students learned how to transform their TEI XML texts into texts other than HTML. With the help of NEH funding, Brown offered a series of TEI workshops to 11 North American universities from January 2007 to June 2009. Project director Julia Flanders describes their goal to teach “text encoding in a way that emphasizes its theoretical and methodological significance.” Elena Pierazzo explains that workshops taught at King’s College London are founded on the belief that “students want to do better or new research.” Teaching strategies include the incorporation of attendee-brought material, exercises relevant to that material and the introduction of resources that will enable attendees to become self-sufficient after completion of the course. Workshops hosted by the Japanese Association for Digital Humanities and the University of Paderborn’s Edirom Summer School 2012 foregrounded the acquisition of markup skills and the “independent handling” of TEI guidelines. Across students’ variety of interests and motivations, the primary challenge for workshop-based instruction is, in James Cumming’s words, to “produce a consistent pedagogical basis while retaining their unique character and experiences.”

2013 Survey of TEI Instructors

Our discussion will also provide an overview of responses to our “Teaching TEI” survey, a preliminary version of which was distributed this spring, receiving more than 30 responses from TEI instructors in Europe, North America, and Japan. This survey will continue to be available over the summer and will be updated next fall.

In the survey we ask:

  • 1) In what country and language do you primarily teach?
  • 2) In what language do you primarily teach? 3)
  • What is your position within your institution?
  • 4) What is your home department or administrative unit?
  • 5) What year did you first teach digital editing, XML, or TEI?
  • 6) How frequently do you teach digital editing, XML, or TEI?
  • 7) In what format do you teach digital editing, XML, or TEI?
  • 8) When you teach digital editing, XML, or TEI, who is your primary audience?
  • 9) Were you financially compensated for your extracurricular teaching?
  • 10) Have you ever charged a fee to participate in a workshop?
  • 11) Do you create your own course materials? What textbook or other resources do you use?

We also invited respondents to list courses and workshops taught and to describe their experience in their own words, which has allowed us to gather significant testimony from instructors new to the field.

Like the TEI community, our respondents are diverse, whether by country, language, or discipline. Our talk will provide a detailed breakdown of responses. Perhaps the most intriguing line of distinction we have found so far is years of experience. For many, teaching TEI is a new addition to their scholarly work. When asked what year they began teaching, the single most frequently reported year is 2012. Of our respondents to date, the median experience is 6 years, with a fairly even split of about 30% each between those who have taught TEI for more than eight years and those who began only since 2011. These two groups are very different. Within our set, new teachers are far more likely to teach TEI as part of a college course curriculum and much less likely to teach workshops. Their target audiences are much less likely to include professors and university staff and are more likely to be limited to undergraduates and graduate students within their respective disciplines. New teachers among our respondents are much more likely to be faculty in a literature or history department and much less likely to be library or IT professionals.

These results are consistent with our general picture: TEI is increasingly being taught and understood as a component of the general humanities curriculum. This change marks TEI’s pedagogical success and its growth in size and scope. This also means, however, that the audience of TEI pedagogy is increasingly an undergraduate audience, and that research projects completed in TEI will often take shape in the classroom. Meeting the needs of this growing audience and its research demands is one of the most important challenges facing the TEI community today.


  • Binkley, P. (2012). LIS 598 – Information Modeling in XML – University of Alberta.
  • Cummings, J. (2012). Teaching the TEI Panel
  • Cummings, J., Baalen, R., and Berglund-Prytz, Y. (2012). An Introduction to XML and the Text Encoding Initiative.
  • Flanders, J. (2009). Final Report: Seminars in Humanities Text Encoding with TEI.
  • Galey, A. (2009, Winter). FIS 2331H: Introduction to Digital Humanities. University of Toronto.
  • Hawkins, K. (2012, November 17). Creating Digital Editions: An Introduction to the Text Encoding Initiative (TEI).
  • Katz, E. (2012, Fall). Historical Editing in the Digital Era. New York University.
  • Mahony, S., and Pierazzo, E. (2012). Teaching Skills or Teaching Methodology? In B. D. Hirsch (Ed.), Digital Humanities Pedagogy. Open Book Publishers.
  • Pierazzo, E. (2011, September 21). Digital Editing. Elena Pierazzo’s Blog.
  • Pierazzo, E., Burghart, M., and Cummings, J. (2012). Teaching the TEI: from training to academic curricula. TEI Conference, College Station, TX.
  • Presner, T. (2012, Winter). Introduction to the Digital Humanities.
  • Rehbein, M., and Fritze, C. (2012). Hands-On Teaching Digital Humanities: A Didactic Analysis of a Summer School Course on Digital Editing. In B. D. Hirsch (Ed.), Digital Humanities Pedagogy. Open Book Publishers.
  • Reid, A. (2012) Graduate Education and the Ethics of the Digital Humanities. In M. Gold, (Ed.), Debates in the Digital Humanities. University of Minnesota Press.
  • Reid, A. (2012) Graduate Education and the Ethics of the Digital Humanities. In M. Gold, (Ed.), Debates in the Digital Humanities. University of Minnesota Press.

TEI metadata as source to Europeana Regia – practical example and future challenges

Europeana Regia (2010-2012) was a project co-funded by the European Commission in the context of the Europeana project. Focusing on the incorporation of digitised manuscripts from the Middle Ages to the Renaissance into Europeana, it was the aim to make the manuscripts of the Carolingian period (Bibliotheca Carolina), the library at the Louvre in the time of Charles V and Charles VI (Library of Charles V and Family) and the library of the Aragonese Kings of Naples virtually accessible.

The source metadata at the participating institutions was available in multiple formats (e.g. MARC21, EAD and TEI) and in different levels of detail, while the Europeana format in the beginning of the project was ESE v3.2. A lot more was needed, than just producing valid records: In order to compile the digital facsimiles via Europeana into unique virtual collections, for Europeana Regia manuscripts a certain specification of the ESE (Europeana Semantic Elements) metadata was agreed on. Considering that each medieval manuscript is a unique piece of work and also having in mind, that the individual or institution responsible for the encoding of the metadata might have their own approach to the matter, it became obvious, that the task not only needed standards but also some way to check if any set of metadata would fulfil these standards to assure high quality within the project.

In order to identify the standard, a complete crosswalk of the necessary information for display in Europeana encoded in ESE v3.4 for the different input formats was compiled. While some partners already had academic metadata of the manuscripts, e.g.in TEI, others still had to choose a metadata format to encode their lists and free text descriptions in. Furthermore one has to keep in mind, that especially for high level encoding like TEI, there are often multiple ways to express the same relation or content (<head><title> vs. <summary> ; <rs type=”person”> vs. <persName>). So in the end, apart from long lists, the true crosswalk within all the encodings used in the project was represented by a single reference transformation to ESE combining all different input format modules with a single Europeana output module. It served the purpose of quality insurance tool prior to and after ingestion as well. For TEI this meant, that the reference transformation would need a certain subset of TEI as suitable input for medieval manuscripts. Finally ENRICH compliant TEI was used with a few additions. For institutions that already have a lot of TEI metadata that is not ENRICH compliant a path to Europeana can be implemented that first creates a reduced export metadata set which then is transformed to ESE.

The XSLT code of the reference transformation is clearly structured, well commented and expandable to accommodate for further input formats like METS/MODS. The paper will show the key elements of this export metadata format and how it maps to the ESE fields and the final display in Europeana. From other examples it becomes obvious, that encoding should always take the most advantage from the encoding format used, as tagged metadata is much simpler read by machines than text format conventions in the content entries. The standards of the Europeana Regia project actually exceeded the necessities of the ESE format by the used of identifiers, that are not properly used in the ESE context but already point to the semantic future of Europeana with EDM.

But the Europeana Regia project was more than just an effort in digitisation, creation of metadata and ingestion into Europeana. On the Europeana Regia portal (www.europeanaregia.eu) the partners also provided translations from the original metadata language to all languages of the participating institutions. This multilingual metadata content is still a treasure that needs to be incorporated into Europeana as well as the use of identifiers and authoritative data.

This leads to the future of Europeana as a retrieval tool for the semantic representation also of the fully linked medieval manuscript metadata with EDM (Europeana Data Model). While EDM especially for manuscripts is still a work in progress (DM2E), a lot can be learned from and already be done based on the Europeana Regia work. The author will show how the reference transformation was changed to produce valid EDM from TEI, MARCXML and EAD. For TEI the advantages and the caveats will be presented, when trying to make full use of the semantic EDM – RDF possibilities, based on academic metadata for medieval manuscripts encoded with TEI. A description of the reference transformation to EDM is given based on the XSLT code and metadata examples from the project. While the representation of manuscript metadata in sematic ontology gains momentum the author hopes to provide some suggestions on future TEI use in that field.

Documenter des “attentes applicatives” (processing expectations)

Il est fréquent d’entendre dans la communauté TEI que l’encodage ne doit pas se soucier des traitements (processing). C’est une évidence : l’encodage ne doit pas dépendre de contraintes applicatives, mais bien de l’analyse du texte encodé et de ses composants. Un tel principe a certainement préservé TEI de plusieurs modes technologiques passagères. Il a aussi contribué à un accroissement important du nombre d’éléments (plus de 600 aujourd’hui). Pour autant, nous pensons que ce développement gêne maintenant le déploiement de TEI et qu’il se paie en complexité d’apprentissage, et d’implémentation. La documentation d’“attentes applicatives” (processing expectations) pour chaque élément nous semblerait utile à l’évaluation des définitions proposées par la TEI et favorable à la convergence des pratiques d’encodage.

Puisque TEI sert à encoder des textes, l’attente applicative la plus commune est sans aucun doute la lecture. Il est essentiel de pouvoir distribuer les textes encodés dans des formats utiles au lecteur : à l’écran (HTML, epub, etc. ) ou imprimables (LaTeX, ODT, etc. ). Le confort de (re)lecture est aussi déterminant dans le processus de correction des corpus textuels encodés. Pour cela, nous utilisons tous les transformations maintenues par le consortium et l’outil de conversion OxGarage. Mais pour atteindre la qualité éditoriale attendue dans le milieu universitaire, nous sommes presque toujours contraints de les personnaliser, y compris pour des composants textuels aussi communs que les notes (d’auteur, d’éditeur, d’apparat critique) ou un index. La qualité des outils maintenus par le consortium n’est absolument pas en cause, mais plutôt la permissivité revendiquée de la TEI : ‘« For the Guidelines to have wide acceptability, it was important to ensure that : (…) multiple parallel encodings of the same feature should be possible. »’ Un tel principe garantit l’adaptabilité de la TEI à presque tout type de sources et de questionnements scientifiques et explique sûrement son succès académique. Cette permissivité est une qualité incontestable de la TEI. Pour autant, en autorisant des solutions d’encodage concurrentes pour un même besoin, elle complique les traitements, l’échange et les exploitations scientifiques : comment générer par exemple des index croisés d’entités nommées sur des fichiers TEI aux encodages hétérogènes ? Même les composants textuels les plus fréquents tel que le rendu typographique sont affectés (il n’existe pas de valeurs normalisées pour l’attribut @rend). C’est une fausse liberté qui est donnée ici, car ces composants les plus fréquents sont aussi les mieux connus et définis. Une attente applicative aussi élémentaire que celle de l’affichage pousse ainsi à mieux préciser le modèle textuel.

Notre proposition consisterait à ajouter pour chaque élément des guidelines, en plus de la définition et des exemples, une section “attentes applicatives” (processing expectations). Ces dernières concernent aussi bien l’affichage, que l’échange (<teiHeader>), ou l’exploitation scientifique (entités nommées, linguistique, etc.). On préciserait par exemple qu’un <author> dans un <titleStmt> désigne l’auteur principal du texte, contrairement à un <author> dans le <sourceDesc> ; qu’un élément <persName> dans un <body> peut alimenter un index des personnes citées, par regroupement de clés (@key, @ref ?). Des équivalents vers différents formats d’import et d’export, comme les traitements de textes, Dublin Core, HTML5, ePub, ou LaTeX pourraient illustrer ces “attentes applicatives” et préciser utilement la sémantique TEI par comparaison à d’autres formats. En développant différents outils (odt2tei, teipub, lateix), nous avons ainsi été contraints de faire des choix qui ne sont pas purement techniques, mais sémantiques, et pour lesquels nous aurions apprécié des orientations plus explicites. Une telle information existe, mais elle est éparpillée dans les chapitres de prose, sur la liste de diffusion ou exprimée implicitement dans les transformations maintenues par le consortium. Il serait précieux de la concentrer sur les pages de la documentation que nous consultons le plus quotidiennement (celles des éléments).

Sans être prescriptives, ces “attentes applicatives” favoriseraient la convergence des pratiques d’encodage des textes et, parce qu’il est difficile de reprendre des fichiers à l’encodage hétérogène et mal documenté, en amélioreraient la pérennité. Elles seront aussi de précieuses indications pour les développeurs de logiciels, afin d’implémenter les fonctionnalités souhaitées par la communauté.

The Lifecycle of the DTA Base Format (DTABf)


This paper describes a strict subset of TEI P5, the DTA ‘base format’ (henceforth DTABf, [DTABf]), which provides tagging solutions for the richness of encoding non-controversial structural aspects of texts while allowing only minimal semantic interpretation. While the focus of Geyken et al. (2012) was put on a comparison of DTABf with other commonly used XML/TEI-schemas such as TEI Tite, TEI in Libraries or TEI-Analytics, this article places particular emphasis on the lifecycle of DTABf.

The DTABf has been created in order to provide homogeneous text annotation throughout the corpora of the Deutsches Textarchiv (German Text Archive, henceforth: DTA, [DTA]). The goal of the DTA project is to create a large corpus of historical New High German texts (1600–1900) that is balanced with respect to the date of their origin, text type and thematic scope and thus is supposed constitutes the basis of a reference corpus for the development of the New High German language. As of June 2013, the DTA corpora contain 1363 works. The text basis is continuously extended either by texts digitized by the DTA or originating from other project contexts ([DTAE]).

The DTABf had been created by applying encoding recommendations formulated by the DTA to the texts digitized during the first project phase (2007–2010, 653 texts). On the basis of the resulting annotations it underwent a thorough revision where the handling of structural phenomena was reconsidered and consistent solutions were determined. As a result of these efforts the DTABf now consists of three components:

  • an ODD file specifying constraints on TEI elements, attributes and values thus reducing the flexibility of the TEI P5 tag set while still providing a fully TEI P5 conformant format ([DTABf ODD]);
  • an RNG schema generated from that ODD ([DTABf RNG]);
  • a comprehensive documentation explaining the handling of common structuring necessities as well as of special cases, and illustrating each phenomenon with examples from the corpus ([DTABf]).

The DTABf currently (June 2013) consists of 77 <teiHeader> elements and 50 <text> elements together with a limited selection of attributes and values (where feasible). The DTABf-header elements specify bibliographic information about the physical and the electronic source, the text classification and legal status of the document as well as information about the encoding. The DTABf-text elements include annotations of formal text structures (e.g. page breaks; lists; figures; physical layout information, such as form work, highlighting, etc.) as well as semantic text structures (heads; proper names; text types, such as poem, chapter, letter, index, note, etc.). Furthermore, the DTABf allows for documented editorial interventions (e.g. correction of printing errors or editorial comments). Linguistic information is not encoded inline since it is gained automatically and applied to the DTA texts via a standoff markup (Jurish 2010). The search engine of the DTA supports linguistic queries and allows filtering by DTABf elements and attributes ([DDC]).

The DTABf’s Life Cycle

Despite the large and heterogeneous data basis upon which the DTABf has been built in the past years new structural phenomena may appear with new texts mainly because of individual printing habits in older historical works. In addition the markup of texts encoded by external projects may differ from the DTABf either formally or semantically. Therefore, the DTABf continually comes under scrutiny, the challenges being first to decide whether adaptations to the format are unavoidable in order to meet new requirements, and second, to ensure that such adaptations do not lead to inconsistencies of the structural markup within the corpus, the latter being a necessary prerequisite for interoperability of the corpus resources. In the next section we illustrate these cases.

New Phenomena in the Scope of DTABf

New phenomena that are in the scope of DTABf fall into two classes: either a tagging solution relying on DTABf elements, attributes and values can be found for the structural phenomena at stake, or there is a transformation of the markup into a DTABf markup.

When a new structural phenomenon is encountered there usually is a semantically equivalent tagging solution already provided by the DTABf. The facsimile in example 1 represents a case where the (discontinuous) quotation is presented inline whereas the bibliographic citation is given in the margin. This markup can be transformed into a DTABf solution where discontinuous quotation parts, the linear order of the text and the correct bibliographic references are handled.

Example 1: Discontinuous Quotations
Figure 1. Example 1: Discontinuous Quotations

Texts from external projects can contain markup that is not part of the DTABf format. In many of these cases the original tagging there is a straightforward transformation into an already existing DTABf solution. This case is illustrated in example 2 where <unclear>-element is replaced by the <gap>-element which is part of DTABf.

Example 2: Tagging of text loss
Figure 2. Example 2: Tagging of text loss

New Phenomena that Require Changes to the DTABf

Changes to the DTABf are carried out only if they are consistent with the existing tag set and do not introduce ambiguities to the format. Changes concern mainly attributes or values and less frequently TEI elements or modules. Possible scenarios for cases where the new requirements cannot be handled within the DTABf are the following:

  • New texts may contain structures which are new to the DTABf, e.g. due to a new text type or document type (e.g. manuscripts).
  • The structural depth upon which the external text has been annotated has no equivalent within the DTABf. Example 3 illustrates that case: a new attribute-value-pair (@type=”editorial”) has been introduced into the DTABf to cope with editorial descriptions of an image.
    Example 3: editorial comments in notes
    Figure 3. Example 3: editorial comments in notes
  • Gaps in the documentation can lead to uncertainties about the markup-elements to be applied.
    Example 4: Encoding as list items or as paragraphs?
    Figure 4. Example 4: Encoding as list items or as paragraphs?
  • New TEI P5 releases may introduce changes to tei_all which may affect the DTABf.
    Example 5: @unit vs. @type within <biblScope> (TEI
                                        header) in release 2.3.0
    Figure 5. Example 5: @unit vs. @type within <biblScope> (TEI header) in release 2.3.0

Ensuring the Consistency of DTABf Encoded Texts

DTABf Encoding Levels

With the growth of the DTABf it gets increasingly difficult and time consuming to apply the whole range of possible DTABf annotations to each DTA corpus text individually. Therefore we have introduced three levels of annotation which allow for a quick check concerning the extent of interoperability of a text with other texts of the corpus. Each level corresponds to a set of elements; the element lists of the three levels are disjoint. Level 1 lists all the mandatory elements for DTABf conformity (e.g. <div>, <head>, <p>, <lg>, <figure>, <pb>), level 2 those that are recommended (e.g. <cit>, <opener>, <closer>, <lb>, <hi>), level 3 the optional elements (e.g. <persName>, <placeName>, <foreign>). For example, if a document is DTABf-level 3 conformant, all elements of level 1 to 3 must have been applied exhaustively for any element according to the DTA encoding documentation. If elements are only partially applied (e.g. partial application of <persName>), the document is not level-3-conformant and thus not interoperable on that level.

Training and Tools for Users

The existence of a comprehensive documentation is a necessary prerequisite for the applicability of the DTABf by a larger user community. In addition, the DTA offers workshops and tutorials where users learn to apply the DTABf in a consistent way.

Furthermore, text edition according to the DTABf is supported by DTAoX, a framework for the Author mode of the oXygen XML Editor. DTAoX provides an ad hoc visualization of DTABf tagged text passages, of the annotation levels they belong to, and potential discrepancies from the DTABf.

Conclusion and Further Work

For the supervised growth of the DTABf we make extensive use of the wide range of customization mechanisms the ODD provides. We plan to include schematron rules that will enable us to formulate a higher expressiveness of restrictions. For example, we would like to restrict the usage of some elements which may be used within the <teiHeader>- or <text>-area (e.g. <msDesc>) to either the one or the other which is currently not possible in the ODD mechanism itself.

DTABf currently serves as best practice format for the encoding of historical printed texts in the CLARIN-D project ([CLARIN-D User Guide]). For a better visibility of the DTABf we plan to publish a CMDI-profile of the DTABf-metadata on the CLARIN-EU level where the DTABf metadata elements and attributes are connected to the ISOcat registry as well as conversion routines for the transformation of DTABf conformant header metadata into CMDI. With these efforts, we want to ensure the further, long-term maintenance and lifecycle of the DTABf beyond the duration of the DTA project.


  1. Geyken, Alexander; Haaf, Susanne and Wiegand, Frank (2012): The DTA ‘base format’: A TEI-Subset for the Compilation of Interoperable Corpora. In Proceedings of Konvens. Wien, 2012, pp. 383-391. [online version]
  2. Bryan Jurish (2010): More than Words: Using Token Context to Improve Canonicalization of Historical German. JLCL 25(1): 23-39. [online version]

For further references [http://www.deutschestextarchiv.de/doku/publikationen].

Promoting the linguistic diversity of TEI in the Maghreb and the Arab region


Since many centuries, the Maghreb region is experiencing significant linguistic hybridization that slowly impacts on its cultural heritage. Besides Libyan, Latin and Ottoman contributions, significant other amounts of resources in various cultures and languages have been accumulated in the Maghreb region, either derived from classical Arabic (i.e. regional dialects) or from various dialects of Berber (i.e. Kabyle). Several resources are even composed simultaneously in several common or restricted languages (literary Arabic, colloquial Arabic, French, English, Berber) like newspapers, “city printing”, advertising media, popular literature, tales, manuals for learning languages, etc. These resources are often written in a hybrid script mixing both classical and vernacular Arabic, or combining transliteration forms between Latin, Arabic and Tifinagh (traditional Berber script). Unlike many traditional textual resources (conventional printed documents and medieval manuscripts), it does not exist today vast corpora of texts in vernacular idioms and scripts. But our hypothesis is that the growing awareness of the diversity of these textual resources would rapidly result in an exponential increase of the number of researchers interested in collecting and studying classical old texts and oral resources. The standard TEI encoding format provides in this respect a unique opportunity to optimize these resources by ensuring their integration into the international cultural heritage and their use with maximum technical flexibility. The “HumanitéDigitMaghreb” project, which is the subject of this intervention, intents to address several aspects of theses research objectives and to initiate their appropriation.

Research hypothesis

The project targets both oral corpus and the rich text resources written in the Maghreb region. It focuses particularly on the continuity, for more than 12 centuries, of a classical still alive Arabic language and on the extreme hybridization of vernacular languages sustained by the rich Libyan, Roman, Hebrew and Ottoman influences and by the more recent French, Spanish and Italian linguistic interference. In short, the Maghreb is a place of extremely abundant, but much unexploited, textual studies.

Our project permits comparative visions to understand how to transform TEI originally designed for classical and modern European languages (Latin, medieval languages, etc. …) in order to work on corpora in literary Arabic and in mixed languages and scripts. For example, how researchers from the Maghreb, who invest in the French metric study and fully understand the TEI markup, can understand the subtlety of Arabic meter markup? How do they develop and give examples, when possible, of markup terminological equivalents of metric description in English, French and Arabic? How can they see if there are really specific «Arabic» structural concepts and then provide the appropriate tags for them. These questions can concern “manuscripts”, “critical apparatus”, “performance text”, etc…? For “TEI speech”, we assume, however, that it is not really likely to be the specific method to apply although much work remains to be done. Doing this, we are aware that researches on similar adaptations are undertaken in other languages and cultures: Korean, Chinese, Japanese … Theses adaptations and appropriations of the TEI experiences are of high interst for us.

Core questions

As a starting point, we consider that the use of TEI in the Maghreb and the Middle East is still sporadic and unrelated. The existing work is mainly concentrated on the study of manuscripts and rare books. This focus can be explained primarily by the existence of large collections of Oriental manuscripts in western digital collections that are TEI encoded since a long time. It can also be explained by the urgency felt within the Arab cultural institutions to accelerate the preservation of cultural heritage from deterioration. Thus, we assume that TEI relatively profited from all experiences and projects for encoding Arabic manuscripts. However, this effort seemingly still needs a larger amount of feedbacks of other nature, generated from other types of resources with other forms of complexity (mainly linguistic and structural). The question that drives us here is to know how the complexity of that cultural heritage (that of the Maghreb as much as we are concerned) would be of any contribution to TEI? How to define its cultural and technological distinctiveness compared to the actual TEI-P5 and what are the solutions?


In the project “HumanitéDigitMaghreb”, we particularly focus on the methods of implementing the TEI to address specific complex structures of multilingual corpus. We achieved some results, but on the long term, we especially concentrate on practical and prospective issues of very large standardized and linguistically structured corpora that will allow, for all linguistic communities (and we concentrate here on the Maghreb world), to constitute appropriate references in order to interact correctly with translation technologies and e-semantics in the future. On this last point, it is essential that the community of Arab and Berber researchers mobilize without delay to provide these languages (both written and oral) with their digital modernity. Three steps are to be taken in this respect:

1. The first step, which is beyond the limits of our project “HumanitéDigitMaghreb”, inevitably involves a linguistic and sociocultural analysis of the Arabic context in order to clarify three points: first, how the TEI, in its current and future versions, would encode the Arab cultural heritage; second, how the Arabic context surpasses the limits of one level of standard cataloging (MARC, ISBD, AACR2, Dublin Core) ; and third, how it succeeds to standardize the different approaches of its heritage scholarly reading.

In its constant evolution, and the need to strengthen its internationalization, the TEI community would undoubtedly profit from these cultural and linguistic characteristics. This would require also that this community be well organized to provide adequate encoding standardized formats for a wide range of linguistically-heterogeneous textual data. We can imagine here the encoding needs of electronic texts in Arabic dialects profoundly scattered with transliterated incises or written in different characters. These texts are potentially very complex. Besides connecting these materials to each other, like in parallel data (often bilingual), there are further levels of complexity inherent to the use of character sets and multiple non-standard transcription systems (different from the International Phonetic Alphabet) and related to the need of transcribing the speech in an overwhelmingly oral society, which poses interesting encoding problems.

2. The second step, which is under the scope of our proposal, is to produce TEI standard references in local languages and to introduce them to academic and professional communities. These standards help address issues of specific linguistic complexity like hybridization of digital resources (local dialects) and preservation of a millenary oral and artistic heritage. Thus, the issue of character sets is not without consequence to represent local dialects, in large part because many of their cultural aspects were not taken into account in the development of existing standards (transcribing numbers and symbols, some forms of ligatures, diplomatic and former alphabets). There are, for example, many properties of the Arabic or Berber languages, as the tonal properties, regional synonymy and classical vocalization, (notarial writing) that require special treatment. Current standards, in particular the Unicode and furthermore ISO 8859 standards, do not take into account many of these aspects.

3. The third step, in which we are also engaged, is the creation of a community of practice specialized in the treatment of specific resources. We note here that most of these resources are potentially complex and certain features require probably specific markup arrangements. This means that a dynamic environment is required to specify the encoding of these documents – an environment in which it is easy to encode simple structures, but where more complex structures can be also encoded. Therefore, it is important to have specifications that can be easily extended when new and interesting features are identified.

We are interested in TEI not only for its collegial dynamics open on non-European linguistic diversity (Japan, China, Korea…), but also for its eclectic research disciplines (literature, manuscripts, oral corpus, research in arts, linguistics…) and its rigor to maintain, enrich and document open guidelines on diversity ensuring at the same time the interoperability of all produced resources.


The results of our work are reflected through a website that lists a collection of TEI encoded samples of resources in areas such as music, Arabic poetry, Kabyle storytelling and oral corpus. To achieve this, we went through a fairly rapid first phase of TEI guidelines appropriation. The second phase would be a larger spreading of the TEI guidelines among a wider community of users including graduate students and mostly scholars not yet convinced of the TEI added-value in the Maghreb region. Those could be specialists of Arabic poetry, specialists of the Berber language, musicologists, storytelling specialists… The translation of the TEI P5 in French and Arabic, but also the development of a sample corpus and the construction of TEI multilingual terminology or glossary in English/French/Arabic, seems very necessary.

We also intend to propose research activities within other communities acting at national and regional levels in order to be in total synergy with the international dynamics of TEI. We have been yet involved in an international project, the “Bibliothèque Numérique Franco-Berbère” aimed at producing Franco-Berber digital resources with a funding from the French speaking International organization. In short, by getting engaged in the school of thought of Digital Humanities and TEI, we explicitly intend to give not only a tangible and digital reality to our work, but we try to make it easily cumulative, upgradable and exchangeable worldwide. More specifically, we expect that our work be easily exchangeable between us and our three Maghreb partner languages (Arabic, French, Berber) beside English.

Apart from the emerging issue of management and setting a standardized and interoperable digital heritage, it is obvious that specialists in this literary heritage should largely explore the methods of study and cataloging. Therefore, this article is limited to discuss only questions of scholars and professionals (libraries and research centers) appropriation of digital humanities tools and services in the Oriental context. We will focus, among other issues, on compared cultural problems by facing European ancient manuscripts study to the Arabic cultural context.


  • ABBÈS R. (2000). “Encodage des corpus de textes arabes en conformité à la TEI, outils et démarche technique“. Rapport final de projet DIINAR-MBC.
  • Bauden F., Cortese Delia Ismaili, and other (2002). Arabic Manuscripts. A Descriptive Catalogue of Manuscripts in the Library of The Institute of Ismaili Studies.
  • Burnard, L. (2012). “Encoder l’oral en TEI : démarches, avantages, défis…. Présenté à Conférence à la Bibliothèque Nationale de France, Paris: Abigaël Pesses.
  • Guesdon, Marie-Genviève (2008). “Bibliothèque nationale de France: Manuscripts catalogue ‘Archives et manuscrits’”. Paper presented at the Fourth Islamic Manuscript Conference, Cambridge
  • Hall, G. (2011). Oxford, Cambridge Islamic manuscripts catalogue online. http://www.jisc.ac.uk/whatwedo/programmes/digitisation/islamdigi/islamoxbridge.aspx
  • Henshaw, C.(2010). “The Wellcome Arabic Manuscript Cataloguing Partnership”, in: News in brief, D-Lib Magazine, March/Apri. http://www.dlib.org/dlib/march10/03inbrief.html
  • Ide, N. (1996). “Representation schemes for language data: the Text Encoding Initiative and its potential impact for encoding African languages”. In CARI’96 
  • Ide, N. M., Véronis, J. (1995). Text Encoding Initiative: Background and Contexts. Springer.
  • Jungen, C. (2012). “Quand le texte se fait matière”. Terrain, n° 59(2), 104‑119.
  • Mohammed Ourabah, S., Hassoun, M. (2012). “A TEI P5 Manuscript Description Adaptation for Cataloguing Digitized Arabic Manuscripts”. Journal of the Text Encoding Initiative,
  • Pierazzo, E. (2010). “On the Arabic ENRICH schema”. Wellcome Library Blog, 27 August, http://wellcomelibrary.blogspot.com/2010/08/guest-post-elena-pierazzo-on-arabic.html
  • Véronis, J. (2000). Parallel Text Processing: Alignment and Use of Translation Corpora. Springer.

XQuerying the medieval Dubrovnik

‘To anyone with the time and patience to study the voluminous Acta consiliorum [of Dubrovnik / Ragusa]’, wrote Fernand Braudel in 1949, ‘they afford an opportunity to observe the extraordinarily well-preserved spectacle of a medieval town in action.’ The archival series of decisions and deliberations made by the three administrative councils of Dubrovnik consist of hundreds of handwritten volumes, predominantly in Latin and still not published in its entirety, spanning the period from 1301 until 1808 (the year the Republic of Ragusa was abolished by Napoleon’s Marshal Auguste de Marmont) [1].

In collaboration with Croatian Academy of Sciences and Arts, Institute of Historical Sciences – Dubrovnik, which is the current publisher of the series Monumenta historica Ragusina (MHR), we have undertaken a pilot project of converting to TEI XML the Volume 6 of MHR. The volume publishes the so-called Reformationes of Dubrovnik councils from the years 1390-1392; it was edited by Nella Lonza and Zdravko Šundrica in 2005 [2]. In this text, different salient points of the Reformationes (meetings, names of persons and places, dates, values and measures, themes, textual annotations) are being marked and the markup decisions are carefully documented, all with the twofold intention of, first, enabling XQuery searches of the Reformationes through the BaseX database [3] not just by us, but by other users, and, second, preparing the documentation for further encoding of other MHR volumes (producing of a “MHR in XML” data set we see as a necessary, but necessarily extensive task).

The small city of Dubrovnik and its relatively closed, but well-documented society were already subjected to a database-driven research project, carried out in 2000 by David Rheubottom (then at the University of Manchester), who used archival records to examine the relationship between kinship, marriage, and political change in Dubrovnik’s elite over a fifty-year period, from 1440 to 1490 [4]. But where Rheubottom, relying on classical relational database, extracted records from original text, abstracting data from words [5], we intend to use the advantages of XML to interpret not only data, but its relationship with the words (enabling also research of e. g. the administrative formulaic language). Where Rheubottom built his database to explore one set of problems over a limited time series, we intend to make it possible for different researchers to pursue their different interests in the framework which could, eventually, embrace all recorded decisions from 500 years of Dubrovnik’s history. Last but not least, Rheubottom’s database remained unpublished — his interpretations were published as a printed book; today we have the possibility to publish (or, to open access to) not only the TEI XML annotated version of the MHR 6, but also the documentation of our encoding principles, as well as the XQueries which we find useful or interesting. Publishing the XQueries makes our research repeatable and reproducible [6]; presenting them in a graded, logically organized way, from the simplest and easiest to more complex and difficult, ensures their educational value.

The TEI XML encoding standard is sometimes criticized for its “there’s more than one way to do it” approach. We hope to show that what one person regards as a drawback, the other can regard an asset; we hope to demonstrate not only how we chose among available TEI elements and attributes to solve specific encoding challenges (e. g. to encode commodity prices, persons referred to also by their father’s name, absence of explicit dates in datable documents, election results), but also to show the ongoing process of documenting the selected combinations and their “constellations”, both in the free prose, more accessible to laypersons, and in the format of XML Schema Documentation of the TEI subset produced by encoding [7].

XQuery is a powerful and expressive programming language, but it is certainly not something that common computer users normally see; by and large, the XQuery layer remains hidden and only selected, prefabricated queries get displayed. Mastering XQuery to explore a database can seem a daunting task, and one best left to non-academic specialists. But let us not forget that the historians who plan to explore records of medieval Dubrovnik in their existing form have already shown enough motivation to master a similarly daunting accessory task of learning medieval Latin (and, in some cases, medieval palaeography). Also, looking at a resource such as The Programming Historian collaborative textbook [8], one can see to what computing depths some historians are prepared to go to be able to pose interesting questions to their material. The ideal user of the MHR in XML is an algorithmically literate medieval scholar, one which does not consider computers as black boxes; perhaps the MHR in XML can itself produce, that is educate, such digital humanists. Because, as Aristotle wrote, ‘Anything that we have to learn to do we learn by the actual doing of it’.


  • [1] Croatian State Archive in Dubrovnik, “Pregled fondova i zbirki, A.1.5. Dubrovačka Republika do 1808.” ["A list of archival series and collections, A.1.5 The Republic of Dubrovnik until 1808"], .
  • [2] Lonza, Nella and Šundrica, Zdravko (eds). Odluke dubrovačkih vijeća 1390-1392 [Deliberations of the Councils of Dubrovnik 1390-1392]. Dubrovnik: HAZU, Zavod za povijesne znanosti u Dubrovniku, 2005.
  • [3] ‘BaseX. The XML Database’,
  • [4] Rheubottom, David. Age, Marriage, and Politics in Fifteenth-Century Ragusa. New York, Oxford University Press, 2000.
  • [5] Rheubottom, David, ‘Computers and the political structure of a fifteenth-century city-state (Ragusa)’, in History and Computing, edited by Peter Denley, Deian Hopkin, Manchester University Press, 1987, pp. 126–132.
  • [6] ‘BaseX Adventures’, .
  • [7] ‘Reformationes consiliorum civitatis Ragusii: encoding guidelines’, [under construction]
  • [8] Crymble, Adam et al. ‘The Programming Historian 2′,

Analyzing TEI encoded texts with the TXM platform

TXM (http://sf.net/projects/txm) is an open-source software platform providing tools for qualitative and quantitative content analysis of text corpora. It implements the textometric (formerly lexicometric) methods developed in France since the 1980s, as well as generally used tools of corpus search and statistical text analysis (Heiden 2010).

TXM uses a TEI extension called “XML-TXM” as its native format for storing tokenized and annotated with NLP tools corpora source texts (http://sourceforge.net/apps/mediawiki/txm/index.php?title=XML-TXM). The capacity to import and correctly analyze TEI encoded texts was one of the features requested in the original design of the platform.

However, the flexibility of the TEI framework (which is its force) and the variety of encoding practices make it virtually impossible to work out a universal strategy for building a properly structured corpus (i.e. compatible with the data model of the search and analysis engines) out of an arbitrary TEI encoded text or group of texts. It should nevertheless be possible to define a subset of TEI elements that would be correctly interpreted during the various stages of the corpus import process (for example, the TEI-lite tag set), to specify the minimum requirements to the document structure and to suggest a mechanism for customization. This work is being progressively carried out by the TXM development team, but it can hardly be successful without an input from the TEI community.

The goal of this paper is to present the way TXM currently deals with importing TEI encoded corpora and to discuss the ways to improve this process by interpreting TEI elements in terms of the TXM data model.

At present, TXM includes an “XML-TEI-BFM” import module developed for the texts of the Base de Français Médiéval (BFM) Old French corpus (http://txm.bfm-corpus.org) marked up according to the project specific TEI customization and guidelines (Guillot et al. 2010). With some adaptation, this module works correctly for a number of other TEI encoding schemas used by several projects: Perseus (http://www.perseus.tufts.edu/hopper), TextGrid (http://www.textgrid.de/en), PUC/Cléo (http://www.unicaen.fr/recherche/mrsh/document_numerique/outils), Frantext (http://www.frantext.fr), BVH (http://www.bvh.univ-tours.fr), etc. However, the use of tags that are not included in the BFM customization and the non respect of some particular constraints (such as a technique of tagging parts of words and of using strong punctuation within the editorial markup elements) may result in lower quality of the TXM corpus (e.g. errors in word counts, collocation analysis or inconvenient display of texts for reading) or even in a failure of the import process due to the limits of the tokenizer used in this module.

A more generic “XML/w+CSV” module allows importing any XML documents (not necessarily TEI) with the possibility to pre-annotate all or selected words using a <w> tag with an arbitrary set of attributes. This module is more robust in terms of producing a searchable corpus but it does not make any use of the semantics of TEI markup. For instance, no difference is made between the text and the header, the notes and variant encodings of the same text segment are all included in the text flow.

To improve the quality of the resulting corpus, it is necessary to “translate” the TEI markup into the various data categories relevant for the TXM data model. This model is relatively straightforward and relies to a large extent on that of the CWB CQP search engine (http://cwb.sourceforge.net). We have already presented the relevant data categories in some detail at the 2012 TEI Members Meeting (Heiden & Lavrentiev 2012) but this time we would like to adopt a more pragmatic approach related to the development of the TXM-TEI import modules.

A corpus is composed of a number of “text units” associated with a set of metadata used mainly to split the corpus in different ways and to perform contrastive analyses. A simple TEI file with one <text> element corresponds usually to a TXM text unit, and the useful metadata can be extracted from the <teiHeader> (or, alternatively, from a separate CSV table).

The second basic element of the TXM data model is the “lexical unit” (or the token), which may be a word or a punctuation mark carrying a number of properties (annotations) inherited from the source document (e.g. the language or a variant form) or generated during the import process (e.g. morphosyntactic description or a lemma suggested by an NLP tool). The properties of the lexical units can be easily searched and analyzed using the CQP search engine. TXM can import a corpus with pre-tagged lexical units but in most cases the tokenization is performed during the import process. In the latter case, it is necessary to pay special attention to the tags that may occur “inside” the tokens. These are typically line or page breaks, or some editorial markup (abbreviation marks, supplied letters, etc.). As far as the milestone-like empty elements are concerned, the TEI has recently adopted a general mechanism using the “break” attribute. As for the word-internal elements with textual content, it is recommended to pre-tag the words containing such elements using the <w> element before the import process.

The third element of the TXM data model is the intermediate structure of the text which can include sentences, paragraphs, divisions or any other continuously or sporadically marked up text segments. They are represented as XML elements, so proper nesting is required. They can be annotated by properties that can be used in a way similar to the text unit metadata. Intermediate structures can be used to separate “text planes” (such as titles vs. text body, direct speech of various characters in a drama, etc.). Although TXM is not designed for managing various readings in critical editions or stages of text evolution, the mechanism of text planes can be used to analyze and compare different text states or variants.

In the simplest case, a text can be represented as a chain of lexical units. This point of view is by all means relevant for word counts, collocation search and analysis, etc. If the source document contains editorial notes or variant encodings of the same text segment (using <choice> or <app> mechanisms), it is necessary to treat them in one of the following ways:

- eliminate them completely from the search indexes;

- create a separate “text plane” for them and possibly relocate them to special text units or divisions;

- project variant readings as additional “properties” onto the lexical units of the main text chain.

The last but not the least aspect of the import process is building “editions” of corpus texts for convenient reading and displaying extended contexts of the search hits. This is where the rich TEI markup and the know-how of producing fancy-styled outputs may be particularly valuable. The objective is to make it possible to use a set of custom stylesheets (like those developed by Sebastian Ratz ones for the TEI consortium) to render these editions but this requires some further development to ensure compatibility with TXM’s features of highlighting search hits and displaying properties of the lexical units. An intermediate solution is currently being experimented to allow the customization of the rendering of selected elements via the CSS class pointing mechanism.

The TXM team is interested in the feedback from any TEI projects willing to analyze their data with the TXM platform and is open to discussion on the improvement of the import modules and their documentation.


  • Guillot, C., Heiden, S., Lavrentiev, A., Bertrand, L. (2010). Manuel d’encodage XML-TEI des textes de la Base de Français Médiéval, Lyon, Équipe BFM <http://bfm.ens-lyon.fr/article.php3?id_article=158>.
  • Heiden, S. (2010). “The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme.” 24th Pacific Asia Conference on Language, Information and Computation. Éd. Kiyoshi Ishikawa Ryo Otoguro. Institute for Digital Enhancement of Cognitive Development, Waseda University, 2010. 389-398. <http://halshs.archives-ouvertes.fr/halshs-00549764>.
  • Heiden, S. & Lavrentiev, A. (2012). “Constructing Analytic Data Categories for Corpus Analysis from TEI encoded sources.” TEI Conference 2012. College Station, TX, 7-10 November 2012. <http://idhmc.tamu.edu/teiconference/program/papers>.

“Texte” versus “Document”. Sur le platonisme dans les humanités numériques et sur la maïeutique TEI des textes (“Text” versus “Document”. Platonism in DH and the maieutics of the text)

Dans mon intervention j’aimerais partager les réflexions qui se sont imposées à moi, en tant que philosophe qui dirige le site des archives philosophiques (XX siècle) et qui pratique la TEI en lien avec ces archives.

Grâce à sa dimension sémantique, TEI occupe à plus d’un titre une place privilégiée dans le paysage DH. Ainsi l’encodage TEI est un exemple de la coopération homme/machine qui ne se limite pas à l’utilité technologique (telle que la sauvegarde du patrimoine, la rationalisation du traitement et d’accès aux très grands corpus ou encore la simplification de la publication “à la carte”). L’encodage TEI est aussi créatif et il ouvre la voie d’accès à des contenus nouveaux et insoupçonnables avant. C’est la cas par exemple pour les diverses visualisations des contenus, pour les analyses stylométriques, scientométriques, etc. Et enfin, TEI révèle aussi certaines vérités sur la nature des objets de recherches dans le domaine SHS.

Une tension est palpable au sein de la TEI. En gros, on peux dire que c’est une tension entre l’encodage linéaire d’une succession des unités linguistiques fixées sur un support “transparent” d’un côté et l’encodage génétique, qui vise à rendre le temps originaire de la production du contenu intelligible, de l’autre.

La TEI a été créée pour ce premier et c’est pourquoi “texte” figure d’une manière programmatique déjà dans son intitulé. Cependant depuis des années la recherche sur l’encodage des manuscrits de travail des écrivains (Flaubert, Proust) est engagée (avec des fortunes diverses). Pour cette approche TEI génétique le “document” prend de plus en plus d’importance. Dans mon travail de chercheur je suis intéressée par les deux tendances (pour des raisons différentes).

Quel est le statut ontique du texte? Quelle sorte d’objet est le texte et de quelle façon existe-t-il? Le texte correspond avant tout à une “surface” perceptible. Même le grand prêtre de la postmodernité, Barthes (dans l’Encyclopedia Universalis), en convient. Barthes attribue au texte avant tout une fonction de sauvegarde: “d’une part, la stabilité, la permanence de l’inscription, destinée à corriger la fragilité et l’imprécision de la mémoire; et d’autre part la légalité de la lettre, trace irrécusable, indélébile, pense-t-on, du sens que l’auteur de l’oeuvre y a intentionnellement déposé”. Cette fonction de sauvegarde est fondamentalement liée au support matériel et à ses propriétés. Peut-on, dans cette fonction de sauvegarde, limiter la “surface perceptible” du texte aux seules combinaisons des lettres? P. Caton – nous y reviendrons plus tard – montre que certainement pas.

La question de la légitimité du “texte” en circulation par rapport à l’oeuvre de son créateur est vieille comme la communication indirecte. Dans le cas de l’écrit, elle se focalise sur l’intention de l’auteur et elle craint l’éditeur malveillant.

Depuis l’invention de l’imprimerie, l’évolution va dans le sens d’une progression vers l’abstraction, vers la suppression des contenus contextuels liés à la matérialité du texte. L’imprimerie a imposé le règne du texte établi (Scholarly Print Edition) et de facto indépendant de son support matériel d’origine à savoir le manuscrit d’auteur. En raison de la popularisation du livre ainsi que de l’impératif de diminuer son prix, nous assistons progressivement aux “dégraissages” du document, à la réduction au strict minimum des informations dont le document d’origine est porteur. Nous assistons au triomphe du texte “pur” dans le minimalisme des éditions de poche et encore plus sur les tablettes. Par ailleurs on peut constater alors qu’au sens strict du terme, les créateurs qui pendant des siècles nous laissaient généralement les manuscrits, puis les tapuscrits, produisent aujourd’hui les fichiers électronique. En ce sens on peut dire, que pour la première fois dans l’histoire de l’humanité ils produisent les “textes”.

Mais au fur et à mesure de l’avancement de notre aire numérique et étant donné que l’encodage sémantique TEI vise à représenter avantageusement les sources sur le WEB, la question de la légitimité et de la fidélité aux sources se pose à nouveau et d’une manière plus aiguë. Elle est exacerbée par une fabuleuse augmentation de la quantité des archives on line et par le catch as catch can omniprésent sur le WEB. Pour nous, c’est l’une des raisons d’aller vers le document encodé TEI qui pourrait devenir garant de la légitimité des sources SHS. TEI pourrait devenir pour les fichiers sources ce que fût Das wohltemperierte Klavier pour le piano.

Dans cet univers virtuel les questions ontologiques prennent une place centrale. Nous l’avons déjà dit, l’encodage TEI révèle certaines vérités fondamentales sur les relations des chercheurs avec les sources SHS. Notre pratique de la TEI en lien avec les archives, montre que l’analyse de la situation d’encodage5 peut-être considérée comme l’analyse des intentionnalités à l’oeuvre dans toute lecture possible.

L’idée porteuse des e-archives est de remplacer la consultation matérielle des archives par leurs consultations en ligne. Cela a de nombreux avantages qui justifient le cout élevé de l’entreprise. Dans l’idéal, un lecteur en ligne doit pouvoir accéder à toutes les informations et à tous les contenus des archives d’origine non seulement d’une manière plus commode, mais aussi enrichies par l’expertise du site qui les édite. Dans le cas où le document-source est représenté par un fichier XML/TEI, l’objet virtuel consulté va être inévitablement construit par l’édition électronique. TEI peut faire de cette transformation inévitable un enrichissement. Mais faut-il imposer ici comme norme, que l’objet virtuel consulté ainsi c’est un “texte” au sens traditionnel du terme?

Du point de vue étymologique le mot “texte” vient du latin “textilis” et “textilis” à son tour de “textus” – le participe passé de textere – tisser. Donc, le mot “texte” vient de l’activité de tisser, vient de l’action. Mais à la différence de la tapisserie, l’auteur n’est pas le seul à tisser, le lecteur tisse aussi6. Comment peuvent-ils avoir tous deux à faire avec le même “objet”? Cette grande question est à l’origine des théories ontologiques et épistémologiques sur le texte et l’identité/permanence de sa signification.

Une longue tradition herméneutique s’occupait de l’explication des intentions des créateurs des oeuvres, de l’explicitation de la signification de l’oeuvre dans son identité absolue (même la déconstruction postmoderne est une étape de cette démarche). La nouveauté radicale de l’encodage TEI dans ce contexte, consiste dans la redécouverte du document lui-même et dans la considération de l’intentionnalité du lecteur avant tout. L’encodeur qui est en contact immédiat avec le document (où au moins de son facsimilé) est l’accoucheur du texte ou des texte(s) possible(s) latent dans le document. L’encodeur doit interroger le document pour l’amener à extérioriser son texte en partant de la matérialité de la source. Il pratique la maïeutique de l’oeuvre et le travail d’encodage révèle la nature essentiellement psychophysique du document, l’importance basique de la perception de son apparence matérielle.

La véritable spécificité de la perception de ces objets particuliers que sont les documents dans l’attitude de la lecture est très peu prise en compte dans les théories herméneutiques et dans les théories traditionnelle du texte.

Il a fallu attendre le début du XXème siècle pour qu’une théorie des processus effectifs de l’écriture et de la lecture voit le jour. Il s’agit de la théorie des Actions et des Produits (APT) de Kazimierz Twardowski. Dans la philosophie la langue est traditionnellement assujettie a exprimer les concepts. Seuls les stoïciens ont pressenti la potentialité du langage d’être un objet suis generis. Le langage est considéré comme un reflet direct de la pensée. L’écriture est considédée comme la représentation du langage. La deuxième moitié du XIXème siècle voit enfin naître la théorie de l’intentionnalité (Franz Brentano, 1838 – 1917) qui, pour la première fois, et sur la base d’une psychologie descriptive de la conscience, jette les ponts entre la pensée et le langage. Son élève – Kazimierz Twardowski (1866-1938) affirmera que la langue ne dit pas seulement quelque chose mais aussi sur quelque chose et que même les expressions impossibles ont un objet (par exemple – “le carré rond”). Les bases ontologiques de la sémantique moderne sont ainsi posées. Face au danger logiciste que cette théorie comporte (cf. Lukasiewicz, Lesniewski – ses élèves), Twardowski présente en 1911 la théorie des actions et des produits qui est une théorie interdisciplinaire aux confins de la grammaire, de la psychologie et de la logique.

L’homme est selon cette théorie auteur/créateur et produit les objets par le biais de ses actions. En pensant, l’homme peut décider de fixer sa pensée dans l’écriture: l’homme construit alors – dans une langue concrète – des phrases (propositions). Du point de vue ontique, ses pensées en tant que processus psychophysiques concrets, ne sont pas identiques avec leur résultat fixé dans l’écriture concrète. Une fois la proposition couchée sur le papier, l’homme devient le premier lecteur de ses pensées. C’est d’abord en tant que lecteur qu’il les corrige. Le produit de sa cognition est toujours un produit psychophysique, sauf quand – oublié de tous, latent – il attend d’être lu. En attendant il est alors uniquement potentiellement une écriture. Les traces d’encre sur le papier existent tant qu’elles durent. Elles sont périssables mais autonomes du point de vue ontique. Par contre l’écriture n’est pas autonome du pont de vue ontologique. Elle a besoin d’être ravivée pour devenir ce qu’elle est, à savoir – un produit de la pensée.

La particularité de la perception d’une écriture sur un objet matériel (papyrus, parchemin, papier, etc.) est bien saisissable par ressemblance et par dis-analogie avec une perception possible de ce même objet matériel mais en tant qu’objet d’art plastique, donc dans l’attitude esthétique. Imaginons la pierre de Rosette comme un objet décoratif couvert d’ornements répétitifs. Les lettres et les mots (reconnus comme tels ou non) feront partie de la perception en tant qu’éléments d’une perception holistique de l’objet. Les traces matérielles qui correspondent à l’écriture y seront considérées tout d’abord comme les autres traces matérielles, en tant qu’éléments fonctionnels dans le construction de l’objet de l’expérience esthétique. Ils feront partie des aspects7 par lesquels l’objet esthétique et ses valeurs se présentent, tout d’abord sensoriellement, à celui qui le perçoit.

Par contre dans la perception dans l’attitude de la lecture, les aspects perceptifs relèveront avant tout de la signification possible de l’écriture. Cependant ici aussi l’action commence par une impulsion matérielle. Le texte est un objet psychophysique et une théorie des aspects, c’est à dire des items sensoriels et perceptifs dans le processus de la construction du texte serait utile.

Est-ce que cela veut dire qu’après la grande époque des herméneutiques qui partaient de l’idée de la signification d’un texte conçu par les intentions de l’auteur, une révolution copernicienne aura lieu grâce à la TEI et on reconnaîtra à l’encodeur le pouvoir constitutionnel par rapport aux textes? Pour répondre à cette question, retraçons les principales conceptions ontologiques du texte. Nous allons distinguer trois types de concepts ontologiques du “texte”: la conception platoniste (A), la conception positiviste (B) et finalement la conception sémantique (C).

(A) Dans la SEP nous pouvons lire que le terme: platonist signifie dans un sens contemporain: “that there exist such things as abstract objets – where an abstract objetc is an objetct that not exists in space or time and which is therefore entirely non-physical and non-mental”8.

Cette idée remonte à Platon et à sa métaphore de la caverne: nous ne voyons que les ombres de la vrai réalité, qui comme le soleil dans la métaphore platonicienne se trouve derrière notre dos. Cette théorie a connu les versions plus où mois radicales et la version contemporaine exposée dans la SEP est très modérée. Intuitivement elle est facile à comprendre grâce aux idéalités mathématiques. Les nombres n’existent pas seulement dans tous les actes concrets de dénombrement. Nous reconnaissons sans difficulté la vérité de la phrase: “Il est vrai que les nombres existent”. La chose se complique si nous posons la question: est-ce que ce sont les mathématiciens qui ont créé les nombres? La position platoniste consiste à dire: non, les nombres existent indépendamment de l’homme, ils existent à priori et au-delà tout calcul concret, ils n’ont pas été créés par l’homme, ils ont été, éventuellement, d’une certaines façon découverts.

Dans le cas du texte, comme dans celui de l’oeuvre d’art, cette position ontologique est plus nuancée dans la mesure où on reconnaît ici à l’homme plus facilement son pouvoir de créer. Mais une fois l’oeuvre créée il rejoint le royaume apriorique des êtres identiques et durables.

La conception platoniste du texte est omniprésente dans les DH. On peut le voir très bien sur l’exemple de l’ontologie DH proposée par Renear & Dubin.

Renear & Dubin partent dans leurs considérations ontologiques de la typologie FRBR (Functional Requirements for Bibliographic Records/ Spécifications Fonctionnelles de Notices Bibliographiques) de l’IFLA concernant les entités possibles à cataloguer par les bibliothécaires. Dans le premier groupe FRBR on distingue quatre unités: oeuvre (Q), expression (par ex. la traduction de Q par XY), manifestation (une édition de cette traduction chez un éditeur Z) et finalement un item (l’exemplaire que j’ai dans ma bibliothèque). Renear & Dubin démontrent à l’aide du concept de la “propriété rigide”, que trois de ces unités ne sont pas des “types” mais uniquement les “rôles” de la première9. Même si leur raisonnement est rigoureux et que leurs investigations contiennent énormément d’observations justes, on est obligé de constater que le cadre général de leur raisonnement, à savoir l’affirmation qu’uniquement l’”oeuvre” est un type ontologique, est l’expression d’un pur platonisme. Car, deux des quatre unités du premier groupe FRBR, à savoir l’oeuvre et l’expression sont parfaitement abstraites: aucune expérience immédiate psychophysique n’est ici possible. On peut montrer10, qu’elles sont des constructions conceptuelles postérieures à toute expérience effectivement possible. Elles sont des constructions conceptuelles utiles pour des besoins de classification (théories ou catalogues) mais elle ne sont pas des moments des expériences possibles. On peux les rencontrer en tant que concepts par le biais de leurs définitions ou par l’abstraction à partir d’une classe de leurs représentants (manifestations et items).

Le seul “type” du premier groupe reconnu par Renear & Dubin “comme type” – est en fait une abstraction!

(B) La conception positiviste/linguistique du texte part de la conception du texte en tant qu’unité linguistique. Elle a donc l’allure plus concrète car elle réfère à une connexion unitaire des sens linguistiques structurés. C’est en ce sens que le texte est présent. Dans l’intitulé même du projet TEI: “Representation of Texts in Digital Form” et plus précisément “Encoding Methods for Machine-readable Texts, Chiefly in the Humanities, Social Sciences and Linguistics“.

Dans les TEI Guidelines “text” est un élément du module: “textstructure” avec pour définition: “text contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample11.

Dans la pratique concrète de l’encodage “text” arrive après “teiHeader” et contient le contenu intelligible du document à encoder. Le “text” ne contient pas de “metamark” concernant le document même (“contains or describes any kind of graphic or written signal within a document the function of which is to determine how it should be read rather than forming part of the actual content of the document”).

Ce concept du “texte” avec lequel opère TEI est en fait emprunté à la fois à la linguistique et à la philosophie. Un texte est une unité structurée des sens langagiers. Idéalement c’est une suite linéaire des propositions (même les expressions bizarres ou “incorrectes” ou autres inventions artistiques peuvent être comprises de cette façon propositionnelle) considérées comme une succession des signes graphiques. Traditionnellement et même dans sa compréhension postmoderne (celle qui inclut les intertextualités, la disparition du texte dans sa déconstruction ou encore la mise sur un pied d’égalité de l’auteur et de l’interprète), ce concept ne réfère que très peu aux propriétés matérielles de ce qui est le véritable support originaire du texte, à savoir le document. En fait, on ne s’y intéresse que s’il est détérioré et que les fragments de l’écriture sont difficilement lisibles, ou encore si ces propriétés matérielles permettent de dater l’écriture ou d’établir l’authenticité du texte et peuvent enrichir le “teiHeader”. Le texte transcende ici toujours le token qui est son support. Malgré son approche positiviste le texte possède finalement une existence unitaire apriorique dans un ailleurs où l’a envoyé l’intentionnalité de son créateur.

Cette compréhension présuppose donc finallement aussi une idée platoniste du texte, car on ne s’intéresse pas ici véritablement aux suites des signes graphiques en tant que fragments du document mais à ce dont ils sont la représentation linguistique et dont l’existence est présupposée a priori: une expression langagière d’une théorie, d’un récit etc.

Paul Caton montre bien les limites de cette compréhension du texte. En s’appuyant sur l’analyse des documents, il montre l’importance du contexte “intérieur” du document pour la compréhension du texte. Il démontre sur les “cas extrêmes” (logo, message cryptographié, écriture sur les affiches) l’importance de la fonction que la suite des signes linguistiques remplit dans/sur chaque document. Il met en avant la communication: les suites des signe linguistiques sont non seulement une représentation écrite du langage – ils communiquent avant tout. Et pour cette communication, le document donne parfois les informations incontournables. Paul Caton conclut que la distinction tranchée du texte et du contexte dans le document est un artefact.

Cette compréhension le “texte” devient clairement problématique dans le cas d’encodage des manuscrits d’auteur. Les travaux récents d’Elena Pierazzo consacrés à la génétique textuelle montrent l’importance du document pour la reconstruction de son contenu à transmettre. Nonobstant les difficultés que la génétique textuelle pose au niveau de l’affichage, elle révèle l’importance du document et la complexité de l’identité du texte dans/sur le document.

conduit suivre la classification qui obéit au signes graphique sur un support et leur structuration linguistique. Cette observation a une valeur plus générale: la langue et non pas uniquement il faut respecter les exigences objectives

(C) La conception sémantique du texte. Notre propre expérience d’encodage a montré, que pour capter le contenu d’un manuscrit philosophique et pour le transmettre pour l’étude à un chercheur future, il ne suffit pas de suivre la classification qui obéit à la structuration linguistique et au positionnement des signes graphiques sur un support. Souvent, pendant l’encodage (et nous allons en montrer quelques exemples pour finir) il faut respecter les exigences objectives relevant de la pensée en évolution. Nous nous sommes donc basés sur les propriétés du document autant que sur notre connaissance de la théorie APT de Twardowski.

Notre document à encoder contient une version française de la théorie APT faite par Twardowski lui-même. Ce document n’est pas une traduction proprement dite. Ici Twardowski pense sa théorie en français. Précédemment il l’a déjà formulée en polonais et en allemand.

En partant des caractéristique du document, nous avons pu constater qu’il y avait deux textes correspondant aux deux campagnes d’écritures. Dans notre procédé la langue nous a guidée mais elle ne décidait pas in fino sur l’encodage12. Ce principe, nous amène vers la conception sémantique du texte.

Sur la troisième voie – sémantique – de la compréhension du concept du texte, le point de départ est toujours donné par la situation d’encodage et par les intentionnalités de lecture que l’encodeur détecte dans le document. Ceci est très proche de la réalité effective de la situation de la lecture et de la rencontre (dans l’attitude de lecture) avec l’objet matériel consulté. Ce processus ne présuppose pas l’existence antérieure d’UN texte à reconstituer et il est donc plus ouvert que les deux précédents.

La question se pose à nouveau si, grâce à un tel encodage TEI génétique des manuscrits, ce ne sont plus les textes a priori qui déterminent l’encodage, mais l’encodage qui au fur et mesure de l’avancement de l’encodage donne le texte? Autrement dit: cette troisième voie donne-t-elle à l’encodeur le pouvoir de constituer librement les textes dans son activité effective d’encodage?

Il est vrai: l’encodeur ne travaille pas ici à la re-constitution d’un texte préexistant. Sa liberté est cependant limitée par les déterminations venant du document et de l’objet théorique que le créateur a fixé dans le document. Le texte où les textes, arrivent au fur et à mesure de l’encodage: ils n’existent pas avant mais ils ne sont pas constitués d’une manière aléatoire: “The semantic tradition consists of those who believed in the a priori but not in the constitutive powers of mind”13.


L’encodeur TEI est en quelque sorte un Hyperlecteur. Son intentionnalité est celle de tout lecteur possible d’un document et non celle de son créateur. L’encodeur a pour tâche, pour construire le texte, de rendre au mieux les contenus communiqués par le document et de se laisser guider par son objet. Le texte est un produit secondaire d’encodage; il est un a priori non antérieur au travail d’encodage. Ne faudrait-il pas alors, dans l’ordre de l’arbre des documents XML/TEI, remplacer le “texte” par le “document” et réintroduire le “texte” plus tard dans l’embranchement? This is the question.


  • Burnard, Lou. Text Encoding for Interchange: A New Consortium. 2000. [http://www.ariadne.ac.uk/issue24/tei].
  • Caton, Paul, “On the term ‘text‘ in digital humanities”, Literary and Linguistic Computing, vol 28, No.2, 2013, p. 209- 220.
  • Crasson, Aurèle and Jean-Daniel Fekete. Structuration des manuscrits: Du corpus à la région. Proceedings of CIFED 2004. La Rochelle (France), 2004: 162–168. [http://www.lri.fr/~fekete/ps/CrassonFeketeCifed04-final.pdf].
  • J.A. Coffa, The Semantic Tradition from Kant to Carnap, Cambridge University Press, 1991.
  • R. Ingarden, The Cognition of the Literary Work of Art, Illinois:Northwetern University Press, 1973.
  • W. Miskiewicz, “La critique du psychologisme et la métaphysique retrouvée – Sur les idées philosophiques du jeune Łukasiewicz”, Philosophia Scientiae 15/2, – La syllogistique de Łukasiewicz, 2011, p. 21-55.
  • W. Miskiewicz, “Les aspects – Interface entre l’homme et l’œuvre d’art”, Roman Ingarden: La phénoménologie à la croisée des arts, ed. P. Limido-Heulot, Presses Universitaires de Rennes, AEsthetica, Rennes, 2013.
  • W. Miskiewicz, “Archives philosophique multilingues à l’époque du numérique: Le projet Archives e-LV”. In: Patrice Bourdelais, Institut des sciences humaines et sociales CNRS, dir. la lettre de l’INSHS, tome 18. – La tribune d’ADONIS. – Paris: INSHS, 2012. – p. 18-20.
  • W. Miskiewicz, “Quand les technologies du Web contournent la barrière linguistique: Archives e-LV.”, Synergies Revues, vol. 1, n° 1. – Synergies Pologne n°spécial 2, 2011, p. 81-91. – ISSN: 1734-4387.
  • E. Pierazzo, ‘Digital genetic editions: the encoding of time in manuscript transcription’. Text Editing, Print and the Digital World, Digital Research in the Arts and Humanities. M. Deegan and K. Sutherland (eds.), Ashgate: Aldershot, 2008, pp. 169–186.
  • E. Pierazzo, P. A. Stokes. ‘Putting the text back into context: a codicological approach to manuscript transcription’. Kodikologie und Paläographie im Digitalen Zeitalter 2 – Codicology and Palaeography in the Digital Age 2. M. Rehbein, T. Schaßan, P. Sahle (eds.) Norderstedt: Books on Demand, 2011, pp. 397-424.
  • E. Pierazzo, “Digital Genetic Editions: The Encoding of Time in Manuscript Transcription.” Text Editing, Print and the Digital World. Ed. Marilyn Deegan and Kathryn Sutherland. Aldershot: Ashgate, 2009. 169–186.
  • E. Pierazzo and M. Rehbein, Documents and Genetic Criticism TEI Style. TEI Consortium, 2010. [http://www.tei-c.org/SIG/Manuscripts/genetic.html].
  • F. Rastier, Arts et sciences du texte. Paris: Presses Universitaires de France, 2001.
  • A.H. Renear & D. Dubin, “Three of the four FRBR group 1 entity types are roles, not types” in Grove, A. (ed), Proceedings of the 70th Annual Meeting of the American Society for Information Science and Technology (ASIST), Milwaukee, WI.
  • Twardowski, Kazimierz, Actions and products. Comments on the Border Aera of Psychology, Grammar and Logic, dans J.Pelc, Semiotics in Poland. 1894-1969, Dordrecht, Reidel, 1979, p. 13-27.
  • TEI: Text Encoding Initiative. TEI Consortium, 2010. [http://www.tei-c.org]. Manuscript Description: [http://www.tei-c.org/release/doc/tei-p5-doc/fr/html/MS.html].
  • Jean-Pierre Balpe, «ÉCRITURE», Encyclopædia Universalis [en ligne], consulté le 30 mars 2013. URL: http://www.universalis.fr/encyclopedie/ecriture/
  • Roland Barthes, «TEXTE THÉORIE DU», Encyclopædia Universalis [en ligne], consulté le 30 mars 2013. URL: http://www.universalis.fr/encyclopedie/theorie-du-texte/
  • Fonctions et Produits dans les Édition e-LV: la publication en ligne des versions polonaise, allemande et française des manuscrits de Twardowski encodées TEI. http://www.elv-akt.net/ressources/editions.php

Modelling frequency data: methodological considerations on the relationship between dictionaries and corpora

The research questions addressed in our paper stem from a bundle of linguistically focused projects which –among other activities– also create glossaries and dictionaries which are intended to be usable both for human readers and particular NLP applications. The paper will comprise two parts: in the first section, the authors will give a concise overview of the projects and their goals. The second part will concentrate on encoding issues involved in the related dictionary production. Particular focus will be put on the modelling of an encoding scheme for statistical information on lexicographic data gleaned from digital corpora.

The mentioned projects are tightly interlinked, are all joint endeavours of the Austrian Academy of Sciences and the University of Vienna and conduct research in the field of variational Arabic linguistics. The first project, the Vienna Corpus of Arabic Varieties (VICAV), was already started two years ago on the basis of a low budget scheme and was intended as an attempt at setting up a comprehensive research environment for scholars pursuing comparative interests in the study of Arabic dialects. The evolving VICAV platform aims at pooling linguistic research data, various language resources such as language profiles, dictionaries, glossaries, corpora, bibliographies etc. The second project goes by the name of Linguistic Dynamics in the Greater Tunis Area: A Corpus-based Approach. This three-year project which is financed by the Austrian Science Fund aims at the creation of a corpus of spoken youth language and the compilation of a diachronic dictionary of Tunisian Arabic. The third project which has grown out of a master’s thesis deals with the lexicographic analysis of the Wikipedia in Egyptian vernacular Arabic. In all these projects, digital data production relies on the Guidelines of the TEI (P5), both for the corpora and the dictionaries. The dictionaries compiled in the framework of these projects are to serve research as well as didactic purposes.

Using the TEI dictionary module to encode digitized print dictionaries has become a fairly common standard procedure in digital humanities. Our paper will not resume the TEI vs. LMF vs. LexML vs. Lift vs. … discussion (cf. Budin et al. 2012) and assumes that the TEI dictionary module is sufficiently well-developed to cope with all requirements needed for the purposes of our projects. The basic schema used has been tested in several projects for various languages so far and will furnish the foundation for the intended customisations.

Lexicostatistical data and methods are used in many fields of modern linguistics, lexicography is only one of them. Modern-time dictionary production relies on corpora, and statistics–beyond any doubt–play an important role in lexicographers’ decisions when selecting lemmas to be included in dictionaries, when selecting senses to be incorporated into dictionary entries and so forth. However, lexicostatistical data is not only of interest for the lexicographer, it might also be useful to the users of lexicographic resources, in particular digital lexicographic resources. The question as to how to make such information available takes us to the issue of how to encode such information.

Reflecting on the dictionary–corpus–interface and on the issue of how to bind corpus-based statistical data into the lexicographic workflow, two prototypical approaches are conceivable: either statistical information can statically be embedded in the dictionary entries or the dictionary provides links to services capable of providing the required data. One group of people working on methodologies to implement functionalities of the second type is the Federated Content Search working group, an initiative of the CLARIN infrastructure which strives to move towards enhanced search-capabilities in locally distributed data stores (Stehouwer et al. 2012). FCS is aiming at heterogeneous data, dictionaries are only one type of language resources to be taken into consideration. In view of more and more dynamic digital environments, the second approach appears to be more appealing. Practically, the digital workbench will remain in need of methods to store frequencies obtained from corpus queries, as human intervention will not be superfluous any time soon. Resolving polysemy, grouping of instances into senses remain tasks that cannot be achieved automatically.

Which parts of a dictionary entry can be considered as relevant? What is needed is a system to register quantifications of particular items represented in dictionary entries. The first thing that comes to mind are of course headwords, lemmata. However, there are other constituents of dictionary entries that might be furnished with frequency data: inflected wordforms, collocations, multi word units and particular senses are relevant items in this respect.

The encoding system should not only provide elements to encode these, but also allow to indicate the source from which the data were gleaned and how the statistical information was created. Ideally, persistent identifiers should be used to identify not only the corpora but also the services involved to create the statistical data.

We basically see three options to go about the encoding problem as such: (a) to make use of some TEI elements with very stretchable semantics such as <note>, <ab> or <seg> and to provide them with @type attributes, (b) to make use of TEI feature structures or (c) to develop a new customisation. We will discuss why we have discarded the first option, will present a provisional solution on the basis of feature structures and discuss pros-and-cons of this approach. As is well known, feature structures are a very versatile, sufficiently well-explored tool for formalising all kinds of linguistic phenomena. One of the advantages of the <fs> element is that it can be placed inside most elements used to encode dictionaries.

<entry xml:id=”mashcal_001″>
<form type=”lemma”>
<orth xml:lang=”ar-arz-x-cairo-vicavTrans”>mašʕal</orth>
<orth xml:lang=”ar-arz-x-cairo-arabic”>مشعل</orth>
<fs type=”corpFreq”>
<f name=”corpus” fVal=”#wikiMasri”/>
<f name=”frequency”>
<numeric value=”6″/>
<gram type=”pos”>noun</gram>
<gram type=”root” xml:lang=”ar-arz-x-cairo-vicavTrans”
<form type=”inflected” ana=”#n_pl”>
<orth xml:lang=”ar-arz-x-cairo-vicavTrans”>mašāʕil</orth>
<orth xml:lang=”ar-arz-x-cairo-arabic”>مشاعل</orth>
<fs type=”corpFreq”>
<f name=”corpus” fVal=”#wikiMasri”/>
<f name=”frequency”>
<numeric value=”2″/>

The paper will be concluded by first considerations considering a more encompassing ODD based solution. We hope the work could lead to the introduction of a comprehensive set of descriptive objects (attributes and element) to describe frequencies in context, encompassing: reference corpus, size of reference corpus, extracted corpus, size of extracted corpus and various associated scores (standard deviation, t-score, etc.).

Selected references

  1. Banski, Piotr, and Beata Wójtowicz. 2009. FreeDict: an Open Source repository of TEI-encoded bilingual dictionaries. In TEI-MM, Ann Arbor. (http://www.tei-c.org/Vault/MembersMeetings/2009/files/Banski+Wojtowicz-TEIMM-presentation.pdf)
  2. Bel, Nuria, Nicoletta Calzolari, and Monica Monachini (eds). 1995. Common Specifications and notation for lexicon encoding and preliminary proposal for the tagsets. MULTEXT Deliverable D1.6.1B. Pisa.
  3. Budin, Gerhard, Stefan Majewski, and Karlheinz Mörth. 2012. Creating Lexical Resources in TEI P5. In jTEI 3.
  4. Hass, Ulrike (ed). 2005. Grundfragen der elektronischen Lexikographie: Elexiko, das Online-Informationssystem zum deutschen Wortschatz. Berlin; New York: W. de Gruyter.
  5. Romary, Laurent, Susanne Salmon-Alt, and Gil Francopoulol. 2004. Standards going concrete : from LMF to Morphalou. In Workshop on enhancing and using electronic dictionaries. Coling 2004, Geneva.
  6. Romary, Laurent, and Werner Wegstein. 2012. Consistent Modeling of Heterogeneous Lexical Structures. In jTEI 3.
  7. Sperberg-McQueen, C.M., Lou Burnard, and Syd Bauman (eds). 2010. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Oxford, Providence, Charlotteville, Nancy. (http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf)
  8. Stehouwer, Herman, Matej Durco, Eric Auer, and Daan Broeder. 2012. Federated Search: Towards a Common Search Infrastructure. In: Calzolari, Nicoletta; Choukri, Khalid; Declerck, Thierry; Mariani, Joseph (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). Istanbul.
  9. Werner Wegstein, Werner, Mirjam Blümm, Dietmar Seipel, and Christian Schneiker. 2009. Digitalisierung von Primärquellen für die TextGrid-Umgebung: Modellfall Campe-Wörterbuch. (http://www.textgrid.de/fileadmin/TextGrid/reports/TextGrid_R4_1.pdf)

A Saussurean approach to graphemes declaration in charDecl for manuscripts encoding

The current approach of TEI to the issue of graphemes encoding consists in recommending to use the Unicode standard. This is sufficient, on the practical side, when we encode printed documents based on post-Gutenberg writing systems, whose set of graphic signs (graphemes, diacritics, punctuation etc.) can be considered standard and implicitly assumed as known.

However, each historical textual document like a medieval manuscript or an ancient inscription features a specific writing system, different from the standard emerged after the invention of print.

This implies that the TEI ‘Unicode-compliance’ principle is not sufficient to define graphemes in pre-print writing systems. Let us assume that manuscript A has two distinct graphems ‘u’ and ‘v’, while manuscript B has only one ‘u’ grapheme. If we identified both the ‘u’ of the first manuscript and the ‘u’ of the second manuscript with the same Unicode codepoint (U+0075), our encoding would imply that they are the same grapheme, while they are not. Each of them, instead, is defined contrastively by the net of relations in the context of its own writing system, and the net of contrastive relations of manuscript A is different from that of manuscript B, as the latter does not have a ‘u/v’ distinction. This is even more evident with other graphic signs such as punctuation, whose expression (shape) and content (value) varied enormously through time.

This is why Tito Orlandi (2010) suggests to declare and define formally, for each document edited (e.g. a manuscript), each graphic sign that the encoder decides to distinguish, identify and encode in his or her digital edition. The natural place for this description seems to be the charDesc element within the TEI Header.

However, a specific technical issue arises, that I shall discuss in this paper: the TEI gaiji module only allows for a description of ‘non-standard characters’, i.e. graphemes and other signs not included in Unicode.

To my knowledge, there is currently no formal way in TEI to declare the specific set of ‘standard’ Unicode characters used in a digital edition and to define the specific value of the corresponding graphemes in the ancient document’s writing system.

This is due to the current TEI general approach to the encoding of ‘characters’. The TEI Guidelines currently suggest that encoders define as few ‘characters’ as possible, while I am suggesting that they should declare and define all encoded signs.

Possible solutions to this specific issue will be examined in this paper. I shall discuss possible changes to the TEI schema to allow for Unicode characters to be re-defined in the specific context of TEI transcriptions of ancient textual sources. Finally, I shall suggest how this might change the general approach towards the issue of graphemes encoding in the TEI Guidelines. I think that, at least in the case of the encoding of ancient documents, it should be recommended that all graphic signs identified, and not only ‘non-standard’ ones, be formally declared and defined.

To be more specific, the glyph element in the charDesc currently allows the encoder to freely define as many glyphs (i.e. allographs) as desired. It is not required, however, to give a complete list of the allographs of a manuscript. The g elements pointing to glyph definitions are meant to annotate individual tokens/instances of a given character (i.e. grapheme) in the body of the transcription, but it is not possible to annotate, i.e. to describe that character/grapheme as a type in the charDesc, if it is encoded by means of an existing Unicode codepoint (like the very common ‘u’, U+0075).

The Guidelines currently recommend, instead, to define characters/graphemes in the charDesc section of the TEI Header by means of char elements only if they are not already present in the Unicode character set. The encoder cannot re-define or annotate the specific value of that character in a manuscript’s graphical system if that character exists in Unicode.

This is not only a matter regarding documentation, i.e. the Guidelines’ current policy on character and glyph description. Let us imagine that an encoder decided to follow the approach suggested by Orlandi and to prepend to the transcription of a manuscript a complete and formal list of all graphemes and/or allographs identified in the manuscript by means of char and/or glyph elements respectively. This would imply overriding even the most common Unicode characters, such as ‘a’, ‘b’ and ‘c’, thus overhauling the approach suggested by the Guidelines – but would still be theoretically feasible on the basis of the current gaiji module. However, if he or she decided to define every character or glyph in the charDesc section, they would then be required to encode each single grapheme or allograph in the body of the transcription by means of a g element (or by means of an XML entity expanding to that element).

In the model that I am advocating, if the editor is providing a transcription of a pre-Gutenbergian primary source the Guidelines shoud recommend to formally list and briefly describe in charDesc all characters and glyphs (i.e. graphemes and allographs) identified. The gaiji module should also provide a mechanism by which, for example:

  • The encoder can decide to encode the ‘u/v’ grapheme of a manuscript simply by means of Unicode character U+0075 (‘u’);
  • He or she must give a brief formal definition of the value that the grapheme encoded with Unicode codepoint U+0075 has in the encoded manuscript (e.g. as not distinct from ‘v’) by means of the char element in charDesc;
  • In the body of the transcription, they can simply transcribe that grapheme by means of Unicode character U+0075 (one keystroke).


  • Baroni A. (2009). La grafematica: teorie, problemi e applicazioni, Master’s thesis, Università di Padova. <http://unipd.academia.edu/AntonioBaroni/Papers/455456/La_grafematica_teorie_problemi_e_applicazioni>. [last retrived 10.03.2013].
  • Mordenti R. (2001). Informatica e critica dei testi, Bulzoni.
  • Mordenti R. (2011). Paradosis. A proposito del testo informatico, Accademia Nazionale dei Lincei.
  • Monella P. (2012). In the Tower of Babel: modelling primary sources of multi-testimonial textual transmissions, a talk delivered at the London Digital Classicist Seminars 2012, Institute of Classical Studies, London, on 20.07.2012. <http://www.digitalclassicist.org/wip/wip2012.html>. [last retrieved 17.03.2013].
  • Orlandi T. (1999). Ripartiamo dai diasistemi, in I nuovi orizzonti della filologia. Ecdotica, critica testuale, editoria scientifica e mezzi informatici elettronici, Conv. Int. 27-29 maggio 1998, Accademia Nazionale dei Lincei, pp. 87-101.
  • Orlandi T. (2010). Informatica testuale. Teoria e prassi, Laterza.
  • Perri A. (2009). Al di là della tecnologia, la scrittura. Il caso Unicode. «Annali dell’Università degli Studi Suor Orsola Benincasa» 2, pp. 725-748.
  • Sampson G. (1990). Writing Systems: A Linguistic Introduction, Stanford University Press.
  • Wittern C. (2006). Writing Systems and Character Representation, in L. Burnard, K. O’Brien O’Keeffe, J. Unsworth, edd., Electronic Textual Editing, Modern Language Association of America.

Texts and Documents: new challenges for TEI interchange and the possibilities for participatory archives


The introduction in 2011, of additional “document-focused” (as opposed to “text-focused”) elements represents a significant additional commitment to modeling two distinct ontologies within the Text Encoding Initiative (TEI) Guidelines, and places increased strain on the notion of “interchange” between and among TEI data modeled according to these two approaches. This paper will describe challenges encountered by members of the development and editorial teams of the Shelley-Godwin Archive (S-GA) in attempting to produce TEI-encoded data reflecting both “document-focused” and “text-focused” approaches through automated conversion. S-GA started out, like most electronic literary archives, with the primary goal of providing users access to rare and widely dispersed primary materials, but increasingly the direction of the project will be to take advantage of the tremendous potential of its multi-layered architecture to re-conceptualize and design the whole as a work-site, or what some are calling an “animated archive,” whose ultimate goal is to make the S-GA material massively addressable in a form that encourages user curation and exploration. The ability to convert from “document-focused” to “text-focused” data—from work-site to publication—will partly determine how participatory the archive can be.

Background & Motivation

The Shelley-Godwin Archive is a project involving the Maryland Institute for Technology in the Humanities (MITH) and the Bodleian, British, Huntington, Houghton, and New York Public libraries that will contain the works and all known manuscripts of Mary Wollstonecraft, William Godwin, Percy Bysshe Shelley, and Mary Wollstonecraft Shelley. We wish to produce two distinct representations of the S-GA materials so as (1) to provide rigorous, semi-diplomatic transcriptions of the fragile manuscripts for those with an interest in the compositional practices of what has been called “England’s First Family of Literature” and (2) to make available clear “reading texts” for those who are primarily interested in the final state of each manuscript page.

The start of text encoding work on the S-GA coincided with the addition of new “document-focused” elements to the TEI in the release of P5 version 2.0.1. Given that the majority of materials in the collection consist of autograph manuscripts, the project team quickly adopted several of these new elements into its TEI customization. The “genetic editing” approach has served the project well—allowing the encoding scheme to target features of the documents that are of greatest interest to the scholarly editors and to rigorously describe often complicated sets of additions, deletions, and emendations that will support further scholarship on the composition process of important literary works. The work of automating the production of usable “reading texts” encoded in “text-focused” TEI markup from data that is modeled according to a “document-focused” approach has proven much more challenging.

Encoding Challenges

The conflict between representing multiple hierarchies of content objects and the affordances of XML is well known and the TEI Guidelines discuss several possible solutions. One of these solutions is to designate a primary hierarchy and to represent additional hierarchies with empty milestone elements that can be used by some processing software to “reconstruct” an alternate representation of the textual object. The approach taken by the S-GA team to produce both “document-focused” and “text-focused” TEI data is a version of the milestone-based approach. The document-focused, “genetic editing” elements form the principal hierarchy (consisting of “<surface>,” “<zone>,” “<line>,” etc.) and milestone elements are supplied to support automatic conversion to “text-focused” markup (which will contain elements such as “<div>,” “<p>,” “<lineGrp>,” etc.).

This solution places increased burden on document encoders to maintain “correctness,” thus potentially lowering data consistency and quality. For instance, empty element milestones representing the beginning and ending of textual features have no formal linkages as part of the document tree. Encoders must supply identifiers and pointers to indicate these linkages. Validating that these identifiers and pointers pair correctly must be accomplished with some mechanism other than the RelaxNG validation that verifies most other elements of the document structure. As noted above, managing multiple hierarchies through the use of milestones is not new. We do argue that the introduction of additional “document-focused” elements in the TEI increases the scope for projects to produce data that reflect two divergent ontologies and thus to encounter the difficulties involved in this “workaround.”

More importantly, the use of the milestone strategy decreases the reusability of the data. For example, to support automated conversion from “document-focused” to “text-focused” data representations, the S-GA team needed to go beyond purpose-built milestone elements like “<delSpan>” and “<addSpan>” and, in effect, semantically overload the general purpose “<milestone>” element. The value of an attribute on “<milestone>” indicates which “text-focused” element is intended to appear in a particular location. This solution is explained in the documentation and the convention used would be (we think) evident after cursory examination. Nonetheless, we are forced to add markup to the “document-focused” data which makes it more unique to the S-GA project and less easily consumable by future users with different goals. This is even more troubling because the “document-focused” data is the true work-site where we hope to invite future collaborators to engage and extend the project.

Maintainability & Provenance Challenges

To avoid the conceptual and technical challenges involved in automating the transformation between “text-focused” and “document-focused” representations, the two sets of data could have each been created by hand and maintained separately. Indeed, this is the approach followed by the Digitale Faustedition project, where a distinction between what the project calls “documentary” and “textual” transcription was considered necessary not only as a reaction to encoding problems, but also as a practical application of theoretical distinctions between documentary record and editorial interpretation. The Faustedition project team, however, still encountered technical challenges when trying to correlate and align these two transcriptions automatically. Use of collation and natural language processing tools helped with this problem, but eventually more manual intervention was needed (Brüning et al. 2013).

The S-GA team felt that maintaining two data sets representing different aspects of the textual objects would have led to serious data consistency, provenance, and curation problems. As the example of the Faustedition project shows, separate representations must be kept in sync with project-specific workflows developed for this purpose. In the case of S-GA, documentary transcription is the main focus; the greatly increased cost and time involved in also maintaining a textual transcription would have reduced the size of the corpus that could be encoded and thus the amount of materials from the archive that could be made fully available under the current phase of the project.

Presentation Challenges

The display and presentation of “document-focused” encoding is another technical challenge introduced by the new TEI elements; to provide a diplomatic transcription, a TEI-to-HTML transformation is not trivial—often times limited by HTML’s own capabilities. A canvas-based system, such as PDF or SVG, is better suited for presenting document-focused encoding.

S-GA is developing and using a viewer for SharedCanvas, a technology developed at Stanford University, that allows editors (and potentially future users) to construct views out of linked data annotations. Such annotations, expressed in the Open Annotation format, relate images, text, and other resources to an abstract “canvas”. In S-GA, “document-focused” TEI elements are mapped as annotations to a SharedCanvas manifest and displayed. Further layers of annotations can be added dynamically, for example search result highlights as well as user comments and annotations. The engagement of students and other scholars will be driven by the possibility of creating annotations in the Open Annotation format, so that any SharedCanvas viewer will be able to render them. It remains a matter for the future development of the project to understand whether some annotations can be added dynamically to the source TEI, especially those pertaining transcription and editorial statements.


The attempt to automatically generate “text-focused” markup from “document-focused” markup forced the project team to confront the intellectual challenges which the introduction of the genetic editing element set makes urgent. The larger stakes involved were made clear to the project team during our recent experiments with the distributed TEI encoding of the manuscripts of Frankenstein and Prometheus Unbound by graduate students at the University of Maryland and the University of Virginia. The attempt to bring additional encoders of various skill levels into the editing and encoding of the Shelley-Godwin materials revealed the importance of being able to convert from “document-focused” to “text-focused” data because this ability will partly determine how participatory the archive can be. The Digital Humanities is now undergoing what might be called a “participatory turn” that poses for the creators of digital literary archives such questions as (1) How can humanists best curate and explore our datasets? (2) How can we bring our research into the graduate and undergraduate classroom, including the process of text encoding?; and (3) How can we fruitfully engage the public, “citizen humanists,” in the work of the humanities? The potential to address these larger questions will necessarily proceed from the way in which the TEI community grapples with the modeling challenges of supporting two distinct ontologies of textual objects.


The Shelley-Godwin Archive is a collaborative endeavor. In developing the ideas in this paper, we have benefited from discussions with Travis Brown, Jim Smith, David Brookshire, Jennifer Guiliano, and other members of the Shelley-Godwin Archive project team.


  • Bauman, S. “Interchange Vs. Interoperability.” Montréal, QC. Accessed April 7, 2013. doi:10.4242/BalisageVol7.Bauman01.
  • Brüning, G., et al. Multiple Encoding in Genetic Editions: The Case of “Faust” http://jtei.revues.org/697
  • Pierazzo, E. A rationale of digital documentary editions http://llc.oxfordjournals.org/content/26/4/463
  • Sanderson, R., et al. SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemination http://arxiv.org/pdf/1104.2925v1.pdf

Beyond nodes and branches: scripting with TXSTEP

Two years ago, at the 2011 TEI members meeting in Würzburg, we presented a first feasibility study and preliminary model of TXSTEP, an open source, XML based scripting language which will make available the power of TUSTEP by an up-to-date and self-explaining user interface.

TUSTEP itself is known as a very powerful set of tools for the processing, analysis and publication of texts, meeting the requirements of scholarly research – and at the same time as having a very steep learning curve, an unfamiliar command-line based user interface and a documentation which is avaliable in German only.

TXSTEP breaks down these barriers to the usability of these tools. It makes available them to the growing e-humanities community, offering them a powerful tool for tasks which can not easily be performed by the scripting tools commonly used for this purpose. At the same time, it allows to integrate the mentioned tools into existing XSL-based workflows.

Compared to the original TUSTEP command language, TXSTEP

  • offers an up-to-date and established syntax
  • allows to draft respective scripts using the same XML-editor as when writing XSLT or other XML based scripts
  • lets you enjoy the typical benefits of working with an XML editor, like content completion, highlighting, showing annotations, and, of course, verifying your code,
  • offers – to a certain degree – a self teaching environment by commenting on the scope of every step.

TXSTEP has in the meantime been subjected to a closer examination by Michael Sperberg-McQueen regarding its overall goal and design, the syntax and structure of the XML command language, including details of naming and style, operating system dependencies, and its positioning within the XML software ecosystem. His critics and proposals – and his very encouraging final remarks – have been very helpful for the further work on the system in the past two years. As a result, we can now present and offer for download a running system containing the modules described below in the current version 0.9.

In the February 2012 issue of the TEI Journal, Marjorie Burghart and Malte Rehbein reported on the results of a survey they had carried out and which “highlight the need for user-friendly, bespoke tools facilitating the processing, analysis, and publishing of TEI-encoded texts”.

With this paper, we want to show how TXSTEP, though not restricted to work with TEI- or XML-encoded texts, could meet a great deal of the mentioned needs of text based research.

The term “user-friendly”, used in the report, suggests that a typical user will be guided by an intuitive interface to ready-made solutions for a problem foreseen by the developer of the respective tool. “But”, to quote Martin Müller from Northwestern University, “that is not what happens in research”.

TXSTEP aims at being “user-friendly” above all to the “exploratory” user who is seriously engaged in research. The tools he needs are different: of course, they have to avoid the need of elementary programming. TXSTEP therefore offers program modules for the very basic or elementary operations of text data handling. These modules allow for further adaptation, (e.g., for defining the collating sequence required for sorting the words for non-English texts). It is possible to run each of these modules separately, but also to team them with any other module of the system.

The TXSTEP modules include:

  • collation of different versions of a text, the results being stored (including TEI-based tagging) in a file for further automatic processing, in addition to being available for eye inspection;
  • text correction and enhancement not only by an interactive editor, but also in batch mode, e.g. by means of correction instructions prepared beforehand (by manual transcription, or by program, e.g. the collation module);
  • decomposing texts into elements (e.g. word forms) according to rules provided by the user, preparing them for sorting according to user-defined alphabetical rules and other sorting criteria);
  • building logical enities (e.g. bibliographic records) consisting of more than one element or line of text and preparing them for sorting;
  • sorting such elements or entities;
  • preparing indexes by generating entries from the sorted elements;
  • transforming textual data by selecting records or elements, by replacing strings or text parts, by rearranging, complementing or abbreviating text parts;
  • integrating additional information into a file by means of acronyms;
  • updating crossreferences;
  • (by including respective native TUSTEP scripts:) professional typesetting, meeting ambitious layout demands as needed for critical editions.

As the output of any one of these modules may serve as input to any other module (including XSLT-stylesheets), the range of research problems for which this system may be helpful is quite wide.

A set of modules like these is rather not appropriate for the occasional end user; its purpose is to make the professional user or the serious humanities scholar independent of foreign programming, even for work not explicitely foreseen by the developers, and to give him at the same time complete control over every detail of the data processing part of his project. It is the user himself who, instead of using a black box, defines in every detail the single steps to be performed.

It is obvious that the use of a modular system like this differs essentially from the use of tools that claim intuitive usability. It differs in two points:

  • First, it requires previous learning, and
  • Second, it requires to analyze a problem before starting to solve it.

It shares these features with other scripting languages.

While there is usually no way for escaping the second point, TXSTEP offers a remedy for the first problem.

How “user-friendly” this can be for professional use in a research environment, we will demonstrate live by means of some elementary examples of text handling and text analysis which can not easily be solved with existing XML tools.


  • Eberhard Karls Universität Tübingen, Zentrum für Datenverarbeitung: TUSTEP. Tübinger System von Textverarbeitungsprogrammen. Version 2013. Handbuch und Referenz. http://www.tustep.uni-tuebingen.de/pdf/handbuch.pdf
  • Tübinger System von Textverarbeitungs-Programmen TUSTEP. http://www.tustep.uni-tuebingen.de
  • TXSTEP – an integrated XML-based scripting language for scholarly text data processing. In: digital humanities 2012. Conference Abstracts.
  • Creating, enhancing and analyzing TEI files: the new, XML-based version of TUSTEP. In: Philology in the Digital Age. Annual TEI Conference, Würzburg 2011.
  • XSTEP – die XML-Version von TUSTEP. http://www.xstep.org

TEI in LMNL: Implications for modeling

What might TEI look like if it were not based in XML? This is not simply an aesthetic question (TEI using a different sort of tagging syntax) but a very practical one, inasmuch as XML comes with limitations and encumbrances along with its strengths. Primary among these (as has been recognized since the first applications of SGML to text encoding in the humanities) is the monolithic hierarchy imposed by the XML data model. Texts of interest to the humanistic scholar frequently have multiple concurrent hierarchies (in addition to the ‘logical’ structure of a text generally presented in XML, we have physical page structures; dialogic and narrative structures; the grammar of natural language; rhetorical and verse structures; etc. etc.), as well as ‘arbitrary overlap’ — constructs found in the text stream that form no hierarchy at all, such as ranges to be indexed or annotated, which can overlap freely both with other structures and with one another.

Of course, TEI proposes mechanisms for dealing with these (in an entire chapter of the Guidelines devoted to this topic), and since the introduction of XPath/XSLT 2.0 along with XQuery, we have more capable means for processing them. But the code we have to write is complex and difficult to develop and maintain. What if we didn’t have to work around these problems?

LMNL) offers such a model, and a prototype LMNL processing pipeline — Luminescent, supporting native LMNL markup on an XML/XSLT platform — offers a way to explore these opportunities. TEI XML documents can be processed programmatically to create LMNL markup, with its representations of overlap (whether using milestones, segmentation, or standoff) converted into direct markup representations. Once in LMNL syntax, ranges and annotation structures can be used to refactor complex XML structures into simpler forms directly correspondent (i.e., without the overhead of pointers) to the textual phenomena they apply to. In particular, the LMNL model has two features that (separately and together) enable significant restructuring and resolution of modeling issues, exposing complexities as they are rather than hiding phenomena (which in themselves may be simple or complex) behind necessary complexities of syntax:

  • Because ranges can overlap freely, families of related ranges emerge, each family overlapping others, but no ranges within a single family overlapping other ranges in the same family. (And here we have multiple concurrent hierarchies, although in LMNL the hierarchical relation among ranges in a single family is only implicit.) For example, one set of ranges represents a clean logical hierarchy of books, chapters, sections and paragraphs, while another represents the pagination of a physical edition, while a third represents a narrative structure. LMNL processing can disentangle these from one another, rendering any of them as a primary ( ‘sacred’) hierarchy in an XML version.

    By the same token, it becomes possible to discern (through analysis of which ranges overlap others of the same or different types) where overlap is truly arbitrary: where, that is, the information indicated by a range (such as an annotated or indexed span) must be permitted to overlap others even of the same type. In other words, typologies of ranges and range types emerge, that both relate them systematically to one another, or deliberately permit them to be unrelated.

  • Since LMNL annotations can be structured and their contents marked up, annotations can take on more of the burden of data capture than is easily or gracefully done with XML attributes. It becomes possible once again, even at significant levels of complexity, to make a broad distinction between the text being marked up, and the apparatus attached to the text.

Demonstrations will be offered, showing both TEI data in LMNL, and the kinds of outputs (in plain text, HTML, SVG or XML including TEI) that can be generated from it.


This is only a partial (in fact quite incomplete) bibliography of work in this area.

  • David Barnard, Ron Hayter, Maria Karababa, George Logan and John McFadden. 1988. SGML-Based Markup for Literary Texts: Two Problems and Some Solutions. Computers and the Humanities, Vol. 22, No. 4 (1988), pp. 265-276.
  • David Barnard, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, C. M. Sperberg-McQueen and Giovanni Battista Varile. 1995. Hierarchical Encoding of Text: Technical Problems and SGML Solutions. Computers and the Humanities, Vol. 29, No. 3, The Text Encoding Initiative: Background and Context (1995), pp. 211-231.
  • CATMA: Computer Aided Textual Markup and Analysis. See .
  • James H. Coombs, Allen H. Renear, and Steven J. DeRose. 1987. Markup Systems and The Future of Scholarly Text Processing. Communications of the ACM, 30:11 933-947 (1987).
  • Claus Huitfeldt. 1994. Multi-Dimensional Texts in a One-Dimensional Medium. Computers and the Humanities, Vol. 28, No. 4/5, Humanities Computing in Norway (1994/1995), pp. 235-241.
  • Paolo Marinelli, Fabio Vitali, and Stefano Zacchiroli. 2008. Towards the unification of formats for overlapping markup. At .
  • Wendell Piez. 2004. Half-steps toward LMNL. In Proceedings of Extreme Markup Languages 2004. See .
  • Wendell Piez. 2008. LMNL in Miniature: An introduction. Amsterdam Goddag Workshop, December 2008. Presentation slides at .
  • Wendell Piez. 2010. Towards Hermeneutic Markup: an Architectural Outline. Presented at Digital Humanities 2010 (King’s College, London), July 2010. Abstract and slides at .
  • Wendell Piez. 2011. TEI Overlap Demonstration. At .
  • Wendell Piez. 2012. Luminescent: parsing LMNL by XSLT upconversion. Presented at Balisage: The Markup Conference 2012 (Montréal, Canada), August 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). doi:10.4242/BalisageVol8.Piez01.
  • Allen Renear, Elli Mylonas and David Durand. 1993. Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies. At .
  • Desmond Schmidt. 2010. The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing (2010) 25(3): 337-356. doi: 10.1093/llc/fqq007.
  • C. M. Sperberg-McQueen. 1991. Text in the Electronic Age: Textual Study and Text Encoding, with Examples from Medieval Texts. Literary and Linguistic Computing, Vol. 6, No 1, 1991.
  • C. M. Sperberg-McQueen. 2006. Rabbit/duck grammars: a validation method for overlapping structures. In Proceedings of Extreme Markup Languages 2006, Montreal, August 2006. At .
  • M. Stührenberg and D. Goecke. 2008. SGF – An integrated model for multiple annotations and its application in a linguistic domain. Presented at Balisage: The Markup Conference 2008 (Montréal, Canada), August 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi: 10.4242/BalisageVol1.Stuehrenberg01.
  • M. Stührenberg and D. Jettka. 2009. A toolkit for multi-dimensional markup – The development of SGF to XStandoff. Presented at Balisage: The Markup Conference 2009 (Montréal, Canada), August 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi: 10.4242/BalisageVol3.Stuhrenberg01.
  • Jeni Tennison and Wendell Piez. 2002. The Layered Markup and Annotation Language (LMNL). Extreme Markup Languages 2002.
  • Jeni Tennison. 2007. Creole: Validating Overlapping Markup. Presented at XTech 2007. Text Encoding Initiative (TEI). P5: Guidelines for Electronic Text Encoding and Interchange, chapter 20, Non-hierarchical Structures. At .XStandoff. .

TEI at Thirty Frames Per Second: Animating Textual Data from TEI Documents using XSLT and SVG

The growing abundance of TEI-encoded texts—including some rather large-scale collections such as those associated with the Brown University Women Writers Project, Perseus Digital Library, Wright American Fiction, and the University of Michigan’s Text Creation Partnership—in conjunction with an expanding palette of visualization tools, has made it possible to create graphic representations of large-scale phenomena. Visual representations, traditional examples of which include graphs, lists, concordances, tables, and charts, have often been used to bring focus to aspects that might otherwise be overlooked. That is, they are in part tools for noticing, assisting the user/reader in seeing what may be difficult or impossible to perceive in the textual flow when it is presented in the conventional manner. As Tanya Clement has recently observed, “Sometimes the view facilitated by digital tools generates the same data human beings . . . could generate by hand, but more quickly,” and sometimes “these vantage points are remarkably different . . . and provide us with a new perspective on texts.” And as Dana Solomon has written, “[d]ue in large part to its often powerful and aesthetically pleasing visual impact, relatively quick learning curve … and overall ‘cool,’ the practice of visualizing textual data has been widely adopted by the digital humanities.” When used for large textual corpora, visualizations can, among numerous other possibilities, represent change over time, group common characteristics among texts, or highlight differences among them, correlated by such factors as author, gender, period, or genre. At the University of Nebraska–Lincoln’s Center for Digital Research in the Humanities we have been experimenting with a new way of visualizing phenomena in TEI corpora and have created an experimental XSLT-based tool that queries TEI files and generates animated videos of the results. Using XPath and XQuery techniques, this tool makes it possible to ask specific or general questions of a corpus such as: “What is the most frequently-occurring 3-gram in each text in this writer’s oeuvre?” or “When did the poet begin to favor use of the word ‘debris’?” The data are then output as scalable vector graphic (SVG) files that are converted to raster images and rendered in video at 30 frames per second. Our present goal is to test this alpha version with the writings of Walt Whitman, or, more specifically, with a particular Whitman poem.

The Whitman Archive has been producing TEI-encoded texts of Whitman’s work since 2000 and offers access to a huge variety of textual data both by and about Whitman. Among these is a poor-quality 40-second recording of someone, possibly Whitman himself, reading the first four lines of one of his lesser-known poems. Even though the Archive makes it clear that the voice may not even be Whitman’s, this sound recording of “Whitman” reading “America” has been surprisingly popular and compelling. It is one of the most frequently requested pages on the site and was recently the focus of an article in Slate. One reason for the recording’s popularity, surely, is its immediacy; it brings Whitman’s words to life, performing them in a way that they are not when users encounter the words as fixed characters on a page or screen. The sound recording also reminds us of the importance of the performative aspect of Whitman’s poetry specifically and of poetry generally. Early in his career, Whitman often recited from Shakespeare and other poets for the entertainment of ferry passengers and omnibus drivers, and his lecture notes from the 1880s demonstrate that he enjoyed performing a variety of poems—both his and others’.

The visualization tool that we have developed is, at this stage, utterly experimental; we make no claims about its superiority relative to other tools or even about its worth for literary analysis. Instead, we see its value as, first, an exploration of techniques for combining TEI and SVG data into ambitious vector-based animations and, second, as a demonstration of the potential for engaging the multi-sensory and multimodal aspects of texts. “Engagement” write Fernanda Viegas and Martin Wattenberg, “—grabbing and keeping the attention of a viewer—is the key to [data visualization's] broader success.” In representing the literary work as an absorbing performance, one that comprises both “data” and “art,” the tool we are developing is calculated to provoke responses in both informational and aesthetic registers. Performance and provocation are perhaps not the most efficient means of adducing, synthesizing, or rendering evidence, but they might well supplement other techniques in conveying some of the complex ways in which literary texts work.


  • Clement, T. “Text Analysis, Data Mining, and Visualizations in Literary Scholarship” in Literary Studies in the Digital Age: An Evolving Anthology (eds., Kenneth M. Price, Ray Siemens), 2013. Modern Language Association.
  • Solomon, D. “Building the Infrastructural Layer: Reading Data Visualization in the Digital Humanities.” MLA 2013 Conference Presentation. url: http://danaryansolomon.wordpress.com/2013/01/08/mla-2013-conference-presentation-from-sunday-162013/
  • Viegas, Fernanda, and Martin Wattenberg. “How to Make Data Look Sexy.” CNN Opinion, 19 April 2011. http://www.cnn.com/2011/OPINION/04/19/sexy.

Analysis of isotopy: a hermeneutic model

The presentation illustrates the analysis of isotopes in twentieth-century literature as a template of deep interpretation of texts, which increases the traditional analytical procedures, proposing an evolution of practices.

The topic fits in the broad debate involving the critical literature in the age of (re)producibility (Riva, 2011), and suggests a rethinking of models and methods in textual hermeneutics, using a digital way (Ciotti, Crupi, 2012). The novelty consists in doing narratological analysis observing its macrostructural and microstructural results (styles, lexemes, isotopes), proposing a hermeneutic template that allows semantic indexing of families and isotopes, deductible through broad concepts: place, space, character and identity.

With the proliferation of tools and technologies that enable the increasing of text data and electronic editions in different formats (rtf, pdf, epub, oeb), decreases the hermeneutic potential triggered by computer when the text is divided into atoms of meaning (Trevisan, 2008). Besides, textual criticism is often lacking historical dimension that communicative act in literary work testifies. So the paper propose a solution to storage problems, distribution, and analysis of literary works in historical perspective, using TEI to codify some semantic features in modern texts.

The analytic practice, promoted by Crilet Laboratory at Faculty of Arts in University “Sapienza” of Rome, is aimed at expanding interpretation purpose of documents (Mordenti, 2007), with digital transcription and its redrafting in semantic markup. So, using literary and hermeneutic tags rather than philological, it develops a pragmatic combination of history and semiotics so that the digital document represents, inside, the interpretative model.

Infact, it’s possible to span narrative corpus in many areas of meaning and analyse: vertically, studying the lexical sorting from maximum frequency to hapax; semantically, studying frequency and position of selected isotopes in text (Greimas, 1970); alphabetically, generating an alphabetical order to identify families of meanings. At this point, having built a system centered on text it’s useful to start critical thinking adding XML markup for links to websites with historical references, within the model proposed by TEI.

Some examples of twentieth-century analytic papers are available because of my decades work in University “Sapienza” of Rome. However, considering my work as reading of a “lector in fabula” ̶ educated and skilled ̶ it make me able to establishe with narrative material a close relationship that also involves the author as a creator. This two characters are bound by the joint effort of giving a real and imaginary birth to the object of art. Thus, the markup should take care of an object that express meaning on two floors: reality and imagination. New technologies are helpful in the breakdown of the two levels, because of native digital architecture. In this way, the option humanities computing is emerging as a choice of epistemology, rather than an instrumental change. A radically rethinking of concept of text appears as a new light, not a deformity designed by artificial systems, but a strict vitality, given from the automation process (Mordenti, 2007).

The paper, therefore, wants to underline the potential of textual analysis using TEI markup, providing for electronic text processing and following Segre’s ideas (Ciotti, 2007; Orlandi, 2010; Fiorentino, 2011; Riva, 2011). The system built in that way would encourage the study of narrative in his historical aspects, social and cultural development, also it can be a valid tool for interpretation of textual themes and motifs related to historical context, especially in secondary schools and universities, as easy for digital born students.

Therefore, research project converge skills of a different type, related to scientific fields and disciplines of various kinds, to highlight a clear interdisciplinary nature.

To historical and literary capabilities are associated, necessary, skills of humanities computing, digital cultures (Ciotti, 2012) and textual theory, which give greater depth to the proposed analytical practice.


  • Burnard, Il manuale TEI Lite: introduzione alla codifica elettronica dei testi letterari, a cura di Fabio Ciotti, Milano, Sylvestre Bonnard, 2005
  • Ciotti, Il testo e l’automa. Saggi di teoria e critica computazionale dei testi letterari, Roma, Aracne, 2007
  • Ciotti, Crupi (a cura di), Dall’Informatica umanistica alle culture digitali. Atti del Convegno di studi in memoria di Giuseppe Gigliozzi (Roma, 27-28 ottobre 2011), Roma, Università La Sapienza, 2012
  • Fiormonte, Scrittura e filologia nell’era digitale, Milano, Bollati Boringhieri, 2003
  • Fiormonte, Numerico, Tomasi (a cura di), L’umanista digitale, Bologna, Il Mulino, 2010
  • Gigliozzi, Il testo e il computer. Manuale di informatica per gli studi letterari, Milano, Bruno Mondadori, 1997
  • Greimas, Del senso, Milano, Bompiani, 1970
  • Holister, Pensare per modelli, Milano, Adelphi, 1985
  • Landow, L’ipertesto. Tecnologie digitali e critica letteraria, trad. it. a cura di Paolo Ferri, Milano, Bruno Mondadori, 1998
  • Luperini, Il dialogo e il conflitto. Per un’ermeneutica materialistica, Bari, Laterza, 1999
  • Meyrowitz, Oltre il senso del luogo, Bologna, Baskerville, 1993
  • Mordenti, L’altra critica. La nuova critica della letteratura fra studi culturali, didattica e informatica, Roma, Meltemi, 2007
  • Orlandi, Informatica testuale. Teoria e prassi, Bari, Laterza, 2010
  • Pierazzo, La codifica dei testi, Roma, Carocci, 2005
  • Riva, Il futuro della letteratura. Lopera letteraria nell’epoca della sua (ri)producibilità digitale, Scriptaweb, 2011
  • Szondi, Introduzione all’ermeneutica letteraria (1975), trad. di Bianca Cetti Marinoni, introd. di Giorgio Cusatelli, Torino, Einaudi, 1992

TEI4LdoD: Textual Encoding and Social Editing in Web 2.0 Environments


Fernando Pessoa’s Book of Disquiet (Livro do Desassossego – LdoD) is an unfinished book project. Pessoa wrote more than five hundred texts meant for this work between 1913 and 1935, the year of his death. The first edition of this book was published only in 1982, and another three major versions have been published since then (1990, 1998, 2010). As it exists today, LdoD may be characterized as (1) a set of autograph (manuscript and typescript) fragments, (2) mostly unpublished at the time of Pessoa’s death, which have been (3) transcribed, selected, and organized into four different editions, implying (4) various interpretations of what constitutes this book. Editions show four major types of variation: variation in readings of particular passages, in selection of fragments, in their ordering, and also in heteronym attribution.


The goal of the LdoD Archive14 is twofold: on the one hand, we want to provide a “standard” archive where experts can study and compare LdoD’s authorial witnesses and their different editions; on the other hand, we want to design a virtual archive that allows both experts and non-experts to experiment with the production of different editions of LdoD, and also the writing of their own fragments based on LdoD’s original fragments.15 Therefore, this latter goal, which is built on top of the archival goal, extends a scholarly understanding of LdoD as both authorial project and editorial construct to a new perspective of LdoD as an individual and/or community editing and writing exploratory environment based on the authorial and editorial witnesses.


Given the above set of goals, the LdoD Archive has to accommodate scholarly standards and requirements on digital archives, for instance the use of TEI as a specification to encode literary texts, and the virtual communities and social software features to support the social edition of LdoD by both other experts and non-experts.16 This second aspect increases the need for a dynamic archive where both types of end-users can edit their own versions of LdoD, and write extensions of the original fragments, while the archive’s experts’ interpretations and analyses of LdoD are kept “unchanged” and clearly separated from the socialized editions and writings. In addition, it is necessary to define how the specifics of LdoD are represented in TEI, for instance, how do we distinguish authorial witnesses (textual records) from the editions and their respective interpretations, as when an editor assigns a fragment to heteronym Vicente Guedes while another editor assigns it to Bernardo Soares.


The solution we propose for the identified challenges is based on a TEI template to encode all authorial and editorial witnesses in TEI, and a software architecture that accommodates the traditional query and search of a digital humanities archive with functionalities of a Web2.0 approach.

Representation in TEI

We have encoded LdoD as a TEI Corpus containing a TEI Header for each one of the fragments. Besides the project information that is represented in the TEI Corpus, we have described properties common to the whole LdoD, which include (1) the set of editions; and (2) Pessoa’s heteronyms. For each fragment we have encoded in a fragment header as witnesses both the original authorial sources and the four editorial sources. This approach allows us to associate interpretation metadata in the context of each witness. Users will be able to compare digital facsimile representations of authorial witnesses (and topographic transcriptions of those witnesses) to editorial witnesses. The latter can also be compared against each other in order to highlight their interpretations of the source. However, there are still some open issues. The first one is that the separation between editorial sources and authorial sources is by convention, and it is not clear how, in terms of interoperability, an external application can “understand” and process this distinction. The second aspect is related to the dynamic evolution of the archive in terms of Web2.0 requirements: how can TEI code be changed as a result of end-users’ interactions with the archive? Note that the traditional approach to encoding in TEI is done statically, through tools like oXygen. However, due to our requirements we want to support the evolution of LdoD as a continuously reeditable and rewrittable book. This means that it is necessary that we enable the addition of new virtual editions and heteronyms in the Corpus and the addition of new fragments that extend the original ones. Additionally, end-users can define their own interpretation of any of LdoD’s fragments, e.g. by using tags, which results in the generation of new editions of the book through the execution of a categorization algorithm. This open issue is partially addressed by the software architecture we propose in the next section.

Architecture Proposal

Most digital scholarly archives are static. By static we mean that the construction of the archive is separated from its use. The former is done using TEI and XML editors, and the latter is supported by XSLT transformations. This software architectural approach is not feasible if we want to provide Web2.0 functionality to the archive. However, we do not want to disregard what is already done in terms of encoding in TEI for the experts. Therefore the architecture needs to support the traditional encoding in TEI by the experts while enabling dynamic end-users’ interactions with the platform.

The key point of the proposal is the use of an object domain model to represent the LdoD archive. Using this approach we, at first, transform LdoD encoded in TEI to the object model, and allow the visualisation and edition of this object model through a web user interface. Additionally, TEI files can be regenerated from the object model. This approach has several advantages: (1) the archives’ experts continue using editor tools like oXygen to do their work; (2) end-users (experts and non-experts) can create their virtual editions and fragment extensions through the web user interface; (3) the object model preserves a semantically consistent LdoD archive by checking the consistency of end-users’ operations; (4) interoperability can be supported by exporting the regenerated TEI files; (5) it is possible to regenerate TEI files according to different formats, for instance, it is possible to use different methods to link critical apparatus to the text.

Our proposal explores current approaches to editing in electronic environments and attempts to integrate them with TEI conceptual and processing models. The object representation of transcriptions is related with the work on data structure for representing multi-version objects (Schmidt and Colomb, 2009). We emphasize the need to have a clear separation between content and presentation in order to simplify and empower presentation tools as claimed in Schlitz and Bodine (2009). With regard to a Web2.0 for digital humanities we are indebted to proposals on cooperative annotations by Tummarello et al. (2005) and the advantages and vision of Web2.0 and collaboration in Benel and Lejeune (2009), Fraistat and Jones (2009), and Siemens et al. (2010). On the other hand, due to a change of paradigm our architectural proposal does not require the complexity of TextGrid as described by Zielinski et al. (2009). More recent research work raises the need to have several views of the encoding (Brüning et al., 2013). In our approach different views are also relevant for interoperability and to simplify the implementation of user interfaces. The work of Wittern (2013) stresses the need to allow dynamic edition of texts and management of versions.

The specific correlation of static and dynamic goals in the LdoD Digital Archive means that our emphasis falls on open changes that feedback into the archive. The TEI encoding and software design implications of this project make us address both conceptual aspects of TEI schemas for modelling texts and documents, and the processing problems posed by user-oriented virtualization of Pessoa’s writing and bibliographic imagination.

During the conference we intend to make a more detailed presentation of the LdoD Archive and show a demo of the prototype being developed.


We would like to thank Timothy Thompson for his contributions to the TEI template for LdoD and Diego Giménez for the encoding of LdoD fragments.

This work was supported by national funds through FCT – Fundação para a Ciência e a Tecnologia, under projects PTDC/CLE-LLI/118713/2010 and PEst-OE/EEI/LA0021/2013.


  • Barney, Brett (2012). ‘Digital Editing with the TEI Yesterday, Today, and Tomorrow’, in Textual Cultures, 7.1: 29-41.
  • Benel, Aurelien and Lejeune, Christophe (2009). ‘Humanities 2.0: Documents, Interpretation and Intersubjectivity in the Digital Age’. International Journal on Web Based Communities, 5.4: 562-576. DOI:10.1504/ijwbc.2009.028090
  • Brüning, Gerrit, Katrin Henzel, and Dietmar Pravida (2013). ‘Multiple Encoding in Genetic Editions: The Case of “Faust”’, Journal of the Text Encoding Intiative, ‘Selected Papers from the 2011 TEI Conference’, Issue 4, March 2013. http://jtei.revues.org/697
  • Burnard, Lou and Syd Bauman, eds. (2012). TEI P5: Guidelines for Electronic Text Encoding and Exchange, Charlottesville, Virgina: TEI Consortium. Available at http://www.tei-c.org/Guidelines/P5/
  • Earhart, Amy E. (2012). ‘The Digital Edition and the Digital Humanities’, in Textual Cultures, 7.1: 18-28.
  • Fraistat, Neil and Jones, Steven (2009). ‘Editing Environments: The Architecture of Electronic Texts’. Literary and Linguistic Computing, 24.1: 9-18. DOI: 10.1093/llc/fqn032
  • Schlitz , Stephanie and Bodine, Garrick (2009). ‘The TEIViewer: Facilitating the Transition from XML to Web Display’. Literary and Linguistic Computing, 24.3: 339-346. DOI: 339-346.doi: 10.1093/llc/fqp022
  • Schmidt, Desmond and Colomb, Robert (2009). ‘A Data Structure for Representing Multi-version Texts Online’. International Journal of Human Computer Studies, 67.6: 497-514. DOI:10.1016/j.ijhcs.2009.02.001.
  • Siemens, Ray, Mike Elkink, Alastair McColl, Karin Armstrong, James Dixon, Angelsea Saby, Brett D. Hirsch and Cara Leitch, with Martin Holmes, Eric Haswell, Chris Gaudet, Paul Girn, Michael Joyce, Rachel Gold, and Gerry Watson, and members of the PKP, Iter, TAPoR, and INKE teams (2010). ‘Underpinnings of the Social Edition? A Narrative, 2004-9, for the Renaissance English Knowledgebase (REKn) and Professional Reading Environment (PReE) Projects’, in Online Humanities Scholarship: The Shape of Things to Come, edited by Jerome McGann, Andrew M Stauffer, Dana Wheeles, and Michael Pickard, Houston, TX: Rice University Press. 401-460
  • Tummarello , Giovanni, Morbidoni, Christian, and Pierazzo, Elena (2005). ‘Toward Textual Encoding Based on RDF’. Proceedings of the 9th ICCC International Conference on Electronic Publishing. http://elpub.scix.net/data/works/att/206elpub2005.content.pdf
  • Vanhoutte, Edward (2006). ‘Prose Fiction and Modern Manuscripts: Limitations and Possibilities of Text-Encoding for Electronic Editions’, in Electronic Textual Editing, edited by Lou Burnard, Katherine O’Brien O’Keeffe, and John Unsworth, New York: Modern Language Association of America. 161-180.
  • Wittern, Christian (2013). ‘Beyond TEI: Returning the Text to the Reader’, Journal of the Text Encoding Intiative, ‘Selected Papers from the 2011 TEI Conference’, Issue 4, March 2013. http://jtei.revues.org/691
  • Zielinski, Andrea, Wolfgang Pempe, Peter Gietz, Martin Haase, Stefan Funk, and Christian Simon (2009). ‘TEI Documents in the Grid’. Literary and Linguistic Computing, 24.3: 267-279. DOI: 10.1093/llc/fqp016

TEI <msDesc> and the Italian Tradition of Manuscript Cataloguing

The Central Institute of Cataloguing (ICCU – Istituto Centrale per il Catalogo Unico e per le informazioni bibliografiche) of the Italian Ministry of Heritage and Culture uses the Text Encoding Initiative standard in the exchange of the manuscripts descriptions processed with Manus OnLine (http://manus.iccu.sbn.it/). Manus OnLine is the Italian national manuscript cataloguing project and at the same time it is the name of a widespread cataloguing software, used by more than 420 people among librarians and researchers. The catalogue contains around 130,000 files that are created using a web application that deals with a relational database in MySQL. The whole software is open-source based.

The current web application allows the sharing of the authority file (which is a rich index of names involved with the manuscripts), and includes some tools that make it more agile in being able to insert and edit the manuscript descriptions.  In the software four years of life, between 2009 and 2013, the cooperative work has proved to be very useful and above all procedures were streamlined for the publication of the manuscript descriptions within the OPAC which, in turn, has become an important tool continuously being updated — a real and proper catalogue in progress. But, in spite of the validity and importance of this cooperative catalogue, some individual libraries and projects that operate simultaneously in different institutions of conservation, need to treat their data outside the central DB. These operators have asked for the export of their manuscript descriptions, because in most cases they want to handle them independently in digital libraries. ICCU has then chosen to create an automatic tool that produces valid TEI documents. This choice respects the need to distribute the processed data to libraries that produced it, and continue to exercise the right over it.

In December 2012, a new module was added to the software, which allows the export of all the descriptions of a project, a library, a specific collection or even the description of a single manuscript. The new software module was studied by Giliola Barbero and Gian Paolo Bagnato in collaboration with the Area of activity for the bibliography, cataloging and inventory of manuscripts.

The choice of the TEI schema was made after careful consideration of bibliographic standards based primarily on the International Standard for Bibliographic Description (ISBD), that is to say MARC and UNIMARC, given that many colleagues had initially expressed a preference for a common standard shared both by cataloguers of manuscripts and printed publications. However, the assessment of MARC and UNIMARC has led to negative results.  Although they are used by some libraries for the structuring of their manuscript descriptions, they do not in fact cover the typical information of the manuscript description and above all the macrostructure of such. Manuscript cataloguing has been traditionally done by first creating a description of the physical aspects of the manuscript, and then the description of a variable number of texts. In case of composite manuscripts, the cataloguing proceeds by creating a description of certain physical aspects shared by the entire manuscript, then the description of the physical aspects of the parts composing the manuscript and, finally, the description of a variable number of texts.

This paper will first demonstrate the relevance of the TEI schema with respect to this traditional macrostructure by showing how it coincides with the most significant in the history of cataloguing of manuscripts. Therefore, the points of contact between the elements used in msDesc of the TEI schema, the UNIMARC and the Dublin Core will be highlighted, and we will attempt to provide a mapping of key information shared by all the three standards.

Secondly, this paper will discuss some critical aspects of the standards and how these have been temporarily resolved. These critical points mainly concern the following elements and information that do not always result in being suited for structuring:

  • supportDesc
  • extent
  • measure
  • technical terms in the binding description
  • technical terms in the music notation description
  • information on manuscript letters

The ICCU evaluated the solutions adopted by the European e-codices and Manuscriptorium projects to describe the support, the number of folios and size of the manuscripts (solutions that differ among each other) and chosen to adapt to the most suited practice in line with the needs of Manus OnLine. However, it has avoided creating further diversification, and currently, it believes that a common choice would be useful. As it regards the binding description and the music notation, while having exploited the element term of the TEI schema, the ICCU believes that it would be necessary to reflect further. It is also absolutely necessary to delve into and discuss the encoding of the physical description and content of manuscript letters in strict accordance with the components of the element msDesc.


  • G. Barbero, S. Smaldone, Il linguaggio SGML/XML e la descrizione di manoscritti, «Bollettino AIB», 40/2 (giugno 2000), 159-179.
  • Reference Manual for the MASTER Document Type Definition. Discussion Draft, ed. by Lou Burnard for the MASTER Work Group, revised 06.Jan. 2011: http://www.tei-c.org/About/Archive_new/Master/Reference/oldindex.html
  • T. Stinson, Codicological Descriptions in the Digital Age, in Kodikologie und Paläeographie im digitalen Zeitalter / Codicology and Palaeography in the Digital Age, hrgb. Von /ed. by M. Rehbein, P. Sahle, T. Schaßan, Norderstedt, BoD, 2009, 35-51.
  • Zdeněk Uhlíř, Adolf Knoll, Manuscriptorium Digital Library and Enrich Project: Means for Dealing with Digital Codicology and Palaeography, in Kodikologie und Paläeographie, 67-78.
  • P5: Guidelines for Electronic Text Encoding and Interchange: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/
  • e-codices – Virtual Manuscript Library of Switzerland: http://www.e-codices.unifr.ch/en
  • Manuscriptorium: http://www.manuscriptorium.com/
  • Manus OnLine: http://manus.iccu.sbn.it/

A stand-off critical apparatus for the libretto of Der Freischütz

Digital editions of opera librettos have been prepared using TEI in several occasions; notable examples are Opera Liber18 (Pierazzo 2005) and OPERA19 (Münzmay et al. 2011). Opera Liber publishes critical editions of librettos with the aim of promoting them as literary text worthy of scholarly attention, in contrast to the common perception of librettos as ancillary material to operatic works. OPERA, on the other hand, develops around the premises that libretto and score are edited according to two independent traditions and moves first steps towards an integrated edition of libretto and music sources.

The BMBF-funded project Freischütz Digital (FreiDi)20 takes a broad approach on the matter, with work packages dedicated to the digitization of different kinds of sources of Carl Maria von Weber’s opera Der Freischütz. The project will include encoded text of both libretto sources (in TEI) and score sources (in MEI)21, as well as recorded audio performances. Some of the modelling challenges for this project include minimizing redundancy throughout the encoding, coordinating the corpus and modelling variance and editorial intervention across the material. This paper discusses the approach taken to model the critical apparatus for the libretto, which uses stand-off techniques to encode variance across the corpus and aims at being able to refer to both textual and musical sources.


There are several sources for Der Freischütz libretto and most are easily accessible. They show that the work changed significantly over a long period of time, from first ideas of the librettist Friedrich Kind (1817) to the premiere of Weber’s opera on 18 June 1821. Moreover, they reveal that Weber himself was crucially involved in the writing process. Proof for this can be found in the surviving manuscript and printed sources: the manuscript of the librettist Friedrich Kind, Weber’s manuscript copy, the surviving copies of the textbook in Berlin, Vienna (KA-tx15), and Hamburg, as well as the first print of the songs, the latter missing the dialog passages. Weber’s autograph score (A-pt), several score copies, and the printed piano reduction constitute a corpus of revealing comparative sources to the libretto sources. Moreover, multiple printed editions that Kind published from late 1821 / early 1822 to 1843 – all of which were meant as reading editions – show even more text versions and variants. Weber first sent manuscript copies of the libretto to a few theatres, but later sent the first complete print edition, which significantly influenced the performance tradition and reception of the work.

Common critical editorial practice in music balances historical overview with performance practice and produces “performable” texts, which often are a highly hypothetical construct based on an amalgamation of sources. In this context, the benefit of a digital edition is to transparently depict textual evolution and facilitate the mutually informed investigation and presentation of music and text sources.


FreiDi includes a TEI encoding for each of the libretto sources listed above. The encoding focuses on the dramatic and lyrical structure of the texts, while preserving original spelling, deleted and added material, etc.

These independent transcriptions are coordinated through a collation-like file (a “core” file) that encodes textual variance with <rdg> elements containing pointers to markup in the encoding of the sources. In general, this approach is similar to collations generated after an alignment step in modern collation software such as Juxta and CollateX22; however, it is designed to operate at more than one level of tokenization, so that statements about variation can be attached to any element in the TEI-encoded sources. Similarly to the ‘double-end-point-attached’ method, the “core” file allows to address variants that would cause overlapping issues when encoded with the ‘parallel segmentation’ method;23 yet it differs from it by keeping <app> statements independent from each other and from the text24. This approach is motivated by the fact that not every difference between sources will be marked as a variant, such as different uses of the Eszett or differences due to document structure such as patches and paste-overs. Using the core file to only identify what are considered “meaningful” variants allows the transcription to keep a higher level of detail without creating issues for collations.

The transcriptions focus substantially on the encoding of the dramatic structure; in fact, the data model will not use the new genetic encoding module since it imposes an important paradigm switch from a text-focused to a document-focused encoding. The editors, nonetheless, can still be detailed about their transcription partly because variation statements are kept separately.

To briefly illustrate this model, let us consider the following verses from source KA-tx15 and A-pt and the corresponding core file entry.

Source KA-tx15.xml Source A-pt.xml
<l xml:id=”KA-tx15_l1″>Sie erquicke,
<l xml:id=”KA-tx15_l2″>Und bestricke
<l xml:id=”KA-tx15_l3″>Und beglücke,
<l xml:id=”A-pt_l1″>Sie erquicke,
<l xml:id=”A-pt_l2″>und beglükke
<l xml:id=”A-pt_l3″>und bestrikke.
<rdg wit=”#KA-tx15″>
<ptr target=”KA-tx15.xml#KA-tx15_l2″/>
<ptr target=”KA-tx15.xml#KA-tx15_l3″/>
<rdg wit=”#A-pt”>
<ptr target=”A-pt.xml#A-pt_l2″/>
<ptr target=”A-pt.xml#A-pt_l3″/>

In this example, the core file records the inversion of verses and the <app> statement is limited to a verse-level domain. The core is made of independent <app> statements, so that differences in capitalization, punctuation and spelling that are not included at this point are encoded as separate statements instead. To record this, the granularity of encoding needs to be greater as shown in the following example.

Source KA-tx15.xml Source A-pt.xml
<l xml:id=”KA-tx15_l1″>Sie erquicke,
<l xml:id=”KA-tx15_l2″>Und
<w xml:id=”KA-tx15_w1″>bestricke</w>
<l xml:id=”KA-tx15_l3″>Und beglücke,
<l xml:id=”A-pt_l1″>Sie erquicke,
<l xml:id=”A-pt_l2″>und beglükke
<l xml:id=”A-pt_l3″>und
<w xml:id=”A-pt_w1″>bestrikke.
<rdg wit=”#KA-tx15″>
<ptr target=”KA-tx15.xml#KA-tx15_w1″/>
<rdg wit=”#A-pt”>
<ptr target=”A-pt.xml#A-pt_w1″/>


Pointing to TEI sources from the “core” file introduces the managerial complexity typical of stand-off markup; for example pointers need to be validated and verified. These issues can be overcome by efficient project management and good authoring tools. The model, however, requires that the TEI-encoded sources include semantically weak elements such as <seg>, <w>, <c> and <pc> in the sources, whose only role is to allow the core file to refer to the text at the right point. Managing this elements is considerably more laborious than managing id references. It would be more efficient to be able to point (or annotate)25 portions of text without needing further XML elements. The TEI XPointer schemes may be useful in this case:26

<rdg wit=”#KA-tx15″>
<ptr target=”string-range(xpath1(*[@xml:id='KA-tx15_l2'],4, 9))”/>
<rdg wit=”#A-pt”>
<ptr target=”string-range(xpath1(*[@xml:id='A-pt_l3'], 4,9))”/>

However, implementations of XPointer are currently uneven and limited to XInclude, so using this approach in FreiDi would require to implement the schemes. Moreover, the current definition of string-range() operates within a “fragment”, or a well-formed XML context. This would make it difficult to select ranges that include an opening or closing tag. Hugh Cayless (2012) has recently suggested that TEI XPointer ought to be more sophisticated and proposed an extension of the schemes.

Finally, the model has also been designed to classify <app> statements according to a specific taxonomy; this results from keeping the statements separated, so that they address sections of text at different, possibly overlapping, levels. Categorizing variants has been one of the topics of discussion within the Manuscript Special Interest Group, which has been working on a revision of the critical apparatus module.27The discussion around categorization has focused on what variants address, such as omissions, punctuation, transpositions. etc. The FreDi project team is considering differentiating between variants addressing spelling, punctuation, transposition and variance caused by setting the text into music. The latter category in particular has not yet been explored in the field of digital editing.


FreiDi is an ambitious project that handles numerous sources, both musical and literary. This is in line with modern approaches to opera editing, which acknowledge that limiting investigation to only the score or only the libretto is not a desirable approach (Strohm 2005). As a general approach, editorial statements are encoded separately from the sources, with the aim of keeping independence between the source encodings and reduce redundancy. Concerning the libretto, a primarily literary form, a stand-off TEI “core” file is designed to handle the critical apparatus and similar cross-source editorial statements. This organization allows one to organize the statements according to a taxonomy, a feature that has been on the wish-list of the Manuscript SIG for a while. The core file relies on being able to point to specific portions of the TEI source, and techniques that implement this are still being perfected by the community. This project aims at contributing to research in these TEI-related aspects, as well as contributing to the debate around digital editions of operas.


  • Cayless, H., 2012. TEI XPointer Requirements and Notes (Draft). Available at https://docs.google.com/document/d/1JsMA-gOGrevyY-crzHGiC7eZ8XdV5H_wFTlUGzrf20w
  • Münzmay, A. et al., 2011. Editing Opera: Challenges of an Integrated Digital Presentation of Music and Text based on “Edirom” and TEI. TEI Members Meeting 2011, Universität Würzburg, 10-16 October.
  • Pierazzo, E., 2005. An Encoding Model for Librettos: the Opera Liber DTD. ACH/ALLC 17th Joint International Conference, University of Victoria, British Columbia, 15-18 June.
  • Schmidt, D. and Colomb, R, 2009. A data structure for representing multi-version texts online. International Journal of Human-Computer Studies , 67.6, 497-514.
  • Strohm, R., 2005. Partitur und Libretto. Zur Edition von Operntexten. Opernedition. Bericht über das Symposion zum 60. Geburtstag von Sieghart Döhring, ed. Helga Lühning and Reinhard Wiesend. 37-56.

Mannheim Corpus of Historical Magazines and Newspapers, PID: http://hdl.handle.net/10932/00-01B8-AE41-41A4-DC01-5
A small portion of the corpus has been transcribed independently (together with other resources) in the GerManC project [1]. In this project however, the original transcription did not use TUSTEP and was enriched directly with TEI markup manually
This first stage is readily comparable to the approach described in [3]. However, other than [3], we aim at a lossless transformation of TUSTEP to XML.
[4] describes an iteratively refined custom transformation from TUSTEP to TEI-SGML by means of TUSTEP’s script language TUSCRIPT, conceptually very similar to the transformation pipeline presented in this paper. Aiming to use standard (and familiar) technology, we have chosen to split the transformation to TEI into a pipeline of Perl feeding XSLT. We have also investigated the use of TXSTEP [5], which aims at providing TUSCRIPT and TUSCRIPT-modules in an XML-Syntax. However, the available modules did not cover the needs of the transformation at hand.
Nous avons analysé le concept de la situation esthétique à partir de Roman Ingarden. La situation d’encodage possède une structure ontique en tout point comparable, mais c’est une situation heuristique.
R. Ingarden, The Cognition of the Literary Work of Art, Illinois:Northwetern University Press, 1973.
“Aspects” ne signifie pas : “perspectives”, “côtés” ou “fragments”. Ce sont plutôt les parcours sensoriels dans la construction de l’objet.
http://plato.stanford.edu/entries/platonism/. Deux des quatre types FRBF (“oeuvre” et “expression”) n’ont aucune réalité psychophysique.
Ainsi pour P. Caton (op. cit.) le texte est ” a matter of contingent social/linguistic circumstances” et les “countable texte” – not a type but a role.
Par exemple à l’aide de la théorie APT (Actions & Products) de Twardowski.
Cette observation a une valeur plus générale pour les ontologies DH.
J.A. Coffa, The Semantic Tradition from Kant to Carnap, Cambridge University Press, 1991, p. 1.
“No Problem Has a Solution: A Digital Archive of the Book of Disquiet”, research project of the Centre for Portuguese Literature at the University of Coimbra, funded by FCT (Foundation for Science and Technology). Principal investigator: Manuel Portela. Reference: PTDC/CLE-LLI/118713/2010. Co-funded by FEDER (European Regional Development Fund), through Axis 1 of the Operational Competitiveness Program (POFC) of the National Strategic Framework (QREN). COMPETE: FCOMP-01-0124-FEDER-019715.
A second goal of the project is to investigate the relation between writing processes and material and conceptual notions of the book. The rationale for allowing non-experts to experiment with reediting and rewriting this work originates in this second goal, and in the want to explore the collaborative dimension of the web as a reading and writing space in the context of a digital archive in ways that enhance its pedagogical, ludic, and expressive uses.
The LdoD Archive will consider two groups of end-users and will provide tools and resources that enable engagement at different levels of complexity, from beginner to expert. Groups of beta users for the virtual editing and virtual writing features have already been canvassed in secondary schools and universities. These communities will allow us to better assess particular needs and define interface structure and access to contents accordingly.
Paper written by Francesca Trasselli, as coordinator of the ICCU’s Area of activity for the bibliography, cataloging and inventory of manuscripts, in collaboration with Giliola Barbero and Gian Paolo Bagnato who have respectfully researched and ultimately realized the exportation procedure.
cf. “Opera Liber – Archivio Digitale Online Libretti d’Opera:” available at:
cf. “OPERA – Spektrum des europäischen Musiktheaters in Einzeleditionen” available at: http://www.opera.adwmainz.de/index.php?id=818.
cf. “Freischütz Digital. Paradigmatische Umsetzung eines genuin digitalen Editionskonzepts.” available at: http://freischuetz-digital.de.
cf. “MEI. The Music Encoding Initiative” available at: http://www.music-encoding.org/.
See for example the page about “Textual Variance” on the TEI Wiki: http://wiki.tei-c.org/index.php/Textual_Variance#Aligner.
As such, this model also differs from Schmidt and Colomb 2009, although it shares the approach of not mixing encoded sources and editorial statements to avoid overlapping hierarchies.
Thinking of apparatus entries as annotations means that other standards specific to annotation may be used in this scenario, for example the Open Annotation Collaboration model: http://www.openannotation.org/.
See Chapter 16 of the TEI Guidelines: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATS. xpath1() applies an XPath expression to select an element, while string-range() identifies a textual range within the selected element, for example starting from position 4 and including the following 9 positions.
A report of a recent workgroup meeting is available on the TEI Wiki: http://wiki.tei-c.org/index.php/Critical_Apparatus_Workgroup.