Abstracts of posters

Library of components for the Computational Philological Domain dealing with TEI markup guidelines CoPhiLib

The aim of this poster is to illustrate the Collaborative Philology Library (CoPhiLib), a library of components devoted to editing, visualizing and processing TEI annotated documents in the subdomain of philological studies. The overall architecture is based on the well known Model-View-Controller (MVC) pattern, which separates the representation of data from the rendering and management (business logic) of the content, for the sake of flexibility and reusability. The CoPhiLib library maps the annotated document on an aggregation of objects, visualized via web as a collection of widgets rendered on the client through rich standard web technologies such as html5, css3, jquery, ajax etc. and controlled by special components devoted to monitor the behavior and interactions among the other components.

The specifications, expressed using the Unified Modelling Language (UML), are language independent and stay at the top level of abstraction, as a formal guidelines for the actual implementations, for example using the Java programming language or any other which follows the object oriented programming paradigm as it could be Python. Currently, only a very small subset of TEI tags are taken into account in our specifications, because our approach is a trade-off between a top-down and a bottom-up design. The approach is top-down, because we analyze the high-level behaviors of the interacting objects and the use-case with related scenarios among functionalities that agents are expected to use. But it is also bottom-up, because we develop applications for specific projects, such as Greek into Arabic or Saussure Project, and we refactor the original design of the specific projects when upper levels of abstraction, valid for multiple scenarios, can be identified and the new interfaces must be taken into account in order to update and extend the basic functionalities.

According to the specifications, the APIs and the actual libraries are developed. The current implementation of the CoPhiLib library is based on the Java platform and the overall system has been developed following the Java enterprise powerful programming model Server Faces Framework (JSF2). Documents are stored in a database XML oriented: eXist-db, but different cross-platform solutions can be easily adopted by implementing a data access object (DAO pattern), due to the pluggable structure. Our application is designed as a collaborative multilayer application and handles the presentation logic by making use of the world wide web (web) Java technologies and the best practices like facelets templates to minimize code and maximize reuse as well as a complete rich Ajax composite component taglib, in order to offer a friendly and efficient web graphical user interface (the most popular is RichFaces alongside to IceFaces, but we preferred PrimeFaces as the most rising one). In the field of digital scholarship users mainly ask web applications that allow an easy access to resources (both textual and visual) and that provides the possibility to work in a collaborative environment by comparing resources, creating relations among resources, adding notes and comments or critical apparatus and sharing them.

From the collection of the TEI-compliant documents stored for the specific projects, the scheme is read (or dynamically generated and read). The actual scheme is expected to be a small subset of the TEI schemes (as discussed above) and it is used by the applications developed with the COPhiLib, in order to instruct the factories on how to instantiate the objects that implement or extend the interfaces or the abstract classes.

This structure provides the necessary flexibility to adapt, at run time, the same application to different uses, according to the nature of the chunks of information contained in the documents that must be rendered. For example, the abstract model is able to manage different multimedia resources in parallel for scholarly editions, like in the E.R.C. Greek into Arabic project, and it is able to deal with facsimile manuscript images within the related transcription, like in the P.R.I.N. Saussure Edition project or, in the future, to provide a sheet music viewer with the related midi or wave execution. Different instances of the Model are obtained by serializing the TEI document through a marshall and unmarshall process, obtaining a synchronized and uniform state of the stored data.

CoPhiLib handles textual phenomena by separating the structure of the text (codicological aspects) from its analyses (philological, linguistic, metric, stylistic, etc.). Stand-off markup approach has been used to manage the data arisen from the automatic text analysis.

Bibliography

  • Bozzi, Andrea G2A: a Web application to study, annotate and scholarly edit ancient texts and their aligned translations Stuida graeco-arabica Pacini Editore Pisa 2013
  • Burbeck, Steve Applications Programming in Smalltalk-80TM: How to use Model-View-Controller (MVC) 1992 (http://st-www.cs.illinois.edu/users/smarch/st-docs/mvc.html)
  • Del Grosso, Angelo Mario Boschetti, Federico Collaborative Multimedia Platform for Computational Philology, CoPhi Architecture IARIA Conference: proceedings of MMEDIA 2013, The Fifth International Conference on Advances in Multimedia Venice 2013
  • Fowler, Martin Analysis Patterns: Reusable Object Models Addison-Wesley Menlo Park, Calif. ; Harlow 1996
  • Gamma, Erich Helm, Richard Johnson, Ralph Vlissides, John Design Patterns: Elements of Reusable Object-Oriented Software Addison-Wesley Boston, MA, USA 1995
  • Hohpe, Gregor Woolf, Bobby Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions Addison-Wesley Boston, MA, USA 2004
  • Burnard, Lou Bauman, Syd TEI P5: Guidelines for Electronic Text Encoding and Interchange Oxford 2008 (. http://www.tei-c.org/Guidelines/P5)

TEI as an archival format

The adoption of the TEI as a common storage format for digital resources in the Humanities has many consequences for those wishing to interchange, integrate, or process such resources. The TEI community is highly divers, but there is a general feeling that all of its members share an understanding of the best way to use the TEI Guidelines, and that those Guidelines express a common understanding of how text formats should be documented and defined. There is also (usually) a general willingness to make resources encoded according to the TEI Guidelines available in that format, as well as in whatever other publishing or distribution format has been adopted by the project. The question arises as whether such TEI-encoded resources are also suitable for long term preservation purposes : more specifically, if a project wishes to ensure long term preservation of its resources, should it archive them in a TEI format? And if so, what other components (schema files, stylesheets, etc.) should accompany the primary resource files when submitting them for long term preservation in a digital archive? TEI encoded resources typically contain mostly XML-encoded text, possibly with links to files expressed using other commonly encountered web formats for graphics or audio; is there any advantage to be gained in treating them any differently from any other such XML encoded resource?

This is not an entirely theoretical question : as more and more digitization projects seek to go beyond simply archiving digital page images, the quantity of richly encoded TEI XML resources representing primary print or manuscript sources continues to increase. In France alone, we may cite projects such as the ATILF, OpenEditions, BVH, BFM, Obvil and many more for all of which the TEI format is likely to be seen as the basic storage format, enabling the project to represent a usefully organised structural representation of the texts, either to complement the digital page images, or even to replace them for such purposes as the production of open online editions. When such resources are deposited in a digital archive, how should the archivist ensure that they are valid TEI and will continue to be usable ? One possibility might be to require that such resources are first converted to some other commonly recognised display format such as PDF or XHTML; and indeed for projects where the TEI form is considered only as a means to the end of displaying the texts, this may well be adequate. But since TEI to HTML or TEI to PDF are lossy transformations, in which the added value constituted by TEI structural annotation is systematically removed this seems to us in general a less than desirable solution. We would like to be able to preserve our digital resources without loss of information, so as to facilitate future use of that information by means of technologies not yet in existence. Such data-independence was, after all, one of the promises XML (and before it SGML) offered.

The data archivist needs to be able to test the coherence and correctness of the resources entering the archive, and also to monitor their continued usability. For an XML-based format, this is a relatively simple exercise. An XML file must be expressed using one of a small number of standard character encodings, and must use a tagging system the syntactic rules of which can be written on the back of a not particularly large envelope. The algorithm by which an XML document can be shown to be syntactically correct, (”well formed”) is expressible within the same scope and producing a piece of software able to determe that correctness is consequently equally trivial. The XML Recommendation adds a layer of “syntactic validation” to this, according to which the use of XML tags within a set of documents can be strictly controlled by means of an additional document known as a schema, defining for example the names of all permitted XML elements and attributes, together with contextual rules about their valid deployment. Syntactic validation of an XML resource against its schema is also a comparatively simple and automatic procedure, rerquiring only access to the schema and an appropriate piece of software. (Given the dominant position enjoyed by XML as a data format, the current wide availability of reliable open-source validators for it seems unlikely to change, even in the long term)

However, the notion of “TEI Conformance” as it is defined in the current Guidelines goes considerably beyond the simple notion of syntactic validity. An archivist concerned to ensure the coherence and correctness of a new resource at this further level needs several additional tools and procedures, and a goal of our project is to determine to what extent the goal of ensuring such conformance is quixotic or impractical. In particular, we will investigate the usefulness of the TEI’s ODD documentation format as a means of extending the scope of what is possible in this respect when using a conventional XML schema language such as RELAX NG or ISO Schematron.

Our initial recommended approach for ingest of a conformant TEI resource might include :

  • syntactic validation of each document against the most appropriate TEI schema; for documents containing textual data this would naturally include TEI All, but also any project-supplied XML schema, and also (for any ODD document supplied) the standard TEI ODD schema;
  • creation of a TEI schema from the supplied ODD and validation of the documents against that in order to validate any project-specific constraints such as attribute values;
  • comparison of the ODD supplied with an ODD generated automatically from the document set;
  • definition and usage of a set of stylesheets to convert the resource into a “lowest common denominator” TEI format

Such an approach suggests that the “submission information package” for a TEI resource will contain a number of ancillary documents or references to documents, notably to a TEI P5-conformant ODD from which a tailored set of syntactic and semantic validators can be generated using standard transformations. We hope to report on this and on the results of our initial experiments with some major French-language resources at the Conference.

The Open Bibliography Project

Humanities scholars often create bibliographies in the course of their work. These can take on many forms: annotated bibliographies, descriptive bibliography and catalogues, author and subject bibliographies, or learning objects for scholars researching people and concepts in their field. The aggregate nature of these publications means that printed bibliographies are often outdated soon after publication and calls for a shift away from print to a more dynamic, web-based bibliography that allows updating and revising as new information becomes available.

While many bibliographical works are still published as print monographs, web-based bibliographies are nothing new; however, current web-based bibliography publishing models present a number of challenges to those wanting to share their research openly on the web. The creators of scholarly web bibliographies must design, create, and host relational databases, forms, queries, and a web interface, as well as deal with the hosting, access and maintenance issues associated with publishing a searchable, accessible database to the web. Most humanities scholars and librarians do not have the technological skills nor access to the infrastructure necessary to host such a site and libraries and institutions are not always able to accommodate these “boutique” project requests.

Additionally, these bibliographies are often multi-layered documents, rich with bibliographic information, metadata about the items described, and added value in the form of annotations, contextual information, and links to other relevant information and resources. This bibliographic and contextual information, which in many cases cannot be found anywhere else on the Web, would be extremely valuable to other researchers if made available in a data markup format that is open to harvesting and repurposing. Scholars working on publishing their own bibliographies would also benefit from an automated approach to harvesting and aggregating bibliographic information into their own bibliographies and publishing that information using open standards.

The Open Bibliography Project [1] represents a novel approach for publishing bibliographies to the Web using TEI in a format that enables linking, sharing, and repurposing of the content of TEI-encoded scholarly bibliographies. To that end, the project has two goals: a) to develop tools allowing scholars to easily construct, markup, and publish bibliographies in more meaningful ways while exposing their structured data to other Web applications; and b) to build a vocabulary for marking up and transforming structured bibliographic data within these documents, using existing vocabularies such as TEI and schema.org to the extent possible, and creating new terms where necessary. Ultimately we would like to provide a tool for scholars to construct bibliographies, assigning structure to citations and annotations using a Web form (XForms or similar technology), and providing a mapping for linking to occur in the background.

The Project is built around a custom TEI module for describing multiple types of bibliographies, including annotated bibliographies and descriptive bibliography, with XSL and CSS stylesheets for transforming the TEI-encoded documents into searchable, structured web (or print) editions and possibly into interactive maps and data visualizations. Using a custom TEI module with pre-defined stylesheets means a lightweight, low-barrier publishing solution for researchers that requires only minimal knowledge of XML and basic web hosting, such as a web folder on a university server or Google Sites.

The Project recognizes the need for sharing unique bibliographic and contextual information found in bibliographies with the wider web of scholarly data in the humanities, social sciences, and other disciplines. Using linked open data standards, such as microdata, the Project hopes to further enhancing the value of scholarly bibliographies by linking them to the linked open data web. Because they are highly structured documents, TEI bibliographies easily lend themselves to linked open data markup; in addition, the annotations within them provide context about items contained within them that may not exist elsewhere on the Web. Initiatives such as schema.org [2] provide tools for document markup compatible with the linked data cloud, and projects such as VIVO [3] provide examples of how faculty profiles and CVs, published as structured bibliographic data, may be published electronically.

Defining a proof of concept for the idea is the first stage of this project. Using the Three Percent translation database, published by the University of Rochester [4], as our seed data, we intend to demonstrate how TEI-encoded bibliographic metadata may be published as linked data in a variety of markup formats and included in the linked data ecosystem. We plan to develop a simple vocabulary for marking up individual citations in the database with schema.org attributes, to which we may map the Three Percent database elements. We will share our XSLT stylesheets under an open license on the Web, so that interested scholars and researchers may contribute to its continued development.

Bibliography

  • [1] http://dawnchildress.com/obp
  • [2] http://schema.org
  • [3] http://vivo.library.cornell.edu
  • [4]http://www.rochester.edu/College/translation/threepercent

An easy tool for editing manuscripts with TEI

The Berlin Brandenburg Academy of Sciences and Humanities (BBAW) is home to multiple long term research projects which encompass various fields of study. The research group TELOTA (The Electronic Life of the Academy) supports the digital humanities aspects of these projects, including developing software solutions for the daily work of their researchers.

Experience shows that the readiness to use TEI encoding for the digital transcription and annotation of manuscripts greatly relies on the user-friendliness of the entry interface. From the perspective of a researcher, working directly in XML is a backwards step in comparison to programs like MS Word. A new software solution must therefore at least offer the same amount of editorial comfort as such programs. Ideally, it would also encompass the complete life-cycle of an edition: from the first phases of transcription to the final publication.

Last year TELOTA developed such a software solution for the recently begun scholarly edition project Schleiermacher in Berlin 1808-1834. The solution consists of various software components that allow the researchers to construct and edit transcriptions of Schleiermachers manuscripts into XML following the TEI guidelines. It includes the possibility to create apparatuses of different kinds, as well as to createwithout much additional effortboth a print and web publication.

The new digital Schleiermacher edition is based on XML schemata, written according to the guidelines of the TEI. A TEI schema was created for each manuscript type: letters, lectures, and a daily calendar. The three schemata however all share a core group of elements. All text phenomena as well as editorial annotations are represented through TEI elements and attributes. The schemata were formed from the sections of the TEI guidelines which suited the projects needs. The addition of project-unique elements or attributes was unnecessary.

The central software component of the new digital work environment is Oxygen XML Author. The researcher does not edit the XML code directly, but instead works in a user-friendly Author mode, which is designed through Cascading Stylesheets (CSS). The researcher is able to choose more than one perspective within the Author view, and thus can select per mouse click the appropriate perspective for the current task. Additionally, a toolbar is provided with which the researcher can enter markup with the push of a button. In this way text phenomena such as deletions or additions, or editorial commentary, are easily inserted. Person and place names can also be recorded with their appropriate TEI markup, and in addition they can be simultaneously linked to the corresponding index. This is done through selecting the name from a convenient drop down list. The entire manuscript text can thus be quickly and simply marked up with TEI conform XML.

Besides creating a digital work environment in Oxygen XML Author, a website was also built for the project based on eXist, XQuery, and XSLT. Through the website the researchers can easily page through or search the current data inventory. For instance, letters can be filtered through correspondence partner and/or year. The user can also follow a correspondence series according to the selected person, or find all texts in which a person was mentioned. The website is presently only available for the project staff, but it offers a prototype for the future, publicly accessible, website.

With the help of ConTeXt a further publication type, a print edition, is automatically generated as a PDF from the TEI XML document. The layout and format is based on the previously printed volumes of the critical edition for Friedrich Schleiermachers works. Each TEI element is given a specific formatting command through a configuration file. In this way the different apparatuses appear as footnotes that refer to the main text with the help of line numbers and lemmata. The print edition can also provide the suitable index for each transcription and solves any occurring cross references between manuscripts.

This work environment has been in use for a year by the research staff of the Schleiermacher edition for their daily work. When asked their opinion, the researchers offered predominantly positive feedback. The only criticism was the fact that the text became difficult to read when it included a great deal of markup. TELOTA met this concern by adding more Cascading Stylesheets, thus allowing for different views of the text that showed only specific groups of elements. The researchers were however in absolute agreement that the new work environment greatly eased their editorial work and saved them significant time. The possibility to directly check the results of their work in a web presentation or as a printed edition was seen as very positive. Such features let the user experience per click the advantages of encoding with TEI. The staff also expressed their relief that it was unnecessary to work directly in XML, and that they instead could mark up their texts through a graphic and easy to use interface.

After the success of the pilot version, the work environment will be implemented this year for further academy projects. The TEI XML schemata and main functions that make up the basis of the work environment can be customized to the different manuscript types and its needs. Furthermore, this solution has already been adapted by other institutions, such as the Academy of Sciences and Literature in Mainz.

Bibliography

  • Dumont, Stefan; Fechner, Martin: Digitale Arbeitsumgebung für das Editionsvorhaben »Schleiermacher in Berlin 1808—1834« In: digiversity — Webmagazin für Informationstechnologie in den Geisteswissenschaften. URL: http://digiversity.net/2012/digitale-arbeitsumgebung-fur-das-editionsvorhaben-schleiermacher-in-berlin-1808-1834
  • Burnard, Lou; Bauman, Syd (Hg.): TEI P5: Guidelines for Electronic Text Encoding and Interchange. Charlottesville, Virginia, USA 2007. URL: http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf
  • User Manual Oxygen XML Author 14. URL: http://www.oxygenxml.com/doc/ug-editor/
  • eXist Main Documentation. URL: http://www.exist-db.org/exist/documentation.xml
  • ConTeXt Dokumentation. URL: http://wiki.contextgarden.net/Main_Page

eCodicology – Algorithms for the Automatic Tagging of Medieval Manuscripts

General description

‘eCodicology’ uses the library stock of roughly 500 medieval manuscripts which have been collected in the Benedictine Abbey of St. Matthias in Trier (Germany). The manuscripts were digitized and enriched with bibliographic metadata within the scope of the project ‘Virtuelles Skriptorium St. Matthias / Virtual Scriptorium Saint Matthias’ (http://stmatthias.uni-trier.de/). Funded by the German Research Foundation (DFG), digital copies were created in the city library of Trier, the long term preservation is undertaken at the University of Trier. The purpose of the BMBF-funded project ‘eCodicology’ is the development, testing and optimization of new algorithms for the identification of macro- and microstructural layout elements on these manuscript pages in order to enrich their metadata in XML format according to TEI standards.

The database of the ‘St. Matthias’ project holds basic information on the physical properties of manuscripts, as they have been described in the older manuscript catalogues. Essential components of each manuscript description are details of the layout features of the manuscript. These details are in part fragmentary and incomplete and can therefore be refined and completed by means of automatic tagging. The more precisely and elaborately those details are described, the better comparisons and analyses can be performed. Within the scope of ‘eCodicology’, the first step is the creation of an inventory of features defining those elements to be recognized reliably with the aid of the algorithms for feature extraction. On this basis, it is expected that new scientific findings about corpora of writers, writing schools, references between manuscripts and proveniences become possible. The great amount of image scans will be analyzed on an empirical basis with the aim that the ‘subjective’ view of the codicologist can – as it were – get ‘objectified’.

As can be seen from the figure below, the data that has been produced in the project ‘Virtuelles Skriptorium St. Matthias’ is the starting point of the work in ‘eCodicology’. The image scans are hosted on distributed servers and are synchronized regularly. Based on this initial data the previous catalogues can be automatically enriched and refined by use of feature extraction processes.

Aims of the project partners and technical procedure

The ‘eCodicology’ project is managed by three project partners working on different tasks (see the figure below). The digitized images are processed at the Karlsruhe Institute of Technology (KIT) using a library consisting of image processing and feature extraction algorithms which are defined in close collaboration between humanities scholars and computer scientists. The metadata schema for the processing and the models for the XML files, in which the results will be saved, are developed in Trier as well as in Darmstadt on the basis of TEI P5. The scientific evaluation will finally take place in Darmstadt. Additionally, statistical analysis of the manuscript groups will be performed. It shall be possible to conduct, adapt or extend the scientific evaluation at any other university.

A software framework will automate the procedure of complex data analysis workflows and is designed generically so that a great amount of image data can be processed with any desired algorithm for feature extraction (basic components: ImageJ and MOA/Weka). Since it will be adaptable for a wider range of documents, the framework will be integrated as a service into the DARIAH infrastructure (http://de.dariah.eu/). The algorithm library is implemented specifically for the automatic analysis of medieval manuscripts. New algorithms can be created by the users at any time and they can be integrated into the library through the web portal. The configuration of the processes, the selection of the algorithms for feature extraction from the algorithm library and their parameterization are controlled via the web portal.

Processing and metadata schema

The processing of a codex page normally entails the following steps:

  1. Preparation and normalization of the page: this contains basic image processing steps such as e.g. the alignment of the page, white balance, histogram operations for the normalization of contrasts.
  2. Object segmentation: the segmentation separates image objects (e.g. writing and illustrations) from the background. The complexity of this process can vary and it is one of the most elaborate operations in the digital image processing.
  3. Feature extraction: features describing the whole page and the segmented objects can be measured using the algorithm library.
  4. Storage: the extracted features are stored within the metadata of the codex image.

The metadata schema used in the DFG-project ‘Virtuelles Skriptorium St. Matthias’ corresponds to the METS format, as it is used for the DFG Viewer (http://dfg-viewer.de/en/regarding-the-project/). Instead of MODS a TEI header is used, which is more specifically adapted to the demands of a manuscript description. A refining of the metadata is intended especially for the measurements of the following basic layout features: page dimensions (height and width), text or writing space, space with pictorial or graphical elements, space with marginal notes and glosses. Additionally, the absolute number of lines, headings, graphical or coloured initial letters or rubricated words and sentences will be annotated. It is also intended to find a way to ‘tag’ the position of graphical elements or text blocks on each page. From these data certain relations or proportions can be deduced. These relations may tell us for example something about special patterns or layout types.

At the moment, the refining of data concentrates on the elements ‘objectDesc’ with ‘supportDesc’ and ‘layoutDesc’ as well as ‘decoDesc’. A focus is laid especially on the following fields (the respective TEI tag is set in brackets):

  1. Layout information (‘layout’): conceivable attributes are @ruledLines, @writtenLines and @columns. Also, the type area or page design can be measured exactly.
  2. Dimensions (‘dimensions’, ‘extent’, ‘height’, ‘width’): attributes allow also a TEI-compliant description of minimum, maximum and average information (@atLeast, @atMost, @min, @max).
  3. Information on perceivable visual units like initials, marginal decoration and embedded images (‘decoNote’, @initial, @miniature, @border), rubrications, additional notes and eventually also multiple foliations.

A first draft of the metadata schema can give a short glimpse on some adaptations concerning the physical description of manuscripts:

<physDesc><objectDesc form=”codex”><supportDesc>
. . .
<extent>
<measure unit=”leaves” quantity=”100″></measure>
<locusGrp xml:id=”locusGrp001″><locus from=”1″ to=”100″></locus></locusGrp>
<measureGrp type=”leaves” corresp=”#locusGrp001″>
<height quantity=”250″ unit=”mm”>250mm</height>
<width quantity=”150″ unit=”mm”>150mm</width>
</measureGrp>
<measureGrp type=”binding”>
<height quantity=”275″ unit=”mm”>275mm</height>
<width quantity=”175″ unit=”mm”>175mm</width>
<measure type=”spineHeight”>4°</measure>
</measureGrp>
</extent>
. . .
</supportDesc><layoutDesc><layout columns=”2″ writtenLines=”24″>
<locusGrp>
<locus from=”1″ to=”100″ xml:id=”locusGrp002″></locus>
</locusGrp>
<dimensions type=”written” corresp=”#locusGrp002″>
<height quantity=”200″ unit=”mm” min=”199″ max=”201″ confidence=”0.8″>
200mm
</height>
<width quantity=”100″ unit=”mm” min=”98″ max=”101″ confidence=”0.75″>
100mm
</width>
</dimensions>
</layout><layout ruledLines=”32″>
<locusGrp>
<locus from=”1r” to=”202v” xml:id=”locusGrp003″></locus>
</locusGrp>
</layout></layoutDesc></objectDesc>
. . .
<decoDesc>
<decoNote type=”initial”></decoNote>
<decoNote type=”miniature”>
<locusGrp>
<locus>8v</locus></locusGrp>
<dimensions>
<height quantity=”50″ unit=”mm” min=”49″ max=”51″ confidence=”0.8″>50mm
</height>
<width quantity=”50″ unit=”mm” min=”49″ max=”51″ confidence=”0.8″>50mm
</width>
</dimensions>
</decoNote>
<decoNote type=”border”></decoNote>
</decoDesc>
</physDesc>

Based on the exemplary interpretation of the empirical data the sustainability of the approach as well as the validity of the inventory of layout features have to be proven. The drawing up of sophisticated ‘microscopic’ information and metrics on every single manuscript page subsequently allows an evaluation of the codices from the abbey of St. Matthias on the basis of quantitative methods: hereby, tendencies throughout the times related to certain genres or languages can be described in a highly elaborated way, image-text-proportions (text space vs. image space) can be defined exactly and relationships to epochs, genres, contents and functions can be created.

Bibliography

  • Embach, Michael; Moulin, Claudine (Ed.): Die Bibliothek der Abtei St. Matthias in Trier – von der mittelalterlichen Schreibstube zum virtuellen Skriptorium, Trier 2013.
  • Tonne, Danah; Rybicki, Jedrzej; Funk, Stefan E.; Gietz, Peter: Access to the DARIAH Bit Preservation Service for Humanities Research Data, in: P. Kilpatrick; P. Milligan; R. Stotzka (Ed.), Proceedings of the 21th International Euromicro Conference on Parallel, Distributed, and Network-Based Processing, Los Alamitos 2013, pp. 9-15.
  • Tonne, Danah; Stotzka, Rainer; Jejkal, Thomas; Hartmann, Volker; Pasic, Halil; Rapp, Andrea; Vanscheidt, Philipp; Neumair, Bernhard; Streit, Achim; García, Ariel; Kurzawe, Daniel; Kálmán, Tibor; Rybicki, Jedrzej; Sanchez Bribian, Beatriz: A Federated Data Zone for the Arts and Humanities, in: R. Stotzka; M. Schiffers; Y. Cotronis (Ed.), Proceedings of the 20th International Euromicro Conference on Parallel, Distributed, and Network-Based Processing, Los Alamitos 2012, pp. 198-205.
  • Vanscheidt, Philipp; Rapp, Andrea; Tonne, Danah: Storage Infrastructure of the Virtual Scriptorium St. Matthias, in: J. C. Meister (Ed.), Digital Humanities 2012, Hamburg 2012, pp. 529-532.

ReMetCa: a TEI based digital repertory on Medieval Spanish poetry

The aim of this talk is to present a Digital Humanities-TEI project devoted to create a computer-based metrical repertory on Medieval Castilian poetry (ReMetCa,www.uned.es/remetca). It will gather poetic testimonies from the very beginnings of Spanish lyrics at the end of 12th century, until the rich and varied poetic manifestations from the Cancioneros of the 15th and 16th centuries. Although metrical studies on Spanish Medieval poetry are developing fast in the last years, researchers have not created a digital tool yet, which enables to undertake complex analysis on this corpus, as it has already been done in other lyrical traditions in Romance languages, such as the Galician-Portuguese, Catalan, Italian or Provençal lyrics, among others, where the first digital repertories arose. ReMetCa is conceived as an essential tool to complete this digital poetic puzzle, which will enable users to develop powerful searches in many fields at the same time, thanks to the possibilities offered by new technologies. It will be very useful for metrical, poetic and comparative studies, as well as a benchmark to be linked to other international digital repertories.

This project is based on the integration of traditional metrical and poetic knowledge (rhythm and rhyme patterns) with Digital Humanities technology: the TEI-XML Markup Language and his integration in a Relational Database Management System which opens the possibility to undertake simultaneous searches and queries using a simple searchable user-friendly interface.

Starting point: poetic repertories in European lyrics

Three significant periods can be distinguished in the creation of medieval and renaissance poetic repertoires. The first one matches up with Positivism (end of the 19th century), with the works of Gaston Raynaud (1884), Gotthold Naetebus (1891), and Pillet and Carstens (1933), among others. The second one starts after the Second World War with the classic work of Frank on Provencal troubadours’ poetry (1953-57), and continues during long time with the editions of printed metrical repertoires (in Old French lyrics Mölk and Wolfzettel, in Italian Solimena, Antonelli, Solimena again, Zenari, Pagnotta, and Gorni, in the Hispanic philology Tavani, Parramon i Blasco, and Gómez Bravo, in the German Touber and the Repertorium der Sangsprüche und Meisterlieder.

Technological advances have made it possible to create a third generation of repertoires –made and searchable with a computer– in which time of research is considerably reduced. The first digital poetical repertoire was the RPHA (Répertoire de la Poésie hongroise ancienne jusqu’à 1600) published by Iván Horváth and his group in 1991. Galician researchers created Base de datos da Lírica profana galego-portuguesa (MedDB); Italian researchers digitalized BEdT (Bibliografia Elettronica dei Trovatori); later appeared the Nouveau Naetebus, the Oxford Cantigas de Santa María Database, the Analecta Hymnica Digitalia, etc.

All these repertoires are very valuable, as they enhance the possibilities of performing comparative researches. The Spanish panorama looks, however, weak in this area, as we do not have a poetic repertoire which gathers the metrical patterns of Medieval Castilian poetry (except for the book of Ana María Gómez Bravo (1999), restricted to Cancionero poetry).

Researchers are, however, more and more conscious of the importance of metrical studies to analyze and to understand Spanish Medieval poetry, as it has been recently shown by the bibliographic compilations of José María Micó (2009) or Vicenç Beltrán (2007). On the other hand, metrical studies have flourished thanks to the creation of specialized journals, such as Rhythmica. Revista española de métrica comparada, edited by Universidad de Sevilla (ISSN 1696-5744), created in 2003 and directed by Domínguez Caparrós and Esteban Torre, or the Stilistica e metrica italiana (2001) directed by Pier Vincenzo Mengaldo, as well as the digital journal Ars Metrica (www.arsmetrica.eu ISSN 2220-8402), whose scientific committee is composed by researchers from different countries.

Other important focuses of recent metrical studies have been research projects, whose results are being published as articles in books and journals and also as PhD works and thesis. There have also been organized several meetings and seminars concerning metrical and poetic problems. In this sense, it is worth to mention the project of prof. José Domínguez Caparrós on metrics in the 20th century, and the one leaded by prof. Fernando Gómez Redondo, devoted to write a diachronic history on medieval Castilian metrics by using traditional definitions of vernacular metrics.

As far as the integration of philology and computer technology is concerned, there have been significant advances during the last years in Spain (it is worth to mention some projects like Beta-Philobiblon http://bancroft.berkeley.edu/philobiblon/beta_es.html), or the digital editions of Lemir (http://parnaseo.uv.es/lemir.htm), or the digital bulletin of the AHLM (www.ahlm.es), as well as the upgrades and improvements made by the Biblioteca Virtual Cervantes (http://www.cervantesvirtual.com/). These tools show, however, a lack of metrical analysis of the texts and do not usually offer any metrical information about them, and this is the aspect that we want to improve with our tool ReMetCa.

Specific goals of this project and tool:

With the creation of ReMetCa our main goals are:

  • To create a database that integrates the whole known Castilian poetic corpus from its origins up to 1511 (over 10.000 texts).
  • To systematize traditional metric analysis by creating different tags suitable for all the poems of the corpus.
  • To provide access to metrical schemes altogether with texts, as well as data sheets gathering the main philological aspects that characterize the poems.
  • To develop a TEI-based description and make it available to all research community through a Web Application based on a Relational Database Management System and PHP.
  • To follow the Standards for Web Content Interoperability through Metadata exchange that will allow the future integration of our project in a megarepertoire, the Megarep project, in which Levente Seláf (González-Blanco and Seláf 2013), a Hungarian researcher of ELTE University, is already working.
  • To contribute to the improvement and discussion about TEI, specifically the TEI-Verse module.
  • To promote research in Digital Humanities within the area of Philology, Metrics, and Literary Theory in Spain.
Technical issues

This poster will be focused on the set of elements of TEI-Verse’s module. Every element will be represented with UML as an entity with its attributes and relationships. The result of this representation will be a complete conceptual model, which will work as the starting point of the logical model, build with an Entity-Relationship (ER) diagram.

The next step is the creation of the physical model, and it will provide us with the opportunity to discuss on the appropriateness of a Relational Database Management System, compared to the apparently easier option of using a native database XML. We will consider pragmatic aspects, such as the usual familiarity of most web applications programmers with RDBMS and the possibility of combining instances of relational systems with documents XML.

The choice of a concrete RDBMS will present two possibilities: MySQL with XPath or Oracle, with its columns XMLType and its incorporation to the recent versions (10g Release 2) of the XQuery query language. Both models, conceptual and logical, will be implemented in both RDBMS fully developed.

A series of queries SQL will be launched on this operative installation, especially centered on data extraction with XPath, in order to verify the actual behavior of each proposal. To perform this simulation, we will use our actual project records and we will simulate obtaining useful data for research that could have been proposed as a requisite of this application by a researcher specialist in the field.

To finish, we will propose a web application with forms for data introduction made with a PHP framework, such as CodeIgniter.

We would like to present these solutions in this poster to be able to discuss them with the TEI community and with members of other projects working with TEI-verse.

REFERENCE WORKS

Repertoires and digital databases
  • Répertoire de la poésie hongroise ancienne, (Iván Horváth et alii) http://magyar-irodalom.elte.hu/repertorium/, http://tesuji.eu/rpha/search/rpha5
  • MeDBD — Base de datos da Lírica profana galego-portuguesa, (Mercedes Brea et alii) http://www.cirp.es/bdo/med/meddb.html
  • BEdT — Bibliografia Elettronica dei Trovatori, (Stefano Asperti, Fabio Zinelli et alii) www.bedt.it
  • Dutch Song Database (Louis Grijp et alii): http://www.liederenbank.nl/index.php?lan=en
  • The Oxford Cantigas de Santa Maria Database (Stephen Parkinson) http://csm.mml.ox.ac.uk/
  • Le Nouveau Naetebus — Répertoire des poèmes strophiques non-lyriques en langue française d’avant 1400 (Levente Seláf) nouveaunaetebus.elte.hu
  • Analecta Hymnica Medii Aevi Digitalia, (Erwin Rauner), http://webserver.erwin-rauner.de/crophius/Analecta_conspectus.htm
Metrical repertoires published in paper
  • Antonelli, R., Repertorio metrico della scuola poetica siciliana, Palermo, Centro di Studi Filologici e Linguistici Siciliani, 1984.
  • Betti, Maria Pia, Repertorio Metrico delle Cantigas de Santa Maria di Alfonso X di Castiglia, Pisa, Pacini, 2005.
  • Brunner, Horst, Burghart Wachinger et Eva Klesatschke, Repertorium der Sangsprüche und Meisterlieder des 12. bis 18. Jahrhunderts, Tubingen, Niemeyer, 1986-2007.
  • Frank, Istvan, Répertoire métrique de la poésie des troubadours, Paris, H. Champion, 1966 [Bibliotheque de l’Ecole des hautes etudes. Sciences historiques et philologiques 302, 308].
  • Gorni, Guglielmo, Repertorio metrico della canzone italiana dalle origini al Cinquecento (REMCI), Florencia, Franco Cesati, 2008.
  • Gómez Bravo, Ana María, Repertorio métrico de la poesía cancioneril del siglo XV, Universidad de Alcalá de Henares, 1999.
  • Mölk, Ulrich y Wolfzettel, Friedrich, Répertoire métrique de la poésie lyrique française des origines à 1350, Munchen, W. Fink Verlag, 1972.
  • Naetebus, Gotthold, Die Nicht-Lyrischen Strophenformen Des Altfranzösischen. Ein Verzeichnis Zusammengestellt Und Erläutert, Leipzig, S. Hirzel, 1891.
  • Pagnotta, Linda, Repertorio metrico della ballata italiana, Milano; Napoli, Ricciardi, 1995.
  • Parramon i Blasco, Jordi, Repertori mètric de la poesia catalana medieval, Barcelone, Curial, Abadia de Montserrat, 1992 (Textos i estudis de cultura catalana, 27).
  • Solimena, Adriana, Repertorio metrico dei poeti siculo-toscani, Centro di studi filologici e linguistici siciliani in Palermo, 2000.
  • Solimena, Adriana, Repertorio metrico dello Stil novo, Roma, Presso la Societa, 1980.
  • Tavani, Guiseppe, Repertorio metrico della lingua galego-portoghese, Roma, Edizioni dell’Ateneo, 1967.
Bibliography on Spanish metrical studies
  • Baehr, Rudolf, Manual de versificación española, Madrid, Gredos, 1970.
  • Balbín, Rafael de, Sistema de rítmica castellana, Madrid, Gredos, 1968.
  • Beltrán, Vicenç, Bibliografía sobre poesía medieval y cancioneros, publicada en la Biblioteca Virtual “Joan Lluis Vives” http://www.lluisvives.com/, 2007.
  • Bonnín Valls, Ignacio, La versificación española. Manual crítico y práctico de métrica, Barcelona, Ediciones Octaedro, 1996.
  • Domínguez Caparrós, José, Diccionario de métrica española, Madrid, Paraninfo, 1985.
  • _____, Métrica y poética. Bases para la fundamentación de la métrica en la moderna teoría literaria, Madrid, U.N.E.D., 1988a.
  • _____, Contribución a la bibliografía de los últimos treinta años sobre métrica española, Madrid, C.S.I.C., 1988b.
  • _____, Métrica española, Madrid, Síntesis, 1993.
  • _____, Métrica comparada: española, catalana y vasca. Guía didáctica. Madrid, U.N.E.D., 1994.
  • _____, Estudios de métrica, Madrid, U.N.E.D., 1999.
  • _____, Análisis métrico y comentario estilístico de textos literarios. Madrid, Universidad Nacional de Educación a Distancia, 2002.
  • Duffell, Martin, Syllable and Accent: Studies on Medieval Hispanic Metrics, Londres, Queen Mary and Westfield College, 2007.
  • García Calvo, Agustín, Tratado de Rítmica y Prosodia y de Métrica y Versificación, Zamora, Lucina, 2006.
  • Gómez Redondo, Fernando, Artes Poéticas Medievales, Madrid, Laberinto, 2001.
  • González-Blanco García, Elena, La cuaderna vía española en su marco panrománico, Madrid, FUE, 2010.
  • _____ y Seláf, Levente, “Megarep: A comprehensive research tool in medieval and renaissance poetic and metrical repertoires”, Humanitats a la xarxa: món medieval / Humanities on the web: the medieval world, eds. L. Soriano – M. Coderch – H. Rovira – G. Sabaté – X. Espluga. Oxford, Bern, Berlin, Bruxelles, Frankfurt am Main, New York, Wien: Peter Lang, 2013.
  • Herrero, José Luis, Métrica española. Teoría y práctica, Madrid, Ediciones del Orto, 1995.
  • Mario, Luis, Ciencia y arte del verso castellano, Miami, Universal, 1991.
  • Micó, José María, Bibliografía para una historia de las formas poéticas en España, ed. Digital, Biblioteca Virtual Miguel de Cervantes, Alicante, 2009 [www.cervantesvirtual.com]
  • Navarro Tomás, Tomás, Métrica española. Reseña histórica y descriptiva, Syracuse, Syracuse University Press, 1956.
  • _____, Arte del verso, México, Compañía General de Ediciones, 1959.
  • _____, Repertorio de Estrofas Españolas, New York, Las Americas Publishing Company, 1968.
  • Paraíso, Isabel, La métrica española en su contexto románico, Madrid, Arco Libros, 2000.
  • Quilis, Antonio, Métrica española, Madrid, Ediciones Alcalá, 1969.
  • Seláf, Levente, Chanter plus haut. La chanson religieuse en langues vernaculaires. Essai de contextualisation, Champion, 2009.
  • Torre, Esteban, El ritmo del verso: estudios sobre el cómputo silábico y distribución acentual a la luz de la métrica comparada, Murcia, Universidad de Murcia, Servicio de Publicaciones, 1999.
  • _____, Métrica española comparada, Sevilla, Servicio de Publicaciones de la Universidad de Sevilla, 2000.
  • Utrera Torremocha, Historia y teoría del verso libre, Ed. Padilla libros, 2001.
  • Valero Merino, Elena, Moíno Sánchez, Pablo y Jauralde Pou, Pablo, Manual de métrica española, Madrid, Castalia, 2005.

TEI-conform XML Annotation of a Digital Dictionary of Surnames in Germany

In this paper we focus on XML markup for the Digital Dictionary of Surnames in Germany (Digitales Familiennamenwörterbuch Deutschlands, DFD). The dictionary aims to explain the etymology, and the meaning of surnames respectively, occurring in Germany. Possibilities and constraints which are discussed can be stated by using the TEI module “Dictionaries” for editing a specialized dictionary such as the DFD. This topic includes situating the new project within the landscape of electronic dictionaries.

Our evaluation of the appropriateness of the proposed guidelines is seen as a contribution to the efforts of the TEI: The consortium regards their specifications as dynamic and ongoing development. The efforts in terms of lexical resources starting with the digitization of printed dictionaries are documented and discussed in various publications (e.g. Ide/Véronis/Warwick-Armstrong/Calzolari 1992; Ide/Le Maitre/Véronis 1994; Ide/Kilgarriff/Romary 2000). The module “Dictionaries” contains widely accepted proposals for digitizing printed dictionaries but projects which are born digital are progressively becoming more common nowadays (Budin/Majewski/Mörth 2012). For a more fine-grained encoding of these resources certain proposals for customization of the module “Dictionaries” can be found (e.g. Budin/Majewski/Mörth 2012). This paper aims to focus on the usefulness of the guidelines for a dynamic and specialized online dictionary without customized TEI extensions. Yet, our investigation points out possible extensions which may increase the acceptance and application of the TEI in other, similar projects.

At first, we want to introduce the Digital Dictionary of Surnames in Germany (2012-2036) as a new and ongoing collaboration between the Academy of Science and Literature in Mainz and Technische Universität Darmstadt. Work on DFD started in 2012. The project is based on data of the German telecommunications company Deutsche Telekom AG and preliminary studies of the German Surname Atlas (Deutscher Familiennamenatlas, DFA). It is planned to integrate the dictionary in an online portal of onomastics named namenforschung.net which can be seen as a gateway to various projects and information related to the field of name studies.

The intention of the DFD is to record the entire inventory of surnames occurring in Germany including foreign ones. Therefore, the entries consist of several features, for instance frequency, meaning and etymology, historical examples, variants and the distribution of the surnames. The short introduction includes a brief classification of the DFD into a typology of dictionaries (Kühn 1989; Hausmann 1989). Then, we focus on data annotation in terms of the DFD according to the TEI Guidelines as the consortium forms a de facto standard for the encoding of electronic texts (Jannidis 2009). Following the proposals means providing possibilities for data exchange and further exploration (Ide/Sperberg-McQueen 1995). Both aspects are particularly important considering the long duration of the project. The encoding scheme of the DFD is mainly based on the TEI module “Dictionaries”. Furthermore, components of the modules “Core” as well as “Names, Dates, People, and Places” are used. The main reason for considering the latter module is the close connection of surnames to geographical features, for example settlements or rivers. TEI extensions for customizing existing tags and annotation hierarchies according specific needs are set aside to provide a higher level of data interchangeability, for instance with other TEI and XML-based onomastic projects such as the Digitales Ortsnamenbuch Online (DONBO), a digital dictionary of place names (Buchner/Winner 2011).

To evaluate the appropriateness of the TEI Guidelines regarding to our project we compare them to the needs of annotating microstructures of the DFD entries. The intention of the TEI is to offer exact as well as flexible annotation schemes (Ide/Sperberg-McQueen 1995). Therefore, relevant criteria for the evaluation are the completeness of the tagset and the flexibility in arranging elements and attributes. Furthermore, the analysis discusses the comprehensibility of possible annotations in terms of descriptive and direct denotations.

In general, the TEI Guidelines – the tagset and the arrangement of its elements – can be used to represent the structure of the entries as well as the features of the DFD adequately. The applicability is, however, influenced by several aspects we want to discuss in greater detail.

At first, the aspect of completeness of the tagset is discussed. It would be useful to have elements within the module “Dictionaries” available to encode the frequency and the geographical distribution. The frequency of a surname is interesting for dictionary users, especially the name bearer. Other than for the DFD, options to encode frequencies seem to be important considering other lexical resources such as explicit frequency dictionaries or the frequency information in learner’s dictionaries, for instance. Elements to annotate the geographical distribution are needed, because the distribution in and outside of Germany serves as means to support, respectively verify, the given sense-related information (Schmuck/Dräger 2008; Nübling/Kunze 2006). These tags seem to be of further interest for parallel developments of national surname dictionaries, for example in Austria (FamOs) as well as for other types of dictionaries, for instance, variety dictionaries.

In our encoding scheme, the missing tags are replaced by more indirect combinations of tags und attributes, for example <usg type=”token”> to encode the frequency or <usg type=”german_distribution”> to annotate the distribution.

Furthermore, it would be helpful to have more possibilities to specify a sense. According to the presentation of surnames in the DFD, a sense is linked with a category, which can be understood as a type of motivation for the given name. An example is the category occupation belonging to the surname Bäcker (‘baker’). For our purposes it is adverse that the attribute @type is not allowed within the element <sense>. We are using the less concise attribute @value as an alternative.

A further example for missing options of explicit markup relates to the sense part. In the DFD senses are ordered according to their certainty. We are using the attribute @expand with the values “primary”, “uncommon”, “uncertain” and “obsolete” to differentiate. However, the definition provided by the TEI Guidelines entails giving an expanded form of information (TEI Consortium P5 2012). The slightly different usage in the DFD annotation scheme is based on the lack of suitable alternatives and the denotative meaning of the expression to expand. Furthermore, it would be helpful to have elements within the module “Names, Dates, People, and Places” which encode not only settlements, place names and geographical names in general but more precise features as hydronyms or agronyms, additionally. Currently, these features are tagged as follows in our articles: <geogName type=”hydronym”/>. Another aspect is the indefinite usage of one element in several contexts. An example is the tag <surname> which can be used to encode the surname in general as well as to annotate the explicit last name of a certain author of a cited publication.

The appropriateness of the module “Dictionaries” for encoding the DFD is diminished by restrictions concerning the arrangement of elements. The element <bibl> for annotating bibliographic references is not allowed on the entry or sense level. Within the project Wörterbuchnetz, the restriction in terms of the sense-element is overridden by embedding the element <bibl> within the element <title> or <cit> (Hildenbrandt 2011). The encoding scheme of the DFD uses the element <cit> as TEI-conform parent-element. For example: <cit> <bibl> <author> <surname>Gottschald</surname> </author> <date when="2006"/> <biblScope type="pp">5</biblScope> </bibl> </cit>

The risk of these flexible solutions is that similar projects might handle similar situations by choosing different TEI-conform markup strategies or customizations by TEI extensions which limits the possibilities for interchange.

As a result, we find that some aspects are not as adequately considered within the TEI modules “Dictionaries” and “Names, Dates, People, and Places” as it would be useful to realize the intended function of a new dictionary of surnames in Germany. An extension of the tagset might include elements for the frequency and the distribution. A further proposal refers to the element <bibl>, which should be allowed in more contexts. The pursuit of the TEI Guidelines, which is to provide an expressive and explicit tagset, is not fulfilled completely in terms of the DFD: The indirect denotations and the vast usage of attributes affect the readability for human lexicographers working on the XML adversely. These are among the reasons for the development of a working environment using the author view of the xml editor Oxygen instead of the source view.

Our explanations might give impetus for slight extensions of the TEI to develop a more comprehensive, comprehensible and flexible annotation scheme for general dictionaries as well as a more adequate annotation scheme for specialized dictionaries. An appropriate and profound encoding can be seen as the basis for an abundance of application scenarios of the DFD.

Bibliography

  • Austrian Academy of Sciences (ed.) (n.d.) Familiennamen Österreichs (FamOs). http://hw.oeaw.ac.at/famos (accessed June 30, 2013).
  • Buchner, S./Winner, M. (2011). Digitales Ortsnamenbuch (DONBO). Neue Perspektiven der Namenforschung. In Ziegler, A./Windberger-Heidenkummer, E. (eds.): Methoden der Namenforschung. Methodologie, Methodik und Praxis. Berlin: Akademie Verlag, pp. 183-198.
  • Budin, G./Majewski, S./Mörth, K. (2012). Creating Lexical Resources in TEI P5. A Schema for Multi-purpose Digital Dictionaries. In Journal of the Text Encoding Initiative. 3. November 2012, Online since 15 October 2012. URL: http://jtei.revues.org/522; DOI: 10.4000/jtei.522. (accessed June 30, 2013).
  • Hausmann, F. J. (1989). Wörterbuchtypologie. In Hausmann, F. J./Reichmann, O./Wiegand, H. E./Zgusta, L. (eds.): Wörterbücher: Ein internationales Handbuch zur Lexikographie. Berlin/New York: de Gruyter, pp. 968-980.
  • Hildenbrandt, V. (2011). TEI-basierte Modellierung von Retrodigitalisaten (am Beispiel des Trierer Wörterbuchnetzes). In Klosa, A./Müller-Spitzer, C. (eds.): Datenmodellierung für Internetwörterbücher. 1. Arbeitsbericht des wissenschaftlichen Netzwerks “Internetlexikografie”. Mannheim: Institut für Deutsche Sprache, pp. 21-35.
  • Ide, N./Kilgarriff, A./Romary, L. (2000). A Formal Model of Dictionary Structure and Content. In Proceedings of Euralex 2000. Stuttgart, 113-126.
  • Ide, N./Le Maitre, J./Véronis, J. (1994). Outline of a Model of for Lexical Databases. In Zampolli, A./Calzolari, N./Palmer, M. (eds.): Current Issues in Computational Linguistics. Pisa: Giardini Editori, pp. 283-320.
  • Ide, N./Sperberg-McQueen, M. (1995). The TEI. History, Goals, and Future. In Computers and the Humanities 29, 5-15.
  • Ide, N./Véronis, J./Warwick-Armstrong, S./Calzolari, N. (1992). Principles for encoding machine readable dictionaries. In Tommola, H./Varantola, K./Salmi-Tolonen, T./Schopp, Y. (eds.): EURALEX ’92. Pproceedings I- II. Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, Finland. Tampere: Tampereen Yliopisto, pp. 239-246.
  • Jannidis, F. (2009). TEI in a Crystal Ball. In Literary and Linguistic Computing. 24(3), 253-265.
  • Kühn, P. (1989). Typologie der Wörterbücher nach Benutzungsmöglichkeiten. In Hausmann, F. J./Reichmann, O./Wiegand, H. E./Zgusta, L. (eds.): Wörterbücher: Ein internationales Handbuch zur Lexikographie. Berlin/New York: de Gruyter, pp. 111-127.
  • Nübling, D./Kunze, K. (2006). New Perspectives on Müller, Meyer, Schmidt: Computer-based Surname Geography and the German Surname Atlas Project. In Studia Anthroponymica Scandinavica. Tidskrift för nordisk personnamnsforskning 24, 53-85.
  • Schmuck, M./Dräger, K. (2008). The German Surname Atlas Project. Computer-Based Surname Geography. In Proceedings of the 23rd International Congress of Onomastic Sciences. Toronto, 319-336.
  • TEI Consortium (eds.). Guidelines for Electronic Text Encoding and Interchange. 17th January 2013. http://www.tei-c.org/P5/ (accessed June 30, 2013).
  • Trier Center for Digital Humanities (ed.) (n.d.) Wörterbuchnetz. http://woerterbuchnetz.de/ (accessed June 30, 2013).

From Paper Browser to Digital Scientific Edition of Ancient Written Sources

To this day, digital epigraphy has developed following two paths. The first one corresponds to the reproduction of the way information is structured in a corpus published on paper, with an additional browsing, search, and data extraction option. For instance, databases like Inscriptions of Aphrodisias On Line (http://insaph.kcl.ac.uk/iaph2007/index.html), Clauss-Slaby (http://www.manfredclauss.de/fr/), EDH (http://edh-www.adw.uni-heidelberg.de/home), or ultimately Phi7 (http://epigraphy.packhum.org/inscriptions/), in spite of the fact that they are efficiently searchable, are structured simply according to the traditional elements of a paper publication, that is to say lemma, diplomatic transcription, critical edition, typographical code, translation, apparatus criticus, historical commentary. The steps involved in their consultation somehow reproduce the ones taken in the consultation of a paper edition in a library, except that the search is quicker and more powerful: one click of the mouse opens a related map or dictionary entry. This is why we are tempted to call such information systems “paper browsers”. But digital scientific editions have more to offer, and some projects have already explored another path, resulting in an attempt to go beyond the possibilities of a paper publication. The best example of this trend is the well known website Res Gestae Divi Augusti Fotogrammetria http://resgestae.units.it/index.jsp, which allows to browse a digital version of highly reliable interactive photogrammetric photographs and squeezes of the huge inscription, that are reasonably impossible to print on paper. Another instance would be the ChiSel System (http://chisel.hypotheses.org/tag/presentation?lang=es_ES), which generates 3d representations of written objects.

Such achievements lead the way to a new kind of information systems, not based on the digitization of the epigraphic knowledge as it is published on paper anymore. On the contrary, a new conceptual model is required: a model that would revert to what an inscription really is, and thus would be able to fully exploit the abilities of the digital environment to express its multidimensional aspects. Ideally, it should be collectively defined.

Following that idea, in this poster we would like to focus more specifically on the textual aspects of the digital representation of inscriptions, expressed in “paper browsers” via the subset EpiDoc TEI (rethinking diplomatic and critical organisation in levels 5 and 6 of Lamé & Valchera 2012). Using a methodological and at the same time experimental approach, as McCARTHY 2005, and before him BORILLO, 1984, encourages, we would like to demonstrate that EpiDoc TEI, whereas it has developed along the “paper browsers” experience, offers more possibilities and can perfectly fulfill the needs of a digital edition as briefly defined previously, if it takes into account the real epigraphic object in all its dimensions (writing, context…).

Thanks to three case studies, we hope to demonstrate its current capacities and what its best use could be. First, we will try and construct the standoff position of a bilingual inscription from Samos (Demotic and Greek texts); then the standoff position of the partly preserved dedication on a statue base also from Samos; and finally the standoff position of two Roman inscriptions, CIL, 11, 6664, CIL, 11, 1421, particularly interesting for the entangled abbreviations, stuck words, mistakes and ligatures. Hopefully, those analyses will help determine how TEI could be optimally used.

We hope that this poster will create the opportunity of a dynamic and fruitful discussion with the TEI community.

Bibliography

Digital humanities bibliography

  • BORILLO, M. 1984 Informatique pour les sciences de l’homme Bruxelles Mardaga
  • CIOTTI, F. 2005 La codifica del testo XML e la Text Encoding Initiative Il manuale TEI Lite: Introduzione alla codifica elettronica dei testi letterari Milano Sylvestre Bonnard 9-42
  • GENET, J.-P. 1994 Source, métasource, texte, histoire Storia & multimedia: Atti Del Settimo Congresso Internazionale Association for History & Computing Bologna Grafis 3–17
  • FUSI, D. 2007 Edizione epigrafica digitale di testi greci e latini: dal testo marcato alla banca dati Digital Philology and Medieval TextsOspedaletto (Pisa) Pacini pp. 121–163
  • FUSI, D. 2011 Informatica per le scienze umane Vol. 1 – Elementi Roma Nuova Cultura 1
  • FUSI, D.2011 Informatica per le scienze umane Vol. 2 – Modelli Roma Nuova Cultura 2
  • GOLD, M. 2012 Debates in the digital humanities Minneapolis University of Minnesota Press
  • GREENGRASS, M., & LORNA, H. 2008 The virtual representation of the past Farnham Ashgate
  • LAMÉ, M., VALCHERA, V., & BOSCHETTI, F. 2012 Epigrafia digitale : paradigmi di rappresentazione per il trattamento digitale delle epigrafi Epigraphica 386–392
  • LUNEFELD, P., BURDICK, A., DRUCKER, J., PRESNER, T., & SCHNAPP, J. 2012 Digital_Humanities Boston MIT Press
  • McCARTHY, W. 2005 Humanities Computing New York Palgrave Macmillan
  • NEROZZI-BELLMAN, P. 1997 Internet e le muse: la rivoluzione digitale nella cultura umanistica Milano Associazione Culturale Mimesis
  • ORLANDI, T. 1985 Problemi di codifica e trattamento informatico in campo filologico Lessicografia, Filologia e Critica Firenze Leo S. Olschki 42 69-81
  • PERILLI, & FIORMONTE, D. 2011 La macchina del tempo. Studi di informatica umanistica in onore di Tito Orlandi Firenze Le Lettere
  • PIERAZZO, E. 2005 La codifica dei testi Roma Carocci
  • RONCAGLIA, G. 1997 Alcune riflessioni su edizioni critiche, edizioni elettroniche, edizioni in rete Internet e Le muse: La Rivoluzione Digitale Nella Culture Umanistica Milano Associazione Culturale Mimesis251–276
  • ROUECHÉ, C. 2009 Digitizing Inscribed Texts». In: Text Editing, Print and the Digital World Farnham Ashgate 159–168
  • SMITH, N. 2012 Les-humanités-dont-on-ne-doit-pas-prononcer-le-nom Translated by M. Lamé Read / Write Book 2 P. Mounier Open Edition press 87–88 http://vitruviandesign.blogspot.it/2012/01/humanities-that-must-not-be-named.html
  • SOLER, F.From 2012 Carnet de recherche Chisel http://chisel.hypotheses.org (Carnet de recherche sur la plateforme Hypotheses.org)
  • SUSINI, G. 1982 Epigrafia romana Roma Jouvence
  • TORRES, J.C., SOLER, F. 2012 An Information System to Analize Cultural Heritage InformationPaper accepted Euromed Conference 2012

Digital humanities bibliography

  • BELLET, M.-É. & al. 2003 De la restitution en archéologie. Actes du Colloque de Béziers organisé par le Centre des Monuments nationaux Paris Éditions du Patrimoine http://editions.monuments-nationaux.fr/fr/le-catalogue/bdd/livre/662
  • ÉTIENNE, R. 1970 Le siècle d’Auguste Paris Armand Colin
  • GHINATTI, F. 1999 Alfabeti greci. Torino: Paravia scriptorium
  • JACQUES, F. 1990 Les Cités de l’Occident romain Paris Belles Lettres
  • KRUMMREY, H., & PANCIERA, S. 1980 Criteri di edizione e segni diacritici Tituli 2: Miscellanea Roma Edizioni di storia e letteratura 2 205–215
  • PANCIERA, S. 2012 What Is an Inscription? Problems of Definition and Identity of an Historical Source Translated by J. BODEL Zeitschrift für Papyrologie und Epigrafik 183 1-10
  • ROW, G. 2002 Princes and Political Culture Ann Arbor University of Michigan Press

Sources Edition

  • ARIAS, P.E., CRISTANI, E., GABA, E. 1977 Camposanto monumentale di Pisa Pisa Pacini
  • HALLOF, KI. 2000 n° 348 Inscriptiones Graecae XII 6, I
  • HALLOF, KI. 2003 n° 589 Inscriptiones Graecae XII 6, II
  • LUPI, C. 1979 I decreti della colonia pisana ridotti a miglior lezione Pisa F. Mariotti e CC.
  • MAROTTA D’AGATA, R. 1980 Decreta Pisana (CIL, XI, 1420-21)ed. critica, trad. e commento Pisa Ed. Marlin
  • SEGENNI, S. 2011 I Decreta Pisana : autonomia cittadina e ideologia imperiale nella colonia Opsequens Iulia Pisana Bari Edipuglia

A Challenge to Dissemination of TEI among a Language and Area: A Case Study in Japan

This presentation describes a challenge to the dissemination of TEI in Japan, a country where most of the people have spoken and written in a single language for more than a millennium. There are at present, very few examples of attempts at adopting TEI for Japanese cultural resources. However, there has been a rich textual tradition, and many textual resources are preserved going back as far as the 8th century. A vastly greater amount of materials remain from the 17th century, due to the spread of technologies of woodblock printing. Humanities researchers have addressed the digitization of humanities resources since the 1950’s.

In the early stages, Japanese linguists began attempts at digitizing language resources in order to statistically analyze Japanese and Western materials, publishing a journal through the establishment in 1957 of a society named “Mathematical Linguistic Society of Japan” (Keiryo Kokugo Gakkai, 計量国語学会)1. In addition, several progressive researchers working at the National Institute for Japanese Language and Linguistics, National Institute for Japanese Literature, and several universities commenced the digitization of their Japanese materials using large-scale computer systems. The National Museum of Ethnology also played an important role in this endeavor.

Following upon these early attempts, several communities were established at the end of the 1980’s due to the impetus of the proliferation of the IBM PC. One was formed as the Special Interest Group of Computers and the Humanities2, that is, SIG-CH, under the auspices of the Information Processing Society in Japan, the largest computer science society in Japan. The others were the Japan Society of Information and Knowledge3 and the Japan Art Documentation Society4. After that, many academic communities were established based on the new possibilities opened up by the Internet. It is especially noteworthy that societies of digital scholarship of archaeology, English corpora, and Asian literature were formed in the 1990’s. Moreover, several academic communities have been formed even in the 21st century, including JADH (Japanese Association for Digital Humanities)5 which has become a constituent organization of ADHO.

Under these circumstances, over a thousand presentations regarding digitization of the humanities have been made since the 1950’s. Around 800 presentations have been done in quarterly workshops of SIG-CH from 1989 to 2012, including various types of digital scholarship in the humanities such as textual analysis, text database, image database, and so on (Figure.1), targeting various fields in the humanities (Figure 2)

Figure 1. Types of digital scholarship in the presentations of
                                    SIG-CH

Figure 1. Figure 1. Types of digital scholarship in the presentations of SIG-CH
Figure 2. Top 11 target fields of the presentations

Figure 2. Figure 2. Top 11 target fields of the presentations

However, the TEI has not fared that well up to now in Japanese academic communities–probably due to several reasons, including the issues of character encoding and language barriers. Actually, differences in character encoding prevented sharing of a broad range of digital content and applications beyond TEI. Many of the applications that were developed for western languages could not be used under Japanese computer environments before the promulgation of Unicode. This means that it was difficult for Japanese humanities researchers to realize the significance and potential of TEI at that time. Moreover, it was also difficult to participate in the discussion of TEI due to difference of language and the large distance from the center of TEI. Therefore, in spite of efforts of few researches, Japanese researchers had rarely participated in the activities of TEI until recently. Instead, they had addressed their textual resources using their own individual approaches.

Recently, the pervasive implementation of Unicode and spread of the Internet widen the possibilities of TEI even in Japan. In 2006, a TEI meeting6 hosted by Christian Wittern at Kyoto University gathered various researchers, newly awakening scholars to the potential of TEI. After that, a series of DH workshops including TEI tutorials in 2009 at Tokyo and Osaka began to be held by a DH community which led to the formation of an association called the Japanese Association for Digital Humanities later. In this new period, even in Japan, researchers of the humanities could experience the potential and possibilities of TEI by hands-on usage of several strong tools based on UTF-8 which were developed by TEI communities such as oXygen, Versioning Machine, Roma, and so on. These efforts were strongly supported by TEI specialists such as Espen Ore, John Lavagnino, Susan Schreibman, and Elena Pierazzo.7 Several DH courses in Japanese universities have recently included tutorials on TEI.

Also, a project of Japanese translation of the TEI guidelines has been initiated by several young researchers led by Kazushi Ohya. Thus, the environment for TEI has been gradually forming in Japan. Actually, several DH projects are trying to use TEI for their digital resources. Their results will be shown in the near future.

During the discussion of adopting TEI on Japanese textual materials, several problems have been recognized. For example, Japanese texts often contain intralinear text that indicates phonetic representation called “ruby,” which was already adopted in HTML58 and ePub 3.0.9 It is not simply a phonetic standard, but its system can depends on the idiosyncratic phonetic representations of a certain author, editor, or publisher. Rather, it represents a phonetic rendering in specific situations. Probably type attributes can be applied in this case, but a guideline should be prepared for such usage. Otherwise, a module may need to be created specifically for handling Japanese materials. This kind of effort could be useful for dissemination of TEI in other countries and areas. Moreover, as already discussed in several places–such as DH2012, some linguists would prefer to avoid using not only TEI but also general tags (even in Japan) so that they can mine texts freely. We should discuss this matter carefully and constructively.

Finally, stand-off markup seems to be suitable for most Japanese resources, but meticulous application has not been carried out so far. It should be solved as soon as possible.

Bibliography

  • Kiyonori Nagasaki, How Digital Technologies Have Been Used: Through the History of the SIG Computers and the Humanities, “IPSJ technical report”, 2013-CH-98(7), pp. 1-6. (in Japanese)
  • Kiyonori Nagasaki, “A Simple Guide to TEI and oXygen”, [ http://www.dhii.jp/nagasaki/blog/node/12 ] (in Japanese)

Dramawebben, linking the performing arts and the scholarly communities

Background

Dramawebben (The Swedish Drama Web) has served as a free digital resource since 2006. A largely unexplored empirical material of Swedish drama free from copyright has been made accessible through a website [1]. The website has been used by scholars and students, theatre practitioners and a general public.

In most cases the first printing of the play was the version first encountered by a theatrical audience. First printings are also the most difficult to access and thus the most exclusive editions and therefore the most important to make accessible. Plays are generally published in two formats: a facsimile and a text version (made from optical character recognition of the facsimile images). They are also accompanied by descriptive catalogue entries, where a reading of each individual play, and of its reception in the press at its first performance, are summarised in informative meta texts. This publication principle is an important preparatory foundation for scholarship, where the facsimile functions as a complement to the encoded text version. Being able to switch between facsimile and encoded text versions is sometimes important, e.g. when the text is in Gothic type or the text only exists in the form of a handwritten manuscript.

Each stage of the work of collecting, processing and publishing the material has been designed in such a way as to lay a preparatory foundation for scholarship that will hold for a development of Dramawebben into an exemplary national infrastructure for digital research in the humanities. In an ongoing project 2012-2014, Dramawebben is further developing the website, making a foundation for pushing the e-Drama infrastructure into a long-term operation. The project includes a baselined corpus of TEI-drama annotated plays and development of exploration tools, and engaging a vibrant community. A key component is to educate students in TEI-encoding and let them be ambassadeurs spreading the word to target disciplines within the humanities, such as linguistics, literary and theatre history, studies in children’s culture, practical and theoretical research in children’s theatre, and arts tertiary institutions.

Collaboration and sustainability

Since its start in 2006, Dramawebben has initiated collaboration with a number of other infrastructures for the mutual benefit of all parties involved. Such cooperation is also ensuring long-term sustainability, and that the research material will be available as a national resource.

In cooperation with Språkbanken (The Swedish Language Bank), tools for linguistic annotations, search functions and display formats for linguistic investigations will be available for Dramawebben [2][3]. Språkbanken, on the other hand, can include drama, an otherwise missing text type, in their language corpora. Språkbanken is also involved in Litteraturbanken (The Swedish Literature Bank), another digitisation infrastructure. Litteraturbanken has provided Dramawebben with advice and technical support on standards for digitisation, publication and process support, and Litteraturbanken uses facsimiles made by Dramawebben.

Making the material accessible has always been a high priority for Dramawebben. Therefore, Dramawebben is included in the search engines Libris and K-samsök of Kungliga biblioteket (The National Library of Sweden) and Riksantikvarieämbetet (The Swedish Central Board of National Antiquities), respectively. There are links from the catalogue entries to the library databases, which refer users to the original material in each respective archive, while, for example, Libris links to Dramawebben’s entries in order to refer users to meta data and full text publications. It is planned that Dramawebben will be included in Libris as its first digital archive.

Dramawebben has also been conducting a very fruitful collaboration in the field of digitisation with the National Library of Sweden, and the archives of the Royal Opera, Royal Dramatic Theatre, and Statens musikverk (Music Development and Heritage Sweden). Through supplementary grants from the Bank of Sweden Tercentenary Foundation, 15,000 pages of printed drama from the national library’s collections have been photographed by the library’s own digitisation department. In the ongoing project, digitisation of handwritten material in the theatrical archives are being made, to the benefit of all parties concerned.

TEI-drama encoding

In order to facilitate more advanced exploration within and across dramas, we are in the process of TEI-encoding a subset of the plays. By adhering to TEI text encoding principles, we make a commitment to sustainability, but can also benefit from being part of a larger community. Preparation for text encoding started in the spring of 2012. All plays on Dramawebben printed 1880-1900 were selected. It included 89 plays in all genres, children’s plays, drama and comedy, plays by female as well as by male dramatists.

Baseline encoding

Common for all plays is a baseline encoding taken from the drama module of TEI, and minimal support for facsimile encoding, connecting the TEI-encoded text to the facsimile. The baseline encoding covers the basic structure of the drama text. On top of that, it is possible to add semantic annotation, which goes beyond the text itself, referring to the action below, behind or beyond the actual words.

Semantic encoding

To tempt scholars in humanities with at least one theme for semantic encoding, we have started with one – textile handicraft, which was a recurrent feature of the plays by female playwrights of the 1880’s. The needle working woman was a strong and yet ambivalent sign from the period. August Strindberg let one of his heroines deny the crochet she was constantly working on: ‘It is nothing, it is just my needlework’[4]. To his female colleague Alfhild Agrell the handicraft had a subversive power. One of her heroines silently embroidered her way to financial independence and freedom from an unbearable marriage[5]. Strindberg’s heroine denies her needlework but still performs it in full limelight. Needlework it is a potent stage action or, a playable sign.

So how did we go about encoding this manifold sign? We soon realized it was not always fully designated in the stage directions. Although the props and starting point of the action was given – ‘She picks her knitting’ – the point where the action ceases might not always be mentioned. The question when she puts down the knitting can be related to why she quits. Encoding handicraft thus opens to the exploratory reading of the drama text that is the basis for every stage action. And it is in this very interpretative process that scholars will meet theatre practitioners.

The textile handicraft is not only embodied in the actual stage action. It will also be present in the lines where the speakers elaborate their knowledge and attitude about it. The props as well as the handicraft are also frequently used as metaphors for life, death and fate as well as for daily matters.

Dissemination

Our task is not only to do the text encoding, but also to implement and spread TEI as a new research tool in the Swedish communities of humanities and of artistic production, by employing students as ambassadeurs. We apply an adapted version of the bottom-up process practiced by the Women Writers Project at Browns[6], meaning that digital humanities must come from the grass roots – the students. The TEI-encoding is therefore performed by five students in literature and theatre science, simultaneously functioning as ambassadeurs.

They have assimilated TEI and the basic encoding quickly. During the first five months of approximately 100 hours work they have also increased substantially in speed and accuracy. That has been a process of finding their own way of balancing transcribing, encoding and proof reading. The students have been encouraged to not only perform the basic encoding but also find their own themes for semantic encoding.

Three workshops will be held during 2013-2014, where the students and invited scholars will present their explorations into the potentials and adventures of digital humanities, given their respective use cases. Main target groups are scholars, theatre practitioners and librarians, who are not familiar with the possibilities of TEI-encoding.

Acknowledgements

The authors gratefully acknowledge financial support from the Swedish Research Council (VR Dnr: 2011-6202).

Bibliography

  • Dramawebben <http://www.dramawebben.se>
  • Korp, Språkbanken, University of Gothenburg, <http://spraakbanken.gu.se/korp/>.
  • Lars Borin, Markus Forsberg, Leif-Jöran Olsson, and Jonatan Uppström. 2012. The open lexical infrastructure of Språkbanken. Proceedings of LREC 2012, Istanbul: ELRA. 3598-3602 <http://spraakbanken.gu.se/karp/>.
  • August Strindberg, To Damascus, 1898.
  • Alfhild Agrell, Saved, 1883.
  • Women Writers Project, Brown University <http://www.wwp.brown.edu/>.

The Karnak Cachette Texts On-Line: the Encoding of Transliterated Hieroglyphic Inscriptions

Between 1903 and 1907, G. Legrain discovered around 800 stone statues, stelae and other objects in a large pit (the so-called “Cachette”) inside the temple of Amun at Karnak, in which they were piously buried by the Egyptian priests, probably during the 1st century B.C. They include a number of royal effigies of all periods but most of the statues primarily belong to the priests who officiated at Karnak from the New Kingdom to the end of the Ptolemaic Period.

The Karnak Cachette Database is an on-line inventory of the Cachette and a tool to search this rich corpus. The first version was launched in 2009; it provides, insofar as possible, a general description of each object (with dimensions, materials, dating), a label, the date of discovery, different inventory numbers, and a bibliography. Version 2 was put online in 2012: it includes an extensive access to the photographic documentation (more than 8,000 photographs are now available); this database has been regularly updated thereafter.

Building on this well-defined corpus, the project aims now at developing the tools to encode, search and publish electronically the hieroglyphic texts inscribed on these objects, which provide anthroponomical, toponymical and prosopographical data and are therefore of historical and documentary significance. The encoding is developed according to the recommendations of the Text Encoding Initiative in combination with relevant “best practices” in the field of Digital Humanities applied to Epigraphy (Elliott et alii 2007; Cayless et alii, 2009). In this sense, even though the project takes into account many of the EpiDoc schema rules, it is only partially compliant with this TEI customization because of the specificities of both the project and Ancient Egyptian Epigraphy (compare with Lamé 2009), and also because there is a necessity to fall within the scope of other Egyptological projects dealing with textual corpora (Winand, Polis, Rosmorduc in press).

Xefee, a tool to encode transliterated hieroglyphic inscriptions

It is well known that XML is far from being a human friendly way to encode texts. Several XML editors are already available; some of them are highly customizable and can be used by very specific project, providing the users are proprely trained and some implementation time and effort is spent. However, due to the specific features of the texts from the Karnak Cachette – for instance in terms of prosopography –, and the general philosophy of the project – edit and analyse texts that require full Egyptological proficiencies –, it has been decided to create a specific XML Editor that would make easier the text input, its marking up as well as the generation of the XML/TEI files.

Xefee – XML Editor for Egyptian Epigraphy – is a desktop Java application developed on Netbeans. It mainly consists of a general user interface (GUI) which provides all the necessary tools for managing and encoding the ancient Egyptian texts as well as the descriptive data pertaining to the Karnak Cachette project. These tools range from an import module that directly converts to XML the hieroglyphic text transcriptions written according to Egyptological standards, to more complex components intended to manage genealogical data.

The tab dealing with text encoding offers to the user a panel of buttons, combo-boxes and other controls that facilitate the marking up the texts with tags pertaining to epigraphy (<lb/>, <cb/>, <gap/>, <sic/>, <supplied/> elements), onomastic (<persName/> element and <rs/> elements with specific @type such as “deity”, “deityEpithet”, “toponym”) and prosopography (<rs/> elements with specific @type such as “person”, “title”, “filiationMark”). To add a tag, the user simply has to select in the top view pane the text to be marked up, and to press the appropriate button on the right-hand half of the tab. Since the XML marking up can be quite dense, mainly because the texts the project is dealing with often consist in compact sequences of personal names and titles, a preview pane in the bottom of the tab renders the encoded strings with different kinds of surrounding or highlighting patterns.

The Ancient Egyptian way to present genealogical filiations also required to build up peculiar tools to handle this very important aspect of the text contents. A tab of the GUI is dedicated to the creation of person’s identities, whilst another one intends to manage the family links and generate the <relationGrp/> element.

The current stage of the Karnak Cachette Project relies on the object and museum data described in the version 1 of the related database and on the photographic material added in its version 2. In order to fully use this already existing material, as well as to store the new data created throughout the encoding of the texts, Xefee leans on a MySQL database in which these different kinds of data are merged. Organised around a main “document” table, the data is spread over eighteen tables, among which four are dedicated to data from version 1, and one to the encoded texts.

In order to make full use of this material in a XML perspective, a sixth and last tab of the GUI is dedicated to the creation of the XML/TEI files. By pressing the upper-left button, the user asks Xefee to pick up in the MySQL database all the needed pieces of information and to place them between the appropriate XML tags. This generates all the sections of a XML file, from the headers with the publication and bibliographic statements to the div elements dealing with the encoded texts. The newly created XML file will be then poured into a native XML eXist database in order to constitute the electronic corpus itself.

Bibliography

  • CACHETTE DE KARNAK: L. Coulon, E. Jambon, Base de données Cachette de Karnak /Karnak Cachette Database launched in November 2009; version 2 updated in January 2012. Karnak Cachette Database (http://www.ifao.egnet.net/bases/cachette).
  • Cayless et alii 2009: H. Cayless, Charlotte Roueché, T. Elliott, G. Bodard, “Epigraphy in 2017”, in Digital Humanities Quarterly 3.1 (2009). Available online.
  • Elliott et alii 2007: T. Elliott, L. Anderson, Z. Au, G. Bodard, J. Bodel, H. Cayless, Ch. Crowther, J. Flanders, I. Marchesi, E. Mylonas and Ch. Roueché, EpiDoc: Guidelines for Structured Markup of Epigraphic Texts in TEI, release 5, 2007. Available online.
  • Lamé 2008: M. Lamé, “Pour une codification historique des inscriptions”, Rivista Storica dell’Antichità 38, 2008 (2009), p. 213-225. Available online.
  • Winand, Polis, Rosmorduc in press: J. Winand, St. Polis, S. Rosmorduc, “Ramses. An Annotated Corpus of Late Egyptian”, in P. Kousoulis (eds), Proceedings of the Xth International Association of Egyptologists Congress (Rhodes, Mai 2008), Leuven, Peeters, in press. Available online

Edition Visualisation Technology: a simple tool to visualize TEI-based digital editions

The TEI schemas and guidelines have made it possible for many scholars and researchers to encode texts of all kinds for (almost) all kinds of purposes: from simple publishing of documents in PDF form to sophisticated language analysis by means of computational linguistics tools. It is almost paradoxical, however, that this excellent standard is matched by an astounding diversity of publishing tools, which is particularly true when it comes to digital editions, in particular editions including images of manuscripts. This is in part due to the fact that, while there’s still an ongoing discussion about what exactly constitutes a digital edition, available publications have significantly raised users’ expectations: even a simple digital facsimile of a manuscript is usually accompanied by tools such as a magnifying lens or a zoom in/out tool, and if there is a diplomatic transcription (and/or a critical edition) we expect to have some form of image-text linking, hot-spots, a powerful search engine, and so on. The problem is that all of this comes at a cost, and the different needs of scholars, coupled with the constant search for an effective price/result ratio and the locally available technical skills, have a led to a remarkable fragmentation: publishing solutions range from simple HTML pages produced using the TEI style sheets (or the TEI Boilerplate software) to very complex frameworks based on CMS and SQL search engines.

The optimal solution to the long standing visualization problem would be a simple, drop-in tool that would allow to create a digital edition by running one or more style sheets on the TEI document(s). The TEI Boilerplate software takes this approach exactly: you apply an XSLT style sheet to your already marked-up file(s), and you’re presented with a web-ready document. Unfortunately, this project doesn’t cover the case of an image-based digital edition I presented above, which is why I had to look elsewhere for my own research: the Digital Vercelli Book project aims at producing an online edition of this important manuscript, and has been examining several software tools for this purpose. In the end, we decided to build a software, named EVT (for Edition Visualization Technology), that would serve the project needs and possibly more: what started as an experiment has grown well beyond that, to the point of being almost usable as a general TEI publishing tool. EVT is based on the ideal work flow hinted above: you encode your edition, you drop the marked up files in the software directory, and voilà: after applying an XSLT style sheet, your edition is ready to be browsed. More in detail, EVT builder’s transformation system divides an XML file holding the transcription of a manuscript into smaller portions each corresponding to individual pages of the manuscript, and for each of these portions of text it creates as many output files as requested by the file settings. Using XSLT modes to distinguish between the rules it is possible to achieve different transformations of a TEI element and to recall more XSLT stylesheets in order to manage the transformations. This allows to extract different texts for different edition levels (diplomatic, diplomatic-interpretative, critical) on the basis of the same XML file, and to insert them in the HTML site structure which is available as a separate XSLT module. If the TEI elements that are processed are placed in an HTML element with the class edition_level- TEI_ element’s_name (e.g. for the element <abbr> in the transformation to the diplomatic edition: dipl-abbr) it is possible to keep the semantic information contained in the markup and, if necessary, associate the element with that class of the CSS rules so as to specify the visualization and highlighting of the item. The edition level outputs and other aspects of the process can be configured editing the evt_builder-conf.xsl file.

At the present moment EVT can be used to create image-based editions with two possible edition levels: diplomatic and diplomatic-interpretative; this means that a transcription encoded using elements of the TEI transcr module (see chapter 1 1Representation of Primary Sources in the Guidelines) should be compatible with EVT, or made compatible with minor changes; on the image side, several features such as a magnifying lens, a general zoom, image-text linking and more are already available. For the future we aim at taking the Critical Apparatus module into consideration, which would imply creating a separate XSLT style sheet to complement the two existing ones, and at making it easier to configure the whole system, possibly by means of a GUI tool. Search functionality will be entrusted to a native XML database such as eXist.

EVT is built on open and standard web technologies, such as HTML, CSS and Javascript, to ensure that it will be working on all the most recent web browsers, and for as long as possible on the World Wide Web itself: specific features, such as the magnifying lens, are entrusted to jQuery plugins, again chosen among the open source, best supported ones to reduce the risk of future incompatibilities; the general architecture of the software, in any case, is modular, so that any component which may cause trouble or turn out to be not completely up to the task can be replaced easily. The project is nearing an alpha release (v. 0.2.0) on Sourceforge, and already offers all the tools listed above, with the exception of a search engine (expected to be implemented in v. 0.3.0).

Bibliography

Editions and digital facsimiles
  • Biblioteca Apostolica Vaticana. http://www.vaticanlibrary.va/home.php?pag=mss_digitalizzati (accessed on March 2013).
  • Codex Sinaiticus. http://www.codex-sinaiticus.net/en/manuscript.aspx (accessed on March 2013).
  • e-codices. http://www.e-codices.unifr.ch/ (accessed on March 2013).
  • e-sequence. http://www.e-sequence.eu/de/digital-edition (accessed on March 2013).
  • Foys, Martin K. 2003. The Bayeux Tapestry: Digital edition [CD-ROM]. Leicester: SDE.
  • Kiernan, Kevin S. 2011. Electronic Beowulf [CD-ROM]. Third edition. London: British Library.
  • Malory Project. http://www.maloryproject.com/image_viewer.php?gallery_id=7&image_id=11&pos=1 (accessed on March 2013).
  • Muir, Bernard James. 2004a. The Exeter anthology of Old English poetry: An edition of Exeter Dean and Chapter MS 3501 [CD-ROM]. Revised second edition. Exeter: Exeter University Press.
  • Online Froissart. http://www.hrionline.ac.uk/onlinefroissart/ (accessed on March 2013).
  • Samuel Beckett Digital Manuscript Project. http://www.beckettarchive.org/demo/ (accessed on March 2013).
  • Stolz, Michael. 2003. Die St. Galler Epenhandschrift: Parzival, Nibelungenlied und Klage, Karl, Willehalm. Faksimile des Codex 857 der Stiftsbibliothek St. Gallen und zugehöriger Fragmente. CD-ROM mit einem Begleitheft. Hg. von der Stiftsbibliothek St. Gallen und dem Basler Parzival-Projekt (Codices Electronici Sangallenses 1).
  • The Dead Sea Scrolls. http://www.deadseascrolls.org.il/ (accessed on March 2013).
  • Vercelli Book Digitale. http://vbd.humnet.unipi.it/ (accessed on March 2013).
Software tools
  • DFG Viewer. http://dfg-viewer.de/en/regarding-the-project/ (accessed on March 2013).
  • DM Tools. http://dm.drew.edu/dmproject/ (accessed on March 2013).
  • Scalable Architecture for Digital Editions. http://www.bbaw.de/telota/projekte/digitale-editionen/sade/ (accessed on March 2013).
  • TEI Boilerplate. http://teiboilerplate.org/ (accessed on March 2013).
  • TEICHI. http://www.teichi.org/ (accessed on March 2013).
  • The TEIViewer project. http://teiviewer.org/ (accessed on March 2013).
Essays and reference
  • Burnard, L., K.O.B. O’Keeffe, and J. Unsworth. 2006. Electronic textual editing. New York: Modern Language Association of America.
  • Buzzetti, Dino. 2009. “Digital Editions and Text Processing”. In Text Editing, Print, and the Digital World. Ed. Marilyn Deegan and Kathryn Sutherland, 45–62. Digital Research in the Arts and Humanities. Aldershot: Ashgate. http://137.204.176.111/dbuzzetti/pubblicazioni/kcl.pdf.
  • Foys, Martin K., and Shannon Bradshaw. 2011. “Developing Digital Mappaemundi: An Agile Mode for Annotating Medieval Maps”. Digital Medievalist n. 7. http://www.digitalmedievalist.org/journal/7/foys/ (accessed on March 2013).
  • Landow, George P. 1997. Hypertext 2.0: The convergence of contemporary critical theory and technology. Baltimore: Johns Hopkins University Press.
  • O’Donnell, Daniel Paul. 2005a. Cædmon’s Hymn: A multimedia study, archive and edition. Society for early English and Norse electronic texts A.7. Cambridge and Rochester: D.S. Brewer in association with SEENET and the Medieval Academy.
  • O’Donnell, Daniel Paul. 2005b. “O Captain! My Captain! Using technology to guide readers through an electronic edition.” Heroic Age 8. http://www.mun.ca/mst/heroicage/issues/8/em.html (accessed on March 2013).
  • O’Donnell, Daniel Paul. 2007. “Disciplinary impact and technological obsolescence in digital medieval studies”. In A companion to digital literary studies. Ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell. 65-81. http://www.digitalhumanities.org/companion/view?docId=blackwell/9781405148641/9781405148641.xml &chunk.id=ss1-4-2 (accessed on March 2013).
  • Price, Kenneth M. 2008. Electronic Scholarly Editions». In A Companion to Digital Literary Studies. Ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell.
  • Robinson, Peter. 2004. “Where We Are with Electronic Scholarly Editions, and Where We Want to Be”. http://computerphilologie.uni-muenchen.de/jg03/robinson.html (accessed on March 2013).
  • Rosselli Del Turco, Roberto. 2006. ‘La digitalizzazione di testi letterari di area germanica: problemi e proposte’. Atti del Seminario internazionale ‘Digital philology and medieval texts’ (Arezzo, 19 – 21 Gennaio 2006), Firenze: Sismel.
  • Rosselli Del Turco, Roberto. 2011. ‘After the editing is done: designing a Graphic User Interface for Digital Editions.’ Digital Medievalist Journal vol. 7. http://www.digitalmedievalist.org/journal/7/rosselliDelTurco/ (accessed on March 2013).
  • TEI Consortium, eds. Guidelines for Electronic Text Encoding and Interchange. V. P5 (31 January 2013). http://www.tei-c.org/P5/.

Use of TEI in the Wolfenbuettel Digital Library (WDB)

This poster will present the use of TEI in the Wolfenbuettel Digital Library (WDB), housed by the Herzog August Bibliothek (HAB), and present the ODDs applied, the ways of creation, processing models, workflows, and the appearance of TEI data in various contexts.

The WDB, that had been a publication platform for digitised cultural heritage materials (as images) in the first place is about to be transformed into a general publication platform for complex digital objects such as digital editions, combining images, full-texts of digitised (and OCRed) prints, and additional data on those digitised materials such as descriptions and structural metadata.

TEI plays an important role in this context as it is created and used in the WDB in various ways:

  • as born digital format, e.g. during manuscript description and for digital editions;
  • as automatically generated data during digitisation and OCR;
  • as result of transformations from various sources, including conversions from PDF, InDesign, Word; the resulting data is used as publication format to “populate” the WDB;
  • as storage format, boiled down to a standard encoding (“base format”);
  • as export format, especially towards repositories such as Europeana.
Data creation

The HAB is partner in various projects that produce TEI data in different ways:

  • There are inhouse manuscript cataloguing projects that encode the descriptions directly in TEI, using the ODD and all materials provided by the previous MASTER and Europeana Regia projects. (cf. http://diglib.hab.de/rules/documentation/) Other inhouse projects prepare digital editions, again directly in TEI for the use and publication. The library has created a working group to set standards of encoding for both kinds of materials and helps respectively oversees the creation (and publication) of that data.
  • With the WDB becoming more and more visible to others the library faces a rising number of requests to house externally prepared digital editions. Some base standards have to be set to match the needs of those externally prepared editions and their requests to be published within the WDB.
  • The conversion of formerly printed text into a digital, structured full-text is a work more and more common. Within the library the modern works published by the library itself are object to this conversion as well as the OCR of historical prints mainly from the 17th century. The resulting texts need to have a common basic encoding. To set the level of this base encoding is addressed by the project AEDit. (cf. http://diglib.hab.de/?link=029)
Transformation

Data comes in all forms to the HAB: As Word, InDesign or PDF files, LaTeX encoded, XML in various flavours, and often as well as TEI files. Problems with TEI files are that the encoding has various flavours and encoding different depths. From these input formats conversions have to be organised into a harmonised TEI format.

Publication

All TEI data are used for publishing. In the scope of the WDB, XML data are exposed both as result of XSLT transformations into HTML and as source data that can be downloaded. In the case of manuscript descriptions HAB runs a manuscript database that is implemented using eXist. Additionally, eXist serves as search engine for the WDB.

The poster will also address interchange issues such as the use of TEI in combination with METS, and the mapping to and export towards ESE/EDM.

Discovery, and Dissemination

The creation of digital editions and digital modelling of various dataformats, semantic searches and the visualisation of data are issues that touch on basic problems common to the diverse disciplines within digital humanities. The HAB currently runs the project “Digital Humanities” which will analyse cataloguing and indexing projects that rely on metadata and explore how current standards and ontologies can be used for modelling central entities such as persons, corporate bodies and places. If necessary, such standards will be customized, developed further and applied as test cases within some current HAB projects. The focus will rest on normalizing data in order to allow an exchange between the various projects of the partners involved and enable an integration into existing or future search engines.

All data that is available via the HABs OAI (http://dbs.hab.de/oai/wdb/) is available under CC-BY-SA license. (cf. http://diglib.hab.de/copyright.html) All data produced inhouse is exposed in the WDB under the same license. Data produced by others and only published via WDB may be subject to other rights declarations.

The major issues of the poster will be both the transformations of TEI into a base format that can be easily used within the WDB as well as the question which TEI it exactly is that can be used this way.

Bibliography

  • Stäcker, Thomas: Creating the Knowledge Site – elektronische Editionen als Aufgabe einer Forschungsbibliothek. In: Digitale Edition und Forschungsbibliothek. Ed. Christiane Fritze et al. Wiesbaden 2011, p. 107- 126 (Bibliothek und Wissenschaft, 44)

The Bibliotheca legum project

Medieval law is a research field of interest to historians, medievalists as well as legal scholars. Especially regarding the past it is often quite difficult to determine what applicable law actually was. The “Bibliotheca legum regni Francorum manuscripta” project (“Bl”) aspires to do so with a focus on the legal knowledge that was prevalent in the Francia. All “leges” (secular law texts) that were copied during the Carolingian period are incorporated.

The website provides an introductory text to each “lex” including reading recommendations, as well as short descriptions of all codices containing these texts. Information on repository, origin and history of the manuscript, contents as well as bibliographical references etc. are given. At the moment there are 273 short descriptions available.

The aim of the Bl is to take up the current state of research and also the research history as complete as possible. Therefore a lot of effort was put into gathering this information. All prior studies concerning single manuscripts as well as several editions of the law texts were surveyed. For each manuscript, age determinations and assumptions about its origin carried out by the different describers are recorded. Thus the features of the various print editions are transposed into the electronic version.

Originally the information was gathered in a MS-Word table, since it was prepared for internal use only. This had certain impacts on the procedural method. The idea to make the data publicly available in a digital form emerged in summer 2012, so the Bl is in its initial year of development. It is work-in-progress and not officially launched yet. Although not all functionalities as well as information are available by now, it is accessible on the web. This was a willful decision to enable the public to pursue the genesis and the development of the project.

The Bl heavily relies on existing resources. With regard to the needs of academic research, it gathers all digital images available of the respective manuscript testimonies (e.g. from “Europeana”, “Gallica”) as well as catalogue information (e.g. “Manuscripta Medievalia”). Therefore the Bl can be seen as a meta-catalogue and gateway to further resources. With kind permission of the “Monumenta Germaniae Historica” (MGH) it was possible to also integrate the complete text of the “Bibliotheca capitularium regum Francorum manuscripta. Überlieferung und Traditionszusammenhang der fränkischen Herrschererlasse” by Hubert Mordek (Munich 1995), which is the most comprehensive work on codices from the respective period. It is not only downloadable in its totality of more than 1000 pages as a PDF, but also as compilations of pages regarding single manuscripts that have been described by him.

The encoding is carried out according to the TEI P5 standard. People and places are tagged and enhanced according to authority files such as VIAF or TGN to enable identification. WordPress is used as a CMS for data management and to provide basic functions. While this platform is very common in the World Wide Web, it is not widely adopted for Digital Humanities’ projects working with XML data. The XSLT processing of the XML files within WordPress as well as certain other features (multilingualism, viewers etc.) are realized by plugins. The Bl is published under Creative Commons licence. XML source files are provided for all manuscripts and are freely available for download.

Features
  • The BL is a multi-language site with interface and general information in German and English.
  • Manuscript descriptions can be reached via multiple browsing accesses (shelfmark, leges contained, date of origin, place of origin).
  • Full text and faceted search are included.
  • All resources within the Bl as well as the external ones are connected via inter-/hyperlinking.
  • Information is given on different levels to make this platform a useful tool for scholars, students and the interested public audience.
  • A comprehensive bibliography on the subject and indices on people, places as well as repositories facilitate further orientation and provide contextualization.
  • Each manuscript description is available as XML download.
  • Some prior studies and editions are integrated within a viewer and are also available as PDF downloads.
  • A blog (German / English) informs about the current state of development and related topics.

The presentation might be of interest to all those working in projects that evolve under similar conditions, namely

  • relatively small workforce (3),
  • no funding,
  • lack of previous experiences in setting up a DH project from the scratch, and
  • absence of a technical partner or “real” programming / web developing skills.

The poster will present the lessons learnt during the initial year of development. Emphasis will be on the use of TEI and TEI connected tools, the difficulties encountered and the compromises made. Also a comprehensive evaluation of WordPress as a CMS within the TEI/XML context is included.

Staff

Prof. Dr. Karl Ubl, Chair of Medieval History, Cologne University (Project Lead)
Dominik Trump (Data aggregation and text encoding)
Daniela Schulz (Technical Lead)

References

  • Hubert Mordek, Bibliotheca capitularium regum Francorum manuscripta. Überlieferung und Traditionszusammenhang der fränkischen Herrschererlasse (MGH Hilfsmittel 15), München 1995.
  • http://www.leges.uni-koeln.de
  • http://www.tei-c.org
  • http://www.europeana.eu/
  • http://gallica.bnf.fr/
  • http://www.manuscripta-mediaevalia.de
  • http://www.mgh.de/
  • http://www.wordpress.com
  • http://viaf.org/
  • http://www.getty.edu/research/tools/vocabularies/tgn/
  • http://www.dfg.de/download/pdf/dfg_im_profil/reden_stellungnahmen/download/handschriften.pdf

Digital edition, indexation of an estate, collaborations and data exchange – August Boeckh online

The August Boeckh project is one of the major research initiatives of the junior research group “Berlin intellectuals 1800–1830”, led by Dr. Anne Baillot at Humboldt University Berlin. The project focuses on August Boeckh’s (1785–1867) manuscripts, who was one of the most important German classical philologists and a central figure in nineteenth-century Berlin. The Boeckh project can be seen as an example of collaboration between institutions, and of developing strategies to link the (meta-)data from libraries and archives with research results. We cooperate with archives and libraries such as the State Library Berlin and Humboldt University Library. Thus, the project is designed to be broader in scope with many connecting factors, and suitable for data exchange.

The key aspects considered for edition and interpretation are (a) the indexing of Boeckh’s literary estate for the August Boeckh Online Platform; (b) the edition of selected letters and reports from this estate as part of the digital edition “Letters and texts. Intellectual Berlin around 1800”; (c) the edition of Boeckh’s manuscript for his lecture “Encyklopädie und Methodologie der philologischen Wissenschaften”, a major work in the history of the classics; and (d) a virtual reconstruction of Boeckh’s personal library consisting of approximately 12,000 books. All these sub-projects help to reconstruct Boeckh’s horizon of knowledge and gain insight into his scholarly work und understanding. With this poster, I want to concentrate on the first two aspects.

The August Boeckh Online Platform presents Boeckh’s extensive literary estate in a systematic overview which has been an acknowledged desideratum.10 The first step was the detailed indexing of each individual manuscript and letter by Boeckh in several Berlin archives and libraries, with a short summary of the content. In addition to these approximately 1500 entries in XML/TEI P5 format, up to 900 entries related to Boeckh from the Kalliope manuscript database in XML were imported.11 At this step, we encountered a problem of disparities in the level of indexing because the Kalliope entries are often based on boxes instead of single documents (one box of e.g. 50 letters versus one letter). Our aim is to complete this information to have rich metadata on every single document in Boeckh’s literary estate. Then, the data will be submitted to Kalliope, so that Kalliope benefits from our research results. At a later stage, the same process will involve data exchange with the Humboldt University Library for reconstructing Boeckh’s library.

The project is also overseeing the publication of selected letters and reports from the estate concerning Boeckh’s activities at the Berlin university, especially his philological seminar.12 These previously unpublished documents shed light on the development of the university and research policy in nineteenth-century Prussia, and are part of the digital edition “Letters and texts. Intellectual Berlin around 1800”. The edition centres around the main research question of intellectual networks in the Prussian capital city Berlin at the beginning of the nineteenth century, and publishes letters and work manuscripts by a selection of several authors.13 The connection with the Boeckh Platform is ensured by a specific XML/TEI P5 schema that is documented in our encoding guidelines.14 The indices play a central role and contain information from our several projects and constantly interlink them. As with the Boeckh Online Platform, our goal is to exchange data and link with other projects and institutions. Thus, the data architecture of the digital edition needs to be detailed as well as open.

The text of the manuscripts is presented in a diplomatic transcription and in an edited version,15 both generated from the same TEI P5 file. The encoding of letters posed some problems, as there is no letter-specific TEI module yet.16 In these cases, we consulted the SIG Correspondence and other digital editions, such as the Carl Maria von Weber – Collected Works.17 Viewing our transcription, the user can compare it with a facsimile of the manuscript as well as with the XML file containing the metadata and the encoding. The XML files are published under a CC-BY licence that they can be re-used and enriched for further research. In the edition and the Boeckh project, authority files are used whenever possibleNote: On the concepts of authority files and their use in scholarly editions, see Peter Stadler, “Normdateien in der Edition”, in: editio 26 (2012), pp. 174–183 [DOI: 10.1515/editio-2012-0013]. including the GND for the identification of persons (Integrated Authority File,18 via entries of the GND number in our index of persons), in collaboration with the Person Data Repository at the Berlin-Brandenburg Academy of Sciences;19 the use of ISO-Codes; persistent URLs (the collaboration with libraries is especially important in this regard because they are probably the only ones who can provide these URLs); individual IDs for each XML/TEI document, etc. In order to answer our main research questions of how intellectual networks were established, how transfer of knowledge took place and books were read or produced, and to reconstruct – and visualize – the dynamics of group relationships, there is a mark-up for people, places, works (e.g. books, articles), and groups/organisations. Via these aforementioned indices the user can search in the edition’s other corpora that cite people, works etc. also cited in the edited Boeckh manuscripts. When used in connection with the Boeckh Online Platform, the researchable context becomes even more comprehensive. On both front ends, search results are shown for the edition as well as the platform and, thus, the manifold connections between the several corpora in the edition (i. e. the manuscripts) and the Platform (i. e. metadata on these and other manuscripts) are made manifest.

In this poster, I want to present the August Boeckh Online Platform and its connection to the digital edition “Letters and texts. Intellectual Berlin around 1800” in the many aspects offered by the manuscripts. I will demonstrate the workflow of the cooperations with the libraries and the wide range of documents that can be linked to the edition with the help of these connections. Furthermore, I will develop one example (the philological seminar) to show how research can benefit from such an approach.

Bibliography

  • Baillot, Anne, “August Boeckh – Nachlassprojekt” [http://tei.ibi.hu-berlin.de/boeckh].
  • Baillot, Anne; Seifert, Sabine, “The Project ‘Berlin Intellectuals 1800–1830’ between Research and Teaching”, in: Journal of the Text Encoding Initiative [Online] Issue 4 (March 2013) [http://jtei.revues.org/707 ; DOI : 10.4000/jtei.707].
  • Seifert, Sabine (ed.), “August Boeckh” [http://tei.ibi.hu-berlin.de/berliner-intellektuelle/author.pl?ref=p0178], in: Anne Baillot (ed.), “Letters and texts. Intellectual Berlin around 1800”, Humboldt University Berlin (Berlin, 2013) [http://tei.ibi.hu-berlin.de/berliner-intellektuelle/?language=en].
  • Seifert, Sabine, “August Boeckh und die Gründung des Berliner philologischen Seminars. Wissenschaftlerausbildung und Beziehungen zum Ministerium”, in: Hackel, Christiane; Seifert, Sabine (eds.), August Boeckh. Philologie, Hermeneutik und Wissenschaftspolitik (Berlin, 2013), pp. 159–178.
  • Sperberg-McQueen, C. M., “How to teach your edition how to swim”, in: LLC 24,1 (2009), pp. 27–39 [DOI: 10.1093/llc/fqn034].
  • Stadler, Peter, „Normdateien in der Edition“, in: editio 26 (2012), pp. 174–183 [DOI: 10.1515/editio-2012-0013].
  • Vanhoutte, Edward; Branden, Ron Van den, “Describing, transcribing, encoding, and editing modern correspondence material: a textbase approach”, in: LLC 24,1 (2009), pp. 77–98 [DOI: 10.1093/llc/fqn035].

‘Spectators’: Digital Edition as a tool for Literary Studies

The proposed poster presents the digital edition of about 30 Romanic moral weeklies (‘Spectators’) as an example of how the TEI can be used for a project which has to deal with complex and overlapping text structures, a large corpus of texts and a data creation environment involving staff with no special training in XML and TEI.

‘Spectators’ are a journalistic genre which had its origins at the beginning of the 18th century in England and spread out all over Europe. It became an important feature of Enlightenment and distributed ethical values and moral concepts to a broad, urban readership. The objective of this digital edition (http://gams.uni-graz.at/mws) is both editing the prominent Romanic weeklies and analysing the texts on narratological and thematic levels. Currently 1300 Spanish, French and Italian texts are provided. The collection is continuously expanded. The project has been realized as a co-operation by the department for Romance Studies and the Center for Information Modeling at the University of Graz.

A characteristic feature of the text genre of ‘Moral Weeklies’ or ‘Spectators’ are interruptions of the text flow and overlays of narrative structures that result from the change of actors and real and fictional dimensions. One goal was a faithful reproduction of the individual issues regarding text-logical units such as headings, paragraphs or quotes. An additional demand was the enrichment of the material by adding research results from the analysis of display planes and narrative forms such as dialogue, letter or dream narrative etc. inside the texts. These considerations were formalized in a data model corresponding to the requirements for an explication of the linguistic and narrative structures, based on TEI P5 XML.

A particular editorial challenge of this digital edition are the overlapping structures resulting from the text-logical units and narrative forms. To solve this problem, the TEI provides different strategies. In this project we decided on using boundary marking with empty elements to mark the starting and ending points of levels of interpretation and narrative forms.

For the implementation of the digital edition of the ‘Spectators’, an exemplary work flow was developed. In addition to the assessment of the material and the survey of the project objectives, this work flow includes a data acquisition scenario which supports the digital compilation and semantic analysis of the research data by the scholars: Based on the data model, a document template for a standard text processing program is created, which includes macros to transform the input into a TEI document.

A webbased Java client allows for the upload of documents into the repository, the Geisteswissenschaftliches Asset Management System (GAMS), which meets the requirements of the OAIS reference model. Based on the open source project FEDORA, this object-oriented digital archive offers the individual design and flexible adaptation of content types (‘content models’) tailored to the type and scope of the source material and specific research interests. A ‘content model’ describes the structural components of a digital object, essentially consisting of a persistent identifier, metadata, content and disseminators.

More than 1300 texts from some 30 French, Italian and Spanish ‘Spectators’ have already been published in their original language using the methods outlined above. The data is available under a Creative Commons license. Moreover, the objects are integrated into the European search portal Europeana (http://www.europeana.eu). The user interface to the collection is multilingual.

Bibliography

  • Ertler, Klaus-Dieter (2012): “Moralische Wochenschriften”, in: Leibniz-Institut für Europäische Geschichte (IEG): Europäische Geschichte Online (EGO). Mainz 2012. http://www.ieg-ego.eu/ertlerk-2012-de
  • Hofmeister, Wernfried/Stigler, Hubert (2010): “Edition als Interface. Möglichkeiten der Semantisierung und Kontextualisierung von domänenspezifischem Fachwissen in einem Digitalen Archiv am Beispiel der XML-basierten ‘Augenfassung’ zur Hugo von Montfort-Edition”, in: Nutt-Kofoth, Rüdiger / Plachta, Bodo / Woesler, Winfried: editio. Internationales Jahrbuch für Editionswissenschaft. Berlin, New York: Walter de Gruyter, 79–95.
  • Lagoze, Carl/Payette, Sandy/ Shin, Edwin/Wilper, Chris (2006): “Fedora. An Architecture for Complex Objects and their Relationships”. http://arxiv.org/ftp/cs/papers/0501/0501012.pdf
  • Vasold, Gunter (2013): “Progressive Editionen als multidimensionale Informationsräume”, in: Ambrosio, Antonella / Barret, Sébastien / Vogeler, Georg: Digitale Diplomatik 2013. Tools for the Digital Diplomatist, Köln: Böhlau, in print.

Laundry Lists and Boarding Records: challenges in encoding “women’s work”

Introduction

In Encoding Financial Records for Historical Research presented at this conference last year in Texas and slated for publication in an upcoming issue of the Journal of the Text Encoding Initiative, we noted a shortcoming of current TEI encoding methods for representing services, as opposed to commodities, when being transfered or traded: ‘In many cases one of the ‘items’ being transferred is a service, not a commodity. Our current system, being based on the TEI <measure> element, seems a clumsy way to handle this. For example, measure unit=”hours” quantity=”2″ commodity=”babysitting” may be reasonable, but when the service being provided is recorded either by things on which it is performed or the people for whom it is provided, rather than the amount of the service that is provided, it becomes difficult to express formally using the current system.’ The ‘transactionography’ approach described in that paper relies on the TEI <measure> element to record the what of a transfer. (The when is recorded using the TEI att.datable attributes.) Many historical financial records, however, include or are even primarily about the exchange of money for services (e.g., laundering, room and board, or domestic service). Since these services were more usually performed by women and often recorded by women, study of these types of HFRs is of particular interest to practitioners of women’s history.

Sample Problems

The quintessential example of this problem occurs when trying to encode a ‘laundry list’. Such lists include a set of items of clothing and prices. But the price is not for purchasing the associated item of clothing, but for laundering it (which is often not explicitly stated).

While one might claim that the work of laundering is implied by the genre ‘laundry list,’ such generic information must be recorded somehow in order to be machine-readable. If we use the <list> element, the @type and @subtype attributes could be used to express that the costs listed are for laundering, not purchasing, but there is no agreed upon vocabulary with which to express this, and it may not generalize well to other services.

Many examples of such laundry lists are extant, and they can potentially provide information not only about period clothing and the habits of wearers, but also about the comparative value of laundering services in different regions and periods, and perhaps (with sufficient contextual information) about the relative cost (and therefore value) of the work of laundering in such various contexts as an individual laundress subcontracting with the keeper of a boarding house, an insitutional laundry as a department of a hospital or hotel, or an industrial laundry serving individual or institutional clients.

In the case we will show in the poster, an individual laundress subcontracted with a boardinghouse keeper to perform the service of laundering clothing and household linens for people who also rented rooms and purchased meals at the boarding house. The laundry lists make up one set of documents that record exchanges of services for cash. They are supplemented by small notebooks in which the boardinghouse keeper tracked charges for food and such other necessities as candles and soap, as well as weekly payments for room and board. The boarder also kept receipts as her own record of the payments.

One Possible Solution
In our ‘transactionography’ we have heretofore used the TEI <measure> element, with its @quantity, @unit, and @commodity attributes, to represent that which is transferred from one person or account to another in a transaction. But in the laundry list case, the work performed by the laundress is not a “commodity” but a “service,” the service for which the boarder paid the boardinghouse keeper in this transaction. However, using the <measure> element with existing attributes leads to markup that fails to distinguish the purchase of a garment from paying for the service of laundering it. One possible solution is to add a new attribute, @service. Thus for instance, a line from a laundry list might be marked up as follows.

<hfr:transaction>
<hfr:transfer fra=”people.xml#fearn” til=”people.xml#EW”>
<measure quantity=”2″ unit=”count” commodity=”skirt”
service=”laundering”>2 wool skirts</measure>
</hfr:transfer>
<hfr:transfer fra=”people.xml#EW” til=”people.xml#fearn”>
<measure quantity=”6″ unit=”pence” commodity=”currency”
>6</measure>
</hfr:transfer>
</hfr:transaction>

This solution seems to have broad application. E.g.:

  • Framing: measure quantity=”15″ unit=”count” commodity=”8×10 color glossies” service=”framing”
  • Shoe shining: measure quantity=”2″ unit=”count” commodity=”shoe” service=”shining”
  • XSLT programming: measure quantity=”18″ unit=”hours” service=”programming”

We will not be surprised, however, if there are cases it does not handle well.

A Broader Problem?

The issues presented by the laundry list example may be representative of a larger problem, that of indirect reference. Indirect reference was described in 2008 by the Women Writers Project. This phenomenon occurs when an author refers to one entity by naming another. In the WWP’s case a person is referred via the name of another person, character, or figure. E.g., the headline of a 2007-05-31 article in the Toronto Star, ‘Terminator gunning to save lives’, refers to then governor of California Arnold Schwarzenegger indirectly through a reference to a character he played in a well-known film. The WWP solution addresses this specific use-case, <persName>:

… to represent the special nature of metaphorical or figurative references. … For this …, the WWP has created a custom attribute for <persName>, @wwp:metaRef. For practical purposes, @wwp:metaRef functions exactly like @ref; where @ref points to the unique @xml:id for the actual reference, however, @wwp:metaRef points to the @xml:id of the person being indirectly or figuratively referenced. For example —

Source text:

Come all ye tender Nymphs and sighing Swains,
Hear how our Thyrsis, Daphnis death complains

Encoded text:

<l>Come all ye tender Nymphs and sighing Swains,</l>
<l>Hear how our <persName ref=”personography.xml#thyrsis.auc”
wwp:metaRef=”personography.xml#jfroud.jke”>Thyrsis</persName>,
<persName ref=”personography.xml#daphnis.tvc”
wwp:metaRef=”personography.xml#tcreech.zxz”>
Daphnis</persName> death complains</l>

It occurs to us that that these cases may not be very different. In the laundry list example, the work of laundering a skirt is referred to by reference to the skirt itself. In the Toronto Star example Arnold Schwarzenegger is referred to by reference to the character he played. Each is a case of indirect reference. It is interesting to contemplate a generic TEI mechanism for indirect reference that would handle both cases.

Conclusion

In this poster presentation we hope to frame the problem of encoding services within historical financial records, present at least one possible solution, and solicit input from the attendees of the TEI conference about the utility of our proposed solution, and about other possible encoding methodologies to solve this shortcoming. One goal is to come up with a methodology that might apply to other cases of what might be called indirect reference.

Bibliography

TEI/XML Editing for Everyone’s Needs

Project

The DFG-funded project Deutsches Textarchiv (DTA) started in 2007 and is located at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW). Its goal is to digitize a large cross-section of German texts published between 1600 to 1900. The DTA presents almost exclusively the first editions of the respective works. Currently, the DTA core corpus consists of more than 840 texts, which were transcribed mostly by non-native speakers using the double keying method. In addition, the DTA hosts more than 520 further texts which were imported using the extension module DTAE. In total, the corpus consists of more than 380,000 text pages (June 2013).

The DTA provides linguistic applications for its corpus, i. e. serialization of tokens, lemmatization, POS tagging, lemma based search, and phonetic search based on rewrite rules for historic spelling. Each text in the DTA is encoded using the DTA base format (DTABf), a strict subset of TEI P5. The markup describes text structures (headlines, paragraphs, speakers, poem lines, index items etc.), as well as the physical layout of the text.

Quality assurance for all texts within the DTA corpora takes place within the quality assurance platform DTAQ. In DTAQ, texts may be proofread page by page in comparison to their source images. This way errors can be detected which may have occurred during the transcription and annotation process.

Problem Statement

DTAQ is running since March 2011, and a lot of tools were developed which allow for various kinds of annotations to the digitized texts. DTAQ users should be enabled not only to review texts but also to correct erroneous transcriptions and annotations, or add new annotations. Within DTAQ, each text is presented page by page alongside its source images in various formats: XML, HTML, plain text etc. To produce this kind of view, the original TEI P5 documents are splitted into several single page documents. This process is reversible, so modified single page documents can be reinserted losslessly into the original TEI document. Based on this page-oriented view, DTAQ provides several ways to change documents on the transcription or annotation level.

We differentiate between several kinds of changes and user experience levels:

  • Changes to the text base (i. e. the plain transcribed text without any kind of markup).
  • Annotation of single tokens or groups of tokens, e. g. named entity annotation, annotation of printing errors etc.
  • Editing of attribute values in existing XML elements, e. g. the values of @ref in <persName> elements to provide links to authority files.
  • Editing of basic XML structures, e. g. adding quotation markup in citations (<cit>/<quote>/<bibl>).
  • Editing of more complex XML structures, e. g. restructuring of paragraphs or even chapters.

For some of these kinds of changes users may not even have to bother with XML markup, other changes require a deeper look into the complete XML document, e. g. if they occur across page breaks, or could produce overlapping hierarchies.

Even though there is a comprehensive documentation available for the DTABf, less experienced users (especially those with little if any previous knowledge of the XML standard) would have to spend significant amounts of time to learn how to properly apply changes to the TEI documents on the level of transcription or annotation.

In addition, each change must be tracked within a revision history system to see (and moderate), which user changed the texts within the DTA repository.

Various Editing and Annotation Possibilities

To make easy changes easy and hard things possible, we provide several ways for users to deal with the digitized texts:

Instant WYSIWYG Editor

Simple changes, like fixing transcription errors, may be carried out directly within the rendered HTML version of a document page, using the @contenteditable="true" attribute (cf. ) which is available within all modern browsers. This technique allows for real WYSIWYG (what you see is what you get), because it makes the generated HTML editable within the rendered view. The modified text node is sent back to the repository, where it replaces the text node of the original TEI document. Users cannot produce invalid markup and don’t even have to bother with angle brackets (cf. ).

Screenshot Instant Editor
Simple Annotation Editor

To annotate simple phrases like named entities, no further knowledge of XML is needed. Just like in the correctors’ view, where users can proofread pages, mark erroneous passages with their mouse, and report the errors via a ticketing system, named entities can be marked and labeled as <persName> or <placeName>, and additional data like references to an authority file can be provided using the @ref attribute.

XML Editor for Single Pages

For mid-size changes on single pages, we provide an online XML editor. This tool is based on the Ajax.org Cloud9 Editor (ace). The editor window displays the syntax-highlighted XML for the corresponding text page. In addition, we provide several tools to support quick and efficient tagging (e. g. select a wrongly printed word like “errorneous”, press the “printing error” button, and an XML template like <choice><sic>errorneous</sic><corr></corr></choice> is inserted into the editor). The editor also provides validation against the DTABf schema (via AJAX requests).

Screenshot XML Editor
DTA oXygen Framework

For larger changes, or even the beginning of new transcriptions from scratch, the DTA developed DTAoX, a framework for the widely used oXygen XML editor, which supports text editing in conformity with the DTABf within oXygen’s author mode. A fine grained color scheme provides visualisations of different tagging levels (as well as of discrepancies with regard to the DTABf) to produce DTABf compatible TEI files. To apply changes to DTA documents, users have to download the whole TEI documents from the DTA repository, mark them as “locked” (to avoid conflicts with other changes), perform their intended changes, and upload the modified documents back into the repository.

Screenshot DTA oXygen Framework

Tracking Changes with git

Each time a change is submitted to the repository, the resulting document is validated against the DTA base format schema, and rejected, if the validation fails. Otherwise, the document gets an updated set of metadata (esp. with regard to timestamps and editor’s responsibilities) and is committed to a git repository. We chose git, because—in contrast to other source control systems—git can deal with huge XML files adequately. Using a version control system of course is crucial, for every change needs to be reproducible and reversible, if necessary.

Availability

The DTABf documentation with a lot of illustrated examples from the DTA core corpus is freely available at the DTA website. RNG and ODD files are provided, as well as template files for starting a new transcription project.

DTAoX, the DTA oXygen framework is freely available for download under the LPGL license.

In its third project phase (application is currently under appraisal by the DFG), the DTA project will provide the DTAQ quality assurance framework for a wider audience and making it open source under the LGPL license.

Poster Presentation and Live Demonstration

The poster will provide a detailed insight into the various text editing modes the DTA provides. Visitors will be able to try out the respective tools by themselves at the live presentation desk.

References

  1. Ajax.org Cloud9 Editor: http://ace.ajax.org.
  2. Cayless, Hugh: I Will Never NOT EVER Type an Angle Bracket (or IWNNETAAB for short). In: Scriptio Continua, 2011-01-06. .
  3. Deutsches Textarchiv: Basis for a Reference Corpus for the New High German Language. .
  4. DTA base format (DTABf): .
  5. DTA oXygen framework (DTAoX): .
  6. oXygen XML editor: .
  7. git (distributed version control system): .
  8. W3C: HTML5. A vocabulary and associated APIs for HTML and XHTML. W3C Working Draft 24 June 2010. (see also the latest editor draft of the HTML 5 specification: ).

Notes
1

http://www.math-ling.org/e-index.html
2

http://www.jinmoncom.jp/
3

http://www.jsik.jp/?index-e
4

http://www.jads.org/eng/index.html
5

http://www.jadh.org/
6

http://coe21.zinbun.kyoto-u.ac.jp/tei-day/
7

Most of the information of the workshops are put on the JADH Web site. (http://www.jadh.org). This series yielded a Web page “A Simple Guide to TEI and oXygen (in Japanese)” which are referred in various related workshops in Japan. (http://www.dhii.jp/nagasaki/blog/node/12)
8

http://www.w3.org/TR/html5/text-level-semantics.html#the-ruby-element
9

http://www.idpf.org/epub/30/spec/epub30-contentdocs.html
10

Baillot, Anne, “August Boeckh – Nachlassprojekt”. [http://tei.ibi.hu-berlin.de/boeckh/] The Platform will be publicly available in August 2013. Login: berlin, password: heidelberg.
11

http://kalliope-portal.de/
12

Seifert, Sabine (ed.), “August Boeckh”, in: Anne Baillot (ed.), “Letters and texts. Intellectual Berlin around 1800”, Humboldt University Berlin [in preparation, 2013] [http://tei.ibi.hu-berlin.de/berliner-intellektuelle/author.pl?ref=p0178]. On Boeckh’s founding of and directing the philological seminar, see Sabine Seifert, “August Boeckh und die Gründung des Berliner philologischen Seminars. Wissenschaftlerausbildung und Beziehungen zum Ministerium”, in: Christiane Hackel, Sabine Seifert (eds.), August Boeckh. Philologie, Hermeneutik und Wissenschaftspolitik (Berlin, 2013), pp. 159–178.
13

For an introduction to this digital edition as well as its use in teaching, see Anne Baillot, and Sabine Seifert, “The Project ‘Berlin Intellectuals 1800–1830’ between Research and Teaching”, in: Journal of the Text Encoding Initiative [Online] Issue 4 (March 2013) [http://jtei.revues.org/707; DOI : 10.4000/jtei.707].
14

http://tei.ibi.hu-berlin.de/berliner-intellektuelle/encoding-guidelines.pdf
15

See Sperberg-McQueen on the fact how technical possibilities and the mutability of digital presentation influence editing as well as editorial theory, C. M. Sperberg-McQueen, “How to teach your edition how to swim”, in: LLC 24,1 (2009), pp. 27–39, esp. pp. 31–33 [DOI: 10.1093/llc/fqn034].
16

On the treatment of correspondence in scholarly editing in general and on the problems of encoding correspondence in TEI, see Edward Vanhoutte, Ron Van den Branden, “Describing, transcribing, encoding, and editing modern correspondence material: a textbase approach”, in: LLC 24,1 (2009), pp. 77–98, esp. pp.82–90 [DOI: 10.1093/llc/fqn035].
17

http://www.weber-gesamtausgabe.de
18

http://www.dnb.de/EN/Standardisierung/GND/gnd.html
19

http://pdr.bbaw.de/english