06 August 2021

Social science and humanities research infrastructures provide a variety of resources and services that researchers can use and benefit from. However, sometimes these resources and tools are not as intuitive for the users, and this is where librarians use their knowledge and skills to advise and guide researchers in choosing among available resources and tools based on their research question.

At LIBER 2021 Online Conference, organised in June, CLARIN and SSHOC offered a webinar that provided participating librarians with some general knowledge on how to successfully guide researchers through the process of producing or using highly encoded historical textual data.


The starting point

Francesca Frontini (CLARIN ERIC) presented a basic scenario that motivated the webinar: for example, a researcher is passionate about a research question dealing with theatrical characters, but this same researcher has very limited knowledge of digital sources and methods (e.g. TEI).

In order to help and guide the researcher in this case, the librarian can use the CLARIN Virtual Language Observatory (VLO) to:

  • find appropriate data,
  • get access to the source material, 
  • process the text with the Language Resource Switchboard (that offers tools which allow the analysis of the text) and
  • visually explore the text.

Getting to know TEI

Maria Eskevich (CLARIN ERIC) introduced the Text Encoding Initiative (TEI) and showed where librarians can find tools, resources, services, and various teaching materials. The TEI default structure consists of a header, body and textual components. It is recommended that the header consists of, at a minimum, bibliographical information (e.g. author, distributor, publisher) and that the body consists of annotations such as names, dates, people and places.

Alternatively, there already exist some SSHOC workflows that explain in detail how to create a TEI-based corpus. In addition, a researcher can use corpora and collections in different archives (e.g. VLODraCor - Drama Corpora Project and OBVIL - corpus Molière) that are already annotated and available in TEI format.



The second part of the workshop was dedicated to the hands-on session, where the participants were able to test offered methods, resources and tools. The use case – Intertextuality phenomena in European drama history – was an interesting research problem because it necessitated the analysis of the literary language of individual dramas of the respective historical language level as well as a comparative literary analysis. The main challenges in performing this study are the sheer volume of the material which cannot be processed manually, multilingualism and the absence of any annotation for parts of the collections.

The sample data which is based on a corpus of theatrical play texts from the 17th and 18th century is available in three different languages (English, French and Spanish). One of the issues encountered is the inconsistency of available formats: the documents can be available as TEI-XML, but do not follow any valid schema, they can be encoded with proprietary formats or are only available as plain text files. In order to fix these issues and normalise the corpora format, XSLT (Extensible Stylesheet Language Transformations) and Python scripts are used to clean up different parts of the corpus.

In the hands-on session, the participants learned how to extract the spoken text of two literary characters (the master and the servant) from the sample data in the single plain text format. The extraction steps included:

  • The first step:  finding annotated data via an aggregator (e.g. VLO, SSH Open Marketplace) and then downloading it from the original source of the data collection.
  • The second step: finding workflows with scripts, processing examples and compatible tools (via, for example, SSH Open Marketplace, CLARIN LRS). The tedious work is predominantly done by the scripts, so it just takes different actions/commands to complete the process. In this workflow, there are 16 steps in total, and the accompanying documentation includes all the details to successfully run the elements of the workflow.
  • The last step: data processing. This can be done offline (by following the instructions provided in a workflow) or online (e.g. via the Language Resource Switchboard).

Want to know more?

Many additional useful insights were shared throughout the webinar, so we welcome you to watch the recording and view the presentation slides.

Also, follow other SSHOC training events and join us this autumn on a workshops about Data Protection and the GDP or at a webinar on Data citation.