Using corpora for implementing validation: SSHOC masterclass on workflows that combine quantity and quality

Date:

07 December 2019

The mantra by Grimmer and Stewart to “validate, validate, validate” is well known in the Social Sciences but how can a researcher strike an ideal balance between rigor and efficiency?

THE NEED FOR VALIDATION

At the CLARIN Annual Conference 2019 in Leipzig, SSHOC partners organised a masterclass for political and social scientists with an interest in using large text collections in their research. This event contributed to two major SSHOC objectives: developing relevant and applicable tools for specific user communities and empowering those communities to actively use such tools. The masterclass addressed the challenges that political and social scientists encounter when confronted with the need to validate their findings obtained with quantitative analysis of text corpora.

The masterclass was offered by Prof. Dr. Andreas Blätte, head of the PolMine project and developer of the polmineR R-package, and Christoph Leonhardt (both University of Duisburg-Essen). They presented common research strategies, talked about why implementing validation remains a technological frontier, mapped out various validation requirements and offered suggestions on how to satisfy the need for validation.

QUANLIFICATION – HOW TO VALIDATE FINDINGS OF CORPUS ANALYSIS

Andreas Blätte elaborated on the required integration of quantitative and qualitative approaches to corpus analysis, and suggested that the combination of the two approaches be described by a new term: quanlification. Although validation by quanlification is needed to achieve valid and sound research results, Blätte noted that such validation is inhibited by technical restrictions. Thus, a set of scenarios and workflows implemented using the polmineR R-package developed by Andreas Blätte were presented as a potential way forward. Topics covered by these workflows were counting, co-occurrence analysis, sentiment analysis, text classification, and Latent Dirichlet Allocation (LDA) Topic Modelling.

Given the various disciplinary backgrounds of the attendees – ranging from computer science to humanities – these workflows were introduced with a focus on the validation of its output rather than on the production of code. However, participants were given ample opportunity to experiment with the polmineR R-package, in order to develop experience with the implementation of validation strategies.

VALIDATION – A NEVER ENDING STORY

In the course of the day, the participants intensively discussed the possibilities and limits of validation. A shared understanding emerged that the need to integrate quantitative and qualitative approaches to corpus analysis is central to these endeavours. Validating algorithmically derived findings of quantitative approaches based on the initial text is necessary for a more complete insight in both the data and what a method actually measures, ensuring intersubjective and valid research.

So, when counting words, contexts have to be taken into account. When calculating co-occurrences, the output should be filtered by their actual semantic meaning. Sentiment analyses should take into account both the complex nature and ambiguity of human speech and hence be evaluated carefully. And machine learning approaches need to be checked by looking back at the initial data.

A TOOL THAT MAKES VALIDATION ENJOYABLE

The polmineR R-package provides a tool which has the philosophy of quanlification at its core. It offers both qualitative and quantitative approaches to corpus analysis, always allowing to reconstruct the full text. The discussion at the end of the session offered a great opportunity to elaborate on the package’s design by presenting workflows which live up to these standards.

DO YOU WANT TO SEE FOR YOURSELF?

In an upcoming webinar planned in spring 2020, Andreas Blätte will present the potential of polmineR for quanlification. If you are struggling with validation implementation for your results from large text collection or simply want to try out a new tool, sign up for the SSHOC newsletter and be the first to know about the webinar and other SSHOC activities.

LINKS TO EVENT MATERIALS

Presentation slides

Photo: Andreas Blätte talking about text analysis

Using corpora for implementing validation: SSHOC masterclass on workflows that combine quantity and quality

THE NEED FOR VALIDATION

QUANLIFICATION – HOW TO VALIDATE FINDINGS OF CORPUS ANALYSIS

VALIDATION – A NEVER ENDING STORY

A TOOL THAT MAKES VALIDATION ENJOYABLE

DO YOU WANT TO SEE FOR YOURSELF?

LINKS TO EVENT MATERIALS

News

SSHOC 2025 Updates

Science Clusters Position statement on operational commitment to EOSC and Open Research

SSHOC, the SSH Open Science Cluster has a New Chair and Vice-Chair in 2024

OSCARS project funded to foster the uptake of Open Science in Europe

Strengthening Cross-Cluster Collaboration: Highlights from the 2nd SSH Open Cluster Assembly