06 August 2021

The world of work is changing rapidly. These changes present new opportunities to revisit the value of collective bargaining agreements, as tools for protecting workers’ rights and as historical documents for understanding developments in industrial relations. Collective bargaining agreements (CBAs) are the results of negotiations between employers and unions to regulate the terms and conditions of employment. The WageIndicator Foundation provides the WageIndicator Collective Agreements Database that includes 1600 collective agreements in 28 languages from more than 50 countries. This database, which was enriched within the SSHOC project, gives researchers the ability to read and compare the original texts of agreements by topic, at national and international levels. But as the amount of data available on collective bargaining grows, so too does the challenge to navigate and understand it.


Using text mining to discover the secrets of CBAs

At the end of May, SSHOC, CLARIN and DARIAH supported the Helsinki Digital Hu­man­it­ies Hack­a­thon 2021 (DHH21), a 10-day summer school organised by the University of Helsinki and Aalto University. The hackathon brought together researchers and students from computer science, data science, humanities and social sciences that formed interdisciplinary teams and explored ways to solve concrete research questions. One of the groups, the CBAQuest team, lead by Daniela Ceccon and Stefano Ceccon during the SSHOC workshop at DHH21, explored the feasibility of assessing the ‘worker-friendliness’ of collective bargaining agreements in order to find new ways of understanding agreements and to contribute to improving global labour market transparency. To this end, the team produced a prototype of a digital tool that offers visualized information about CBAs to anyone interested in the documents governing the lives of workers.

The worker friendliness of collective labour agreements was rated by considering the following measurements:

  • Equality is evaluated through the presence of clauses addressing 4 indicators that fall under gender equality trigger: gender equality, discrimination, sexual harassment and grievance procedure;
  • Overtime and annual leave is evaluated by checking whether there are regulations on overtime, whether there is travel allowance provided, and whether the number of days of annual leave after 1 year of working is above the international standard of 15 working days.
  • Text accessibility is evaluated through 3 indicators: concreteness, readability, lexical density which provide the information about how easy it is for the workers to understand the contract.


The figure below shows the gap and relationship between the total number of CBAs, the number of CBAs with equality related indicators, and the number of CBAs with both equality related triggers and procedures mentioned. The total number of CBAs that are contained is 1247. From those, the ones that were found to be related with the indicators, that means they have the equality related triggers, were 584. Among all the 584 gender equality related CBAs, 101 of them were found to have the procedure related terms mentioned.



The next figure is a stacked bar chart which illustrates the contribution of each equality-measuring indicator to the overall score. Among other things, it can be observed that, overall, sexual harassment is more widely addressed in CBAs then the other three indicators: grievance procedure, gender equality and discrimination. The figure also shows the difference in the overall scores between countries for the 20 countries with the highest total score which indicates that those CBAs include the most clausses addressing the issues related to one of the 4 equality-measuring indicator. Thus, it can be observed that Romanian CBAs are more equality oriented than Slovakia’s CBAs.



The map below shows the number of CBAs in the current dataset for each country. The map clearly indicates that many countries have only a small number of CBAs, which may cause the current score results to be to some extent unrepresentative in countries with less data. This problem, however, could be solved with the continuous expansion of the database in the future.




The formula to calculate the score for this measurment was applied to all available CBAs in the dataset. The bar chart below shows the average worker friendliness scores of CBAs by country from the perspective of annual leave and overtime working for the 20 countries with the highest score.



It is clear that the current results do not seem to represent a perfect situation. As shown in the graph, some countries such as the UK and Belgium get lower average scores while they are often perceived as countries with sophisticated systems protecting workers. There are many possible reasons for this observation.

Firstly, the absence of clauses relevant to overtime or annual leave does not always imply low worker friendliness. There may already be clear laws and regulations in specific countries regarding these aspects, hence the abscence of need for CBAs to have sections dedicated to them, which explains why the assumption that absence means worker unfriendliness is not always correct. But it is important to realize that the score indicates the worker friendliness of CBAs themselves without considering the contexts.

Secondly, the availability of CBAs in the dataset varies across countries. It can be difficult to collect CBAs from countries such as the UK due to reasons like privacy concerns, and the small and biased sample may also affect the accuracy of the final score.

Thirdly, there are also limitations regarding binary indicators. In the current scoring system, the indicators mostly concern the existence of certain clauses, but the existence of clauses does not have anything to do with the quality of these clauses and does not necessarily mean that those clauses are enough to protect labor’s rights.



This indicator is based on three different measures, namely concretness, readability and lexical density, which were used to observe how easy it is for the workers to understand CBAs. Concreteness refers to the amount of abstract versus concrete words used in a text. Texts with relatively more concrete words are more accessible than texts with relatively more abstract words. While this measures the use of referentiality to concrete objects, the second aspect which referred to readability measures the years of education one would need to be able to understand a piece of text. This is usually evaluated through the amount of long words and long sentences. Finally, lexical density looks at language from a semantic perspective and is a measure of the number of different words that are used. The hypothesis is that for a text to be considered readable, it should require fewer years of education and should have low lexical density. The figure below shows the text accessibility scores for various languages, and it can be seen that Slovak’s CBAs are considerably more reader-friendly than Indonesian CBAs.




In summary, the project identified a number of ways that the ‘worker-friendliness’ of agreements might be measured, making use of text mining methods to analyse and score agreements on various indicators. By using and visualising these scores, the team has been able to find new methods of evaluating agreements at a glance, in ways that might facilitate understanding of these agreements for labour market researchers and workers in general. However, a number of challenges and limitations have been identified that invite further research into the secrets of collective bargaining agreements.


Written by Alfredo Aníbal Collado Ayub, Dario Del Fante, Fabiënne Reedijk, Katarzyna Jachymek, Kristina Kalinauskaite, Nadine Chambers, Osama Khalid, Rosanna Yingying Hu, Trudy Mensah, Ulysses Ko, Vasiliki Kokkala, Yiwen Xing, Zhiyuan Zhou.