SSHOC (Task 5.2) is developing a data repository service for SSH institutions. The new service is built upon the Dataverse software and will be adjusted to the needs of the European research infrastructures. Dataverse is an open source software platform, community driven, and allows integrations with other data services such as DataCite or ROpenScience. It has a modular design principle using API’s, that allows for distributed file storage, and supports the building of further microservices on top.
We would like to discuss with potential interested Service Providers what their ideas are about such a service:
The webinar starts with a presentation of the current functionality, followed by a presentation of new features to be developed. After these presentations we will collect input from the audience. The discussion will focus on essential requirements for such a service, preferences, organisation, necessary training.
The webinar will be chaired by Marion Wittenberg, service manager of DataverseNL at Data Archiving and Networked Services (DANS), together with her colleagues Laura Huis in ‘t Veld, functional manager, and Vyacheslav Tykhonov, Information Scientist.
The event is intended for staff of interested Service Providers from CESSDA.
Participation is - in principal - by invitation only. Please contact marion.wittenberg[at]dans.knaw.nl if you would like to join.
The outcomes from the webinar discussion are presented below in Q&A format
Q: Is there support for material with disclosure concerns, or is it only useful for anonymous data?
A: At the moment, dataverse is not meant for storing of privacy sensitive data, it will be available for anonymized and pseudonymized data. However, there is a Global Dataverse working group that is looking into possible solutions for storing sensitive data.
Q: How secure would a Dataverse setup on EOSC cloud be?
A: The service that will be developed is not meant for storing privacy sensitive data, it will be available for anonymized and pseudonymized data. For those data it will be a secure service.
Q: Data Producers will presumably have concerns about their data being hosted in the cloud, internationally, by a third party (for example, for a CESSDA Service Provider). Will this be explored as part of the SSHOC Project?
A: In the next release of the dataverse software (version 4.20) it will be possible to configure dataverse to store files in more than one place at the same time (multiple file, s3, and/or swift stores). This could possibly be a solution for this.
Q: Can you explain the user restrictions options more, also which planned developments there may be for levels of permission?
A: There are a lot of roles and associated permissions within the current software of Dataverse for users to publish and share data. See: http://guides.dataverse.org/en/latest/user/dataset-management.html#id40. When a dataset is published, the metadata is always open access. It is possible to restrict access to files for end users and add a request access button. At the moment it is not in the planning of the SSHOC project to change anything concerning user restrictions.
Q: A question about PDF Previewer: Which impact do file restrictions have?
A: All data viewers are only available for unrestricted files, this applies for PDF viewer, Spreadsheet viewer, and others. There is no need for concerns about accessing or viewing restricted data.
Q: What are the main functionalities of Dataverse and what are the main differences to other repository tools? Why should we implement Dataverse instead of other open source solution, like, e.g., NADA?
A: You can find a list of main dataverse functionalities here: https://dataverse.org/software-features
There are many open source repositories to choose from, but Dataverse scores high on being FAIR compliant. See also this article: https://www.biorxiv.org/content/10.1101/418376v2.full
You can also check this this comparative list of software for data sharing:
Q: Can we access usage statistics?
A: It’s possible to collect user statistics. It is also possible to measure statistics on sub dataverse level. There is a Python application for this available, called Miniverse: https://github.com/IQSS/miniverse
Q: As Dataverse administrators, do you often get emails from researchers about e.g. access to this or that dataset, that should have been directed to the authors directly and you must redirect them?
A: Dataverse has a whole set of roles and permissions in which you can define who should get emails and access requests. Access requests will be send to users with admin- or curator role for the relevant dataset. You can also set a contact address for each dataset.
Q: How do you manage requests for embargos?
A: There is no embargo function in Dataverse. However, you can upload your data without publishing it. When the embargo period is over, you will have to publish your dataset manually in order to make it findable for others.
Q: What about long-term preservation? Are files stored in long-term preservation formats?
A: Dataverse doesn’t change the file format at the moment. It is the responsibility of the repository or data manager to make sure that the appropriate sustainable file formats are being uploaded in the system.
In addition, the way Dataverse handles the datasets is very sustainable. It is not possible to remove a dataset, because there is a persistent identifier attached to it. In exceptional cases you can remove the dataset, but a tombstone will be set in place to inform where the data has been moved to.
Q: What version of DDI does Dataverse support?
A: Dataverse metadata is compliant with DDI Lite, DDI 2.5 Codebook.
Q: Does Dataverse allow for variable-level documentation (including the possibility for visualizing variable frequencies)?
A: At the moment, there is no plan to develop variable-level documentation support as metadata fields in the UI, as far as we know. However, Scholars portal developed an external application called DDI Explorer, see https://dataverse.scholarsportal.info/dataverse.xhtml
Q: Is it envisaged to develop a possibility of performing online data analysis?
A: There is an external application called DDI Explorer developed by Scholars Portal, it allows to do some basic data analysis. It’s Open Source and can be developed further if there is need for it in the CESSDA community.
Demo of DDI Explorer. See https://dataverse.scholarsportal.info/dataverse.xhtml
Q: Could dataverse be extended to support variable search for quantitative data?
A: Variables are not part of the metadata in Dataverse. However, if you enable full text search for your dataverse installation, you can search inside the variables stored in the data.
Q: Is it possible to customize Dataverse to a specific layout (e.g., specific filters to look for datasets)?
A: It is possible to configure on dataverse level which search facets you would like to show (based on the available metadata fields.)
There is also functionality available in the latest versions to customize the front page of Dataverse. You can also add a custom footer for your (sub)dataverses. See also: http://guides.dataverse.org/en/latest/user/dataverse-management.html#id11
Q: How does Dataverse integrate the user interface into an existing/ new website for an archive?
A: If you have your own Dataverse installation, you can make your own homepage for your dataverse instance. You can link from the archive website to the Dataverse website.(see also this section about Branding your Installation in the Dataverse User Guide: http://guides.dataverse.org/en/latest/installation/config.html#id90
It is also possible to use a widget, copy paste few lines of code from Dataverse in your web page and it will appear on your website. This is also possible if you will make use of the central CESSDA Dataverse.
Q: How does Dataverse Metadata Schema look like?
A: The original Dataverse metadata schema managed by Harvard IQSS and available as Google Spreadsheet document: https://docs.google.com/spreadsheets/d/13HP-jI_cwLDHBetn9UKTREPJ_F4iHdAvhjmlvmYdSSw/edit#gid=0
Q:You are currently working on adding support for controlled vocabularies, does your solution leaves the option open to let users enter free text instead of a term from a vocabulary?
A: Yes, we do not change this functionality, you can put free text in the fields if you want to.
Q: How do you add custom metadata fields, by configuration or by programming?
A: You can configure the metadata fields at dataverse level. You can choose from different standard metadata block (see this list: http://guides.dataverse.org/en/latest/user/appendix.html). If your fields are not in the standard blocks, you will have to create a custom metadata block, by creating a .tsv file and load this into your dataverse. The latter requires some technical knowledge.
Q: Is dataverse compliant with CESSDA CCM after the controlled vocabulary support has been developed?
A: Dataverse metadata is compliant with DDI 2.5. We are currently working on creating a custom metadata block that can be used by Service Providers to be compliant with CMM. We will also provide instructions on how to install this block.
Q: How is language handled in Dataverse? CESSDA SPs use a variety of languages when documenting studies.
A: Dataverse SSHOC project is developing a service called Weblate to handle all languages in the proper way and with the help of community. A Beta version of the Weblate service available here: http://weblate.dataverse.org
Q: Where are translation of the Dataverse User Interface stored?
A: All translations for Dataverse are stored within this Github repository: https://github.com/GlobalDataverseCommunityConsortium/dataverse-language-packs
Q: Are the translations only available for the user interface, or also for the metadata fields?
A: The metadata fields can also be translated, but the typed in values of these fields not. They will remain in the language that the depositor has used.
Q: Will Dataverse be installed on EOSC?
A: At the moment we have installed dataverse on the CESSDA Google cloud. We use this installation as a developing instance (staging server). We are investigating whether such a central installation as a production server will be the best solution for CESSDA Service Providers (SPs) or that SPs prefer to have their own installation running in their own environment. For this second option, there will be the possibility to download the to the CESSDA requirements adapted Dataverse from the SSHOC marketplace of tools, and install it yourself, as an archive in a box solution.
Q: What are the pros and cons of having your own installation vs, CESSDA hosted?
A: In the central CESSDA installation you are not responsible for the technical infrastructure. That will be covered centrally. Every institute will have their own dataverse, but it will impose some functionality that needs to be used by everyone. We will investigate if we can have some variation to the needs of different institutions (for example different Pids).
For an own installation you need to have technical staff to maintain it, but you are more flexible to adjust it to your own preferences.
Q: A hosted solution assumes that a help desk and Service Level Agreement (SLA) be put into place; will this be part of the planning through 2022?
A: We will first need to investigate the wishes of the Service Providers, and then we will have to decide what kind of policies are needed. But a SLA and a help desk will be one of the requirements of a central service. This central service itself won’t be part of the SSHOC project, only the preparations for such a service.
Q: What is the timeline for production of self-archiving? When can it be used by researchers to deposit data?
A: In the Dataverse CESSDA project we have developed a Docker module that allows to build and install Dataverse automatically in case an institution decides to run a self-archiving solution, “archive in the box”. However, it’s not production ready setup as it is required to add production ready settings like a PID provider, federated authentication and storage.
Currently we’re developing a production ready infrastructure in the SSHOC Dataverse project, it will be possible to get Dataverse up and running on any Kubernetes based Cloud provider. This will be available at the end of 2021.
Q: What is the time-lag before features are back in the main branch of Dataverse?
A: At the moment there is a quite long delay between the time when new functionality will be contributed to Dataverse and finally available in the master branch. The reason is quite obvious: Dataverse is used as a data repository in more than 50 different institutions, the Dataverse team wants to be in control and to be completely sure that new features will not break some basic functionality. The testing process is complicated and long.
Q: We would ideally like to see a support model that provides hosting services appropriate to anonymous microdata files, configurable so that DOI assignments can be made using local DOI accounts.
A: It should be not a problem to publish microdata files, Dataverse is well suited to do that. There is no support for multiple DOI prefixes in one dataverse instance.
Q: We would require appropriate access management mechanisms to enable a remote administrator at ISSDA to approve access to specific resources, for specific periods of time.
A: Dataverse has a role and permissions mechanism that can be tested to verify if it’s suitable for your needs.