In this section, I will discuss four critical aspects embedded in this aim: 1) sensitive data, 2) replication, 3) archive, and 4) social and digital media data. ("large scale" is also part of the aim but reviewed already in the previous section).
Support for Sensitive Data:
The repositories mentioned in the previous section, and others domain-specific repositories in biomedical areas, have worked or are working towards solutions to support sensitive data. ICPSR provides data enclaves to store, access, and work on sensitive data. Harvard Dataverse and ODUM Dataverse are working on supporting tiered access to sensitive data by integrating with the DataTags system (Sweeney, Crosas, and Bar-Sinai 2015), described below.
DataTags: The DataTags system defines six levels of access and security requirements, where each dataset (or data file) in a repository is assigned one of the six datatag levels - blue, green, yellow, orange, red, and crimson -. The repository guarantees that the access and security requirements will be applied appropriately for each level. This approach facilitates sharing sensitive datasets by standardizing the restrictions given to the data from less to maximum restriction. The blue datatag applies to open or public data, which means that a user should be able to access the data without the need to either register or agree to a DUA. In this case, the data can be transferred and stored without encryption. The green datatag applies to a dataset that has no substantial restrictions or sensitive information, but the user needs to register to access it. Green datatag would be applied to data that has been de-identified, but it carries a risk of being re-identified. By capturing information about the data user, the data provider has a record of who accessed the data in case the data are re-identified in the future. The yellow datatag applies to restricted data, in which data users must be granted permissions to access the data. A DUA applies to the data, but it can be agreed to through a simple click-through. For data associated with an orange datatag, however, the user needs a signed DUA to access the data. In this case, usually, the institution or organization representing the user and the organization representing the data provider need to have mutually agreed before signing the DUA. The red datatag is the same as orange, but additionally, access to the data requires two-factor authorization. Most HIPAA data and FERPA data would be assigned a red datatag. For this white paper, the crimson datatag, the maximum restrictive level is not relevant, and in most cases, the data at this level might need to be accessed outside any network.
ISSOD could consider applying the DataTags system or a similar system to classify the datasets they plan to share into a set of well-defined levels that guarantees the requirements corresponding to each level. This standardization facilitates establishing data use agreements with the original data provider organization and reviewing with IRB the restrictions needed for sharing or accessing the data.
Differential Privacy or other Privacy Preserving tools: An attractive extension to support sensitive data is adding tools that allow the analysis of sensitive data without having to access the raw data. The Harvard Privacy Tools project has been working on a differential privacy tool, PSI (Gaboardi et al. 2016). The current plan is to integrate the PSI tool with the Dataverse software, but it could as well be integrated with other data platforms. In conjunction with the DataTags system, the differential privacy tool will allow conducting some preliminary analysis on a yellow, orange, or red dataset without having to go through the long and tedious (and sometimes not available) process of DUA approval to be granted access to the raw data. It will also allow constructing open differentially private metadata set for a sensitive dataset, including differentially-private summary statistics.
Trusted Remote Storage Agents:
For some data sources, the organization responsible for the data might not agree to move the data to the repository’s storage. A solution to this can be offered by the concept of trusted remote storage agents. To support this, a DUA would need to be signed between the organization and the repository agreeing that the data are kept in a remote storage owned and maintained by the organization, while the repository holds the metadata describing the dataset and the persistent unique URL to access the data. Both the repository and the owner of the remote storage would manage together granting access to the data (via passing a token when credentials are proven or other means).
Governance, Access Review, DUAs:
Besides the necessary technology and features to support sharing sensitive social and digital media data described above, in most cases, the challenge will be deciding on the DUA between the data source and the research institution that wants to use the data. A way to facilitate this step is to standardize, to the extent possible, the DUA and start with a template at the beginning of the negotiations.
Learning from other research communities with experience defining agreements for sensitive data could be valuable. In particular, from the biomedical community, which has established well-defined practices for sharing sensitive data for several years with projects such as TOPMed and DBGap for genomic data (https://www.nhlbiwgs.org/). In these projects, the data can be accessed for only very specific and approved research purposes.
For ISSOD, also, there are many open questions when planning to use non-public social media data for research: do social media users need to provide informed consent to allow using their data for research? Should the dataset be modified if a social media users delete a published post, following the recent European General Data Protection Regulation (GDPR)? Is it possible to guarantee that the data will be removed and synched correctly?
Replication
Replication of an empirical study (or reproducibility, as it is also often referred) can be supported at various levels of complexity. At a minimum, any data and code used in the study must be made available to enable computational reproducibility of the original results. Unfortunately, in most cases, the data and code shared by the data authors do not have sufficient information to be able to reproduce the published work. This problem is often experienced by the Odum Institute at the University of North Carolina, which serves as a third-party peer reviewer to verify that the results in a submitted manuscript can be reproduced by using the data and code provided by the authors (see more on this initiative at the Odum Institute site: http://cure.web.unc.edu/odum/). On average, the Odum team needs to contact back the authors three times to gather more information before the code runs appropriately and reproduces the results. It is hard, therefore, to reproduce the results automatically if the data and code are not well formatted, reviewed, and documented.
Computational reproducibility could be improved by either using dedicated resources to curate the data (and code) once is in the repository or adding tools and policies that increase the chances that all the necessary information will be provided. The IQSS and Odum teams are working towards integrating the Dataverse platform with replication tools, such as Code Ocean or Jupyter Notebooks, which will allow running the code on the data in an online platform, avoiding setting up a local environment. This collaboration aims to empower not only reviewers but the entire research community, to more easily be able to reuse a dataset and reproduce previous work. Once the results are verified, a ‘reproduced’ certification or badge should be assigned to the dataset.
Sensitive data present an additional challenge for reproducibility. If reviewers are assigned to reproduce the results in a manuscript, they will need to have the necessary permissions to access the data and code for verifying the results. One option could be to grant access to reviewers only for running the code but not allowing them to use the data for any other purposes. The DUA should include the appropriate language to enable access for peer-review. Another option, when possible, could be to integrate the replication tools with differential privacy tools, that is, provide a differentially private version of the code.
Social and Digital Media Data
It would be useful to integrate the repository with network data visualizations and geospatial visualizations to support social media data. This integration would be doable through the repository API, given that the data and metadata can be accessed in a format compliant with the visualization tools. As an example of such integrations, the Harvard Dataverse integrates with the WorldMap platform (http://worldmap.harvard.edu/) to visualize geospatial datasets. For qualitative data (text, images, videos), the Qualitative Data Repository (https://qdr.syr.edu/) hosted at Syracuse University is implementing an open annotation tool to annotate individual data posts and export the annotations for analysis.