Introduction

While working for the Census Bureau, I created a unit whose mission was to increase the awareness and acceptance of administrative data for research purposes. I sought data including as much identity, temporal and spatial detail as possible to enable linkages across files and over time. I successfully negotiated data sharing arrangements because of the agency’s statutory authority. The Census Act prohibits data use for non-statistical purposes (e.g., surveillance, enforcement, or marketing) and from releasing information that could reveal the identity of any business or individual in the data. I pursued arrangements with data owners across all levels of government and industry, vastly increasing the number and variety of data sourceshttps://www2.census.gov/about/linkage/data-file-inventory.pdf

to improve Census’ economic and population statistics.

While running this unit, I constantly considered the risks of acquiring and using universe-level datasets. I often asked myself, “What is the worst thing that could happen?” On any given day, there were many, many answers to that question. My high frequency worst things worries were:

Data breaches
Information degradation
Insider threats
Volatility in sources
Data misuse
IT constraints (computing and security)
Errors in published numbers
Negotiations falling through
Suboptimal linkages
Parties revoking agreement terms

I was constantly seeking best practices to prevent the worst things from happening and developing plans to deal with actual disasters. As the signatory on most of the data sharing agreements, I was accountable for adhering to the terms and conditions within (and I especially wanted to avoid fines and imprisonment from violating IRS agreements). I encourage applying my “what could go wrong” thinking to a large-scale data sharing initiative. Social media, search engine, rideshare and job search platform, patient and customer encounter, and cell phone data contain large volumes of time- and context-specific information, include legitimate and illegitimate entries, corrections and deletions, and are available with hazy degrees of permissible uses and consent. The data generation process and owner incentives for sharing are very different than with government data.

I explore four of the worst things that I think could happen involving large scale social data sharing. I define and describe potential consequences for each, suggest possible methods for avoidance, and then consider the potential functions of an institution to facilitate and moderate responsible data sharing.

The worst things that could happen

Data are discarded

If owners throw out the data, sharing cannot take place. Data may be discarded by owners who see no value in retention, or only see expense and liability. Owners may also be deleting data by overwriting it, often a result of legacy systems designed when storage was expensive. Data may be discarded due to legal constraints. Examples include federal laws that limit months of retention, such as 24 months for the National Directory of New Hire (NDNH) data or 48 months of National Change of Address (NCOA) data. Can a trusted institution retain data for secondary use, to enable historical or longitudinal study when firms have no incentive to keep it? Would such an arrangement be durable over time? How can the additional privacy risks (Altman et al. 2018) be addressed? Should laws be changed to authorize longer retention windows for government data like NDNH and NCOA? When a record is deleted, can any information about its original existence be retained to enable record count checks in later versions?

How do retraction and revision factor into data sharing arrangements? When juvenile records are expunged, tweets are deleted tweets, or account holders request erasure, how are data stores handled? What version control and metadata can reflect evolving databases? What is the effect on replication? Are there liability issues that make sharing some data risky for owners?

When an owner reviews output and suppresses or demands changes, data are effectively discarded. This censorship affects the validity of results. This occurs in arrangements private companies but also with state and federal governments but is seldom discussed. How can this be addressed, through pre-registration, by making the terms of review and release public?

Public outcry

When data subjects, advocacy groups, lawmakers, or the media are surprised by secondary data uses or data sharing, they can be alarmed and angry. The public outrage may be legitimate, due to duplicitous or clueless data owner actions, or it may be due to a lack of transparency on the justifications for use and security protocols. In any case, public outcry often halts or ends data sharing. How can you avoid this worst case? Data owners and analysts need to be more transparent with the public about how and why the data will be used. This is hard. In government, I did federal register notices, privacy impact assessments, websites, and seminars. They seldom reached the audiences that matter. Practical tools such as Finch’s engagement matrix (ADRF Network Working Group Participants 2018) can be tested and deployed. Clear but sensitive language needs to accompany all output explaining that important research was made possible by clients or users like you with minimal intrusion and risk. It needs to be as commonplace and recognizable as PBS’ “Made possible by viewers like you” and the American Humane Society’s “No animals were harmed” certification.

What can be done?

There are ever growing mounds of consumer, user and usage data, reflecting encounters, transactions, and statuses. Fortunately, there are many responsible owners and providers, as well as experts to help curate and preserve data. Can another institution accelerate more responsible data sharing? Others have called for cross-sector intermediaries, including Groves and Neufeld (2017) and Bernholz (2016). Some international initiativesSuch as the UK Consumer Data Research Centre (https://www.cdrc.ac.uk/) or Canadian Social Media Data Stewardship (http://socialmediadata.org).

may also be relevant. If a new institution took action, how much of the following would it do?

Identification.

An institution could seek useful sources, monitor emerging sources, and sponsor data collection. It could actively build the catalog. Or connect those with data to those who want it. How technically involved in the data should an institution be? Should it help users understand universes, data treatment, and limitations?

Negotiation.

An institution could set norms for working with firms, exploring the value proposition between parties. Would it negotiate on behalf of individual projects or a set of uses?

Agreements.

An institution could manage agreements after negotiation, handling their time limits, maintenance, and modification. Would an institution help enforce negotiated terms of use? Would an institution consider information ownership and licensing or subscription terms? How involved with payments would an institution want to be on behalf of a data owner?

Data Transfer and Access.

An institution could manage a data environment with current period and/or historical data and provide controlled access. Does it enable a federated system? Or is it an aggregator building a repository? Would an institution act as a trusted third party and obtain identified data to conduct joins? Should it anonymize data for researchers? Would it provision through remote access or host researchers? Would they handle screening and monitoring of researchers (could they outsource through institutional Google sign-in or use Experian to validate identity)? Would an institution assist with output review?

Gather tools and models.

An institution could engage with those who explore privacy, ethics, and security controls. Would an institution act as an IRB? Would an institution work on messaging and monitor perception? Would it advance transparency, helping subjects see how their data are being used? Would it explore data trust or commons models, would multiple approaches be needed depending on data type and source or expected users?

Conclusion

These observations are drawn from my federal experience where I hoped for the best and planned for the worst. Will companies engage? Ongoing conversations with companies are needed, building on existing work by the GovLab and Future of Privacy Forum (Harris 2017), to understand their motives and incentives. Cross-disciplinary and cross-sector efforts are needed to set norms on data sharing and archiving and to invest in technical and governance models. An institution supporting large-scale social data sharing must be durable enough to withstand public scrutiny and corporate leadership changes. With a clear vision, well-defined scope, and persistence, it will make significant progress.

References

ADRF Network Working Group Participants. 2018. “Communicating About Data Privacy and Security.” Administrative Data Research Facilities Network Working Paper 1. https://repository.upenn.edu/admindata_reports/1?utm_source=repository.upenn.edu%2Fadmindata_reports%2F1&utm_medium=PDF&utm_campaign=PDFCoverPages.

Altman, Micah, Alexandra Wood, David R O’Brien, and Urs Gasser. 2018. “Practical Approaches to Big Data Privacy over Time.” International Data Privacy Law 8 (1): 29–51. doi:10.1093/idpl/ipx027.

Bernholz, Lucy. 2016. “Trusted Data Intermediaries.” Stanford PACS. https://pacscenter.stanford.edu/publication/trusted-data-intermediaries/.

Grabus, Sam, and Jane Greenberg. 2017. “Toward a Metadata Framework for Sharing Sensitive and Closed Data: An Analysis of Data Sharing Agreement Attributes.” In Proceedings of the 11th Annual Conference on Metadata and Semantic Research, edited by Emmanouel Garoufallou, Sirje Virkus, Rania Siatri, and Damiana Koutsomiha, 755:300–311. Cham: Springer International Publishing. doi:10.1007/978-3-319-70863-8_29.

Groves, Robert M., and Adam Neufeld. 2017. “Accelerating the Sharing of Data Across Sectors to Advance the Common Good.” https://mccourt.georgetown.edu/massive-data-institute/dataforsocialgood.

Harris, Leslie. 2017. “Understanding Corporate Data Sharing Decisions: Practices, Challenges, and Opportunities for Sharing Corporate Data with Researchers.” Future of Privacy Forum. https://fpf.org/wp-content/uploads/2017/11/FPF_Data_Sharing_Report_FINAL.pdf.

The Worst Things

Amy O’Hara

2019

Introduction

The worst things that could happen

Data are discarded

Public outcry

What can be done?

Identification.

Negotiation.

Agreements.

Data Transfer and Access.

Gather tools and models.

Conclusion

References

Introduction

The worst things that could happen

No data sharing

Bad data sharing

Data are discarded

Public outcry

What can be done?

Identification.

Negotiation.

Agreements.

Data Transfer and Access.

Gather tools and models.

Conclusion

References