Overview White Papers Bios Organizers

Research Ethics Issues Raised in Collecting and Maintaining Large Scale, Sensitive Online Data

Michelle N. Meyer

Center for Translational Bioethics and Health Care Policy and Steele Institute for Health Innovation, Geisinger


Respect for persons

Although commonly forgotten, in the U.S. tradition of research ethics that developed in the late 1970s, the default requirement of study-specific, voluntary informed consent emerged from—and is merely one way of honoring—the principle of respect for persons. Both the Belmont Report and the federal Common Rule it animated recognize that such consent is not always feasible or ethically (or legally) required for all research (Meyer 2015). Individual consent will also be a poor fit in cases where data are inherently relational, as at least some data of interest to ISSOD will be (e.g., data pertaining to Facebook “friending” behavior or Twitter follower behavior).This is similarly true of genetic/omic research, since genetic/omic information is inherently relational, pertaining to families and broader kin groups. Although this problem is well-recognized, no satisfactory solution for consent that honors this relationality has been identified. With some exceptions for identical twins, individuals invited to participate in genetic/omic research are merely advised to discuss their choice with their blood relatives.

Nevertheless, this section provides a brief overview of the kinds of consent that are possible, the kinds of data and research that may not legallyThis white paper addresses research regulation under the U.S. federal Common Rule only; application of the GDPR, other non-U.S. laws, and U.S. state laws are beyond the scope of the paper.

or ethically require any form of consent, and the spectrum of other mechanisms beyond consent that are available for respecting data subjects as persons.

Other ways of respecting data subjects

When consent is infeasible or otherwise inappropriate, other ways of respecting data subjects as persons should be pursued. For instance, researchers can often provide notice of the research, either before or after the fact, and either study-specific or broad. Notice can also include return of aggregate results of research to data subjects. Some empirical studies of public perceptions of research have found that many (though not all) people would find notice of the purpose, nature, and/or results of research to be nearly as satisfying as consent for minimal risk research. For instance, in the aforementioned survey of MTurk Twitter users, although most thought that researchers should ask permission before collecting or using public tweets, “a number of respondents specifically framed their desire to both know about the research and to see it when it’s finished as an issue of respect. Informed consent can be seen as both informing and consenting, and for many respondents, the former would be sufficient”(Fiesler and Proferes 2018). A national survey of public perspective on pragmatic randomized trials found that although most respondents preferred traditional, written informed consent (despite the majority recognizing that the trials depicted did not pose additional risk on participants), a substantial minority—nearly 40% in one arm—not only tolerated but preferred general notification (Nayak et al. 2015).

ISSOD could require all publications based on ISSOD data to be made publicly available on the ISSOD site, accompanied by accessible lay summaries. A more ambitious version would involve a technology-enabled platform that tells data subjects exactly which publications their data contributed to.23andMe has such a feature: when logged into the platform, the user sees a message such as, “Michelle, you contributed to 17 published discoveries!” followed by a list of those publications with links to the journal (unfortunately, often not open access).

ISSOD could also host videos of lightning talks that explain to the public the major discoveries enabled by their data. Finally, in addition to returning aggregate research results, it can be valuable to return individual research results. The National Information Study that ISSOD envisions might, for instance, show each survey respondent how their answers compare to others.The Facebook app Genes for Good offers something like this. Participants are asked to complete numerous phenotype surveys, but when asked, e.g., how frequently they smoke, they are rewarded through a simple data visualization showing how their answer compares to those of others, e.g., the respondent is in the top quintile for smoking activity.

Another way of respecting data subjects and their contribution to research is by ensuring that ISSOD enables reproducible science. Reproducible science is an ethical (not merely scientific) issue because whatever risks data subjects bear are for naught if the resulting science is not reproducible. Social Science One, for instance, will require researchers who access their platform to adhere to the “replication standard”: all funded research must archive replication data files, including code, methods and metadata (King and Persily 2018). In addition, ISSOD could, similar to NIH and clinicaltrials.gov, require preregistration of hypotheses (if any) and data analysis plans and require public posting of results, including null results, within a specified time period (subject to extension) on an ISSOD site (whether or not the results are also reported in a peer-reviewed journal).

Other ethical issues raised

Group harm, vulnerable populations, and participant engagement

Certain populations may both perceive that they are at increased risk by having their data aggregated and shared and may in fact be at increased risk of harm. For instance, one survey study found that “[l]esbian, gay and bisexual (LGB) respondents were more likely to express concern over their Twitter posts being used in government (odds increase of 2.12) and commercial settings (odds increase of 1.92), compared to heterosexual respondents,” perhaps owing to historical online abuse of LGBT persons and/or antagonistic relationships between these communities and some governments (Williams, Burnap, and Sloan 2017). If ISSOD expects to enable research on vulnerable populations, it should consider engaging those communities early in the process of developing the platform to see whether risks can be mitigated or community representatives can be included in ISSOD oversight structures. More generally, it is a good idea to engage the public (whether vulnerable or not) in development and oversight of the research platform.

Access to data

Data repositories typically calibrate data access according to the sensitivity and/or identifiability of the data, providing different access mechanisms for different kinds of data. At one end the spectrum, the most sensitive data might only be accessible through virtual or actual data enclaves. At the other end of the spectrum, metadata and innocuous, individual-level raw data might be made publicly accessible. In between are myriad options (see Table 1 in M. N. Meyer (2018a), and Appendix) that involve trade-offs between data security, on the one hand, and transparency and fair access, on the other. For instance, it is common to restrict data access to “qualified researchers,” usually permanent faculty (i.e., those with “PI privileges”) affiliated with a research institution (e.g., Social Science One). The purpose of this restriction is that the institution can be required to be a party to the data use agreement, thereby lending that agreement some teeth. But this practice excludes citizen scientists, independent academics, and journalists, who may have legitimate interests in the data and who are likely to be more representative, politically and in other ways, of the general population whose data comprises the repository. A data repository that purports to serve the public good but is accessible only by academics may be viewed with skepticism by citizens on the political right, who may associate academia with biased, ideologically-driven research.

Data use agreements should prohibit attempts by researchers to re-identify or contact data subjects without the explicit permission of ISSOD (which can review proposals for, e.g., re-identification research).

Commercialization of research

It is essentially unheard of for data sources to be compensated for the research use of their data or to share in any profits derived from research, whether those data derive from human tissue or online behavior. Nevertheless, some people feel differently about the issue of their data in commercial as opposed to non-profit purposes, and a persistent minority of data subjects believe they should be financially compensated for use of what many regard as their “property” (Fiesler and Proferes 2018). For instance, one of several objections by some to the case of Henrietta Lacks (in which physician-investigators collected and used leftover, quasi-pseudonomized clinical tissue for research without consent, as was then and remains today legal) is that, although those physician-investigators did not profit from the research, others downstream from the initial research did—and handsomely. Some ethicists have argued cogently that sources of passively collected research materials that the source would have produced anyway and which therefore entail no extra effort by or inconvenience to the data subject (e.g., leftover clinical tissue or, in the present context, digital data already produced for the user’s own purposes) are not morally owed compensation (Truog, Kesselheim, and Joffe 2012). Still, in response to this minority public sentiment, the revised Common Rule newly requires research consent to include, where appropriate, a “statement that the subject’s biospecimens (even if identifiers are removed) may be used for commercial profit and whether the subject will or will not share in this commercial profit.”45 C.F.R. § 46.116(c)(7) (2017). Note that because the Common Rule does not apply to research with non-identifiable biospecimens collected for a purpose other than the instant research project, these new requirements technically only apply to research in which researchers intervene or interact with tissue sources to newly collect identifiable or non-identifiable biospecimens.

The regulations do not, however, provide that tissue sources must, or even ought, to share in profits.

Although compensating each data subject for their contribution is unreasonable and infeasible, it does make sense to constrain researchers from commercializing ISSOD-enabled research in ways that would threaten public access to the benefits of that research. This is especially important if, as seems likely, the public will be passively contributing data to ISSOD and bearing some (minimal) risk for doing so. Social Science One, for instance, precludes researchers who are funded by the platform from patenting their results, but it does not preclude other for-profit uses of the data (e.g., writing a profitable trade book about the research results) (Social Science One 2018).

Partnering with, versus scraping, platforms

In general, partnering with platforms provides several potential advantages. First, platforms can incorporate notice of (if not consent to) data sharing with ISSOD into their user-facing materials. Similarly, platforms may be able to and help transmit aggregate or individual results as a gesture of respect. Third, the standard rule in biobank research is that samples from participants who withdraw are destroyed and no longer used in analyses going forward, but that their data is not removed from completed analyses. By analogy, partnering with a platform could entail a mirrored research database that refreshes in real time, enabling deleted tweets to be automatically disappeared from the research platform. (The evolving content of the database would have to be accounted for somehow to enable reproducibility of the analyses.) Conversely, scraping is often frowned upon by both platforms and users, especially if the platform is quasi-closed and community- or purpose-specific, where researchers may be viewed as interlopers. This can undermine trust, setting back the goal of large scale data sharing. The primary risk of partnering with platforms is that it would be important to insulate the research from influence by the platform.


There is not currently consensus on the ethical issues raised in this white paper, and that is unlikely to be achieved in the near future, if ever (King and Persily 2018; Vitak, Shilton, and Ashktorab 2016; Zimmer 2010). Moreover, as always, different research studies will raise different concerns, significantly challenging the feasibility of one-size-fits-all ethics rules. For both reasons, it is wise to focus on process.

What should that process look like? Most university IRBs are unfamiliar with both data sharing or research with online data, leading to both Type 1 and 2 errors in reviewing proposals such as this. Moreover, the vast majority of secondary research with data housed with ISSOD will either be non-human subjects research or exempt from IRB review. Most IRBs will not conduct a substantive review that falls outside of their jurisdiction. Moreover, many aspects of the Common Rule, which IRBs are trained to apply, are a poor fit for ISSOD. The act of data collection and maintenance itself does not meet the regulations’ definition of “research” (M. N. Meyer 2018b) and as discussed above, most secondary analyses of these data will be exempt from the Common Rule or fall outside of it completely. The Common Rule is further hampered by a weak definition of “identifiable” and a focus on risks to individual data subjects only, as opposed to third parties, groups, or society.

Still, some sort of prospective group ethics review is desirable, at least for some categories of data collection (e.g., prior to scraping a platform) and some categories of secondary research with collected data (e.g., research with highly sensitive and/or highly identifiable data, research investigating a highly sensitive or controversial question, or research that targets vulnerable populations). Social Science One requires university IRB approval or (more likely) determination of non-human subjects or exempt status;King and Persily (2018) incorrectly state that federal research regulations require IRBs to make exempt determinations. In fact, the Common Rule is silent on this question. Although OHRP has historically recommended that this not be left to investigators (https://www.hhs.gov/ohrp/regulations-and-policy/guidance/faq/exempt-research-determination/index.html) during the Common Rule revision process, HHS/OHRP proposed a “decision tool” that would allow investigators (or others, as the institution prefers) to make their own exempt determinations, with impunity, so long as the inputs they enter are accurate. That did not make it into the Final Rule, largely because the agencies ran out of time to develop the tool, but HHS has indicated that it intends to introduce this in the future. Perhaps anticipating this, some IRBs have developed online decision tools, of a sort, through which they permit their investigators to make their own exempt determinations. The proposed practice of investigators making their own exemption determinations is controversial, and until the dust settles, it would be best if IRBs continue to make these determinations for ISSOD projects.

prospective peer review that includes a review of the proposal’s scientific merit and potential benefits, but also of the “ethical track record” of the PI and the potential costs to data subjects and others;It is unclear how peer reviewers are meant to investigate the “ethical track record” of the PI. Nor is it clear that peer reviewers are sufficiently knowledgeable about the risks of this work to be helpful (much as IRBs are often insufficiently knowledgeable about the scientific merits of proposed research).

and, assuming the proposal passes peer review, separate ethics review by ethicists appointed by Social Science One with specific expertise in online research ethics. Finally, Social Science One collaborates with an NSF-funded team of information scientists, PERVADE, to provide continuous ethics feedback about the platform’s decisions. Something like this body might be engaged to review at least some subset of data collection and data analysis activities. Members of that committee should include not only those who specialize in online communities and digital data but also those who are broadly trained in moral reasoning. ISSOD should also consider include laypersons on the committee (ideally, more than the single community member most IRBs retain). For more on oversight of research when IRB review and the Common Rule are poor fits, see M. N. Meyer (2018a, 224–27).

Public trust

Whether behavior is morally right or wrong is not dictated by public opinion. However, data sharing initiatives like ISSOD will not be successful if they do not earn the public’s trust. This is likely to be one of the biggest obstacles to success. In the U.K., for instance, the National Health Service (NHS) attempted to invoke a social license to justify the extraction of data from medical records, partly for research, unless patients opted out. Public and even professional opposition was strong enough that the program, care.data, was shelved. Some scholars have argued that the program failed to secure all necessary aspects of a social license for research, which require that data subjects perceive participation to be voluntary and governed by values of reciprocity, non-exploitation, and service of the public good (Carter, Laurie, and Dixon-Woods 2015).

There is some existing research investigating perceptions of research use of digital data, and it suggests the challenges ahead. In the aforementioned survey of Twitter users, 65% believed that researchers should not be able to use even public tweets without explicit user permission. When asked whether they themselves would agree to allowing a university researcher to use their tweet, 53% said yes, 14% said no, and 33% said it would depend on contextual factors. When asked if they would opt out of having their tweets used in all academic research, 29% said yes and another 25% again said it would depend. When asked which factors would influence their comfort with “a tweet” of theirs being used in research, respondents (n = 268) were most likely to indicate being somewhat or very uncomfortable when: the tweet was from their protected account (75%), no consent was sought (67%), it was a public tweet they had later deleted (64%), the tweet was quoted in published research and attributed to their Twitter handle versus quoted anonymously (56% vs. 27%),The authors note that respondents may not have realized how easy it is to reidentify a user whose tweet is quoted verbatim, even if the username is omitted.

researchers also analyzed their public profile information (e.g., username and location) versus researchers not having such information (55% vs. 20%), they were informed after the fact (50%), and if their tweet was one of only a few dozen being analyzed versus one of millions (47% vs. 21%), and a human read their tweet to analyze it versus a computer program doing so (37% vs. 17%). When asked about their overall comfort level with tweets being used in research, only 21% to 27% of respondents said they were somewhat or very uncomfortable. But when asked about their comfort if “your entire Twitter history was used,” that number shifted to 49%. When asked if they would want to know that a tweet of theirs was used in a university study, 80% of respondents said yes.

Notably, this survey did not elicit the strength of respondent preferences for consent if such a requirement would hamper or preclude research. A study of patient preferences regarding consent to medical record review found that although most respondents preferred an in-person consent session with their physician, only 13.8% would prefer such research not to occur if written or verbal consent would make the research too difficult to conduct (Kraft et al. 2016).

Several barriers to public acceptance of large scale, nonconsensual data collection are likely, including an ineffable sense of “creepiness” (especially, for instance, in the case of data collected from platforms that present themselves as private, such as WhatsApp and Facebook Messenger), fear of research, and lack of appreciation of how little is known about important social phenomena (and, hence, the importance of research with big data). Additional research on lay perceptions and preferences is needed for an initiative such as ISSOD to be successful, including: a) research that progresses beyond opinion poll-like surveys and takes an incentive compatible and/or experimental approach to measuring preferences and otherwise elicits preferences in ways that require respondents to acknowledge the trade-offs of data privacy, b) research that investigates perceptions and preferences of users on platforms other than Twitter (e.g., Facebook, Reddit, 4/8chan, Instagram, WhatsApp), and c) research that goes beyond baseline perceptions and preferences and investigates how to communicate initiatives like ISSOD to the public in ways that engage, rather than alienate, the public.

Appendix: Data Access Provisions of Major Data Repositories



  • Nationally representative sample of 5,000 American children & adults/year
  • Oversamples people over 60 years, Hispanics & African Americans

Data collected

  • Survey (demographic, socioeconomic, dietary, health-related questions)
  • Exam (medical, dental, physiological measurements, lab tests, genomic)
  • Smoking, alcohol consumption, sexual practices, drug use, physical fitness & activity, weight, dietary intake, reproductive health (e.g., use of oral contraceptives & breastfeeding practices)
  • SNP data (after 2003)

Data tiers

  1. Open/public/unrestricted
  2. Some small anonymized genomic datasets not linkable to any other datasets are available on request w/o IRB review (because no human subjects involved w/non-identifiable data) through Data Use Agreement/release form
  3. Restricted: Data that could compromise confidentiality of survey respondents or institutions or is “sensitive by nature”
    • All geographic data below national level
    • Exact interview & exam dates
    • Most genomic data

Processes for accessing restricted data

  1. NHANES Data Support Agreements: initiated by NHANES w/identified experts under signed agreement to assist in data collection or processing
  2. NHANES QA/QC Collaborator datasets: Inter-agency QA/QC dataset agreement w/current NHANES collaborators 3 mos prior to public release
  3. NHANES Special Use Data Agreements: Under special circumstances NCHS enters into agreement w/Collaborators, CDC employees, or any researcher to provide limited non-public special dataset; request reviewed by Director & Confidentiality Officer
  4. NCHS RDC applications: Requests by “any researcher” to match NHANES data to external data sources; to analyze lower level geography or indirect identifiers; for access to non-public release data which are the basis of published analyses, e.g., published analyses based on one year of data:
  • Application (example) submitted to Research Data Center (RDC), judged by: Well-defined research question addressing public health concern (consistent w/consent scope), explanation of why restricted variables are necessary, technical feasibility, disclosure risk (based on variables requested, remote vs. on-site access, analytic plan including stats methods)
  • Review Committee (including Analyst, Data System Rep(s) & Confidentiality Officer) may approve, disapprove, or (often) R&R
  • Avg review: 6-8 weeks
  • Approval doesn’t guarantee all output generated by analysis will be released; output is reviewed for disclosure risk & will be suppressed if necessary
  • Completion of online Confidentiality Orientation & 100% score on quiz
  • Signed Confidentiality Agreement (e.g., use data only for approved purpose; no attempt to re-ID or discover suppressed cells; no attempt to introduce any additional data through statistical programming or otherwise; don’t use data in way that poses additional risk to respondents; if you can inadvertently deduce small cells (<5) or an individual–level-information, don’t share that information with anyone or in any publication and immediately notify RDC)
  • Signed Designated Agent affidavit
  • Review of Disclosure Manual
  • Submission of fee
  • Manuscript must be submitted to RDC Analyst prior to submitting for publication
  • Appears that NCHS Ethics Review Board (ERB) reviews all proposals to analyze restricted (i.e., identifiable) data after RDC approves (see here); biospecimens consent refers to such ERB review but survey consent doesn’t, even though some survey data are restricted/identifiable. No local IRB review appears to be required.

National Longitudinal Survey of Youth (NLSY)


  • Nationally representative birth cohorts
  • Incidentally includes prisoners; their data is not specially protected or restricted but questions about illegal activity are skipped in interviews if prisoner cannot enter answers directly into laptop (i.e., if prisoner would have to answer out loud)

Data collected

  • Education, training, cognitive tests; age, gender, geographic residence & neighborhood composition; household composition; race, ethnicity & immigration; computer & Internet access
  • Sensitive questions: income & assets, religion, relationships w/parents & family, sexual experiences, abortion, drug & alcohol use, criminal activities, homelessness, runaway episodes
  • Geocode data: county, metropolitan statistical area, ZIP Code, census tract and block, & latitude & longitude of residence

Data tiers

  1. Open/public/unrestricted
  2. Restricted data
  • Geographic data: state, county, metropolitan statistical areas of residence, country or state/county of birth, state of residence, state, country, or region of world in which respondent’s parents & grandparents were born
  • Name & locations of colleges & universities attended
  • Dates of birth, marriage, divorce, death, school attendance, etc. are month-year only in public files; specific dates restricted
  • Income & assets variables are public but topcoded (see p. 157 here)

Processes for accessing restricted data

  • Project-specific application (new use of data requires new application):
  • Clear statement for general audience of project (max 4 paragraphs) & explanation why geographic variables are necessary
  • Project must further mission of BLS & the NLS program “to conduct sound, legitimate research in the social sciences” (see p. 157 here)
  • Researcher must work/study at U.S. institution (needn’t be U.S. citizen)
  • Non-negotiable Letter of Agreement pledging to adhere to BLS confidentiality policy, signed by BLS office & official at requestor’s institution w/authority to enter into legal agreements on institution’s behalf
  • Each individual authorized to access data under Letter of Agreement signs non-negotiable BLS Agent Agreement designating him as unpaid agent of BLS & requiring him to take certain security & confidentiality measures
  • Geocode agreements last 1 year for students, 3 years for most faculty; term for adjuncts, visiting faculty, postdocs, etc. depends on how long they’ll stay at institution; extensions granted on case-by-case basis; if researcher leaves institution before end of agreement, his BLS agent agreement terminated & if no one else is authorized at institution, Letter of Agreement also terminated
  • Average 6-8 weeks after application submitted until legal docs signed at BLS
  • After letters executed, can order appropriate Geocode CD
  • Research outputs subject to review by BLS to ensure compliance w/confidentiality requirements
  • Facilities where Geocode data used subject to BLS inspection to ensure compliance w/Letter of Agreement
  • May not link geocode data w/individually identifiable records from any other dataset
  • Penalty for misuse
  • Process to access original cohorts geocode data slightly different

Framingham Heart Study (FHS)


  • Three generations comprising 15K clinically & genetically well-characterized participants
  • Since 1994, two groups from minority populations added

Data collected

  • Medical records, specimens (DNA, urine, blood & blood products), physical examination, blood tests, electrocardiogram, genetic data

Data tiers (doesn’t seem to be any public data)

  1. Much FHS data is available through other repositories according to their access policies: geno/phenol data via dbGaP; phenol data via BioLINCC
  2. Authorization required: studies of existing data, collection of new data (from participants or existing samples), images or medical records

Processes for accessing restricted data

  • Applicant first requests & receives an account
  • Application: background & rationale, specific aims, methods, data requested
  • Application routed to appropriate committee(s):
    • Executive Committee (requests for participant contact)
      Lab Committee (requests for Framingham bio-specimens for non-genetic research)
    • DNA Committee (requests for genomic data not included in dbGaP; requests for Framingham DNA or other bio-specimens for genetic research)
    • Research Committee (requests for clinical data not available in BioLINCC)
  • Application review criteria:
    • Does proposal complement Framingham's research scope?
    • Is collaboration w/a Framingham investigator planned?
    • Does proposal require unique characteristics of FHS cohort(s)?
    • Does proposal put minimal demand on FHS resources?
    • Does proposal show proof of resources for conducting project?
    • Investigators strongly encouraged to use Omni data/biospecimens
    • Local (i.e., investigator) IRB approval required of all approved data &/or material distributions. (“Although Framingham data is de-identified, FHS is a study of a single community & hence one's identify can be more easily ascertained, even if traditional identifiers are removed.”)
  • If proposal involves new participant contact or additional specimen collection, Observational Studies Monitoring Board (external to FHS) also reviews proposal.
  • Proposals eligible for expedited review (w/in 2 weeks) if request only existing data or new phenotypic data w/o participant contact
  • Among application questions: Will this project generate new individual level data on Framingham participants? For example, sets of analyzable data from individual level measurements, images or lab specimens.
  • Fee from $3-$10,000.
  • Framingham Data & Materials Distribution Agreement (DMDA) required of all approved data &/or material distributions:
    • No attempted re-ID
    • Data & materials not used for any purpose contrary to consent; must consult w/Study Investigators re: consent terms & conditions
    • No use beyond approved research project
    • No further sharing
    • Advance notification of publication, etc.

Wisconsin Longitudinal Study (WLS)


  • NIA-sponsored longitudinal study of a random sample of 10,317 1957 WI H.S. graduates
  • Broadly representative of white, non-Hispanic Americans w/at least high school education; minorities not well represented

Data collected

  • Genotypic data from Illumina HumanOmniExpress array w/ quality metrics (supplied by U of Washington Genetic Analysis Center)
  • Genotype imputation data w/genotypes imputed to 1000 Genomes Project phase 3 reference panel
  • Life course, intergenerational transfers & relationships, family functioning, physical & mental health & well-being, morbidity & mortality from late adolescence through 2011, social background, youthful aspirations, schooling, military service, labor market experiences, family characteristics & events, social participation, psychological characteristics & retirement, attractiveness rating, relative BMI

Data tiers & processes for access

  1. Most WLS data are publicly downloadable & free w/some variables removed (geography, birth & death months, data re: friends & relationship w/other participants, names of colleges) or top- or downcoded to prevent ID of outliers (many monetary values, height, weight, BMI). They only require downloaders of the public data (Level One) to provide them with their name, a valid email address, geographic location, and academic area of specialty.
  2. Subject to approval: email WLS staff, explain why level-1 data insufficient, sign confidentiality agreement There is a smaller subset of our data that is defined as either sensitive or has a marginally higher risk of identifiability (Level Two).  Access to that data is granted after the researchers provide us with a statement on why the public data is not sufficient to answer their research question, a copy of their CV, and proof of human subjects training.’
  3. Accessible thru secure server: extremely sensitive data (e.g., genetic data, audio recordings) available w/research plan, local IRB approval, signed DUA, analyses conducted on WLS secure server, genetic data also approved by WLS Genetic Advisory Board. Finally our most sensitive data, including the DNA data and some geographic codes, is a Level Three request.   These data require a fully-executed Data Use Agreement (DUA), and proof of IRB (or equivalent) approval from the researcher’s home institution. Once the data are licensed through the DUA we provide users with a copy of the data to analyze at their home institution. 
  4. Accessible only in physical coldroom: SS earnings & benefits

Health and Retirement Study (HRS)


  • Longitudinal panel study of representative sample of 20K Americans supported by NIA & SSA

Data collected

  • In-depth interviews, health data, genetic data

Data tiers & processes for access

  1. Most survey data is unrestricted & publicly available w/
    1. Registration (name, valid email, phone, organization, state; type of organization, including none or other; primary role in org—faculty, students, staff, other; highest degree; whether you’re working alone, w/collaborator w/name or under supervision w/name; primary research area, whether you’ve published using HRS data)
    2. Agreement (via registration) to conditions of use (e.g., no attempted ID, no data transfer to 3rd parties)
  2. Sensitive health data (e.g., biomarkers, prescription drug data, diabetes study, cognition & behavior phenotypic data, telomere data, memory): available from public portal w/
    1. Sensitive DUA (no re-ID, store & use data in secure environment, cite HRS & provide copies of publications to HRS, guidelines for frequency & magnitude tabulations, publish only aggregate stats)
    2. Verification of identity & institutional affiliation (unclear if independent/citizen sci ok)
  3. Restricted data (SSA admin data, VA health data, national death index cross-year cause of death, CMS cross-ref file, geographic info, pension estimation program & database, industry & occupation data, cancer site, Part D Plan info, interview date, date of death, cross-wave race & ethnicity, college tuition imputations) available in 2 ways w/different data security plans & confidentiality agreements (flow chart; text narrative):
    1. MiCDA Enclave Virtual Desktop Infrastructure (VDI): submit application (for each participating institution) including:
  • Letter from Dept. Chair (for students)
  • Proof of local IRB review (exempt, expedited, or full)
  • 1-3 p. Research proposal (what restricted variables you need & why, study team details, project goals)
  • Data order form
  • MiCDA VDI Data Security Plan
  • MiCDA Data Enclave Acceptable Use Policy
  • ISR Pledge to Safeguard Respondent Privacy
  • Confidentiality Agreement
  • [Certain data merges can only be performed by visiting Enclave in person w/additional signed Confidentiality Agreement Restricting Disclosure & Use of Data from the MiCDA Enclave]
    1. Traditional licensing agreement (required for SSA, CMS linkages):
  • Proof of IRB approval (expedited or full—apparently NOT exempt): once your application complete & data security plan is acceptable, you submit proposal to your IRB &/or your institutional Contracting Authority for review; completed reviews are submitted to HRS
  • 1-3 p. Research proposal (what restricted variables you need & why, study team details, project goals)
  • Data order form
  • Data security plan (see also this checklist)
  • CV(s)
  • Institutional Federalwide Assurance (IRB registered w/OHRP)
  • Proof of current federal funding (primary penalty for breach is notification via NIA to your funding agency)
  • Confidentiality Agreement w/institutional countersignature
  1. Access to SSA administrative data & CMS research data is “more involved”; contact HRS for details (seems to involve encrypted physical media to qualified researchers?)
  2. Genetic data: genetic data products (candidate gene & SNP files, genptype data, exome data) from 20K genotyped respondents: available after applying first for dbGaP access to controlled data & then submitting to HRS a Genetic Data Access Use Agreement & Genetic Data Order Form
  3. Additional restrictions on merging restricted data files w/each other (e.g., restricted SSA admin records may not be merged w/geographic data)

Completed application goes for final review by full Data Confidentiality Committee (DCC). If approved, HRS PI signed the Confidentiality Agreement & data is made available. Restricted data agreement must be renewed annually. Any change in circumstances requires modification to agreement. Periodic inspection of site by HRS possible. On termination. Licensing agreement users mist return of destroy data (if latter, must provide counter-signed certification of destruction).

Notes about IRB review

  • HRS considers public data to pose a sufficiently low risk of re-identification that it is exempt from IRB review
  • This memorandum from the HRS PI to IRBs makes clear that restricted data poses a sufficiently high risk of re-identification that exemption is inappropriate, that the risk for the IRB to consider is that of re-identification, & that IRB review should therefore center on the “Restricted Data Protection Plan, and those aspects of the Research Plan that deal with issues of respondent anonymity and data security, if any. By the time they reach you, HRS will have approved these Plans. But we ask for your review because you will be better able to judge the extent to which, in your institution's physical and computing environment, whether the Plans are adequate to ensure participant anonymity and limitation of access to the restricted data to the persons specified in the agreement.” [NB: This partially contradicts what is said about the VDI pathway, which is that your IRB may indeed find your research exempt.]
  • Certification of IRB review form: Certify that 1) your IRB has an active FWA and 2) “Our Institutional Review Board/Human Subjects Review Committee has reviewed, according to its standards and procedures for live human subjects, and approved, the Restricted Data Protection Plan (and those portions of the Research Plan that deal with respondent anonymity and data security, if any), approved by the Health and Retirement Study, of the Restricted Data Investigator above; and has approved those plans.”

Possible penalties for breach of Agreement for Use of Restricted Data

  • Denial of future access to HRS data
  • Notification to your institution’s scientific integrity office & request for sanctions
  • Notification to your current funding agency w/recommendation that all current funds be terminated & all future funds be denied
  • Other remedies available at law

General Social Survey (GSS)

Data collected

  • Demographic, behavioral, & attitudinal questions, plus topics of special interest (civil liberties, crime & violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, stress & traumatic events)

Data tiers (individually identifying info, e.g., name, address, never provided)

  1. Public use files (include no geocoded data)
  2. GSS geographic identification code files (state, primary sampling unit, county, & Census tract) available to researchers under special contract w/NORC

“Sensitive Data”: any data that might compromise anonymity or privacy of respondents. Specifically, any data file that, for either individuals, or families, includes:

  1. Identification numbers or demographic information (such as month & year of birth, age, ethnicity, occupation, industry, gender, etc.);
  2. Geographic identification of areas smaller than Census Division, including, but not limited to state, county, minor civil division, primary sampling unit (PSU), segment, city, place, zip code, tract, block numbering area, enumeration district, block group, or block;
  3. Any variables or fields derived from the data mentioned in items a)-b) above, including data linked to a GSS dataset using the data mentioned in items a) & b) above as linking or matching variables.

Process for access

  • Research Plan: describe which datasets & variables you want; must be project-specific
  • CV for each researcher
  • Sensitive Data Protection Plan (data security plan)
  • Human Subjects Review from your Institution, using Sensitive Data Protection Plan as part of the application for approval; may result in approval or waiver
  • Contract for Use of Sensitive Data:
    • If Investigator isn’t fulltime permanent faculty at institution, requires co-I who is fulltime, PhD-level faculty member
    • Data must be returned or destroyed
    • Investigator(s) & institution must assume liability up to 100K for any violations by any person at the institution
    • Signed by representative who can enter into contracts on behalf of institution
    • If investigator leaves institution, agreement terminates
  • Process can take several months
  • Fee: $750

Criteria for review

  • GSS takes its promise of anonymity to its respondents very seriously and this is the basis for the contract process
  • GSS aims to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting


Participants & data collected

  • Genotype & phenotype data collected via many studies under a wide range of consent terms

Data tiers

  1. Open: available to anyone w/no restrictions
  2. Controlled: allows download of individual-level genotype & phenotype data that have been de-identified (i.e., no personal identifiers, such as name)

Process for access

  • PI (must be registered by your institution as a PI in your eRA account) & institutional Signing Official (both w/NIH eRA Commons accounts) co-sign request for data access
  • Statement summarizing proposed research use for the requested data
  • List of collaborating investigators at same institution (collaborators at other institutions must submit own requests)
  • Submission of request constitutes agreement to Data Use Certification (e.g., use limited to proposed project, no re-ID or re-contact, no further distribution of data)
  • Agree to Code of Conduct
  • Adhere to data security measures

Criteria for access

  • Data access & use must be consistent w/participants’ intent as reflected in consents; datasets are placed in different “consent groups,” e.g.:
    • General research use: use limited only by model Data Use Certification
    • Health/medical/biomedical research only: no ancestry, no possibly stigmatizing research, no non-health research
    • Non-profit use only
  • Some datasets require local IRB approval; others (e.g., often, general research use) don’t
  • Access to all controlled datasets requires approval of an NIH Data Access Committee (DAC), which looks to ensure proposed plan matches limitations (if any) of consent group; no additional ethics review is involved


Bruckman, Amy. 2016. “Do Researchers Have to Abide by Terms of Service (TOS)?” The Next Bison: Social Computing and Culture. https://nextbison.wordpress.com/2016/02/26/tos/.

Budin-Ljøsne, Isabelle, Harriet J. A. Teare, Jane Kaye, Stephan Beck, Heidi Beate Bentzen, Luciana Caenazzo, Clive Collett, et al. 2017. “Dynamic Consent: A Potential Solution to Some of the Challenges of Modern Biomedical Research.” BMC Medical Ethics 18 (1): 4. doi:10.1186/s12910-016-0162-9.

Carter, Pam, Graeme T Laurie, and Mary Dixon-Woods. 2015. “The Social Licence for Research: Why Care.Data Ran into Trouble.” Journal of Medical Ethics 41 (5): 404–9. doi:10.1136/medethics-2014-102374.

Caulfield, Timothy, and Jane Kaye. 2009. “Broad Consent in Biobanking: Reflections on Seemingly Insurmountable Dilemmas.” Medical Law International 10 (2): 85–100. doi:10.1177/096853320901000201.

Fiesler, Casey, and Nicholas Proferes. 2018. “‘Participant’ Perceptions of Twitter Research Ethics.” Social Media + Society 4 (1). doi:10.1177/2056305118763366.

Hartznog, Woodrow N., and Frederic D. Stutzman. 2013. “The Case for Online Obscurity.” California Law Review 101 (1): 1–49.

Kaye, Jane, Edgar A. Whitley, David Lund, Michael Morrison, Harriet Teare, and Karen Melham. 2015. “Dynamic Consent: A Patient Interface for Twenty-First Century Research Networks.” European Journal of Human Genetics 23 (2): 141–46. doi:10.1038/ejhg.2014.71.

King, Gary, and Nathaniel Persily. 2018. “A New Model for Industry-Academic Partnerships.” Working Paper. http://j.mp/2q1IQpH.

Kraft, Stephanie Alessi, Mildred K. Cho, Melissa Constantine, Sandra Soo-Jin Lee, Maureen Kelley, Diane Korngiebel, Cyan James, et al. 2016. “A Comparison of Institutional Review Board Professionals’ and Patients’ Views on Consent for Research on Medical Practices.” Clinical Trials 13 (5): 555–65. doi:10.1177/1740774516648907.

Meyer, Michelle N. 2015. “Two Cheers for Corporate Experimentation: The A/B Illusion and the Virtues of Data-Driven Innovation.” Colorado Technology Law Journal 13 (2): 273–331.

———. 2018a. “Ethical Considerations When Companies Study—and Fail to Study—Their Customers.” In The Cambridge Handbook of Consumer Privacy, edited by Evan Selinger, Jules Polonetsky, and Omer Tene, 207–31. Cambridge, UK: Cambridge University Press.

———. 2018b. “Practical Tips for Ethical Data Sharing.” Advances in Methods and Practices in Psychological Science 1 (1): 131–44. doi:10.1177/2515245917747656.

Mittelstadt, Brent, Justus Benzler, Lukas Engelmann, Barbara Prainsack, and Effy Vayena. 2018. “Is There a Duty to Participate in Digital Epidemiology?” Life Sciences, Society and Policy 14 (1): 9. doi:10.1186/s40504-018-0074-1.

Nayak, Rahul K., David Wendler, Franklin G. Miller, and Scott Y. H. Kim. 2015. “Pragmatic Randomized Trials Without Standard Informed Consent?: A National Survey.” Annals of Internal Medicine 163 (5): 356–64. doi:10.7326/M15-0817.

Nissenbaum, Helen. 2004. “Privacy as Contextual Integrity.” Washington Law Review 79 (1): 119–58.

Office for Human Research Protections. 2008. “Coded Private Information or Specimens Used in Research, Guidance.” https://www.hhs.gov/ohrp/regulations-and-policy/guidance/research-involving-coded-private- information/index.html.

Ploug, Thomas, and Søren Holm. 2016. “Meta Consent - A Flexible Solution to the Problem of Secondary Use of Health Data: Meta Consent.” Bioethics 30 (9): 721–32. doi:10.1111/bioe.12286.

Secretary’s Advisory Committee on Human Research Protections. 2013. “Attachment B: Considerations and Recommendations Concerning Internet Research and Human Subjects Research Regulations, with Revisions.” https://www.hhs.gov/ohrp/sachrp-committee/recommendations/2013-may-20-letter-attachment- b/index.html#backfn2.

Social Science One. 2018.

Truog, Robert D., Aaron S. Kesselheim, and Steven Joffe. 2012. “Paying Patients for Their Tissue: The Legacy of Henrietta Lacks.” Science 337 (6090): 37–38. doi:10.1126/science.1216888.

Vitak, Jessica, Katie Shilton, and Zahra Ashktorab. 2016. “Beyond the Belmont Principles: Ethical Challenges, Practices, and Beliefs in the Online Data Research Community.” In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, 939–51. New York: ACM Press. doi:10.1145/2818048.2820078.

Williams, Matthew L, Pete Burnap, and Luke Sloan. 2017. “Towards an Ethical Framework for Publishing Twitter Data in Social Research: Taking into Account Users’ Views, Online Context and Algorithmic Estimation.” Sociology 51 (6): 1149–68. doi:10.1177/0038038517708140.

Zimmer, Michael. 2010. “Is It Ethical to Harvest Public Twitter Accounts Without Consent?” https://www.michaelzimmer.org/2010/02/12/is-it-ethical-to-harvest-public-twitter-accounts-without-consent/.