Research Ethics Issues Raised in Collecting and Maintaining Large Scale, Sensitive Online Data

Michelle N. Meyer

Center for Translational Bioethics and Health Care Policy and Steele Institute for Health Innovation, Geisinger

2019

Other ethical issues raised

Group harm, vulnerable populations, and participant engagement

Certain populations may both perceive that they are at increased risk by having their data aggregated and shared and may in fact be at increased risk of harm. For instance, one survey study found that “[l]esbian, gay and bisexual (LGB) respondents were more likely to express concern over their Twitter posts being used in government (odds increase of 2.12) and commercial settings (odds increase of 1.92), compared to heterosexual respondents,” perhaps owing to historical online abuse of LGBT persons and/or antagonistic relationships between these communities and some governments (Williams, Burnap, and Sloan 2017). If ISSOD expects to enable research on vulnerable populations, it should consider engaging those communities early in the process of developing the platform to see whether risks can be mitigated or community representatives can be included in ISSOD oversight structures. More generally, it is a good idea to engage the public (whether vulnerable or not) in development and oversight of the research platform.

Access to data

Data repositories typically calibrate data access according to the sensitivity and/or identifiability of the data, providing different access mechanisms for different kinds of data. At one end the spectrum, the most sensitive data might only be accessible through virtual or actual data enclaves. At the other end of the spectrum, metadata and innocuous, individual-level raw data might be made publicly accessible. In between are myriad options (see Table 1 in M. N. Meyer (2018a), and Appendix) that involve trade-offs between data security, on the one hand, and transparency and fair access, on the other. For instance, it is common to restrict data access to “qualified researchers,” usually permanent faculty (i.e., those with “PI privileges”) affiliated with a research institution (e.g., Social Science One). The purpose of this restriction is that the institution can be required to be a party to the data use agreement, thereby lending that agreement some teeth. But this practice excludes citizen scientists, independent academics, and journalists, who may have legitimate interests in the data and who are likely to be more representative, politically and in other ways, of the general population whose data comprises the repository. A data repository that purports to serve the public good but is accessible only by academics may be viewed with skepticism by citizens on the political right, who may associate academia with biased, ideologically-driven research.

Data use agreements should prohibit attempts by researchers to re-identify or contact data subjects without the explicit permission of ISSOD (which can review proposals for, e.g., re-identification research).

Commercialization of research

It is essentially unheard of for data sources to be compensated for the research use of their data or to share in any profits derived from research, whether those data derive from human tissue or online behavior. Nevertheless, some people feel differently about the issue of their data in commercial as opposed to non-profit purposes, and a persistent minority of data subjects believe they should be financially compensated for use of what many regard as their “property” (Fiesler and Proferes 2018). For instance, one of several objections by some to the case of Henrietta Lacks (in which physician-investigators collected and used leftover, quasi-pseudonomized clinical tissue for research without consent, as was then and remains today legal) is that, although those physician-investigators did not profit from the research, others downstream from the initial research did—and handsomely. Some ethicists have argued cogently that sources of passively collected research materials that the source would have produced anyway and which therefore entail no extra effort by or inconvenience to the data subject (e.g., leftover clinical tissue or, in the present context, digital data already produced for the user’s own purposes) are not morally owed compensation (Truog, Kesselheim, and Joffe 2012). Still, in response to this minority public sentiment, the revised Common Rule newly requires research consent to include, where appropriate, a “statement that the subject’s biospecimens (even if identifiers are removed) may be used for commercial profit and whether the subject will or will not share in this commercial profit.”45 C.F.R. § 46.116(c)(7) (2017). Note that because the Common Rule does not apply to research with non-identifiable biospecimens collected for a purpose other than the instant research project, these new requirements technically only apply to research in which researchers intervene or interact with tissue sources to newly collect identifiable or non-identifiable biospecimens.

The regulations do not, however, provide that tissue sources must, or even ought, to share in profits.

Although compensating each data subject for their contribution is unreasonable and infeasible, it does make sense to constrain researchers from commercializing ISSOD-enabled research in ways that would threaten public access to the benefits of that research. This is especially important if, as seems likely, the public will be passively contributing data to ISSOD and bearing some (minimal) risk for doing so. Social Science One, for instance, precludes researchers who are funded by the platform from patenting their results, but it does not preclude other for-profit uses of the data (e.g., writing a profitable trade book about the research results) (Social Science One 2018).

Partnering with, versus scraping, platforms

In general, partnering with platforms provides several potential advantages. First, platforms can incorporate notice of (if not consent to) data sharing with ISSOD into their user-facing materials. Similarly, platforms may be able to and help transmit aggregate or individual results as a gesture of respect. Third, the standard rule in biobank research is that samples from participants who withdraw are destroyed and no longer used in analyses going forward, but that their data is not removed from completed analyses. By analogy, partnering with a platform could entail a mirrored research database that refreshes in real time, enabling deleted tweets to be automatically disappeared from the research platform. (The evolving content of the database would have to be accounted for somehow to enable reproducibility of the analyses.) Conversely, scraping is often frowned upon by both platforms and users, especially if the platform is quasi-closed and community- or purpose-specific, where researchers may be viewed as interlopers. This can undermine trust, setting back the goal of large scale data sharing. The primary risk of partnering with platforms is that it would be important to insulate the research from influence by the platform.

Governance

There is not currently consensus on the ethical issues raised in this white paper, and that is unlikely to be achieved in the near future, if ever (King and Persily 2018; Vitak, Shilton, and Ashktorab 2016; Zimmer 2010). Moreover, as always, different research studies will raise different concerns, significantly challenging the feasibility of one-size-fits-all ethics rules. For both reasons, it is wise to focus on process.

What should that process look like? Most university IRBs are unfamiliar with both data sharing or research with online data, leading to both Type 1 and 2 errors in reviewing proposals such as this. Moreover, the vast majority of secondary research with data housed with ISSOD will either be non-human subjects research or exempt from IRB review. Most IRBs will not conduct a substantive review that falls outside of their jurisdiction. Moreover, many aspects of the Common Rule, which IRBs are trained to apply, are a poor fit for ISSOD. The act of data collection and maintenance itself does not meet the regulations’ definition of “research” (M. N. Meyer 2018b) and as discussed above, most secondary analyses of these data will be exempt from the Common Rule or fall outside of it completely. The Common Rule is further hampered by a weak definition of “identifiable” and a focus on risks to individual data subjects only, as opposed to third parties, groups, or society.

Still, some sort of prospective group ethics review is desirable, at least for some categories of data collection (e.g., prior to scraping a platform) and some categories of secondary research with collected data (e.g., research with highly sensitive and/or highly identifiable data, research investigating a highly sensitive or controversial question, or research that targets vulnerable populations). Social Science One requires university IRB approval or (more likely) determination of non-human subjects or exempt status;King and Persily (2018) incorrectly state that federal research regulations require IRBs to make exempt determinations. In fact, the Common Rule is silent on this question. Although OHRP has historically recommended that this not be left to investigators (https://www.hhs.gov/ohrp/regulations-and-policy/guidance/faq/exempt-research-determination/index.html) during the Common Rule revision process, HHS/OHRP proposed a “decision tool” that would allow investigators (or others, as the institution prefers) to make their own exempt determinations, with impunity, so long as the inputs they enter are accurate. That did not make it into the Final Rule, largely because the agencies ran out of time to develop the tool, but HHS has indicated that it intends to introduce this in the future. Perhaps anticipating this, some IRBs have developed online decision tools, of a sort, through which they permit their investigators to make their own exempt determinations. The proposed practice of investigators making their own exemption determinations is controversial, and until the dust settles, it would be best if IRBs continue to make these determinations for ISSOD projects.

prospective peer review that includes a review of the proposal’s scientific merit and potential benefits, but also of the “ethical track record” of the PI and the potential costs to data subjects and others;It is unclear how peer reviewers are meant to investigate the “ethical track record” of the PI. Nor is it clear that peer reviewers are sufficiently knowledgeable about the risks of this work to be helpful (much as IRBs are often insufficiently knowledgeable about the scientific merits of proposed research).

and, assuming the proposal passes peer review, separate ethics review by ethicists appointed by Social Science One with specific expertise in online research ethics. Finally, Social Science One collaborates with an NSF-funded team of information scientists, PERVADE, to provide continuous ethics feedback about the platform’s decisions. Something like this body might be engaged to review at least some subset of data collection and data analysis activities. Members of that committee should include not only those who specialize in online communities and digital data but also those who are broadly trained in moral reasoning. ISSOD should also consider include laypersons on the committee (ideally, more than the single community member most IRBs retain). For more on oversight of research when IRB review and the Common Rule are poor fits, see M. N. Meyer (2018a, 224–27).

Public trust

Whether behavior is morally right or wrong is not dictated by public opinion. However, data sharing initiatives like ISSOD will not be successful if they do not earn the public’s trust. This is likely to be one of the biggest obstacles to success. In the U.K., for instance, the National Health Service (NHS) attempted to invoke a social license to justify the extraction of data from medical records, partly for research, unless patients opted out. Public and even professional opposition was strong enough that the program, care.data, was shelved. Some scholars have argued that the program failed to secure all necessary aspects of a social license for research, which require that data subjects perceive participation to be voluntary and governed by values of reciprocity, non-exploitation, and service of the public good (Carter, Laurie, and Dixon-Woods 2015).

There is some existing research investigating perceptions of research use of digital data, and it suggests the challenges ahead. In the aforementioned survey of Twitter users, 65% believed that researchers should not be able to use even public tweets without explicit user permission. When asked whether they themselves would agree to allowing a university researcher to use their tweet, 53% said yes, 14% said no, and 33% said it would depend on contextual factors. When asked if they would opt out of having their tweets used in all academic research, 29% said yes and another 25% again said it would depend. When asked which factors would influence their comfort with “a tweet” of theirs being used in research, respondents (n = 268) were most likely to indicate being somewhat or very uncomfortable when: the tweet was from their protected account (75%), no consent was sought (67%), it was a public tweet they had later deleted (64%), the tweet was quoted in published research and attributed to their Twitter handle versus quoted anonymously (56% vs. 27%),The authors note that respondents may not have realized how easy it is to reidentify a user whose tweet is quoted verbatim, even if the username is omitted.

researchers also analyzed their public profile information (e.g., username and location) versus researchers not having such information (55% vs. 20%), they were informed after the fact (50%), and if their tweet was one of only a few dozen being analyzed versus one of millions (47% vs. 21%), and a human read their tweet to analyze it versus a computer program doing so (37% vs. 17%). When asked about their overall comfort level with tweets being used in research, only 21% to 27% of respondents said they were somewhat or very uncomfortable. But when asked about their comfort if “your entire Twitter history was used,” that number shifted to 49%. When asked if they would want to know that a tweet of theirs was used in a university study, 80% of respondents said yes.

Notably, this survey did not elicit the strength of respondent preferences for consent if such a requirement would hamper or preclude research. A study of patient preferences regarding consent to medical record review found that although most respondents preferred an in-person consent session with their physician, only 13.8% would prefer such research not to occur if written or verbal consent would make the research too difficult to conduct (Kraft et al. 2016).

Several barriers to public acceptance of large scale, nonconsensual data collection are likely, including an ineffable sense of “creepiness” (especially, for instance, in the case of data collected from platforms that present themselves as private, such as WhatsApp and Facebook Messenger), fear of research, and lack of appreciation of how little is known about important social phenomena (and, hence, the importance of research with big data). Additional research on lay perceptions and preferences is needed for an initiative such as ISSOD to be successful, including: a) research that progresses beyond opinion poll-like surveys and takes an incentive compatible and/or experimental approach to measuring preferences and otherwise elicits preferences in ways that require respondents to acknowledge the trade-offs of data privacy, b) research that investigates perceptions and preferences of users on platforms other than Twitter (e.g., Facebook, Reddit, 4/8chan, Instagram, WhatsApp), and c) research that goes beyond baseline perceptions and preferences and investigates how to communicate initiatives like ISSOD to the public in ways that engage, rather than alienate, the public.

Appendix: Data Access Provisions of Major Data Repositories

NHANES

Participants

Nationally representative sample of 5,000 American children & adults/year
Oversamples people over 60 years, Hispanics & African Americans

Data collected

Survey (demographic, socioeconomic, dietary, health-related questions)
Exam (medical, dental, physiological measurements, lab tests, genomic)
Smoking, alcohol consumption, sexual practices, drug use, physical fitness & activity, weight, dietary intake, reproductive health (e.g., use of oral contraceptives & breastfeeding practices)
SNP data (after 2003)

Data tiers

Open/public/unrestricted
Some small anonymized genomic datasets not linkable to any other datasets are available on request w/o IRB review (because no human subjects involved w/non-identifiable data) through Data Use Agreement/release form
Restricted: Data that could compromise confidentiality of survey respondents or institutions or is “sensitive by nature”
- All geographic data below national level
- Exact interview & exam dates
- Most genomic data

Processes for accessing restricted data

NHANES Data Support Agreements: initiated by NHANES w/identified experts under signed agreement to assist in data collection or processing
NHANES QA/QC Collaborator datasets: Inter-agency QA/QC dataset agreement w/current NHANES collaborators 3 mos prior to public release
NHANES Special Use Data Agreements: Under special circumstances NCHS enters into agreement w/Collaborators, CDC employees, or any researcher to provide limited non-public special dataset; request reviewed by Director & Confidentiality Officer
NCHS RDC applications: Requests by “any researcher” to match NHANES data to external data sources; to analyze lower level geography or indirect identifiers; for access to non-public release data which are the basis of published analyses, e.g., published analyses based on one year of data:

Application (example) submitted to Research Data Center (RDC), judged by: Well-defined research question addressing public health concern (consistent w/consent scope), explanation of why restricted variables are necessary, technical feasibility, disclosure risk (based on variables requested, remote vs. on-site access, analytic plan including stats methods)
Review Committee (including Analyst, Data System Rep(s) & Confidentiality Officer) may approve, disapprove, or (often) R&R
Avg review: 6-8 weeks
Approval doesn’t guarantee all output generated by analysis will be released; output is reviewed for disclosure risk & will be suppressed if necessary
Completion of online Confidentiality Orientation & 100% score on quiz
Signed Confidentiality Agreement (e.g., use data only for approved purpose; no attempt to re-ID or discover suppressed cells; no attempt to introduce any additional data through statistical programming or otherwise; don’t use data in way that poses additional risk to respondents; if you can inadvertently deduce small cells (<5) or an individual–level-information, don’t share that information with anyone or in any publication and immediately notify RDC)
Signed Designated Agent affidavit
Review of Disclosure Manual
Submission of fee
Manuscript must be submitted to RDC Analyst prior to submitting for publication
Appears that NCHS Ethics Review Board (ERB) reviews all proposals to analyze restricted (i.e., identifiable) data after RDC approves (see here); biospecimens consent refers to such ERB review but survey consent doesn’t, even though some survey data are restricted/identifiable. No local IRB review appears to be required.

National Longitudinal Survey of Youth (NLSY)

Participants

Nationally representative birth cohorts
Incidentally includes prisoners; their data is not specially protected or restricted but questions about illegal activity are skipped in interviews if prisoner cannot enter answers directly into laptop (i.e., if prisoner would have to answer out loud)

Data collected

Education, training, cognitive tests; age, gender, geographic residence & neighborhood composition; household composition; race, ethnicity & immigration; computer & Internet access
Sensitive questions: income & assets, religion, relationships w/parents & family, sexual experiences, abortion, drug & alcohol use, criminal activities, homelessness, runaway episodes
Geocode data: county, metropolitan statistical area, ZIP Code, census tract and block, & latitude & longitude of residence

Data tiers

Open/public/unrestricted
Restricted data

Geographic data: state, county, metropolitan statistical areas of residence, country or state/county of birth, state of residence, state, country, or region of world in which respondent’s parents & grandparents were born
Name & locations of colleges & universities attended
Dates of birth, marriage, divorce, death, school attendance, etc. are month-year only in public files; specific dates restricted
Income & assets variables are public but topcoded (see p. 157 here)

Processes for accessing restricted data

Project-specific application (new use of data requires new application):
Clear statement for general audience of project (max 4 paragraphs) & explanation why geographic variables are necessary
Project must further mission of BLS & the NLS program “to conduct sound, legitimate research in the social sciences” (see p. 157 here)
Researcher must work/study at U.S. institution (needn’t be U.S. citizen)
Non-negotiable Letter of Agreement pledging to adhere to BLS confidentiality policy, signed by BLS office & official at requestor’s institution w/authority to enter into legal agreements on institution’s behalf
Each individual authorized to access data under Letter of Agreement signs non-negotiable BLS Agent Agreement designating him as unpaid agent of BLS & requiring him to take certain security & confidentiality measures
Geocode agreements last 1 year for students, 3 years for most faculty; term for adjuncts, visiting faculty, postdocs, etc. depends on how long they’ll stay at institution; extensions granted on case-by-case basis; if researcher leaves institution before end of agreement, his BLS agent agreement terminated & if no one else is authorized at institution, Letter of Agreement also terminated
Average 6-8 weeks after application submitted until legal docs signed at BLS
After letters executed, can order appropriate Geocode CD
Research outputs subject to review by BLS to ensure compliance w/confidentiality requirements
Facilities where Geocode data used subject to BLS inspection to ensure compliance w/Letter of Agreement
May not link geocode data w/individually identifiable records from any other dataset
Penalty for misuse
Process to access original cohorts geocode data slightly different

Framingham Heart Study (FHS)

Participants

Three generations comprising 15K clinically & genetically well-characterized participants
Since 1994, two groups from minority populations added

Data collected

Medical records, specimens (DNA, urine, blood & blood products), physical examination, blood tests, electrocardiogram, genetic data

Data tiers (doesn’t seem to be any public data)

Much FHS data is available through other repositories according to their access policies: geno/phenol data via dbGaP; phenol data via BioLINCC
Authorization required: studies of existing data, collection of new data (from participants or existing samples), images or medical records

Processes for accessing restricted data

Applicant first requests & receives an account
Application: background & rationale, specific aims, methods, data requested
Application routed to appropriate committee(s):
- Executive Committee (requests for participant contact)
  Lab Committee (requests for Framingham bio-specimens for non-genetic research)
- DNA Committee (requests for genomic data not included in dbGaP; requests for Framingham DNA or other bio-specimens for genetic research)
- Research Committee (requests for clinical data not available in BioLINCC)
Application review criteria:
- Does proposal complement Framingham's research scope?
- Is collaboration w/a Framingham investigator planned?
- Does proposal require unique characteristics of FHS cohort(s)?
- Does proposal put minimal demand on FHS resources?
- Does proposal show proof of resources for conducting project?
- Investigators strongly encouraged to use Omni data/biospecimens
- Local (i.e., investigator) IRB approval required of all approved data &/or material distributions. (“Although Framingham data is de-identified, FHS is a study of a single community & hence one's identify can be more easily ascertained, even if traditional identifiers are removed.”)
If proposal involves new participant contact or additional specimen collection, Observational Studies Monitoring Board (external to FHS) also reviews proposal.
Proposals eligible for expedited review (w/in 2 weeks) if request only existing data or new phenotypic data w/o participant contact
Among application questions: Will this project generate new individual level data on Framingham participants? For example, sets of analyzable data from individual level measurements, images or lab specimens.
Fee from $3-$10,000.
Framingham Data & Materials Distribution Agreement (DMDA) required of all approved data &/or material distributions:
- No attempted re-ID
- Data & materials not used for any purpose contrary to consent; must consult w/Study Investigators re: consent terms & conditions
- No use beyond approved research project
- No further sharing
- Advance notification of publication, etc.

Wisconsin Longitudinal Study (WLS)

Participants

NIA-sponsored longitudinal study of a random sample of 10,317 1957 WI H.S. graduates
Broadly representative of white, non-Hispanic Americans w/at least high school education; minorities not well represented

Data collected

Genotypic data from Illumina HumanOmniExpress array w/ quality metrics (supplied by U of Washington Genetic Analysis Center)
Genotype imputation data w/genotypes imputed to 1000 Genomes Project phase 3 reference panel
Life course, intergenerational transfers & relationships, family functioning, physical & mental health & well-being, morbidity & mortality from late adolescence through 2011, social background, youthful aspirations, schooling, military service, labor market experiences, family characteristics & events, social participation, psychological characteristics & retirement, attractiveness rating, relative BMI

Data tiers & processes for access

Most WLS data are publicly downloadable & free w/some variables removed (geography, birth & death months, data re: friends & relationship w/other participants, names of colleges) or top- or downcoded to prevent ID of outliers (many monetary values, height, weight, BMI). They only require downloaders of the public data (Level One) to provide them with their name, a valid email address, geographic location, and academic area of specialty.
Subject to approval: email WLS staff, explain why level-1 data insufficient, sign confidentiality agreement There is a smaller subset of our data that is defined as either sensitive or has a marginally higher risk of identifiability (Level Two). Access to that data is granted after the researchers provide us with a statement on why the public data is not sufficient to answer their research question, a copy of their CV, and proof of human subjects training.’
Accessible thru secure server: extremely sensitive data (e.g., genetic data, audio recordings) available w/research plan, local IRB approval, signed DUA, analyses conducted on WLS secure server, genetic data also approved by WLS Genetic Advisory Board. Finally our most sensitive data, including the DNA data and some geographic codes, is a Level Three request. These data require a fully-executed Data Use Agreement (DUA), and proof of IRB (or equivalent) approval from the researcher’s home institution. Once the data are licensed through the DUA we provide users with a copy of the data to analyze at their home institution.
Accessible only in physical coldroom: SS earnings & benefits

Health and Retirement Study (HRS)

Participants

Longitudinal panel study of representative sample of 20K Americans supported by NIA & SSA

Data collected

In-depth interviews, health data, genetic data

Data tiers & processes for access

Most survey data is unrestricted & publicly available w/
1. Registration (name, valid email, phone, organization, state; type of organization, including none or other; primary role in org—faculty, students, staff, other; highest degree; whether you’re working alone, w/collaborator w/name or under supervision w/name; primary research area, whether you’ve published using HRS data)
2. Agreement (via registration) to conditions of use (e.g., no attempted ID, no data transfer to 3^rd parties)
Sensitive health data (e.g., biomarkers, prescription drug data, diabetes study, cognition & behavior phenotypic data, telomere data, memory): available from public portal w/
1. Sensitive DUA (no re-ID, store & use data in secure environment, cite HRS & provide copies of publications to HRS, guidelines for frequency & magnitude tabulations, publish only aggregate stats)
2. Verification of identity & institutional affiliation (unclear if independent/citizen sci ok)
Restricted data (SSA admin data, VA health data, national death index cross-year cause of death, CMS cross-ref file, geographic info, pension estimation program & database, industry & occupation data, cancer site, Part D Plan info, interview date, date of death, cross-wave race & ethnicity, college tuition imputations) available in 2 ways w/different data security plans & confidentiality agreements (flow chart; text narrative):
1. MiCDA Enclave Virtual Desktop Infrastructure (VDI): submit application (for each participating institution) including:

Letter from Dept. Chair (for students)
Proof of local IRB review (exempt, expedited, or full)
1-3 p. Research proposal (what restricted variables you need & why, study team details, project goals)
Data order form
MiCDA VDI Data Security Plan
MiCDA Data Enclave Acceptable Use Policy
ISR Pledge to Safeguard Respondent Privacy
Confidentiality Agreement
[Certain data merges can only be performed by visiting Enclave in person w/additional signed Confidentiality Agreement Restricting Disclosure & Use of Data from the MiCDA Enclave]
1. Traditional licensing agreement (required for SSA, CMS linkages):
Proof of IRB approval (expedited or full—apparently NOT exempt): once your application complete & data security plan is acceptable, you submit proposal to your IRB &/or your institutional Contracting Authority for review; completed reviews are submitted to HRS
1-3 p. Research proposal (what restricted variables you need & why, study team details, project goals)
Data order form
Data security plan (see also this checklist)
CV(s)
Institutional Federalwide Assurance (IRB registered w/OHRP)
Proof of current federal funding (primary penalty for breach is notification via NIA to your funding agency)
Confidentiality Agreement w/institutional countersignature

Access to SSA administrative data & CMS research data is “more involved”; contact HRS for details (seems to involve encrypted physical media to qualified researchers?)
Genetic data: genetic data products (candidate gene & SNP files, genptype data, exome data) from 20K genotyped respondents: available after applying first for dbGaP access to controlled data & then submitting to HRS a Genetic Data Access Use Agreement & Genetic Data Order Form
Additional restrictions on merging restricted data files w/each other (e.g., restricted SSA admin records may not be merged w/geographic data)

Completed application goes for final review by full Data Confidentiality Committee (DCC). If approved, HRS PI signed the Confidentiality Agreement & data is made available. Restricted data agreement must be renewed annually. Any change in circumstances requires modification to agreement. Periodic inspection of site by HRS possible. On termination. Licensing agreement users mist return of destroy data (if latter, must provide counter-signed certification of destruction).

Notes about IRB review

HRS considers public data to pose a sufficiently low risk of re-identification that it is exempt from IRB review
This memorandum from the HRS PI to IRBs makes clear that restricted data poses a sufficiently high risk of re-identification that exemption is inappropriate, that the risk for the IRB to consider is that of re-identification, & that IRB review should therefore center on the “Restricted Data Protection Plan, and those aspects of the Research Plan that deal with issues of respondent anonymity and data security, if any. By the time they reach you, HRS will have approved these Plans. But we ask for your review because you will be better able to judge the extent to which, in your institution's physical and computing environment, whether the Plans are adequate to ensure participant anonymity and limitation of access to the restricted data to the persons specified in the agreement.” [NB: This partially contradicts what is said about the VDI pathway, which is that your IRB may indeed find your research exempt.]
Certification of IRB review form: Certify that 1) your IRB has an active FWA and 2) “Our Institutional Review Board/Human Subjects Review Committee has reviewed, according to its standards and procedures for live human subjects, and approved, the Restricted Data Protection Plan (and those portions of the Research Plan that deal with respondent anonymity and data security, if any), approved by the Health and Retirement Study, of the Restricted Data Investigator above; and has approved those plans.”

Possible penalties for breach of Agreement for Use of Restricted Data

Denial of future access to HRS data
Notification to your institution’s scientific integrity office & request for sanctions
Notification to your current funding agency w/recommendation that all current funds be terminated & all future funds be denied
Other remedies available at law

dbGaP

Participants & data collected

Genotype & phenotype data collected via many studies under a wide range of consent terms

Data tiers

Open: available to anyone w/no restrictions
Controlled: allows download of individual-level genotype & phenotype data that have been de-identified (i.e., no personal identifiers, such as name)

Process for access

PI (must be registered by your institution as a PI in your eRA account) & institutional Signing Official (both w/NIH eRA Commons accounts) co-sign request for data access
Statement summarizing proposed research use for the requested data
List of collaborating investigators at same institution (collaborators at other institutions must submit own requests)
Submission of request constitutes agreement to Data Use Certification (e.g., use limited to proposed project, no re-ID or re-contact, no further distribution of data)
Agree to Code of Conduct
Adhere to data security measures

Criteria for access

Data access & use must be consistent w/participants’ intent as reflected in consents; datasets are placed in different “consent groups,” e.g.:
- General research use: use limited only by model Data Use Certification
- Health/medical/biomedical research only: no ancestry, no possibly stigmatizing research, no non-health research
- Non-profit use only
Some datasets require local IRB approval; others (e.g., often, general research use) don’t
Access to all controlled datasets requires approval of an NIH Data Access Committee (DAC), which looks to ensure proposed plan matches limitations (if any) of consent group; no additional ethics review is involved

References

Ballantyne, Angela, and G. Owen Schaefer. 2018. “Consent and the Ethical Duty to Participate in Health Data Research.” Journal of Medical Ethics 44 (6): 392–96. doi:10.1136/medethics-2017-104550.

Bruckman, Amy. 2016. “Do Researchers Have to Abide by Terms of Service (TOS)?” The Next Bison: Social Computing and Culture. https://nextbison.wordpress.com/2016/02/26/tos/.

Budin-Ljøsne, Isabelle, Harriet J. A. Teare, Jane Kaye, Stephan Beck, Heidi Beate Bentzen, Luciana Caenazzo, Clive Collett, et al. 2017. “Dynamic Consent: A Potential Solution to Some of the Challenges of Modern Biomedical Research.” BMC Medical Ethics 18 (1): 4. doi:10.1186/s12910-016-0162-9.

Carter, Pam, Graeme T Laurie, and Mary Dixon-Woods. 2015. “The Social Licence for Research: Why Care.Data Ran into Trouble.” Journal of Medical Ethics 41 (5): 404–9. doi:10.1136/medethics-2014-102374.

Caulfield, Timothy, and Jane Kaye. 2009. “Broad Consent in Biobanking: Reflections on Seemingly Insurmountable Dilemmas.” Medical Law International 10 (2): 85–100. doi:10.1177/096853320901000201.

Fiesler, Casey, and Nicholas Proferes. 2018. “‘Participant’ Perceptions of Twitter Research Ethics.” Social Media + Society 4 (1). doi:10.1177/2056305118763366.

Hartznog, Woodrow N., and Frederic D. Stutzman. 2013. “The Case for Online Obscurity.” California Law Review 101 (1): 1–49.

Kaye, Jane, Edgar A. Whitley, David Lund, Michael Morrison, Harriet Teare, and Karen Melham. 2015. “Dynamic Consent: A Patient Interface for Twenty-First Century Research Networks.” European Journal of Human Genetics 23 (2): 141–46. doi:10.1038/ejhg.2014.71.

King, Gary, and Nathaniel Persily. 2018. “A New Model for Industry-Academic Partnerships.” Working Paper. http://j.mp/2q1IQpH.

Kraft, Stephanie Alessi, Mildred K. Cho, Melissa Constantine, Sandra Soo-Jin Lee, Maureen Kelley, Diane Korngiebel, Cyan James, et al. 2016. “A Comparison of Institutional Review Board Professionals’ and Patients’ Views on Consent for Research on Medical Practices.” Clinical Trials 13 (5): 555–65. doi:10.1177/1740774516648907.

Meyer, Michelle N. 2015. “Two Cheers for Corporate Experimentation: The A/B Illusion and the Virtues of Data-Driven Innovation.” Colorado Technology Law Journal 13 (2): 273–331.

———. 2018a. “Ethical Considerations When Companies Study—and Fail to Study—Their Customers.” In The Cambridge Handbook of Consumer Privacy, edited by Evan Selinger, Jules Polonetsky, and Omer Tene, 207–31. Cambridge, UK: Cambridge University Press.

———. 2018b. “Practical Tips for Ethical Data Sharing.” Advances in Methods and Practices in Psychological Science 1 (1): 131–44. doi:10.1177/2515245917747656.

Mittelstadt, Brent, Justus Benzler, Lukas Engelmann, Barbara Prainsack, and Effy Vayena. 2018. “Is There a Duty to Participate in Digital Epidemiology?” Life Sciences, Society and Policy 14 (1): 9. doi:10.1186/s40504-018-0074-1.

Nayak, Rahul K., David Wendler, Franklin G. Miller, and Scott Y. H. Kim. 2015. “Pragmatic Randomized Trials Without Standard Informed Consent?: A National Survey.” Annals of Internal Medicine 163 (5): 356–64. doi:10.7326/M15-0817.

Nissenbaum, Helen. 2004. “Privacy as Contextual Integrity.” Washington Law Review 79 (1): 119–58.

Office for Human Research Protections. 2008. “Coded Private Information or Specimens Used in Research, Guidance.” https://www.hhs.gov/ohrp/regulations-and-policy/guidance/research-involving-coded-private- information/index.html.

Ploug, Thomas, and Søren Holm. 2016. “Meta Consent - A Flexible Solution to the Problem of Secondary Use of Health Data: Meta Consent.” Bioethics 30 (9): 721–32. doi:10.1111/bioe.12286.

Secretary’s Advisory Committee on Human Research Protections. 2013. “Attachment B: Considerations and Recommendations Concerning Internet Research and Human Subjects Research Regulations, with Revisions.” https://www.hhs.gov/ohrp/sachrp-committee/recommendations/2013-may-20-letter-attachment- b/index.html#backfn2.

Social Science One. 2018.

Truog, Robert D., Aaron S. Kesselheim, and Steven Joffe. 2012. “Paying Patients for Their Tissue: The Legacy of Henrietta Lacks.” Science 337 (6090): 37–38. doi:10.1126/science.1216888.

Vitak, Jessica, Katie Shilton, and Zahra Ashktorab. 2016. “Beyond the Belmont Principles: Ethical Challenges, Practices, and Beliefs in the Online Data Research Community.” In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, 939–51. New York: ACM Press. doi:10.1145/2818048.2820078.

Williams, Matthew L, Pete Burnap, and Luke Sloan. 2017. “Towards an Ethical Framework for Publishing Twitter Data in Social Research: Taking into Account Users’ Views, Online Context and Algorithmic Estimation.” Sociology 51 (6): 1149–68. doi:10.1177/0038038517708140.

Zimmer, Michael. 2010. “Is It Ethical to Harvest Public Twitter Accounts Without Consent?” https://www.michaelzimmer.org/2010/02/12/is-it-ethical-to-harvest-public-twitter-accounts-without-consent/.

Respect for persons

Varieties of consent

When consent is not legally and (arguably) not ethically required

Other ways of respecting data subjects

Other ethical issues raised

Group harm, vulnerable populations, and participant engagement

Access to data

Commercialization of research

Partnering with, versus scraping, platforms

Governance

Public trust

Appendix: Data Access Provisions of Major Data Repositories

NHANES

Participants

Data collected

Consent language

Data tiers

Processes for accessing restricted data

National Longitudinal Survey of Youth (NLSY)

Participants

Data collected

Consent language

Data tiers

Processes for accessing restricted data

Framingham Heart Study (FHS)

Participants

Data collected

Consent language

Data tiers (doesn’t seem to be any public data)

Processes for accessing restricted data

Wisconsin Longitudinal Study (WLS)

Participants

Data collected

Data tiers & processes for access

Health and Retirement Study (HRS)

Participants

Data collected

Data tiers & processes for access

Notes about IRB review

Possible penalties for breach of Agreement for Use of Restricted Data

General Social Survey (GSS)

Data collected

Data tiers (individually identifying info, e.g., name, address, never provided)

Process for access

Criteria for review

dbGaP

Participants & data collected

Data tiers

Process for access

Criteria for access

References