Kalev Leetaru, Fobres, 20 July 2017
Not a week goes by that my inbox isn’t filled with a small barrage of announcements from publishers, universities, funding agencies and NGOs unveiling the latest open access or open data initiative. It is fantastic to see this newfound enthusiasm for making the final output of the world’s research available for open reading, reuse and replication, but the focus to date has nearly exclusively been on the final outcomes of research, while the initial ethical reviews of just what research should be conducted in the first place remains cloaked in secrecy. Does this mean we should add “open ethics” to our push for “open access” and “open data?”
When it comes to “big data” research, champions tout that the peer review process and institutional ethics review process (Institutional Review Board or “IRB” in academic parlance) ensures a proper balance between innovation and ethics, yet as I discovered last year, even in the aftermath of the most high profile ethical discussions, such as the Facebook emotions study, little has changed, with both the university (Cornell) and publisher (National Academies) in that case offering that no or only minor changes were made in its aftermath. Indeed, it is very likely that the paper would be published again were it submitted today.
We live in a world in which much of our “big data” research happens in secrecy up until the very moment it is published or leaked, often with primarily technical teams using data and methods in unexpected ways that frequently push the boundaries of privacy and ethics. Many come from fields with little to no history of ethical prereview of research and where the focus is on high profile results untempered by concerns of what those innovations might mean for society at large or the privacy of those affected by the research. Journals and funders in those fields, including federal funding agencies, apply a largely hands off approach, letting researchers police themselves or exempting entire fields from ethical review due to privacy and research ethics not being a historical focal point of the field, or concerns that such a focus on research ethics would unfairly penalize researchers from countries that “refuse to extend their [ethical] purview to cover social and behavioral science.”
A psychologist who runs a set of willing test subjects, who have all signed research releases, through a set of protocols to assess their various psychological traits and defects, would likely never be able to publish in the literature their names, list of traits and their photographs for the world to see. Yet, a computer scientist who harvests millions of unwitting people’s Facebook profiles (including those of young children) and runs them through a set of emotion mining algorithms to estimate highly sensitive emotional states like depression could do so and publish the results as a public dataset, complete with each person’s real name, their rough location and profile picture, all without ever interacting with an ethics reviewer and have the data shared millions of time across the world. Only if the dataset garners the attention of the media or high profile ethicists is there likely to be any public discussion and even if the dataset is eventually taken down, it will live on in a myriad backup copies and mirror sites and make it into countless more datasets and publications.
This is the world of research ethics we live in today: a relentless focus on outcomes and what can be done, with those decisions made in secrecy, rather than an open frank dialog about what we as a society agree should be done. It may be that our society makes a conscious decision to eliminate ethics and privacy as considerations for research, but those decisions should happen in public and made part of the public record, rather than in secrecy and known only to the few.
This past December, the Gates Foundation sent me an email celebrating the full launch of their open access policy, “requiring that all published foundation-funded research is free and immediately accessible, with full reuse rights granted to underlying data – no exceptions” and followed that up a month later that “all published researched funded by the foundation must be made available on full open access terms, free and immediately accessible, including underlying data.” The Foundation noted that “This is an important milestone for the scientific community, and a reflection of the growing commitment to free exchange of information among funders, researchers and, increasingly, publishers.”
Indeed, it is laudable that the Gates Foundation launched such a commitment to open access, especially its requirement that datasets be released for open reuse. However, when asked if the Foundation had a similar requirement that the research ethics reviews of its funded projects be released publicly (with any sensitive information redacted) or whether it was considering such a requirement, the Foundation replied that its open access efforts were “not intended to address” such issues and that “as a general matter, our grantees are responsible for the conduct of their projects. They are expected to manage their projects in accordance with all applicable legal and ethical standards, including obtaining consents and approvals that are applicable to the project.”
In short, the Gates Foundation’s focus on openness stops at the output of the research it funds, while the ethical considerations and justifications for that research being conducted in the first place and any concerns raised during that process remain the responsibility of researchers themselves, with the Foundation having no interest in making that process more transparent or open. Did a given project undergo any ethics review at all? Was it formally exempted from review due to its use of public data, without a further review of its methods or questions being asked? Did the project undergo review and receive unanimous support with no concerns? Or did the project undergo more than a year of debate with considerable dissent about whether it should proceed amid serious concerns of potential harm which were finally overturned by more senior members of the review panel? Or, did the researcher go to their university’s lawyers to receive a special exemption from further review? In the course of my explorations of the state of research ethics today, I've found examples of all these situations. Yet, the public and fellow researchers will never know how a given study came to be – they will just see a publication in a journal and a press release touting how amazing the research is. The debate leading up to whether that research should have happened at all is nearly entirely hidden, inaccessible even to fellow scholars.
Reaching out to several major universities involved in “big data” research yielded a largely uniform sentiment that the ethical considerations of research are not something that should be publicly accessible. I asked Columbia University whether it required faculty and staff to submit research that uses large online datasets like social media or web content for research ethics review prior to undertaking it, especially for work focusing on compiling detailed profiles of personal information about users from their social media activity and whether it permitted deliberate violations of terms of service by its researchers and or the knowing use of data resulting from criminal data theft, including the use of stolen medical records. A spokesperson pointed me to their primary IRB website, but when pressed to answer these specific questions, the University said it was formally declining to answer any of the questions. When an institution as prestigious as Columbia, with a nearly $100 million investment in “big data” research, makes the decision to formally decline comment, rather than discuss how it ensures ethical conduct in its research, it speaks volumes about the state of research ethics in “big data” today.
Stanford University provided a relatively detailed response to the same questions, pointing to its human subject research policies and stating that “publicly available” data which “contains personally identifiable information” would require IRB review, but may not always otherwise. With respect to “research on big data from an external source, such as social media,” a spokesperson provided that “it is most likely that the university would first require an agreement with the source providing the data. The process of executing an agreement would include a privacy review to determine whether the university could accept ownership of the data, which would have to meet all privacy laws and requirements before the data could be transferred to the university for study. The university would then ‘own’ the data for the purposes of its research, and the university would not accept data for purposes of research if it could not verify the source.”
Despite this apparent attention to ethical detail, when asked if Stanford had a process for allowing members of the public and other researchers to review the approved ethics proposal of a given project to learn how the researchers justified the ethics of their project, she stated “With respect to public access — we do not provide added access. In many cases it would be premature to release protocols, since the goal of research would be to determine if a study does work. We also do sponsored research, and the sponsor would generally require confidentiality. Ethical review can be requested by the IRB, by the reviewers, by the panel if ever a question were raised about ethics.”
Particularly troubling to ethical transparency is the statement “we also do sponsored research, and the sponsor would generally require confidentiality.” Like most large research institutions, Stanford conducts a significant amount of externally funded research, but it is interesting to note Stanford’s stance that the ethical justification for such research necessarily should remain out of public view, just as it would if the sponsoring organizations performed the research in-house. In short, whether a company like Facebook has its own researchers conduct a study or whether it hires Stanford to do the study for it, in both cases the ethical justification for the study remains secret. The transparency of academia seems to stop just short of making ethics transparent.
Indiana University was not much more helpful. When I asked the PI of the nearly one million dollar NSF-funded“Truthy” project for a copy of his approved IRB proposal and how he addressed some of the ethical considerations of his social media analysis platform, he stated that his project had IRB approval, but refused to provide a copy of the IRB approval or any detail from it, instead referring me to the University’s Vice President for Research, who in turn referred me to the Associate Vice President for University Communications, who ultimately never responded to my questions. A copy of what appears to be the NSF proposal for the project posted to one of the faculty member’s websites does not mention the word “ethics” at all and mentions privacy, just once and only in the context of privacy issues making progress more difficult: “difficulties related to privacy concerns in collecting data and the massive size of relevant data sets have hindered faster progress.” Despite the researchers uploading the proposal itself, they do not appear to have uploaded their IRB proposal.
Given that the Truthy project was funded through taxpayer funds awarded by the National Science Foundation (NSF), I asked the NSF if it would provide a copy of the IRB approval for this project. A spokesperson responded that NSF would provide such documents only after a formal legal FOIA request and that it reserved the right to charge monetary fees for such access.
Thus, as a real world example, when one requests to review the ethical considerations of how privacy and personal information found online is handled by a federally-funded project at a public university, the faculty member overseeing the project refuses to provide access, the university does not respond to multiple requests and the federal funding agency that supported the project with taxpayer funds says that the only way it would allow the materials to be reviewed would be through a formal legal FOIA request for which it reserves the right to a charge monetary fee.
In short, researchers appear to be extremely loath to share any information about the ethical considerations of their work, viewing ethics as more of an obstacle to overcome than a moral bar that must be met.
After multiple failed attempts to obtain comment from Harvard University for my first article on data ethics last June, the university was far more responsive the second time around. A spokesperson clarified that for certain kinds of research, in addition to IRB approval, a separate “Provostial Review” is also required. The university initially noted that for particularly sensitive research, such as the use of stolen data, “it is highly unlikely that we would approve such a use, especially if the data was obtained illegally or any publication could result in identification of individuals.” However, when asked to comment on a recent publication which had done exactly that, the spokesperson confirmed that the study had been approved by both the Harvard IRB and Provostial Review processes and clarified that “In use of data that is posted illegally (e.g. stolen) would be highly scrutinized and certainly the source of the data, how/where it was obtained from would be major factor as to whether or not it is approved. Ultimately the decision would depend on how the proposed research study justifies the need for the study, the method by which they obtain the data and clearly prove that the use of data/information will do no harm to individuals. We generally consider information to be in the public domain if generally accessible or available to the public (e.g. via media, public websites, newspapers, etc.).”
However, as with the other universities, when asked what the process was for a member of the public or concerned academic researcher to request a copy of the ethical justification of a Harvard study, the spokesperson responded that she could not recall such a request, but that “if such a request was received we certainly would respond accordingly.” She then caveated that IRB proposals “include[e] details on the hypothesis the investigator is trying to prove, experimental procedures/methodology, processes used for protection of private information (e.g. how they are de-identifying the information prior to publication); and the data security measures. Disclosure of any of these information would be detrimental either to the researcher (e.g. their unpublished methodology will be public, potential loss of IP) or the subjects (e.g. publicly available de-identification or data security measure would increase the potential for unauthorized access or identification). This is one of the reasons why documents release under FOIA are heavily redacted.” She also noted that “IRB approval includes multiple communications between IRB and the researcher asking for clarifications or details” and that releasing such approval information would require considerable effort to mask the identities of the IRB reviewers.
Thus, similar to Stanford, Harvard cited the need to keep ethical reviews of its research secret in order to protect researchers from having their methods subjected to external visibility. While this makes sense from the standpoint that researchers would be afraid of others “scooping” them or stealing their ideas before they can publish on them, it is less clear why at least portions of the reviews cannot be released after the research has been published, especially those sections addressing what ethical and privacy concerns the IRB believed were raised by the work and the justifications the researchers themselves used to argue for conducting the work anyway. This tremendous hesitation to release information on research ethics creates a landscape where ethical decisions about what kinds of questions or actions are ethically acceptable to perform on unwitting members of society without their knowledge are made entirely in secret.
Even more troubling is that the published literature typically emphasizes successful research, while the nature of academic research means that many questions asked of a given dataset will likely fail to yield a publishable result. This means that on any given day, academic researchers are asking myriad questions of data that we will simply never hear about because they didn’t end up yielding something the researcher thought could get published or which no journals accepted for publication.
More to the point, research which is so ethically troubling that journals actually refuse to publish it on grounds it is unethical, may never be published, meaning the broader research community and general public will never know it occurred.
Yet, perhaps the most common way in which "big data" research finds itself exempted from ethical review is the "public data" exemption in which many universities either do not require IRB review or utilize a fast track review that exempts the research at the first stage if it makes use of datasets which are generally available to the public. In such cases, IRBs appear to commonly exempt the rest of the research, including the methods and privacy implications, if the work exclusively relies on preexisting data which can be found online.
Most recently, I came across a press release for a large Mellon Foundation grant to Waterloo University to fund large-scale research on web archives, of which at least two of the co-PIs have a history of research using large web archives content. In email correspondence, the researchers emphasized the amount of web archives data they had amassed for their research and that they had received considerable interest from organizations holding large amounts of web archives data to use the new platform they were developing.
When asked if the software tools they were developing to help analyze web archives incorporated any kind of privacy or other ethical considerations, such as allowing certain kinds of privacy-sensitive queries to be disallowed or using population characteristics of the dataset to warn or disable queries that might generate privacy concerns (using the same kind of privacy techniques employed by companies in the commercial sector), the researchers replied that their tools were simply traditional analytic tools, focused on maximizing the kinds of questions that can be asked, rather than stepping back to ask if there are certain kinds of questions that should not be asked. (Indeed this was also true of the NSF-funded Twitter analysis platform at Indiana University).
In this regard it is striking to see the juxtaposition between academia’s focus on outcomes-at-all-costs and the focus within many companies I’ve spoken to on privacy and ethical safeguards that utilize heuristics and statistical models to block certain kinds of queries from being performed where they may pose particular ethical questions, even if the answers to those questions would have substantial economic benefits.
Despite asking the researchers multiple times if their Mellon-funded project or any of their other web archives research had undergone a formal institutional ethics review such as an IRB review, the PI repeatedly declined to answer and instead would state only that their work was in “full compliance with all ethical guidelines and policies of the University of Waterloo.” A university spokesperson subsequently clarified that “whether or not something requires ethics review is project dependent” and that “In general, I do know that information that is legally accessible to the public, appropriately protected by law, and where there is no reasonable expectation of privacy does not require ethics review.”
Given the university spokesperson’s clarification that web archives research likely does not actually require ethics review, coupled with the researcher’s steadfast refusal to confirm whether his work had ever undergone ethical review, it is unclear whether there has actually ever been independent ethical review. In particular, past work by the researchers has involved detailed analysis of a historical archive of the GeoCities website, including profiling specific individuals like community leaders. The work has also involved examining the visual content of the site. Given that GeoCities sites could and did include considerable personally identifiable information, including full names, photographs of individuals and many other details, it is unclear to what degree such research has been subject to external peer ethical review and the specific arguments used to justify it or any accommodations made to ensure that such work did not pose undue privacy concerns. The researchers declined to answer whether their GeoCities work had been reviewed by an IRB or other ethical body, pointing only to their original statement that their work comported with university policy.
The researchers' Mellon grant will also help expand a series of “datathons” called Archives Unleashed, in which interested researchers are brought together for a day to create tools and conduct analyses of large web archives. Their most recent event was hosted at The British Library last month and included two UK-related datasets: the “UK Government Web Archive – 2010 UK General Election Collection” and “UK Government Web Archive – Public Inquiries, Inquests, Royal Commissions, Reviews and Investigations.”
When I asked The British Library how it handled ethical review of research conducted with its web archives content, a spokesperson pointed me to its research policy and in particular their Code of Good Research Practice guidelines. Those guidelines provide that “The Library should assume primary responsibility for ensuring that ethical practice is maintained in the research projects and collaborations for which it is the Lead Research Organisation. Collaborative projects led by other Research Organisations may be governed primarily by that organisation’s research ethics policies and procedures. However, British Library staff must ensure that these processes cover all ethical aspects of the project." The policy also requires that "In many such cases, the project will be led by the partner organisation and managed primarily through that organisation’s research governance and research ethics processes. The Library does not wish to duplicate efforts. In some research collaborations, the relevant research ethics concerns may be addressed by the lead research organisation and/or other bodies or sponsors involved in the project. However, in such cases it is ESSENTIAL that the British Library staff involved in the project continue to consult this Code of Practice. In particular, for projects governed by the policies and procedures of external partners, British Library staff must ensure that these processes cover ALL ethical aspects of the project" (emphasis from original document). In addition, under the general checklist of projects subject to such ethical review are "Does the research involve the use or creation of data relating to directly identifiable human subjects?" and "Does the research require particular attention to be paid to any aspects of intellectual property or copyright?"
When asked if these ethical review guidelines applied to the two UK datasets made available in the Archives Unleashed datathon, which the Library had cosponsored, the spokesperson said the Library was unable to comment. When I asked the Waterloo researchers, who had also organized the event, whether attendees had been required to submit their proposed projects beforehand for ethical review prior to the event and in particular whether any projects involving the two UK datasets had been submitted to The British Library for review, the researchers initially responded that no British Library datasets were made available during the datathon. When pointed to additionalinformation that appeared to contradict this and suggest that at least one of the UK datasets available during the datathon was under the purview of The British Library, the researchers declined to comment further and instead referred me back to The British Library and National Archives. They also did not comment on whether projects at the datathons using other datasets were subjected to any form of ethical review or limitations on what questions could be asked of them.
When asked again, The British Library said it was unable to comment, meaning it is entirely unclear at this point whether Library datasets were available at the datathon, whether projects involving those datasets would have been subject to Library ethical review and if so, how the datathon would have managed those requirements, or whether any ethical review of any kind was applied to the projects performed at the event or the others held as part of the Archives Unleashed series.
As a premier funding agency for such digital humanities and social sciences research, I asked the Mellon Foundation for comment on its policies regarding ethical review of the projects it funds, its own policies on ethics and privacy in funded research and whether it requires projects like Archives Unleashed to adopt and enforce ethical review on participants. Despite repeated requests, neither its Director of Communications nor her staff responded.
While in keeping with the reaction from other funding agencies, Mellon’s failure to offer leadership in the area of ethics and privacy of the work it funds squanders the uniquely powerful voice it has as a major funder of such data mining projects.
Indeed, across the institutions and researchers I’ve spoken with, just one single researcher, in a business college at a major university, offered his IRB proposal and did so with his very first response to my email. His paper had become heavily cited and he was very eager to discuss the considerable ethical discussions that had gone into its design and that had shaped its final form and it was clear that he and his IRB had not only put inordinate thought into ethical and privacy considerations, but that they were proud of this focus on ethics. Strangely, his response stands alone, as every other researcher I’ve interacted with has been evasive, hostile or simply not responded to repeated requests for more information about the ethical considerations of their work and not one other has actually provided a copy of their IRB proposal, while no university I’ve spoken with to date has an open policy regarding its ethical reviews, funding agencies seem to have little interest in the topic and major journals either leave it to researchers to police themselves or have explicit policies against requiring ethics reviews due to the historical standards of their field.
Putting this all together, all of this emphasis on open access and open data is a tremendous step forward in making the world’s scholarly output more accessible to its citizenry and fellow researchers and in enabling a transformative new era in reuse and replication, but this appetite for transparency and openness seems to end abruptly when it comes to discussing the ethics and privacy considerations of that research. Universities and individual researchers are left to themselves to decide what they believe is ethical or if they even believe ethics or privacy considerations should have a place in their work, universities simply don’t want to talk about ethics or steadfastly maintain that ethical reviews must be conducted in secrecy and kept secret, while funding agencies and publishers look the other way or simply have no interest in the topic. The final outputs of research, publications and datasets, are the scholarly barometers of success, determining tenure, promotions, fame and prestige for their creators, while universities and funders reap the rewards of promoting all of the work they are supporting. A page filled with publications and innovative new datasets for download are the sign of a highly successful research laboratory at a university, while a page filled with approved IRB proposals and detailed justifications for how each publication or dataset protects privacy and comports with generally accepted ethical considerations is far less likely to attract fame and fortune and far more likely to raise legal, ethical or other troubles and negative press as other researchers and the general public take issue with the institution’s views towards ethics. In short, the academic ecosystem, from researchers to universities to publishers to funders all would rather look the other way and focus on what can be done, rather than what should be done. If, as a society we come together and decide that ethics and privacy are outdated concepts that have no place in modern big data research, then that’s a democratic decision that was reached together, rather than a ruling made in secret by a handful of faculty who may have little understanding of the research they are reviewing or the ethical and privacy impacts of the proposed analysis and which the university, researcher conducting the work and funding agency fight to keep secret.
In the end, the academic community must decide if “openness” and “transparency” apply only to the final outputs of our scholarly institutions, with individual researchers, many from fields without histories of ethical prereview, are exclusively empowered to decide what constitutes ethical and moral conduct and just how much privacy should be permitted in our digital society, or if we should add “open ethics” to our focus on open access and open data and open universities up to public discourse on just what the future of “big data” research should look like.