Evaluating Facial Recognition Technology
A Protocol for Performance Assessment in New Domains
Introduction
Facial recognition technology (FRT), namely the set of computer vision techniques to identify individuals from images, has proliferated throughout society. Individuals use FRT to unlock smartphones, computer appliances, and cars. Retailers use FRT to monitor stores for shoplifters and perform more targeted advertising. Banks use FRT as an identification mechanism at ATMs. Airports and airlines use FRT to identify travelers.
FRT technology has been used in a range of contexts, including high-stakes situations where the output of the software can lead to substantial effects on a person’s life: being detained overnight at an airport or being falsely accused of a crime, as was the case for Robert Williams and Michael Oliver. A 2016 study reports that one out of two Americans are involved in a “perpetual line-up” (i.e., an ongoing virtual police lineup), since local and federal law enforcement regularly perform facial recognition-based searches on their databases to aid in ongoing investigations. Beyond the effects of current use of FRT, widening the deployment of FRT to continuous surveillance of the public has the potential to change our use of public spaces, our expectations of privacy, our sense of dignity, and the right to assemble.
The widespread use of FRT in high-stakes contexts has led to a loud call to regulate the technology — not only from civil society organizations, but also by the creators and vendors of FRT themselves. IBM, for instance, has discontinued its sale of “general purpose facial recognition software,” stating that “now is the time to begin a national dialogue on whether and how facial recognition technology should be employed by domestic law enforcement agencies,” offering to work with Congress to this end. Amazon initiated a one-year moratorium on police use of its facial recognition technology, calling for “governments [to] put in place stronger regulations to govern the ethical use of facial recognition technology.” Microsoft, too, announced that they will not sell FRT software to police departments "until we have a national law in place, grounded in human rights."
Numerous pieces of state and federal legislation in the US echo this call. Many propose a moratorium on government use of FRT until comprehensive guidelines can be set. One U.S. Senate bill proposes to bar federal agencies and federally funded programs from using FRT. The state of Massachusetts has proposed restricting state usage of FRT, and the City of San Francisco enacted legislation to prohibit municipal departments from using FRT.
All of us support these calls for rigorous reflection about the use of FRT and one common thread throughout nearly all proposed and passed pieces of legislation is a need to understand the accuracy of facial recognition systems, within the exact context of their intended use. The federal Facial Recognition Technology Warrant Act, for example, calls for “independent tests of the performance of the system in typical operational conditions” in order to receive a warrant to use facial recognition for a given task within the government; the Ethical Use of Facial Recognition Act calls for a moratorium on government use of FRT until regulatory guidelines can be established to prevent “inaccurate results”; the State of Washington requires that FRT vendors to enable “legitimate, independent and reasonable tests” for “accuracy and unfair performance differences across distinct subpopulations;” the state of Massachusetts proposes “standards for minimum accuracy rates” as a condition for FRT use in the state. The push for accuracy testing is not unique to the United States. The European Union Agency for Fundamental Rights has similarly emphasized the need to make accuracy assessments for different population groups, and the European Commission emphasizes the need to demonstrate robustness and accuracy with AI systems.
Understanding true in-domain accuracy — that is, accuracy of FRT deployment in a specific context — is crucial for all stakeholders to have a grounded understanding of the capabilities of the technology. FRT vendors require objective, standardized accuracy tests to meaningfully compete based on technological improvements. FRT users require in-domain accuracy to acquire FRT platforms that are of highest value in the posited application. Civil society groups, academics, and the public would benefit from a common understanding of the capabilities and limitations of the technology in order to properly assess risks and benefits. Therefore, we took a concerted effort to examine this specific question of the technology, in hopes of better understanding the operational dynamics in the field.
Although it may seem simple at first glance, understanding performance of facial recognition for a given real-world task — e.g. identifying individuals from stills of closed-circuit television video capture — is not in fact an easy undertaking. Many FRT vendors advertise stunning performance of their software. And to be sure, we have witnessed dramatic advances in computer vision over the past decade, but these claims of accuracy are not necessarily indicative of how the technology will work in the field. The context in which accuracy is measured is often vastly different from the context in which FRT is applied. For instance, FRT vendors may train their images with well-lit, clear images and with proper software usage from machine learning professionals, but during deployment, clients such as law enforcement may use FRT based on live video in police body cameras, later evaluated by officers with no technical training. The accuracy of FRT in one domain does not translate to its uses in other domains —and changing context can significantly impact performance, as is common knowledge in the computer science literature.
One central concern of such cross-domain performance, which has given rise to profound criticisms of FRT, is that models may exhibit sharply different performance across demographic groups. Models trained disproportionately on light-skinned individuals, for instance, may perform poorly on dark-skinned individuals. A leading report, for instance, found that false positive rates varied by factors of 10 to 100 across demographic groups, with such errors being “highest in West and East African and East Asian people, and lowest in Eastern European individuals.” In this White Paper, we characterize this gulf between the contexts in which facial recognition technology is created and deployed as stemming from two sources: domain shifts stemming from data differences across domains and institutional shifts in how humans incorporate FRT output in decisions. We outline concrete, actionable methods to access deployment-domain accuracy of FRT.
In our view, the ability to evaluate the accuracy of FRT is critical to the normative debates surrounding FRT. First, if a system simply does not perform as billed, and if accuracy differs dramatically across demographic groups, poor performance may disqualify an FRT system from use and obviate the need for other normative considerations. Second, performance interacts directly with normative questions. For example, lower accuracy heightens concerns about the cost of misidentification. Higher accuracy, on the other hand, amplifies concerns over surveillance, privacy, and freedom of expression. The central role of accuracy in these debates likely explains why so much proposed legislation has called for rigorous assessments of performance and is why we have tailored this White Paper to the subject.
Of course, many other considerations factor into the adoption of FRT. Concerns over privacy, consent, transparency, and biased usage all significantly complicate the use of FRT systems, independent of accuracy. While such concerns are critical to a meaningful discussion about FRT, they fall outside the direct scope of this White Paper. The scope here remains intentionally narrow, as consensus around how to assess the operational limits of the technology can be crafted more readily than consensus around wide-ranging normative commitments around the technology. For a broader normative assessment, each individual use case must necessarily be judged by the potential harms and benefits along all of these dimensions and we point readers to broader discussions in the references cited throughout this White Paper.
A WHITE PAPER FOR STANFORD’S INSTITUTE FOR HUMAN-CENTERED ARTIFICIAL INTELLIGENCE
Daniel E. Ho
Emily Black
Maneesh Agrawala
Li Fei-Fei