A House with No Foundation
Forensic science needs to build a base of rigorous research to establish its reliability.
Many of the forensic techniques used in courtroom proceedings, such as hair analysis, fingerprinting, the polygraph, and ballistics, rest on a foundation of very weak science, and virtually no rigorous research to strengthen this foundation is being done. Instead, we have a growing body of unreliable research funded by law enforcement agencies with a strong interest in promoting the validity of these techniques. This forensic “science” differs significantly from what most of us consider science to be.
In the normal practice of science, it is hoped that professional acculturation reduces these worries to a functional minimum. To this degree, science is based on trust, albeit a trust that is defensible as reasonably warranted in most contexts. Nothing undermines the conditions supporting this normal trust like partisanship. This is not to say that partisanship does not exist in some form in most or all of the practice of science by humans, even if it is limited to overvaluing the importance of one’s own research agenda in the grand scheme of things. But science is a group activity whose individual outputs are the product of human hopes and expectations operating within a social system that has evolved to emphasize the testing of ideas and aspirations against an assumed substrate of objective external empirical fact.
The demands of the culture of science—ranging from the mental discipline and methodological requirements that constitute an important part of the scientific method to the various processes by which scientific work is reviewed, critiqued, and replicated (or not)—tend to keep human motivation-produced threats to validity within acceptable bounds in individuals; and the broad group nature of science ensures that, through the bias cancellation that results from multiple evaluation, something like progress can emerge in the long run. However, in contexts where partisanship is elevated and work is insulated from the normal systems of the science culture for checking and canceling bias, then the reasons to trust on which science depends are undermined. Nowhere is this more likely to be a serious problem than in a litigation-driven research setting, because virtually no human activity short of armed conflict or dogmatic religious controversy is more partisan than litigation. In litigation-driven situations, few participating experts can resist the urge to help their side win, even at the expense of the usual norms of scientific practice. Consider something as simple as communication between researchers who are on different sides of litigation. Although there is no formal legal reason for it, many such researchers cease communicating about their differences except through and in consultation with counsel. What could be more unnatural for normal researchers? And what purpose does such behavior serve other than ensure that scientific differences are not resolved but exacerbated?
These concerns apply not only to research undertaken for use in a particular case, but also to research undertaken for use in unspecified cases to come, as long as the litigation interest of the sponsoring party is sufficiently clear. This is what differentiates litigation-driven research from much other interest-driven research. For instance, in research directed toward Food and Drug Administration (FDA) approval of drugs, drug companies are interested not only in positive findings but also in the discovery of dangers that might require costly compensation in the future or cause their drug to be less competitive in the marketplace. In other words, built-in incentives exist to find correct answers. In addition, the research will have to be conducted according to protocols set by the FDA, and it will be reviewed by a community of regulators who are technically competent and, at least in theory, appropriately skeptical. By contrast, in much litigation-driven research, there is a single unambiguous desired result, and the research findings will be presented to a reviewing community (judges and juries) that typically is not scientifically literate. These circumstances are more like the ground conditions relating to industry-sponsored research in regard to food supplements and tobacco, two areas of notoriously problematic claims.
Our attention is focused on an area that does not appear to figure prominently in most examinations of the problems of litigation-driven research: law enforcement-sponsored research relevant to the reliability of expert evidence in criminal cases, evidence that virtually always is proffered on behalf of the government’s case. Of primary concern is research directly focused on the error rates of various currently accepted forensic identification processes, which have not been subject to any formal validity testing.
Illusion of infallibility
Many forces combine to raise special concerns in such areas. From the perspective of prosecution and law enforcement, any such research can result only in a net loss because in these areas there has been a carefully fostered public perception of near-infallibility. Research revealing almost any error rate under common real-world conditions undermines the aura. In addition, data that can show deficiencies in individual practitioners threaten that individual’s continued usefulness as an effective witness. The combined effects of these two kinds of findings can potentially result in increased numbers of acquittals in cases where other evidence of a defendant’s guilt is weak. Valid or not, however, such testimony is extremely useful to a prosecutor who is personally convinced of the guilt of the defendant (which, given the partisan nature of the litigation process, is virtually every prosecutor) and is willing to use whatever the law allows in an effort to convince the jury of the same thing. Consequently, research results calling into question the validity of such expertise, or defining its error rates, is threatening because it undermines a powerful tool for obtaining convictions and also threatens the status and livelihood of the law enforcement team members who practice the putative expertise.
It is not surprising, therefore, to discover that until recently such research was rare, especially in regard to forensic science claims that predated the application of the Frye test (requiring that the bases of novel scientific evidence be generally accepted in some relevant scientific community before it can be admitted into evidence). Such evidence had never been considered “novel” and therefore had never been confronted with any validity inquiry in any court. Even in regard to expert evidence that had been reviewed as novel, the review often consisted of little more than making sure that there was at least some loosely defined “scientific” community that would vouch for the accuracy of the claimed process.
The winds of change began to blow with the Supreme Court’s Daubert decision, of course, although it was several years before the first significant Daubert challenge to prosecution-proffered expertise was heard, and there is still reason to believe that substantial resistance exists among the judiciary to applying Daubert and its important descendant Kumho Tire to prosecution-proffered expertise as rigorously as they have been applied to the expert proffers of civil plaintiffs. Nevertheless, there have been some successful challenges, most notably in regard to handwriting identification expertise; and the potential for challenges in other areas has made law enforcement, particularly the Federal Bureau of Investigation (FBI), seek research that could be used to resist such challenges.
After a century of oversold and under-researched claims, suddenly there is interest in doing research. However, certain aspects of that research give reason to believe that it must be received with caution. Various strategies appear to have been adopted to ensure that positive results will be exaggerated and negative results will be glossed over, if not withheld. These include the following: placing some propositions beyond the reach of empirical research, using research designs that cannot generate clear data on individual practitioner competence, manufacturing favorable test results, refusing to share data with researchers wishing to conduct reanalyses or further analyze the data, encouraging overstated interpretations of data in published research reports, making access to case data in FBI files contingent on accepting a member of the FBI as a coauthor, and burying unfavorable results in reports where they are least likely to be noticed—coupled with an unexplained disclaimer that the data cannot be used to infer the false positive error rate that they plainly reveal.
The clearest example of the first strategy is the claim of fingerprint examiners that their technique has a “methodological error rate” of zero and that any errors that occur are therefore lapses on the part of individual examiners. Because the technique can never be performed except through the subjective judgment of human fingerprint examiners, it is impossible to test the claimed division of responsibility for error empirically. The claim is thereby rendered unfalsifiable.
To see the second strategy at work, one need only examine the FBI-sponsored studies of the performance of handwriting identification examiners. These studies, led by engineer Moshe Kam, were supposed to compare the performance of ordinary persons and of document examiners in the identification of handwriting by giving them the task of comparing samples of handwriting. Instead of designing a test that would do this directly by, for example, giving all test takers a common set of problems with varied difficulty, Kam et al. adopted a roundabout design that randomly generated sorting tasks out of a large stockpile of known handwriting. Consequently, each individual test taken by each individual participant, expert or nonexpert, differed from every other test. In some, hard tasks may have predominated and in others trivial tasks. This meant that, given a large enough number of such tests administered to both the expert and the lay group, one might infer that the aggregate difficulty of the set of tests taken by each group was likely to be similar, but evaluation of the performance of any individual or subset of individuals was undermined. This unusual test design is inferior to more straightforward designs for most research purposes, but it is superior in one respect: It makes it impossible to identify individual scores and thus to expose unreliable examiners. Thus, any such people remained useful prosecution expert witnesses. This contrasts with research led by Australians Bryan Found and Doug Rogers, which posed similar research questions but was designed in a way that allowed them to discover that a considerable range of skill existed among professional examiners, some of whom were consistently more inaccurate than others.
The third strategy reflects the notion that, left to their own devices, those having a point to make using data rather than a genuine question to ask of data are tempted to design studies that produce seemingly favorable results but that actually are often meaningless and misleading. In one study of fingerprint identification, conducted during the pretrial phase of the first case in generations to raise serious questions about the fundamental claims of fingerprint identification experts, the FBI sent sample prints to crime laboratories around the country. The hope was to show that all labs reached the same conclusion on the same set of prints. When half a dozen labs did not reach the “correct” decisions, those labs, and only those labs, were sent annotated blow-ups of the prints and were asked to reconsider their original opinions. Those labs got the message and changed their minds. This supposedly proved that fingerprint examiners are unanimous in their judgments. A second study was designed to prove the assumption that fingerprints are unique. This study compared 50,000 fingerprint images to each other and then calculated the probability that two prints selected at random would appear indistinguishably alike. In a comment on the study written for a statistical journal, David H. Kaye explains the errors in the study’s design and analysis, which led to a considerable overstatement of the conclusions that its data can support. Kaye attributes the problems in the research to its being “an unpublished statistical study prepared specifically for litigation.” He concludes by suggesting that the study provides “a lesson about probabilities generated for use in litigation: If such a probability seems too good to be true, it probably is.”
The Kam handwriting studies also reflect the reluctance of many forensics researchers to share data, an obvious departure from standard scientific practice. Kam et al. have generated four data sets on government grants: three from the FBI and one from the Department of the Army. Repeated requests for the raw data from those studies for purposes of further analysis have been repeatedly denied, despite the fact that the youngest of the data sets is now more than three years old and hence well beyond the usual two-year presumptive period of exclusive use. Besides, there is serious criticism of the application of even this time-bound model of exclusive use of data relevant to public policy, especially when generated through government grants. Had the research been sponsored by almost any other federal agency, data sharing would have been required.
As for encouraging overstatement to produce sound bites useful for litigation, Kam’s first (nonpilot) study offers an example: It claims that it by itself “laid to rest . . . the debate over whether professional document examiners possess a skill that is absent in the general population.” (It didn’t.) Or consider Sargur Srihari and colleagues’ claim that their computer examination of a database of about 1,500 handwriting exemplars established the validity of the claim of document examiners that each and every person’s handwriting is unique. If one tracks Srihari et al.’s reports of the research from early drafts to final publication, the claims for uniqueness grow stronger, not more tempered, which is not the typical progression of drafts in scientific publishing. Finally, Srihari’s claims became the subject of a substantial publicity campaign on behalf of this first study to “prove” the uniqueness claim on which handwriting identification expertise had stood for a century. All this despite the simple fact that in the study itself, all writings were not found to be distinguishable from each other.
The FBI has apparently had a policy requiring coauthorship with an FBI employee as a condition of access to data derived from their files (at least by researchers not considered committed friends of the FBI). This policy has been in place at least since the early 1990s, when William Thompson of the University of California at Irvine and a coauthor were denied access to DNA case data unless they accepted such a condition. This practice undermines the normal process of multiple studies driven by multiple research interests and perspectives.
An example of the likely effects of such a “friends-only” regime may be seen in a recent study by Max Houck and Bruce Budowle in the Journal of Forensic Sciences. (Houck is a former examiner for the FBI laboratory, who recently joined the faculty of West Virginia University; and Budowle is still with the FBI.) That study dealt with an analysis of 170 hair comparisons done at the FBI laboratory between 1996 and 2000. In each case, a questioned hair sample from a real case had been compared microscopically to a hair sample from a known human source to determine whether they were sufficiently similar that they might have come from the same person. Subsequently, the same samples were subjected to mitochondrial DNA comparison. The authors stated that the purpose of the study was to “use mtDNA results to assess the performance of microscopic analysis.” Perhaps the most central question in such a study concerns how often a questioned hair actually comes from the known source when the human examiner declares that they are “associated”; that is, consistent in their characteristics. Of the 80 hairs in the set that had been declared associated, nine (11 percent) were found by mtDNA analysis to be nonmatches. However, this result was buried in a single paragraph in the middle of the paper, followed by this statement: “These nine mtDNA exclusions should not be construed as a false positive rate for the microscopic method or a false exclusion rate for mtDNA typing: it (sic) displays the limits of the comparison of the hairs examined in this sample only and not for any hairs examined by any particular examiner in any one case.” In making this statement, the authors equate the epistemic value of the results of subjective human evaluation and the results of mtDNA analysis on the question of common origin. In other words, all techniques are equal and no study should have any bearing on our evaluation of future cases in court.
What next?
Thus, we have seen favorable findings declared to “end all debate” on a question, whereas unfavorable findings are declared to have “no implications” beyond the pages of the study reporting them. Both of these extremes are seen infrequently in contexts other than research done with one eye on upcoming litigation.
We make no claim that the above examples are the result of any systematic review of the literature. They are merely instances we encountered as we labored down in our little corner of the forensic science mine, where we have for years examined reliability issues in regard to various forensic identification claims. However, enough canaries have died in our corner of the mine to suggest that such law enforcement-sponsored research should be approached with caution.
What does that suggest for the future? First, that the circumstances in the criminal justice system that tend to distort such research deserve attention as part of any larger inquiry into the problems of litigation-driven research. Second, it suggests that any efforts that bring more independent researchers working under more independent conditions into the forensic knowledge-testing process should be encouraged. As to the judicial consumers of such research, it is unlikely that, in an adversarial system, anything official can or will be done about the phenomenon, especially when the research enters the legal process during pretrial hearings, where the usual rules of evidence are themselves inapplicable. And thus until fundamental changes occur in the research environment that creates litigation-directed forensic science research, courts would be well advised to regard the findings of such research with a large grain of salt.