Examining the Fisher’s Exact Test and Its Place in EEO Litigation

Introduction

Even before the 1978 publication of the Uniform Guidelines on Employee Selection Procedures, adverse impact analyses (alternatively known as disparate impact analyses) have been conducted by employers to evaluate passing rate differences between subgroups on various practices, procedures, and tests. Methods for conducting such analyses have typically included impact ratio tests that comparatively evaluate the success rates between two groups (e.g., the 80% Rule), statistical significance tests, and practical significance tests (Bobko & Roth, 2004). While these methods have remained consistent, the actual tools (i.e., statistical procedures) have evolved, with some exceptions.

While the medical and statistical fields have recently gravitated towards more powerful statistical techniques for analyzing 2 X 2 tables, and have grown to recognize serious limitations as well as constraints with the conventional Fisher Exact Test ("FET" hereafter) for analyzing 2 X 2 tables, the HR and personnel psychology fields have not been so quick to adapt. Specifically, the FET has been contested in the statistical literature since 1945 (Mehrotra, et. al., 2003) and most practitioners in the statistical field now reserve its use for situations where its strict conditional assumptions can be met and its conservative nature taken into consideration when evaluating its results (Upton, 1992; Lydersen, Fagerland, & Laake et. al., 2009).

To understand the limitations of the FET, we must first have an understanding of the different models for 2 X 2 contingency tables. Because statistical significance tests involve a comparison of the observed result to what might have occurred due to chance, each test requires those chance results to be operationally defined. In the context of 2 X 2 tables, three distinct models have been developed based on differing operational definitions. The choice among these models has been a matter of debate among statisticians for decades, and central to the debate are the conditional assumptions, which pertain to whether the marginal totals of the table are assumed to be fixed a priori or whether they can be assumed to be drawn from a larger population (Camilli, 1990). Collins and Morris (2008) describe the three models in which 2 X 2 tables can be evaluated, which are summarized briefly below.

Model 1: Independence Trial. All marginal totals are assumed to be fixed in advance (i.e., proportion of each group and selection totals are fixed). The data are not viewed as a random sample from a larger population.

Model 2: Comparative Trial. Either the row or column totals are fixed in advance. For example, the applicants are viewed as random samples from two distinct populations (e.g., men and women). The proportion from each population is fixed (i.e., the marginal proportion on one variable is assumed to be constant across replications). The second marginal proportion (e.g., the marginal proportion of applicants who pass the selection test) is estimated from the sample data.

Model 3: Double Dichotomy. In this model, neither the row nor the column marginal totals are assumed to be fixed. Applicants are viewed as a random sample from a population that is characterized by two dichotomous characteristics. No purposive sampling or assignment to groups is used, and the proportion in each group, as well as the success rate can vary across samples.

These three models can be summarized as having "fixed," "mixed," and "free" marginal assumptions. As will be discussed in greater detail later, the current state of the statistical and medical research literature holds the various 2 X 2 tests available fit these three models with more or less precision.

Limitations of the Fisher’s Exact Test

Shortly after Ronald Fisher framed his exact test (Fisher, 1935), some statisticians began challenging its use across different 2 X 2 scenarios (e.g., Barnard, 1945) as well as its conservative nature (see Yates, 1984). While these early contests were theoretical in nature, more recent criticisms have been based on the results of modern data simulation analyses that provide a more in-depth scan of the statistical behavior of various 2 X 2 tests (Sekhon, 2005; Collins & Morris, 2008; Crans & Shuster, 2008; Lin & Yang, 2009; Lydersen, et. al., 2009). These recent studies have revealed two major limitations of the FET: the fact that its strict conditional assumptions are rarely met in actual practice and the conservative nature of the FET.

The first limitation deals with the conditional assumptions required for correctly applying the FET. The statistical field has convened at a consensus that the FET can only be accurately applied in the first model—the Independence Trial Model. Because this model does not represent typical personnel selection data, "there is reason to question the appropriateness of the FET for adverse impact analysis" (Collins & Morris, 2008). The appropriateness of treating the margins as fixed has been at the heart of much of the debate that has surrounded the FET for over 50 years.

Some statisticians contend that the Independence Trial Model requires that "both of the margins in a 2 X 2 table are fixed by construction—i.e., both the treatment and outcome margins are fixed a priori" (Sekhon, 2005; see also Romualdi, et. al., 2001; Hirji et. al., 1991; D’Agostino, et. al., 1988; and Ludbrook, 2008). In other words, for the conditional assumptions of the Independence Trial Model to be met, the investigator needs to identify the marginal totals of both the rows and columns prior to conducting the experiment that will produce the numbers within each. It is common in experimental research to specify in advance the relative numbers in each treatment conditions; however, it would be unusual to specify the frequency of both the predictor and outcome before collecting any data (Gimpel, 2007). While recommended by some, this condition seems to be only rarely met in practice.

Collins and Morris (2008) argued that the data available for adverse impact analysis is rarely consistent with the fixed marginal assumptions. For example, in an analysis of applicants vs. hires, the number of applicants in minority and majority groups is unlikely to be consistent across samples. And while it may be tempting to view promotion or layoff decisions as involving a fixed pool of candidates and a fixed number of persons selected, once the set of individuals is fixed, it becomes unclear what comprises the sample space on which probabilities are defined. Similarly, the set of candidates considered for a promotion decision will have been previously selected using some screening procedure that may have considered some of the same factors that are used to make the promotion decision. Therefore, the prior selection process, which determined the number of minority applicants, will not be independent of the success ratio of the promotion decision, the parameter of interest.

An additional challenge with meeting the conditional assumptions of promotional settings is that employers may first try to fill promotional opportunities with in-house employees from a variety of lower positions (which will have different potential weight and availability percentages for each group), and then turn to outside resources if the slot cannot be filled internally. Situations like these blur the "fine line" between "fixed," "mixed," and "free" marginal assumptions. When applying the three models to typical adverse impact analyses, it becomes clear that the conditional assumptions of the FET will only rarely be met.

The debate over the use of conditional versus unconditional tests has been going on for decades, and is not likely to be resolved any time in the near future. Our goal here is more modest – to evaluate the use of alternate significance test as a decision-making aid in evaluating adverse impact. In this context, of primary concern are the error rates for the decision rule. Specifically, we are concerned with the likelihood of false-positives (Type I errors) and false-negatives (Type II errors). This leads to the second, and more important criticism of the FET – that the test is overly conservative.

The statistical field at large holds that the FET is too conservative (see Reference Authorities Regarding the Limitations of the Fisher Exact Test for a partial listing of citations that hold this position). In this context, conservative refers to the fact that the desired significance level, for example 0.05, cannot be attained exactly due to the discrete distribution of the data, and lesser values must be used. Discreteness occurs because, for small sample sizes, the number of possible outcomes considered by the FET is small (Agresti, 2007). As a result, the p-value can take on only a limited number of possible values, and often none of the possible outcomes will have p-values close to but less than the nominal significance level. Therefore, the obtained probability of a Type I error will be less than the nominal alpha level, often considerably lower.

It is important to note that the problem is not with the p-values, which are accurate given the conditional assumptions, but rather results from the use of a decision rule where the p-value is compared to a = .05. Upton (1992) argued that the conservativeness of the FET is due to the common practice of fixing the nominal significance level at 0.05. For example, if one were to instead set a = .055, the results with 2 women hired would also be significant and the Type I error rate (.054) would be quite close to the nominal level. Thus, the problem of conservatism can be avoided by directly interpreting p-values, rather than reporting results as significant or non-significant based on a fixed alpha level. However, in Title VII situations, fixed significance levels are the required standard, so the detrimental consequences of discreteness remain.

This limitation results in the FET having "less power than conditional mid-P tests and unconditional tests" while these other tests "generally have higher power yet still preserve test size" (Lydersen, et. al, 2009). For this limitation alone, several statisticians have recommended that the "traditional FET should practically never be used" (Lydersen, et. al, 2009) because of the "actual significance level (or size) being much less than the nominal level" (Lin & Yang, 2009). Agresti (2007) recommends using the mid-P adjustment even in situations where the fixed marginal assumptions can be met "because the actual error rate [of the FET] is smaller than the intended one" (p. 48).

Choosing a test that can accurately set this .05 standard—not claim the standard yet deliver something higher (such as the FET)—is key in choosing an effective legal strategy.

Admissibility of the FET in Title VII Litigation

In the U.S. Supreme Court case, Daubert v. Merrell Dow Pharmaceuticals (1993), seven members of the Court agreed that expert evidence offered in federal litigation needs to make use of "scientific methodology" to prove or disprove the hypothesis. One requirement the court instated with this standard is that the investigative tools need to have a known or potential error rate and need to be "reliably applied to the facts at hand." For decades now, the courts have established that the .05 threshold is set in stone as the standard for identifying and deliberating adverse impact. Choosing a test that can accurately set this .05 standard—not claim the standard yet deliver something higher (such as the FET)—is key in choosing an effective legal strategy.

The uncorrected FET has been used (by default) for years in Title VII litigation. As far as we are aware, however, the FET has not yet been specifically challenged (compared to alternatives) under the criticisms that have been leveled in more recent years. This is likely because, for such challenge to occur, the rare situation would need to emerge where a litigated adverse impact case is significant using one test and not significant using the other, and each test would be subjected to the legal choosing process. Given the background described above, we don’t believe the FET would survive a Daubert challenge. However, if a situation emerged where the opposing experts in an EEO case agreed upon the 2 X 2 sampling circumstances on the case, one of the 2 X 2 models could be mutually adapted. Even if the situation was as close to a conditional circumstance as possible, deciding whether to correct for discreteness might still be an issue of contention (see Agresti, 2007, p. 49).

With over 20 articles published in statistical research journals and the majority of categorical statistical texts over the last 10 years giving only conditional use permission to the FET (a circumstance rarely met in adverse impact settings), thoroughly documenting the FET’s conservative nature, and recommending or endorsing other techniques like the Lancaster’s Mid-P ("LMP" hereafter), employers would be much safer in litigation settings using the same. It is likely because of these reasons that the application of the LMP has been more recently discussed in the EEO litigation and compliance literature (DCI Consulting, 2010; Ruggieri, Pedreschi, & Turini, 2010), software programs (Biddle Consulting Group, 2010), and EEO court cases (Strong v. Blue Cross, 2010; Delgado-O’Neil v. City of Minneapolis, 2010).

Beyond the legal implications and challenges that may come from analysis systems that use the FET, HR professionals as "liability analysts" are likely to want to use more balanced methods that better fit all three 2 X 2 situations and do not produce such conservative results. The LMP provides one alternative that fits all three 2 X 2 analysis conditions.

Lancaster’s Mid-P (LMP) as the Solution

For the reasons discussed above, we advocate using the Lancaster mid-P correction to the FET, which effectively corrects the FET to more accurately reflect the probability values of the adverse impact case analyzed in any of the three 2 X 2 models. This is because in the clearly conditional fixed model, the LMP provides a correction for discreteness that adjusts the FET to a less conservative alpha level (Agresti, 2007). In mixed and free marginal settings, the functional mechanics of the LMP result in computed values for various settings that accurately emulate the results of unconditional exact tests.

The versatile nature of the LMP is a key characteristic for practitioners and employers. One can only imagine the difficulties of having to go through a decision tree to choose which of the three models is most appropriate for each and every adverse impact analysis, then only having to defend exactly which margin was fixed, mixed, or free in litigation or enforcement settings. In addition, practitioners would be faced with choosing between the various 22 tests that are available for analyzing 2 X 2 tables, with strengths and limitations in each. Then one needs to choose whether any corrections will be made for discreteness. Our research has shown that the LMP is highly balanced and has been well supported in the literature for analyzing 2 X 2 tables in a variety of adverse impact situations.

Beyond what is mentioned above, Hirji (2006) provided several additional reasons why LMP is the preferred correction for the FET: (1) Statisticians who hold very divergent views on statistical inference have either recommended or given justification for the mid-P method, (2) the power of the mid-P tests is generally close to the shape of the ideal power function, (3) in a wide variety of designs and models, the mid-P rectifies the extreme conservativeness of the traditional exact conditional method without compromising the type I error in a serious manner, and (4) empirical studies show that the performance of the mid-P method resembles that of the exact unconditional methods and the conditional randomized methods (Hirji 2006, pp. 218-219). Hirji concludes by stating: "The mid-P method is thus a widely-accepted, conceptually sound, practical and among the better of the tools of data analysis. Especially for sparse and not that large a sample size discrete data, we thereby echo the words of Cohen and Yang (1994) that it is among the "sensible tools for the applied statistician."

Conclusions

The choice among procedures for testing statistical significance in 2 X 2 tables has been an issue of continuing research and debate for decades. Our review of the literature identified no less than 22 tests to choose among, each with its own particular assumptions, strengths and weaknesses (Upton, 1982). The availability of alternative significance tests suggests that employers who find themselves as defendants in Title VII settings will be called on to defend not only the results of their adverse impact analysis, but also how those statistics were calculated.

References

Agresti, A. (2007). An introduction to categorical data analysis (2nd ed.). Wiley.

Bobko, P., Roth, P.L. (December, 2004). Personnel selection with top-score-referenced banding: On the inappropriateness of current procedures. International Journal of Selection and Assessment, 12 (4), 291-298.

Camilli, G. & Hopkins, K. D. (1979). Testing for association in 2 X 2 contingency tables with very small sample sizes. Psychological Bulletin, 86, 1011-1014.

Collins, M. W. & Morris, S. B. (2008). Testing for adverse impact when sample size is small. Journal of Applied Psychology, 93, 463-471.

Crans, G. G. & Shuster, J. J. (2008). How conservative is Fisher’s exact test? A quantitative evaluation of the two-sample comparative binomial trial. Statistics in Medicine, 27 (8), 3598-3611.

Hirji, K. F., Tan, S. & Elashoff, R.M. (1991). A quasi-exact test for comparing two binomial proportions. Statistics in Medicine, 10, 1137-1153.

Lin, C.Y & Yang, M.C. (2009). Improved p-value tests for comparing two independent binomial proportions. Communications in Statistics - Simulation and Computation, 38 (1), 78-91.

Lydersen, S. Fagerland, M.W. & Laake, P. (2009). Recommended tests for association in 2 X 2 tables. Statistics in Medicine, 28, 1159–1175.

Mehrotra, D.V., Chan, I.S.F. & Berger, R.L. (2003). A cautionary note on exact unconditional inference for a difference between two independent binomial proportions. Biometrics, 59, 441–450.

Plackett, R. L. (1984). Discussion of Yates’ ‘Tests of significance for 2 X 2 contingency tables.’ Journal of Royal Statistical Society, Series A, 147, 426-463.

Upton G. (1992). Fisher’s exact test. Journal of the Royal Statistical Society, Series A, 155: 395–402.