In their article “A Feminist Review of Behavioral Economic Research on Gender Differences,” published in the April 2019 issue of F*eminist Economics, *Esther-Mirjam Sent and Irene van Staveren state that their work was inspired by my own work. In a series of articles (here, here, and here) and a book, I had performed meta-analyses of behavioral economics work concerning gender and risk-taking. Sent and van Staveren are to be commended for taking on the ambitious project of extending the focus to include investigations into not only risk, but also overconfidence, altruism, and trust. They also come to conclusions that are, based on my own investigation, broadly correct: “[F]ew studies report statistically significant as well as sizeable differences,” “large intra-gender differences (differences among men and differences among women) exist,” and “[m]any studies have not sufficiently taken account of various social, cultural, and ideological drivers.” I feel obligated, however, to point out that there are a number of methodological problems in their article. While the article is certainly notable, considerable caution should be exercised about taking its methods as a model for future work.

In order to understand how the behavioral economics literature on gender has become so misleading, and sort out what the evidence actually says, it is vitally important to understand two things. The first is the difference between substantive and statistical significance. The second is the important role played by publication bias. Sent and van Staveren’s article, unfortunately, does not deal adequately with either. As a result, its findings are difficult to interpret.

### Substantive vs. Statistical Significance

While behavioral economists regularly report measures of statistical significance (such as t-statistics and p-values) in relation to “gender differences,” they have not—at least yet–commonly calculated measures that express the *substantive* *size* of the difference. I introduced the effect size measure called “Cohen’s d” into this economics literature as a remedy. Cohen’s d expresses the difference between male and female mean scores on some variable in standard deviation units, yielding a measure of the substantive size of the difference that can be compared across studies. For example, male and female average heights are about 2.6 standard deviations apart (that is, d=2.6), indicating that the distributions overlap only towards their tails. In contrast, estimates of behavioral differences very rarely exceed d=1.0 in absolute value and are most often considerably smaller, indicating male and female distributions that may lie almost on top of each other. A broad meta-analysis of the psychological literature on behaviors ranging from math performance to aggressive behavior to speaking styles (here), for example, found that 78% of the gender differences were associated with Cohen’s d values of less than 0.35.

Sent and van Staveren also use this measure, but portray its relation to statistical significance erroneously. They state that “Cohen’s d …does not fully equate to statistical significance. An article might report that the results indicate an insignificant difference between men and women, yet this does not imply that d = 0 (it rather implies that d will be in the neighborhood of 0).” But Cohen’s d doesn’t just “not fully equate” with statistical significance: It represents an entirely different concept. And one cannot make inferences about the size of Cohen’s d from whether it is statistically significant or not.

I suspect that this confusion may arise because the formulas for estimating Cohen’s d and the related t-statistic (for a null hypothesis of equality of means) both have the difference between the men’s and women’s means as the numerator. So it would seem that if the difference between means is big, both d and t should be big, and if the difference is small, both d and t should be small. But this reasoning ignores what goes on in the denominators. The sample size is essentially irrelevant for determining Cohen’s d. (The male and female sample sizes in the Cohen’s d denominator are simply weights. One can easily check that they cancel out when the sample sizes are the same for the two groups.) This (non-)relation of Cohen’s d to the sample size contrasts markedly with the case of the t-statistic, which systematically gets larger (in absolute value) as the sample size gets larger and our estimates therefore become more reliable and precise.

So when the sample is small and therefore quite noisy, t will tend to be small. But the small size of the sample has no implications for the size of d. In Table 6.2 of my book (or here), for example, I report some point estimates of d that have confidence intervals that include zero (that is, that are statistically insignificant) yet take on values as large (in absolute value) as -0.68 and +0.89. Considering that the most precise (that is, large sample) estimates I could find for gender differences in risk-taking were in the neighborhood of d=0.13 (book, p. 63), such values are quite far from “the neighborhood of 0”! In compiling Table 5.1 in my book (or here) I did not have the data to compute confidence intervals, so I only reported *numerical *d values *that corresponded to statistically significant differences in means*, and listed non-statistically-significant ones as “NSS.” To do otherwise would have meant reporting quite unreliable numbers.

When the sample size is large (so we have more information to go on), the converse can occur, and one may encounter a highly statistically significant but substantively tiny value of d. In one of the most precise studies I reviewed (book, p. 94), a sample of tens of thousands of individuals yielded a statistically significant estimate of only d=0.08. Such a case, in fact, illustrates my major critique of the behavioral economics “gender differences” literature: The titles and conclusions of articles often trumpet the finding of a statistically significant “gender difference,” when in fact the degree of substantive difference between the distributions is usually small and the findings have virtually no implications for predicting the behaviors of individual men and women.

The confusion about statistical and substantive significance also carries over to these authors’ discussion of the Index of Similarity (IS). Introduced by me into this literature based on similar measures applied in other contexts, IS represents the proportion by which (discrete) distributions overlap. (Note that while a larger d means more *difference*, as IS rises from 0 towards 1 it indicates less difference and more *similarity*.) Citing an earlier article and discussing issues of sample size, Sent and van Staveren claim that 0.75 has been shown to be the appropriate “cutoff value” for determining *substantive* significance using IS. That is, they seem to want to consider two distributions to be *substantively similar* only if they overlap by at least 75%. The article they cite, however, shows nothing of the kind. The article is about sampling distributions and *statistical* significance, and the figure of 0.75 seems, from the best I can tell, to be drawn arbitrarily from a particular Monte Carlo simulation.

In truth, the judgment of *substantive* significance has nothing to do with mathematical rules, but is rather a matter of thoughtful interpretation. When people have an essentialist, “Mars versus Venus” extreme prior belief about gender “differences,” they in effect picture Cohen’s d as being up around, say, 3.0 or more, and the Index of Similarity as equaling zero (no overlap). Noticing that an overlap exists *at all* may be a substantively important point! For example, by how much do the male and female distributions of the variable “number of lifetime pregnancies” overlap? Most people’s first impulse, since this question refers to basic biological functions, is to say “no overlap.” But this ignores the fact that some women are never pregnant. According to recent data for the United States, about 18% of women in their early forties have not born a child. Even if the IS for pregnancies were as small a number as 5% or 10%, it would still be a good reminder that not all women share the same behaviors, nor do women’s and men’s behaviors always differ.

### Publication Bias

The second important point, in understanding how the behavioral economics literature on gender has become so distorted, is to recognize that the observable literature on a topic is not necessarily representative of the *actual body of evidence* as contained in the data that was originally gathered. The statistical and qualitative evidence I compiled shows quite clearly that stereotype-consistent “findings” about gender differences are far more like to be published, and to be highlighted in a publication, than results that might challenge stereotyped beliefs. The counter-stereotypical finding of “no statistically significant difference” between men’s and women’s average responses is particularly dreaded by researchers, given current professional and editorial practices. It is widely taken to indicate a failure in the analysis and understood to make the results uninteresting and unpublishable. Yet in the context of studying “gender differences,” finding *no evidence to support the hypothesis of gender difference* is (when the sample size is respectably large) a *real finding*! That is, it provides interesting evidence of *similarity*.

A further question about representativeness arises when one also considers the practices of data gathering. Behavioral economists tend to disproportionately study Western, industrialized, and often young and university-related populations. Hence, the existing body of evidence already reflects a selectivity in regard to exactly *which* men and women, out of humanity as a whole, we are talking about, even before findings are further narrowed by the publishability hurdle.

Yet Sent and van Staveren’s focus (according to their abstract, Introduction, Conclusion, and procedures) is on presenting “a critical review of the behavioral economics *literature*” (emphasis added). While they mention publication bias and cultural context in a few sentences here and there, rather little attention is paid to the fact they are missing whole classes of studies (the ones consigned to the file drawer) and whole populations of people (those not studied). And it is not clear that even “the literature” is reviewed in a thorough or unbiased manner. Sent and van Staveren describe their selection process for articles in only three sentences, explaining that they “tried to do our best to capture key publications.” Yet a focus on “key publications” is likely, in current professional culture, to mean focusing on those studies that have gotten the most attention, which in turn tends to mean exactly those that have found the most extreme evidence of “difference.” The authors then further limit their selection to only articles that include all the statistics needed to compute Cohen’s d. (In my work, my research assistant and I wrote to authors to get the necessary statistics when they were missing. This time-consuming task was a big reason we only looked at only the issue of risk.)

If Sent and van Staveren’s purpose were only to review and critique some “key studies” from a feminist perspective, this issue of representativeness would not be a big problem. But they attempt to come up with broad summaries of “the literature.” And, from their summaries, they attempt to draw conclusions about the behavior of actual men and women humanity-wide (e.g., “…leading us to believe that men and women tend to exhibit altruism at similar levels”). The lack of attention to representativeness puts those sorts of claims on very shaky ground.

### Interpreting their Findings

The earlier-mentioned confusion between statistical and substantive significance, combined with the lack of apparent attention to how “not statistically significant” can be a legitimate *finding*, make Sent and van Staveren’s tables hard to interpret. Are their estimated Cohen’s d-values and IS-values reported in numerical form only when derived from somewhat reliable sample sizes, or do they include possibly wild and statistically insignificant numbers? The authors don’t explain, and so we don’t know.

While the categories of “B,” “C”, “D” in their tables might be expected to shed some light on how the reported measures of substantive significance relate to statistical significance, they unfortunately do not. The categories of “B” and “C” purport to refer to results that are all in a single direction while Category “D” is described as the catchall for “mixed or insignificant results.” Yet it is not clear what these categories mean. Even the casual reader can see that results listed in the “B” and “C” sections are sometimes, in fact, mixed—that is, include findings with both negative and positive signs. With whole articles classified according to “B,” “C,” or “D,” we are given no information about what is going on *within* the range of studies within an article. If an article contains a mix of statistically significant and insignificant results, is that article a “B” or “C” based on its statistically significant results? Or a “D” based on its statistically insignificant ones? And what about articles (often centrally concerned with something other than gender—or they wouldn’t have gotten published) that reported *only *statistically insignificant results? Did the authors look at them? We aren’t told.

As these authors’ sample of risk articles overlaps somewhat with those I reviewed, I have also been able to do some spot checking of the tables. Unfortunately, I came up with quite a number of discrepancies. If you find you cannot understand the categorization scheme used in the tables, or replicate some of their numbers using information from the original articles, you are not alone, and the problem may not be with you.

### Conclusion

I would have hoped to see the problems I have discussed here remedied before publication, and a stronger article published as a result! It is not letting out any big secret to disclose that I was one of the anonymous reviewers of this article when it was in draft form. I raised these issues (and others) in my reviews, and I understand that other reviewers and the Associate Editor in charge of the paper raised some of them as well. Yet, after a certain point in the process, I was no longer asked to review revisions. I am puzzled and disappointed by the decision of the journal to let the article reach publication in its current form.

To repeat, I laud the authors of this article for their ambition, and, despite my critiques, believe that their conclusions are correct in their broadest and most easy-to-digest sense. Yet I hope this short piece of mine will prevent future researchers from adopting some of the more problematic interpretations and methodologies put forth in this article as models for their own work. We in feminist economics have long said that the creation of knowledge is a social endeavor, strengthened by broad communities of discourse and critique.