Statistically Controlling for Confounding Constructs is Harder than You Think—Jacob Westfall and Tal Yarkoni

Last week, I posted “Adding a Variable Measured with Error to a Regression Only Partially Controls for that Variable.” Today, to reinforce that message, I’ll discuss the PlosOne paper “Statistically Controlling for Confounding Constructs is Harder than You Think” (ungated), by Jacob Westfall and Tal Yarkoni. All the quotations in this post come from that article. (Both Paige Harden and Tal Yarkoni himself pointed me to this article.)

A key bit of background is that social scientists are often interested not in just prediction, but in understanding. Jacob and Tal write:

To most social scientists, observed variables are essentially just stand-ins for theoretical constructs of interest. The former are only useful to the extent that they accurately measure the latter. Accordingly, it may seem natural to assume that any statistical inferences one can draw at the observed variable level automatically generalize to the latent construct level as well. The present results demonstrate that, for a very common class of incremental validity arguments, such a strategy runs a high risk of failure.

What is “incremental validity”? Jacob and Tal explain:

When a predictor variable in a multiple regression has a coefficient that differs significantly from zero, researchers typically conclude that the variable makes a “unique” contribution to the outcome.

“Latent variables” are the underlying concepts or “constructs” that social scientists are really interested in. This passage distinguishes latent variables from the “proxies” actually in a data set:

And because measured variables are typically viewed as proxies for latent constructs of substantive interest … it is natural to generalize the operational conclusion to the latent variable level; that is, to conclude that the latent construct measured by a given predictor variable itself has incremental validity in predicting the outcome, over and above other latent constructs that were examined.

However, this is wrong, for the reason stated in the title of my post: “Adding a Variable Measured with Error to a Regression Only Partially Controls for that Variable.” Here, it is crucial to realize that any difference between the variable actually available in a data set and the underlying concept it is meant to proxy for counts as “measurement error.”

How bad is the problem?

The scope of the problem is considerable: literally hundreds of thousands of studies spanning numerous fields of science have historically relied on measurement-level incremental validity arguments to support strong conclusions about the relationships between theoretical constructs. The present findings inform and contribute to this literature—and to the general practice of “controlling for” potential confounds using multiple regression—in a number of ways.

Unless a measurement error model is used, or a concept is measured exactly, the words “controlling for” and “adjusting for” are red flags for problems:

… commonly … incremental validity claims are implicit—as when researchers claim that they have statistically “controlled” or “adjusted” for putative confounds—a practice that is exceedingly common in fields ranging from epidemiology to econometrics to behavioral neuroscience (a Google Scholar search for “after controlling for” and “after adjusting for” produces over 300,000 hits in each case). The sheer ubiquity of such appeals might well give one the impression that such claims are unobjectionable, and if anything, represent a foundational tool for drawing meaningful scientific inferences.

Unfortunately, incremental validity claims can be deeply problematic. As we demonstrate below, even small amounts of error in measured predictor variables can result in extremely poorly calibrated Type 1 error probabilities.

… many, and perhaps most, incremental validity claims put forward in the social sciences to date have not been adequately supported by empirical evidence, and run a high risk of spuriousness.

The bigger the sample size, the more confidently researchers will assert things that are wrong:

We demonstrate that the likelihood of spurious inference is surprisingly high under real-world conditions, and often varies in counterintuitive ways across the parameter space. For example, we show that, because measurement error interacts in an insidious way with sample size, the probability of incorrectly rejecting the null and concluding that a particular construct contributes incrementally to an outcome quickly approaches 100% as the size of a study grows.

The fundamental problem is that the imperfection in variables actually in data sets as proxies for the concepts of interest doesn’t make it harder to know what is going on, it biases results. If researchers treat proxies as if they were the real thing, there is trouble:

In all of these cases—and thousands of others—the claims in question may seem unobjectionable at face value. After all, in any given analysis, there is a simple fact of the matter as to whether or not the unique contribution of one or more variables in a regression is statistically significant when controlling for other variables; what room is there for inferential error? Trouble arises, however, when researchers behave as if statistical conclusions obtained at the level of observed measures can be automatically generalized to the level of latent constructs [9,21]—a near-ubiquitous move, given that most scientists are not interested in prediction purely for prediction’s sake, and typically choose their measures precisely so as to stand in for latent constructs of interest.

Jacob and Tal have a useful section in their paper on statistical approaches that can deal with measurement error under assumptions that, while perhaps not always holding, are a whole lot better than the assumption than assuming a concept is measured precisely by the proxy in the data for that concept. They also make the point that, after correctly accounting for measurement error—including any differences between what is in the data and the underlying concept of interest—often there is not enough statistical power in the data to say much of anything. That is life. Researchers should be engaging in collaborations to get large data sets that—properly analyzed with measurement error models—can really tell us what is going on in the world, rather than using data sets they can put together on their own that are too small to reliably tell what is going on. (See “Let's Set Half a Percent as the Standard for Statistical Significance.” Note also that preregistration is one way to make results at a less strict level of statistical significance worth taking seriously.) On that, I like the image at the top of Chris Chambers’s Twitter feed:

whats best for science.png

Dan Benjamin, my coauthor on many papers, and a strong advocate for rigorous statistical practices that can really help us figure out how the world works, suggested the following two articles as also relevant in this context: