Machine Learning Hasn't Solved Psychology's Replication Crisis
New research attempts to automate replication of psychology research. Here's why it doesn't work.
This is a repost of a Psychology Today article I wrote, with an expanded discussion at the bottom to better clarify my views. This is another one where I got pushback from editors that I didn’t completely agree with, so I wanted to express things more fully here.
Replicating studies is key to gaining confidence in them. We don’t just want psychological effects that happened once in a lab, we want effects that are broadly true and can be used to help us improve our lives in the real world. But conducting replication studies is difficult, time-consuming, and often fraught with academic in-fighting. What if we could just use machine learning to help automate this process and automatically get replication scores for thousands of studies at a time?
New research by Wu Youyou, Yang Yang, and Brian Uzzi in the Proceedings of the National Academy of Sciences tries to do this. They use machine learning to attempt to understand how well psychology research will replicate across several subfields (e.g., clinical psychology, developmental psychology, social psychology). This is an ambitious paper, and it gives some insight into replication in psychology. However, issues with the machine learning approach should make us cautious when interpreting results.
What did they do?
The researchers collected a sample of 388 psychology studies that had been replicated previously and used them to train their machine learning model. These were existing studies that had been conducted for other reasons, such as the 2016 Replication Project in Psychology (RPP) and the Life Outcomes of Personality Replication (LOOPR) project. The text of these papers was analyzed using a well-known algorithm. Roughly what the algorithm does is count up how often every word in a paper is used, and then convert these into a series of 200 numbers based on common word associations in social science research. These 200 number summaries of the manuscript text are then used to train a machine learning model to predict whether a study replicated accurately or not.
Then, the researchers used the machine learning model trained on existing replications to predict whether other papers would replicate (if someone were to try to replicate them in the future). They made these predictions on a much larger set of papers—over 14,000 papers, covering almost every paper published in six top journals for an entire decade. Then they analyzed these predictions to try to understand these subfields better.
Potential Problems With the Research
Careful readers of this paper might notice some potential issues right away.
1. How accurate were these predictions?
The accuracy was decent but not great: 68%. So when they analyze predictions for 14,000 new papers, we know they will be fairly inaccurate.
Further, we can do a quick check of the predicted replication of a field to the actual replication of a field. Sometimes it lines up: for social psychology, the replication rate in completed research is 38%, and the predicted replication rate is 37%. But sometimes it’s far off: for personality psychology, the replication rate in completed research is 77%, but the predicted rate is 55%. This should give us pause when drawing conclusions from this model.
2. Is it really reasonable to expect previous replication studies to predict new ones?
Answering this question means determining whether the previous replication studies do a good job representing any and all possible future replications (at least from these six journals). There are a couple reasons it does not.
First, the previous replication studies don’t include any studies from clinical psychology or developmental psychology. That’s a problem because this paper wants to make predictions about the top papers in both of those fields. Since the model wasn't trained on any of those papers, it’s likely the accuracy will be even lower when it encounters this new, different type of paper. (The authors try to address this by saying that the types of words used in those papers are similar to the types of words used in areas where we do have replications, but it’s not entirely convincing.) Our 68% accuracy is likely even lower for these fields.
Second, even in the areas where there are several existing replications, they don’t represent all areas equally well. For example, more replications have been done of social psychology experiments that can be done quickly on a computer, as compared to those that involve recording interactions and coding or rating behavior. So, our accuracy for these types of studies may also be less accurate.
3. Is a model based on lexical associations the best way to evaluate studies that have markers such as p-values?
The use of word vectors (200 numbers related to the authors' word choices) means that this particular machine learning approach relies on word associations alone. Other factors, beyond just what words were used, are clearly important. For example, we know that studies with p-values that just cross the threshold for being publishable tend to be less reliable than studies with p-values that cross by a wide margin. If this data could be used, and accuracy were increased by 5-10%, I’d be much more confident in any conclusions drawn from the predictions.
What Can We Learn?
Youyou and colleagues conclude that their "model enables us to conduct the first replication census of nearly all of the papers published in psychology’s top six subfield journals over a 20-year period." While they do generate and analyze predictions from this large set of manuscripts, the concerns over accuracy and applying the algorithm to new types of data (e.g., new subfields, new types of research) make me skeptical of being able to draw reliable conclusions from the algorithm's output.
That said, there are several convincing arguments that the authors make where their algorithm matches the existing literature. These arguments are most convincing (to me) because of this match.
There isn't just one replication rate for psychology; replication rates should really be considered by area (e.g., personality psychology does better than social psychology)
Lead authors who publish more and in better journals tend to have work that replicates more, but working at a prestigious university doesn't predict better replication rates.
Studies that get more media attention tend to replicate less—possibly because the media is drawn to flashy, counterintuitive stories that are also less likely to stand the test of time.
Finally, the authors found that experimental research (where psychologists actively manipulate conditions) tend to replicate less than non-experimental research (where psychologists observe behavior and report what is related to what). This is somewhat surprising, but seems to me like it might be explained by the sample used to train the model: personality psychology, which tends to be more methodologically rigorous and observational, replicates more. Social psychology, which tends to be more methodologically lax and experimental, replicates less. Machine learning models pick up on patterns in the data they are trained on. Just as training a crime prediction model on racially biased data will reproduce those biases, training a replication prediction model on data biased towards observational research will reproduce that bias. There may be unique advantages to observational research in psychology over experiments, but I'm not yet convinced.
Overall, this manuscript represents an interesting contribution to a growing literature on using machine learning to evaluate research literature. A lot of computational work went into developing both the text-based codings and the predictions for the more than 14,000 new studies. While the algorithm isn't yet strong enough for us to draw strong conclusions, there is potential that in a few years, automated overviews of the field based on this model will be precise enough for us to make confident statements about psychology as a whole.
The Added Context
Could this approach work? One of my major concerns with this paper is that it mixes in several potentially incompatible goals. As a result, I don’t think it does a good enough job in any of them.
Goal 1: Provide a Good Estimate of Replication Rates Across Psychology
It would be really useful to have a good estimate of the replication rates across many top journals in psychology. This would provide evidence for whether the field needs reform, which could be used to drive key policy decisions at the field level. To build a machine learning model that could do this effectively, we would need to take into account all of the information we have about what makes a study a good candidate to replicate or not. That includes standard information that statisticians have been discussing for decades:
How many people were used to study the question? (larger samples give more reliable results)
What was the p-value of the original effect? (smaller original p-values are more likely to replicate)
What was the effect size of the original effect? (larger effects, when seen in big enough samples, tend to replicate better)
These are simple pieces of information that have been successfully used in prior research trying to use machine learning to automatically score replication. Not using them here means that key information that would improve model accuracy was left on the table.
I suspect that it was left on the table because this can be more difficult information to gather. It’s often done via a combination of automated and hand-coding, and neither method is perfect. But taking the time to do this work carefully and correctly—including all possible tools developed by the field—is still a substantial time savings to the field over actually running thousands of replication studies.
Goal 2: Creating a Quick and Dirty Estimate of Replication
The authors don’t say this is their goal, but that’s very much how I read this paper. When I attended the Metascience 2019 conference, I saw models that could get in the upper 80% to 90% range of accuracy. They used the kind of features I described above. But the authors of this new paper decided they wanted to only use features from the text of the article. That is, they used text mining techniques to see whether you could predict replication using (relatively) easy workflows, which do not involve anyone having to read, skim, or hand-code anything from the papers.
This is also a valuable goal. Being able to have a (very) rough sense of replication without having to even skim a paper would be a valuable tool. But that’s not how the paper is sold. It’s sold as being able to provide reliable evidence about thousands of papers. It can’t do that, but it does have a place on the effort / reward tradeoff hierarchy. It’s the type of thing that a journal editor could use to triage papers (desk reject papers with very low replication likelihood), or an applied researcher could use to prioritize their reading list (only keep up with the literature that has a high likelihood of replicating). But it’s not a foolproof tool, and it’s not able to give reliable estimates of an entire field or journal’s replication rate.
Goal 3: Coallating and Analyzing a Large Set of Existing Replications
What I appreciated the most about this paper was their secondary data analysis. That is, I didn’t particularly care for their machine learning work, but I think by combining data from many replication projects and reporting the trends from this dataset, they gave me a good snapshot of what we’ve discovered about replication in psychology in the last decade or so.
This was definitely not the authors’ goal. They assembled data on existing replications as a means to building a dataset for training their machine learning model. But it’s that training dataset that felt the most interesting to me. That’s because each of those projects involves the kind of careful, time-intensive labor that doesn’t scale well to an automated tool. Combining the results of time-intensive labor was, for me, more interesting and valuable than reading the unreliable results of a shot at a quick and dirty machine learning tool.
The Missing Piece: More Training Data
I wouldn’t mind seeing another attempt taken at this project. However, I’d like to see it done once there was more training data available—that is, they can create another training dataset for building their model that includes a wider variety of study types. In particular, I’d like to see a training dataset that includes replication studies from developmental psychology (how does baby behavior replicate?), clinical psychology, hand-coded behavioral research, and more complex research designs (like longitudinal studies following people for months or years). Of course, the only way to get this kind of dataset would be to invest resources in conducting these kinds of more difficult, time-intensive replication projects. While rolling out a new tech tool is exciting, as we have seen in the world of large language models in AI recently, rolling it out poorly is often worse than taking the time to get it right.