Here’s a study. Can you guess the result?
Cartoons are rated on a scale from 0 to 9 with 0 meaning “not at all funny” and 9 representing “very funny.”
The catch is that participants indicate their rating while holding their pen in their mouths. They need to record their ratings using their teeth to hold the pen thus creating a forced smile as in picture on the left. Or, they may need to record their ratings using their lips to hold the pen thus creating a forced frown as in the picture on the right.
The question is which participants indicate that they find the cartoons funnier? Does forcing a smile mean you will rate the cartoons as more funnier? If so, how much of a difference do you think it would make on the 0 to 9 comedy scale?
I’d like to hear a few opinions first before revealing the research results. I’ll add [UPDATED] to the post title when it’s ready. Please no spoilers from those who have read about this particular study before. If you’ve read about related ones though, you’re invited to speculate about this one.
Thank you to everyone who played. A special shout-out is due dragonfrog, who engaged in a bit of self-experimentation.
The full answer ends up being complicated.
Strack, Martin, and Stepper (1988) found a rating difference of 0.82 units on the 0 to 9 comedy scale. People when forced to “smile” by holding a pen with their teeth found cartoons funnier than those forced to frown. That’s nearly a full point just due to something that you wouldn’t think would make a difference. This is bizarre and puzzling.
Accordingly, Strack, Martin, and Stepper’s 1988 paper became a widely cited article (1433 times according to Google Scholar). It also made an appearance in the hit bestseller Thinking, Fast and Slow by Daniel Kahneman.
But 1998 was a while ago. Recently, the results of 17, independent, pre-registered direct replications of this study were released. And they aren’t nearly as amazing.
Our meta-analysis revealed a rating difference of 0.03 units with a 95% confidence interval ranging from -0.11 to 0.16.
Three hundredths of a point on the 0 to 9 scale. That’s about 0.3%. You couldn’t get closer to it making absolutely no difference if you tried.
This is, frankly, what I would expect. A cartoon is funny or not, and it doesn’t matter what you are doing with your mouth any more than raising your eyebrows during a horror movie will make it more scary. But this was nevertheless accepted for 28 years, and I’m not sure even this replication will soundly truce the original study. There are people who have built their careers on extensions to the original study. Will they give up so easily on an idea simply because it is demonstrably false?
Strack himself, of course, isn’t willing to give up yet. He wrote a response to the replicators. Here is his strongest criticism:
the authors have pointed out that the original study is “commonly discussed in introductory psychology courses and textbooks”. Thus, a majority of psychology students was assumed to be familiar with the pen study and its findings. Given this state of affairs, it is difficult to understand why participants were overwhelmingly recruited from the psychology subject pools.The prevalent knowledge about the rationale of the pen study may be reflected in the remarkably high overall exclusion rate of 24 percent. Given that there was no funneled debriefing but only a brief open question about the purpose of the study, to be answered in writing, the actual knowledge prevalence may even be underestimated.
That participants’ knowledge of the effect may have influenced the results is reflected in the fact that those 14 (out of 17) studies that used psychology pools gained an effect size of d = – 0.03 with a large variance (SD = 0.14), while the three studies using other pools (Holmes, Lynott, and Wagenmakers) gained an effect size of d = 0.16 with a small variance (SD = 0.06). Tested across the means of these studies, this difference is significant, t(15) = 2.35, p = .033; and the effect for the non-psychology studies significantly deviates from zero, t(2) = 5.09, p = .037, in the direction of the original result.
This criticism was has some similarity to what some of our commenters said when speculating on what the results might be. Here was Morat20:
Ah, the problem of research in psychology. People screw with your tests by meta-gaming it, happily hiding the very information you’re after and thinking they’re being helpful.
This behavior, however, seems the opposite of what Strack is now suggesting. Strack seems to claim that psychology students rather than trying to be helpful are somehow inoculated against the effect. This seems…unlikely. (Note that hypothesis-aware dragonfrog in the comments reported results that the cartoons were funnier when smiling.)
Additionally, Strack implies that the psychology students have seen the study results when the replicators made it clear that they the students were tested before they covered the material in class.
Furthermore, even among the set of non-psychology students, the effect size of d = 0.16 is stunningly small. At best, it seems we are talking about an effect so weak that merely enrolling in a psychology course immunizes you to it, and even if you haven’t done that, the effect is still not close to significant.
The other defenses provided, are frankly, borderline cringe-worthy. One was that perhaps Far Side comics aren’t funny anymore. But the replication itself found that they fit nicely in the mid-range of the 0-9 scale as assessed by the pen-mouthing subjects. Some people didn’t get them, but that wouldn’t explain why their oral contortions would therefore be nullified on those cartoons they did understand.
A third critique:
the RRR labs deviated from the original study by directing a camera on the participants. Based on results from research on objective self-awareness, a camera induces a subjective self-focus which may interfere with internal experiences and suppress emotional responses.
It’s entirely possible that this is the case, but it feels an awful lot like the straw-grasping that always occurs after a failed replication.
The final critique:
Finally, there seems to exist a statistical anomaly. In a meta-analysis, when plotting the effect sizes on the x-axis against the sample sizes on the y-axes across studies, one should usually find no systematic correlation between these two parameters. As pointed out by Shanks and his colleagues (e.g., Shanks et al., 2015), a set of unbiased studies would produce a pyramid in the resulting funnel plot such that relatively high-powered studies show a narrower variance around the effect than relatively low-powered studies. In contrast, an asymmetry in this plot is seen to indicate a publication bias and/or p-hacking (Shanks et al., 2015).
He then presents a plot of the effect sizes against the study sizes and fits a line to it and tacks on:
Without insinuating the possibility of a reverse p-hacking, the current anomaly needs to be further explored.
So that we’re all on board the same bus, this how academics say “she’s got blood coming out of her wherever.” He isn’t outright accusing them of shenanigans, but he totally is.
The problem is, however, that his plot doesn’t show what he desperately wants it to show. First, he calls these studies “high-powered” and “low-powered”. In fact, they are all about the same size. The smallest study had 87 subjects; the largest 163. I’d certainly rather have the bigger number, but we aren’t talking about a large difference when it comes to the power of the studies. This is the reason we don’t see the funnel he is looking for.
All in all, this effect seems like it should be consigned to the dust bin. Readers of Thinking, Fast and Slow ought to take note.