Why open data is critical during review: An example

While reviewing a paper a while ago, a single word in the methods section caught my eye: “randomized“. Based on that single word, I strongly suspected that the methods were not accurately reported and that the conclusions were entirely unsupported. But I also knew that I’d never be able to prove it without the data. Here, I’ll explain what happened in this one example of why empirical papers without open data (or an explicit reason why the data can’t be shared) hamper review and shouldn’t be trusted.

Starting the review

I got a review request for a paper that had been through a couple rounds of review already, but the previous reviewers were unavailable for the revision. I don’t mind these requests, as the previous rounds of review should have caught glaring problems or clarity issues. As I always do, I skimmed the abstract and gave the paper a quick glance to make sure I was qualified. Then I accepted.

While some people review a paper by reading it in order, and others start with the figures, I jump straight to the methods. The abstract and title made the paper’s goals clear, so I had a good gist of the core question. Since I tend to prioritize construct validity in my reviews, I wanted to know whether the experiment and analysis actually provide sufficient evidence to answer the question posed.

The methods

The experiment showed 4 different items, and the task was to select one based on the instruction’s criteria. Over 200 subjects were run. There’s no need to go into more specifics. It was a single-trial 4-alternative-forced-choice (4AFC) experiment that also had an attention check. The items were shown in a vertical column, and the paper noted that the order of the items was “randomized” and that an equal number of subjects were presented with each ordering. The goal was to figure out which item was more likely to be selected.

Did you spot what caught my attention?

It takes a bit of effort to balance out ordering issues. The subjects need to be evenly distributed among the different orderings, and because there was an attention check, any dropped subject needs to be replaced by someone who is shown the same ordering. If I developed such an infrastructure and put in the work to implement it for 200+ subjects, I’d be pretty reluctant to describe it as “randomized”.

Nevertheless, the methods insisted that an equal number of subjects were presented with each ordering. I was skeptical, but what choice did I have except to trust what the paper says?

Reviews are not the time for trust

If I don’t believe the paper’s claim, is my review just supposed to say “Nuh Uh!”?

I suppose I could go though the arduous process of explaining a way to verify the ordering in a revision. But if I don’t trust the paper’s writeup, why would I trust a writeup of an additional analysis? I’d also need to explain to the editor that if there is an ordering issue, it’s not just something that can be mentioned in a limitations section. An ordering effect can completely invalidate the conclusions.

Some might insist that what I should do here is spend the time to read and review all the other aspects of the paper and just bring up possible ordering effects in the review. But if the ordering issue is as serious as I suspect it is, any comments or recommendations about other details would be an ineffective use of time. It’d be like critiquing the door hinges of a dilapidated collapsed building.

Don’t just request data. Make your review conditional on getting it.

A few years ago, I adopted a simple, albeit controversial policy: I require access to data, replication materials, and analysis materials (or an explanation why each can’t be shared) before writing the review. And I signed a pledge to that effect, called the Peer Reviewers’ Openness Initiative (PRO Initiative).

I, as a reviewer, along with any future reader should be able to verify details like the stimuli ordering ourselves. Unfortunately, the empirical data was not shared. So, I emailed the editor to request it.

Even if I’m not suspicious of a particular issue, I generally run multiple short checks on the data as part of my review. I wrote about this process previously on this blog in The 10-minute data check. When I request materials, I never specify why I want it or what I’ll check because it’s not up to the authors to select what I can and can’t review. Moreover, although I don’t have time to check everything, future readers should be able to check anything. Every facet of scientific evidence must be scrutinizable.

Back and forth with the editor

The editor was reluctant at first, and I expected what typically happens when I review for computer science venues – that I’d be kicked off the review. However, this time, over the course of half a dozen emails, I convinced the editor by pointing out that the journal publisher allows for an editor to require authors to provide the data during review. They said they’d talk to the editor in chief and then possibly send the request to the authors.

Then I heard nothing for a month. When the review was due, I started getting weekly automatic reminders. But as I hadn’t received the data yet, I waited for the editor.

The editor eventually emailed me to ask when I could submit my review. I reminded them about our previous conversation about the data. They graciously thanked me for the reminder, but added that since they hadn’t heard back from the authors, we should proceed with the review anyway because (paraphrased) it wouldn’t be fair to require the authors of only this paper to provide data.

I suggested sending a reminder to the authors and setting a deadline of two weeks before making a decision. The editor agreed.

Back and forth with the author

Several weeks after first requesting the data, the editor sent a couple links from the authors. Now, I can finally start the review, right? Right?!?!

No.

One of the links didn’t work, and the other had materials unrelated to replication. So I emailed the editor who confirmed it didn’t work. They then emailed the authors again.

Later, we got a new link with the data! But a third of the data was missing with no explanation. So I emailed the editor again, and they once again emailed the authors.

Then, the next week, the editor finally forwarded the complete dataset from the authors. It was undocumented and not posted on a persistent repository, but it was good enough.

This whole back-and-forth is the very common and very unnecessary time-vampire of demanding research transparency. Editors waste time instead of just sending the message to the authors, and authors with PhDs in computer science suddenly don’t know how to send a file. Ask anyone involved in replication or reanalysis, and they’re all exhausted by the near universality of people evading any basic responsibility.

The review finally starts

After 2 months, I could finally look at the data, and I had to catch myself back up because I’d completely forgotten after all that time. I wanted to check the distribution of orderings!

It only took me a few minutes to realize something was very wrong. Contrary to what the paper stated, the number of subjects who saw each ordering was not equal. Not even close.

But how would this imbalance impact the results? While it’d be nice to model how the different positions within each ordering impacted which item was selected, I started with a simple question: what would the results look like if people always selected the first position. I simply aggregated the data by how frequently each item appeared in the first position. I then made a default ggplot bar graph and compared it to the results figure in the paper. It was stunningly similar!

The results could be explained by an imbalance in the ordering despite the paper implying that no such imbalance existed. The paper’s statement wasn’t true, and the results were entirely invalid.

Writing it up

At this point, I double checked my analysis and the paper’s methods, analyses, and results. And I wrote up the review.

The other reviewers were very positive and raised no major issues. However, the editor agreed with my assessment and rejected the submission.

What can be learned from this example?

How many published research papers out there have mistaken conclusions due to ordering effects? Or statistical errors? Or bugs in the experiment code? Or incorrectly described behavioral instructions?

If the goal of peer review is really about scrutinizing the submission, then it should not be up to the authors to choose which aspects of the research are available for reviewers or readers to scrutinize. We should not have to trust that everything written in a paper is true, and we should be able to verify any reasonable aspect ourselves. Like a building on a movie set, it may look great from the angle they’ve chosen to show you, but if anyone is going to buy it, they should be able to see whether it’s just a facade.

Another issue is the ridiculous time and effort spent getting the replication and reproducibility materials. Let’s look at the breakdown in this case:

Considering the review request: 5 minutes
Skimming the methods and requesting the data: 10 minutes
Emailing back and forth with the editor: multiple hours spread over 2 months
Analyzing the data and writing the review: ~1 hour

Editors and authors must provide replication and reproducibility materials to reviewers from the start. Don’t make them beg. Make it policy not to send papers out for review if the authors don’t provide the information necessary to thoroughly evaluate them.

Any journal and any researcher that wants to be considered credible should be expected to share the following upon submission:

All materials needed to empirically replicate the experiment on a public persistent repository (or an explanation for why each component can’t be shared).
All raw data, documentation, and materials needed to computationally reproduce the results (or an explanation for why each component can’t be shared).

The open science movement has been going on long enough. At this point, venues that have not adopted a mandatory transparency policy (with specific exceptions for privacy issues) are behind the curve. An equivalent to the COS TOP level 2 policy template should be universal.

Steve Haroz's blog

Visual Perception, Data Visualization, and Open Science