Photo by Florian Pérennès

Minimum Expectations for Open Data in Research

Open data allows people to independently check a paper’s analysis or perform an altogether new analysis. It’s also a way of allowing future work to perform meta-analyses and ask questions that may not have been asked in the original paper. Therefor, it’s important to make experiment data public, provide it completely, and make it accessible for it to be useful to others.

But many missteps can happen that reduce the value of open data. These tips should help ensure that your data is indeed open, useful, and accessible.

Provide at least the minimum information

Experiment data should include an entry for each trial, usually as a row in the table. It shouldn’t be aggregated by condition or by subject. Here is a minimum set of columns that would be appropriate for most experiments:

  • Subject ID – Remember to keep it anonymous. No names, mTurk IDs, or IP addresses.
  • Trial number – Even though trials are usually recorded in order, it’s best to make this explicit.
  • One column for each independent variable – It should be possible to reconstruct the trial.
  • A column for each raw measurement – This means raw responses (e.g. whether the subject pressed left or right), not their accuracy or some other processed information.

Other suggested columns that may be useful depending on the specific experiment:

  • Subject information like gender, education, visual deficiencies, etc – It’s most important for between-subject designs. Remember to be careful about anonymity.
  • Environment or equipment information like monitor resolution, browser, operating system, etc.
  • Date and time – When was the experiment run? When did each trial start?
  • Processed or aggregated information from other columns – It’s often useful to include processed information in the data. Just make sure that it augments rather than replaces raw information.

Use a reliable repository

Your website, your institution’s website, and any for-profit company (like GitHub) are not reliable. URLs change; FTP mistakes happen; companies shut down. Here is PLOS ONE’s list of reliable repositories.

Keep it tidy

If your data format is complicated, provide a copy of the data that is formatted in a way that’s easy to process and analyze. The simplified format should still include all of the data, and you should also provide the raw original data for transparency. In other words, please simplify your arrays nested inside of JSON objects nested inside of CSV cells.

Use an accessible format

Use free and open formats. Stick to CSV when possible. JSON is ok if necessary. If a project really needs some other format, make sure there are clear instructions for reading it. No Microsoft Excel, unless you plan on buying a copy of the software for everyone.

Common mistakes

  1. Aggregating the data – Some people post a single data point per subject or per condition. Many assumptions are made when aggregating, so it’s critical to provide raw unbiased data without locking people into a particular approach for aggregation.
  2. Skipping the response variable – While it’s useful to know whether a response is correct, recording the actual response is more important in case there are concerns about how “correctness” was calculated.
  3. Skipping the data dictionary – I know you think your column names make perfect sense. Well, they don’t. Make a text file with a very brief description of every column in your data.
  4. Not putting the data URL in the paper – How is anyone supposed to know how to get the data unless you put it in paper? Don’t make anyone email you! I recommend putting it in the abstract.
  5. Failing to check text entries for identifying information – You never know what information people will type into a textbox. One strategy is to drop that column from the open data and make it available on request.

Your results are not too big

Open data repositories can handle your data. People manage to share huge results from astrophysical data to fMRI volumes that vary over time for dozens of subjects. A CSV or JSON that’s under 5GB would easily fit on an open science repository like OSF and figshare. For larger datasets, you can break it up into multiple files or use repositories like Data Dryad.

Most experiments fit on a couple floppy disks and could be downloaded over a dial-up modem, so it was silly to not have open data in 1998, let alone 2018. We have a multitude of free fast reliable research repositories, so there’s no excuse anymore.

Photo Credit: Florian Pérennès

EDIT: Clarified that a cleaned up version of the data should be provided in addition to (not instead of) the raw original data.

Posted in Science by .