It should be possible for someone to load and analyze your data without ever speaking to you and without becoming enraged in frustration.
Open
It needs to actually be shared publicly. None if this “available upon request” nonsense. There’s no excuse for hiding the data that supports an article’s claims. If the data is not shared, you’re inviting people to assume the you fabricated the results.
What about privacy? Collecting sensitive data doesn’t in any way diminish the likelihood of a calculation error or the incentive to falsify results. For identifiable or sensitive data, put it in a protected access repository.
Where to post it?
Persistent – It needs to be on a repository that has a long-term plan for storage. That means OSF, Zenodo, or one of the services on re3data.
Immutable – Github is not immutable and therefore not reliable. You can change history and dates on Github repositories. Github is sketchy.
Not your crappy website – Your dinky little Wix page is not a persistent repository. It doesn’t matter if it’s hosted by the university.
Big data – If you have more than a GB of data, you can use a service like data dryad. I suggest you also post all of the smaller files and a sample subset of the data to a service like OSF for easy access.
What to include?
1. The raw data – All of it. Not aggregated. Not a subset. All of it. Yes, I actually mean all of it. Why don’t you understand this? All. Of. The. Data. “Oh, but certainly I shouldn’t include…” What part of “all of it” is not getting through?
2. A data dictionary – Let’s be very clear about this: your variable names suck. Oh, you think your column names are clear? No they’re not. A data dictionary solves this problem with a simple text file that has one line for each field or column:
- The field or column name (e,g., “color” or “size”)
- The range or possible values (e.g., “red, blue, “green” or “1, 2, 3, … 10”)
- A description of what that variable is. One sentence is plenty.
A text file is fine. It doesn’t need to be complicated.
3. (recommended) A simpler version – While aggregated data or a subset is not a replacement for raw data, it can be a very helpful addition when the data is over 100 MB or in a complex format. The code that was used to make the simpler version must be included.
What file format?
Standard – Stick to CSV or TSV if possible. As a general rule, you should be able to open it in Excel. It’s important that on the data collection side, you consider storing it in a standard format from the beginning.
Other standards – Excel, JSON, and XML are all acceptable too. But try to have at least some sample code that can load them, especially for nested JSON or XML.
Special equipment or field-specific standards – If you’re using neuroimaging, eye tracking, or some other type of data collection that’s specific to one field, make sure to use a file format that others can load without needing to rely on any code from you. There should be libraries already available for that data format.
Proprietary – Avoid custom formats. But sometimes, circumstances require it. In that case, you should have (1) clear documentation about the format and (2) clearly documented code to load the data.
Don’t nest data formats – Seriously, stop sticking JSON or XML inside of a CSV. Stop it!
Something really stupid that apparently needs to be said – Don’t save a proprietary format as a file with a .CSV extension. If it’s a CSV, it should open in Excel.
The data format
Don’t abbrev. – Unless the abbreviation is very common in your field, write out the full_column_name
instead of fucln
. Modern IDEs have autocomplete, so you’re not saving anyone any effort by chopping off half of the letters and making everything unreadable.
Keep it tidy – Put each trial or observation in its own row. Don’t worry that it’ll result in a lot of repeated data. Example columns for a behavioral experiment: subject_id
, subject_age
, trial_index
, stimulus_color
, stimulus_size
, response
, reaction_time
.
Text, not numbers – If you have a factor variable like “color” with a limited set of values (e.g. “red”, “green”, “blue”), write the value, not a number that needs a lookup table to get the value. Using a numerical representation doesn’t save much space, and you’re just asking for a coding error.
Response, not accuracy – In behavioral experiments, record the actual response instead of just recording whether or not it was accurate. Otherwise, it’s not raw data, and you’re asking for a coding error.
Every stimulus parameter – It should be possible to exactly recreate the stimuli from the data and code. If it adds 100 extra columns, so what? Storage and bandwidth are cheap. There could be some extreme examples like random dot motion with thousands of parameters per trial, but in those cases, aim for a predictable random number seed.
When I was mostly finished with this, I remembered I wrote a very similar post 4 years ago. I still see these same mistakes routinely.
Thanks. Great advice. Data are plural.
If we were speaking Latin, you’d be right. If you referred to a “datum point” instead of a “data point”, you’d be consistent. Here is a quote from the Guardian style guide:
There’s a (longer and more detailed) version of much of this advice here: https://osf.io/nz5ws/, from a paper last year.
As part of the paper, there’s also a resource wiki attempting to explain how to incorporate the advice into a workflow (eg using the codebook package in R): https://osf.io/ht2e5/