Tag Archives: Science

Confusion about open science repositories

I recently gave a talk on Open Practices in Visualization Research at the workshop on Methodological Approaches for Visualization (BELIV). Unfortunately, with only a 10 minute talk, I had to leave out many important details, which has resulted in some confusion.

A few people have brought up concerns that repositories for open data and materials do not have long term viability. “What happens if the site shuts down in 5 years?” As an alternative, people have proposed storing data and materials in a pay-walled IEEE repository. While it’s good to hear that open access is being discussed, being informed is important for the discussion to be fruitful. So I’ll highlight some critical information about the Open Science Framework (OSF).

1. 50-year preservation fund

The Center for Open Science (COS) has a fund devoted specifically to preserving and maintaining the repository in case the organization ever shuts down. This fund would make a read-only form of the repository accessible for 50+ years. Here is a quote from the sustainability supplement in the COS’s strategic plan (page 24):

In the event of COS’s closing, the preservation fund guarantees long-term hosting and preservation of all the data and content stored on the OSF (50+ years based on present costs and use)

2. An open license with no paywall

Content posted to OSF can choose from a variety of open licenses. Any future work that builds upon the content, incorporates it into a meta-analysis, or scrutinizes it can freely access and link to the material. Openness facilitates research without needing to rely on an expensive subscription to the publisher. Furthermore, an open license means that future work will not require the original author give permission or even reply to emails.

On the other hand, some people want the content to be stored in IEEE’s digital library. That is exactly the opposite of open science. It would be behind a pay-wall (that’s not open). Also, IEEE would own the copyright of the data and material. Either IEEE or an obnoxious original author in fear of scrutiny could obstruct any attempt to publish work that reuses the content on licensing grounds (that’s not science).

3. No risk of lock-in

The openness of OSF allows people to copy their content elsewhere in the future. So there is little risk of being “stuck” with OSF if you don’t like it. If someone creates a better site, they could even mirror OSF’s content, so future open science systems could start with all of the information already on OSF.

4. Updates and edits to content

Like in version control, most open science repositories allow for updating content such that previous versions are always accessible. That approach allows for further updates such as added documentation or fixing typos without erasing the peer-reviewed version. In contrast, making a change to the IEEE digital library is a nightmare.

5. Templates for policies and submission forms

There have been some attempts by individuals and organizations such as ACM to “reinvent the wheel” by creating their own policies for open practice requirements and badges. These attempts often fail to consider flexibility and transparency in reporting.

Alternatively, the Transparency and Openness Promotion (TOP) guidelines have pre-written templates for modular policies that with various levels of strictness (from simply reporting whether it is available to mandatory submission) and for various artifacts (materials, collected data, analysis code, etc.). A table (artifact x sternness) summarizing the different policies is available on  the last page here.

  1. The full set of modular open policy templates with example implementations by various journals is available here.
  2. An author disclosure form for making submissions that request one of the open science badges is available here.

 

One final note: I’m not especially attached to OSF. There are alternatives such as zenodo and figshare, but OSF has the most full-featured set of services and has the most well-thought-out policies.

Mysterious Origins of Hypotheses in Visualization and CHI

For years, I’ve noticed a strange practice in Visualization and CHI. When describing a study, many papers list a series of predictions and number them as H1, H2, H3… For example:

  • H1: Red graphs are better than blue graphs
  • H2: Participants will read vertical bar graphs more quickly than horizontal bar graphs

I have never seen this practice in any other field, and I was curious as to the origin.

Half Hypotheses

Although these statements are referred to as ‘hypotheses’, they’re not… at least, not completely. They are predictions. The distinction is subtle but important. Here’s the scientific definition of hypothesis according to The National Academy of Sciences:

A tentative explanation for an observation, phenomenon, or scientific problem that can be tested by further investigation…

The key word here is explanation. A hypothesis is not simply a guess about the result of an experiment. It is a proposed explanation that can predict the outcome of an experiment. A hypothesis has two components: (1) an explanation and (2) a prediction. A prediction simply isn’t useful on its own. If I flip a coin and correctly guess “heads”, it doesn’t tell me anything other than that I made a lucky guess. A hypothesis would be: the coin is unevenly weighted, so it is far more likely to land heads-up. It has an explanation (uneven weighting) that allows for a prediction (frequently landing heads-up).

The Origin of H1, H2, H3…

Besides the unusual use of the term “hypothesis”, where does the numbering style come from? It appears in many IEEE InfoVis and ACM CHI papers going back to at least 1996 (maybe earlier?). However, I’ve never seen it in psychology or social science journals. The best candidate I can think of for the origin of this numbering is a misunderstanding of null hypothesis testing, which can be best explained with an example. Here is a null hypothesis with two alternative hypotheses:

  • H0: Objects do not affect each other’s motion (null hypothesis)
  • H1: Objects attract each other, so a ball should fall towards the Earth
  • H2: Objects repel each other, so a ball should fly away from the Earth

Notice that the hypotheses are mutually exclusive, meaning only one can be true. In contrast, Vis/CHI-style hypotheses are each independent, and all or none of them can be true. I’m not sure how one came to be transformed into the other, but it’s my best guess for the origins.

Unclear

On top of my concerns about diction or utility, referring to statements by number hurts clarity. Repeatedly scrolling back and forth trying to remember “which one was H3 again?” makes reading frustrating and unnecessarily effortful. It’s a bad practice to label variables in code as var1 and var2. Why should it be better to refer to written concepts numerically? Let’s put an end to these numbered half-hypotheses in Vis and CHI.

Do you agree with this perspective and proposed origin? Can you find an example of this H numbering from before 1996? Or in another field?