Guide to user performance evaluation at InfoVis 2013

When reading a paper (vis or otherwise), I tend to read the title and abstract and then jump straight to the methods and results. Besides the claim of utility for a technique or application, I want to understand how the paper supports its claim of improving users’ understanding of the data. So I put together this guide to the papers that ran experiments comparatively measuring user performance.

1. Common Angle Plots as Perception-True Visualizations of Categorical Associations – Heike Hofmann, Marie Vendettuoli – PDF
Tuesday 12:10 pm

2. What Makes a Visualization Memorable? – Michelle A. Borkin, Azalea A. Vo, Zoya Bylinskii, Phillip Isola, Shashank Sunkavalli, Aude Oliva, Hanspeter Pfister – PDF
Tuesday 2:00 pm

3. Perception of Average Value in Multiclass Scatterplots – Michael Gleicher, Michael Correll, Christine Nothelfer, Steven Franconeri – PDF
Tuesday 2:20 pm

4. Interactive Visualizations on Large and Small Displays: The Interrelation of Display Size, Information Space, and Scale – Mikkel R. Jakobsen, Kasper Hornbaek – PDF
Tuesday 3:00 pm

5. A Deeper Understanding of Sequence in Narrative Visualization – Jessica Hullman, Steven Drucker, Nathalie Henry Riche, Bongshin Lee, Danyel Fisher, Eytan Adar – PDF
Wednesday 8:30 am

6. Visualizing Request-Flow Comparison to Aid Performance Diagnosis in Distributed Systems – Raja R. Sambasivan, Ilari Shafer, Michelle L. Mazurek, Gregory R. Ganger – PDF
Wednesday 10:50 am

7. Evaluation of Filesystem Provenance Visualization Tools – Michelle A. Borkin, Chelsea S. Yeh, Madelaine Boyd, Peter Macko, Krzysztof Z. Gajos, Margo Seltzer, Hanspeter Pﬁster – PDF
Wednesday 11:10 am

8. DiffAni: Visualizing Dynamic Graphs with a Hybrid of Difference Maps and Animation – Sébastien Rufiange, Michael J. McGuffin – PDF
Thursday 2:00 pm

9. Edge Compression Techniques for Visualization of Dense Directed Graphs – Tim Dwyer, Nathalie Henry Riche, Kim Marriott, Christopher Mears – PDF
Thursday 3:20 pm

Less than a quarter

Only 9 out of 38 InfoVis papers (24%) this year comparatively measured user performance. While that number has improved and doesn’t need to be 100%, less than a quarter just seems low.

Possible reasons why more papers don’t evaluate user performance

Limited understanding of experiment design and statistical analysis. How many people doing vis research are familiar with different experiment designs like method of adjustment or forced-choice? How many have run a t-test or a regression?
Evaluation takes time. A paper that doesn’t evaluate user performance can easily scoop a similar paper with a thorough evaluation.
Evaluation takes space. Can a novel technique and an evaluation be effectively presented within 10 pages? Making better use of supplemental material may solve this problem.
Risk of a null result. It’s hard – if possible at all – to truly “fail” in a technique or application submission. But experiments may reveal no statistically significant benefit.
The belief that the benefit of a vis is obvious. We generally have poor awareness of our own attentional limitations, so it’s actually not always clear what about a visualization doesn’t work. Besides being poor at assessing our abilities, it’s also important to know for which tasks a novel visualization is better than traditional methods (e.g. excel and sql queries) vs. when the traditional methods are better.
A poisoned well. If a technique or application has already been published without evaluation, reviewers would scoff at an evaluation that merely confirms what was already assumed. So an evaluation of past work would only be publishable if it contradicts the unevaluated assumptions. It’s risky to put the time into a study if positive results may not be publishable.

I’m curious to hear other people’s thoughts on the issue. Why don’t more papers have user performance evaluations? Should they?

P.S. Check out this paper looking at evaluation in SciVis.

10 thoughts on “Guide to user performance evaluation at InfoVis 2013”

Danyel Fisher October 15, 2013 at 6:13 am

I’d argue that “Evaluation Takes Time” is not a “fear of being scooped” question, it’s a financial question. Any sort of serious evaluation costs several weeks of work–at reasonable tech payscales (not grad students), that’s something like $10,000. Is it worth the marginal value?

I’d love to hear your thoughts on an evaluation of (say) Google N-Gram. (I’m in the keynote, so it comes to mind.) What should I compare N-Gram to? A really big SQL table of (word, value, date)? An Excel table of (date, value) that’s pre-processed for any given word? Do I create a new UI for N-Gram that allows me to type in a word and get a long table of numbers to scan over?
1. Steve Haroz Post authorOctober 15, 2013 at 6:44 am
  
  Danyel,
  
  Evaluation definitely has a monetary cost. It’s probably not worth the marginal value for an internally created and used vis. But publishing a paper promoting the utility of an approach to a broad audience is different. Doesn’t the audience deserve some proof showing that it helps some task? Otherwise, a paper promotes the use of a technique or the applicability of a technique to a type of problem, but it’s not clear whether others would benefit from it.
  
  N-Gram is a line chart. The novelty is in the data collected rather than the vis technique. But the vis technique is heavily researched and studied (Cleaveland and McGill, 45 degree banking, etc.). I doubt N-Gram would pass the bar for novelty by reviewers.
pera October 22, 2013 at 6:10 pm

Hi, thanks for the post. You can find the “Common Angle Plots as Perception-True Visualizations of Categorical Associations” paper (pdf and sources) here:
https://github.com/mariev/common_angles-paper/tree/master/revision
Steve Haroz Post authorOctober 22, 2013 at 7:22 pm

Thanks pera. The link has been added.
Petra Isenberg October 29, 2013 at 7:07 am

There are many more ways to evaluate a visualization (technique/tool/system) than user performance and doing a user performance evaluation sometimes just does not make sense for the type of contribution a paper makes. Just as an example: we presented hybrid-image visualization this year as a technique that allows two visualizations to be blended for distant-dependent viewing. We did not do a user performance evaluation but this does not mean that we did not evaluate our technique. What we did is to employ a qualitative image inspection technique (QRI) [1] as well as discussed perception theory that backs up that our technique actually works. For the type of contribution that our paper made, this was the right match of evaluation – and no reviewer asked for a user performance study. A user performance evaluation would have to ask very different types of questions that no longer match the focus of the paper (which was the presentation of the technique). Interesting things to study in the future are: what are the effects on cognition when using hybrid-image visualization in collaboration or doing a specific in-situ eval of hybrid-image vis in a work context (but even then I’d probably opt for a qualitative study and not user performance).

[1] Paper summarizing different evaluation types in visualization that includes QRI: http://hal.inria.fr/hal-00846775/PDF/Isenberg_2013_SRP.pdf
1. Steve Haroz Post authorOctober 29, 2013 at 4:52 pm
  
  Hi Petra,
  
  Thanks for the comment.
  
  There are many more ways to evaluate a visualization (technique/tool/system) than user performance
  
  I agree! Like I said, I don’t think UP evaluation has to be 100%. And one example when it may not be necessary is when the premise of the visualization is based on an already evaluated perceptual theory (though we have to watch for overextending theories). The hybrid-image paper is a great example of a technique that explains its roots in perceptual theory. While – as the paper states – there’s more work to be done in terms of designer guidelines, the premise is built on the experiments run by Oliva & Sychs (1994-2000), Navon (1977), and Campbell & Robson (1968). We know why it should work. Many proposed techniques lack that foundation entirely.
  
  As for the question of quality inspection, I’m skeptical of this approach for a few reasons:
  1) We are often blind to our own attentional limitations (Change blindness blindness).
  2) We don’t always choose the tool that most optimizes our performance (Franklin Taylor’s “optimal shoveling” study)
  3) We can perceive improvements in visual clarity even though the actual display is blurred (Motion sharpening).
  4) Performance takes a huge hit when users are exploring rather than looking for something they know is there (oddball vs. targeted search). The reader or presentation/demo viewer in these QRIs is rarely naïve and is frequently primed. “Figure X clearly shows that….” “Oh yeah, I DO see that.” They know what they’re looking for, and that’s what they find.
  
  Now, many QRIs that seek to answer the question of whether something is visible at all likely don’t suffer from the above problems. But there are so many ways that a QRI can be misleading. Should readers always have to dig through literature to determine whether a paper’s presented results are misleading?
  
  My overall concern about QRIs is: I don’t trust the brain to self-assess.
Alexander Lex October 29, 2013 at 10:01 am

I do agree with Petra, I think performance evaluation is really only feasible and sensible for a small subset of techniques published: it needs to be a simple technique with a clear alternative. How would you evaluate the performance of a complex tool using a combination of novel and established visualization techniques? There is no way to isolate the various aspects that can influence performance. What if there is no adequate visualization technique to compare to? In the case of our LineUp paper [1], for example, we ran a study to get some qualitative feedback from users. Should we have run a comparative study to Excel, for example? I think not. Should we have created an alternative visualization technique that we think inferior, just for the purpose of evaluating the superiority of another approach? Which alternative should we have implemented from the whole design space? And LineUp is actually a pretty technical paper, where a performance evaluation would be much more feasible (but not necessary meaningful) compared to a complex system as used in a design study paper.

In contrast, in our Context-Preserving Visual Links paper [2], we ran a performance study that evaluates the effectiveness of highlighting with color vs highlighting by connecting elements with edges (among others). This is a clearly controllable and atomic aspect of the visualization and thus it makes a lot of sense to compare user performance, quite similar to your 2012 InfoVis paper.

I argue that for most of the papers you list above, the evaluation was the most important or a very important part. We learn that, radial layouts are better for certain tasks than other graph layouts, for example. These kinds of papers are very valuable, but they are not the only ones that are valuable.

I do think that none of the reasons you mention should influence the decision whether to do a performance evaluation – most of them are unjustifiable excuses, with the exception of “The belief that the benefit of a vis is obvious”. Here I agree with van Wijk, as he discussed in his VIS capstoen: strive for tools that are obviously better. I wouldn’t trust a flimsy 12 person evaluation that shows me that complex-tool-1 is better than complex-tool-2, especially if the benefit is not obvious.

[1] http://lineup.caleydo.org
[2] http://people.seas.harvard.edu/~alex/papers/2011_infovis_context-preserving-links.pdf
1. Steve Haroz Post authorOctober 30, 2013 at 6:41 am
  
  Hi Alexander,
  
  Sorry about the delayed response. Your comment was flagged as spam.
  
  As I stated in the original post and the reply to Petra, we don’t need 100% of papers to evaluate user performance. However, we should be doing more to evaluate how techniques impact performance for wide variety of tasks. It’s even possible that combining techniques (that were already evaluated for many tasks) would have some sort of negative interaction. As Tamara Munzner said in the evaluation panel, we could stop making new visualizations and still have ten or more years of evaluation work to do.
  
  I don’t agree that Excel is unfair for comparison. There are many tasks for which visualization does not help. Demonstrating that having a visualization at all actually improves a user’s understanding of the information is useful. Excel is not a straw man.
  
  I do agree that an entire alternative visualization doesn’t need to be created for comparison. But selectively simplifying features or altering visual mappings would be very informative in determining what aspect of a vis helps a particular task. You did exactly that in the contextual links paper. You implemented highlighting and straight lines for comparison. For a more complex multifaceted application, selectively removing or simplifying even one component (especially the more novel techniques or combinations thereof) allows a the reader to know how much that component impacts a particular task and whether it’s worth the effort to implement.
  
  Perhaps we just have a difference in philosophy. If a solid evaluation yields results that counter my intuition, I’d first question the study. But if the study is solid and replicated, I’d accept that I have a flawed intuition.
Pingback: Guide to user performance evaluation at InfoVis 2015 | Steve Haroz's blog
Pingback: Guide to user performance evaluation at InfoVis 2016 | Steve Haroz's blog

Comments are closed.