Understanding Dataset Difficulty with V-Usable Information



Kawin Ethayarajh (Stanford University)

Kawin Ethayarajh is a PhD student and Facebook Fellow at Stanford NLP, where he is advised by Dan Jurafsky. He was previously a BMO National Scholar and John H. Moss Scholar at the University of Toronto, where he was advised by Graeme Hirst and David Duvenaud. His research focuses on the end-to-end evaluation of NLP systems, from representation-centric evaluation (“How Contextual are Contextualized Word Representations?”, ACL 2019) to data-centric evaluation (“Understanding Dataset Difficulty with V-Usable Information”, ICML 2022) to human-centric evaluation (“Dynaboard”, NeurIPS 2021).



Short Abstract: Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model -- as the lack of V-usable information (Xu et al., 2019), where a lower value indicates a more difficult dataset for the model family V. Our framework allows for many types of comparisons under the same umbrella: not only can we compare different model families, but also different datasets, different slices of the same dataset, different instances in a distribution, and different input attributes. We apply our framework to discover annotation artefacts in widely-used NLP benchmarks, such as SNLI and CoLA.