Humanly Certify Superhuman Classifiers



Qiongkai Xu (University of Melbourne)

Qiongkai Xu is a research fellow on Security in NLP at the University of Melbourne (UoM), where he is primarily working at intersection of Privacy & Security, Machine Learning and Natural Language Processing. His research goal is to audit machine learning models from the perspective of i) new performance evaluation paradigm and 2) privacy and security. Prior to joining UoM, Qiongkai finished his Ph.D. program at the Australian National University (ANU). He has several years’ working experience in industry labs, such as Huawei Noah's Ark Lab, Data61 CSIRO and IBM China Research Lab.



Short Abstract: Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research. Today, this challenge is especially relevant given the emergence of systems which are showing increasing evidence in outperforming human beings. In some cases, this "superhuman" performance is readily demonstrated; for example by defeating top-tier human players in traditional two player games. On the other hand, it can be challenging to evaluate classification models that potentially surpass human performance. Indeed, human annotations are often treated as a ground truth, which implicitly assumes the superiority of the human over any models trained on human annotations. In reality, human annotators are subjective and can make mistakes. Evaluating the performance with respect to a genuine oracle is more objective and reliable, even when querying the oracle is expensive or sometimes impossible. In this paper, we first raise the challenge of evaluating the performance of both humans and models with respect to an oracle which is unobserved. We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference. Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting, which we believe will assist in understanding the stage of current research on classification. We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks, for which an oracle does not exist, and show that under our mild assumptions a number of models from recent years have already achieved superhuman performance with high probability---suggesting that our new oracle-based performance evaluation metrics are overdue as an alternative to the widely used accuracy metrics that are naively based on imperfect human annotations.