The Matthew Effect when Learning from Weakly Supervised Data



Yang Liu (UCSC)

Yang Liu is currently an Assistant Professor of Computer Science and Engineering at UC Santa Cruz (2018 - present). He was previously a postdoctoral fellow at Harvard University (2016 - 2018) hosted by Yiling Chen. He obtained his Ph.D. degree from the Department of EECS, University of Michigan, Ann Arbor in 2015. He is interested in crowdsourcing and algorithmic fairness in machine learning. He is a recipient of the NSF CAREER Award and the NSF Fairness in AI award (lead PI). He has been selected to participate in several high-profile projects, including DARPA SCORE and IARPA HFC. His research has been covered by media including WIRED and WSJ. His work on using machine learning to forecast future security incidents has been successfully commercialized and acquired by FICO. His recent works have won four best paper awards at relevant workshops.



Short Abstract:

Our data is often weakly supervised: the training labels are primarily solicited from human annotators, which encode human-level mistakes; in semi-supervised learning, the artificially supervised pseudo labels are immediately imperfect; in reinforcement learning, the collected rewards can be misleading, due to faulty sensors. The list goes on. Despite successes of exiting solutions towards addressing weak supervisions, we recently identified strong evidence of disparate treatments of them for different sub-populations defined by, for example, the demographic groups.

This talk will first introduce this unique challenge in weakly supervised learning, and then I’ll present a detailed case study using a broad family of popular semi-supervised learning (SSL) algorithms. I am going to show that the sub-population that has a higher baseline accuracy without using SSL (the ``rich" sub-population) tends to benefit more from SSL; while the sub-population that suffers from a low baseline accuracy (the ``poor" sub-population) might even observe a performance drop after adding the SSL module. I will provide both theoretical and empirical support for the above observation. I will discuss how this disparate impact can be mitigated and hope that our result will alarm the potential pitfall of using SSL and encourage a multifaceted evaluation of future weakly supervised learning algorithms.

The talk is based on the following two works:
[1] Understanding Instance-Level Label Noise: Disparate Impacts and Treatments, Yang Liu International Conference on Machine Learning (ICML), 2021.
[2] The Rich Get Richer: Disparate Impact of Semi-Supervised Learning, Zhaowei Zhu, Tianyi Luo, and Yang Liu, International Conference on Learning Representations (ICLR), 2022.