Why Label When You can Search? Alternatives to Active Learning for Applying Human Resources to Build Classification Models Under Extreme Class Imbalance

  • Josh Attenberg
  • Foster Provost

This paper analyses alternative techniques for deploying low cost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance—where traditional labeling strategies, such as active learning, can be ineffective.  Consider the problem of building classifiers to help brands control the content adjacent to their on-line advertisements.  Although frequent enough to worry advertisers, objectionable categories are rare in the distribution of impressions encountered by most on-line advertisers—so rare that traditional sampling techniques do not find enough positive examples to train effective models.  An alternative way to deploy human resources for training-data acquisition is to have them “guide” the learning by searching explicitly for training examples of each class.  We show that under extreme skew, even basic techniques for guided learning completely dominate smart (active) strategies for applying human resources to select cases for labeling.  Therefore, it is critical to consider the relative cost of search versus labeling, and we demonstrate the trade offs for different relative costs.  We show that in cost/skew settings where the choice between search and active labeling is equivocal, a hybrid strategy can combine the benefits.