Learning with Imbalanced Data Sets 101

  • Foster Provost
The majority of learning systems previously designed and tested on toy problems or carefully crafted benchmark data sets usually assumes that the training sets are well balanced. Unfortunately, this balanced assumption is often violated in real world settings.  Indeed, there exist many domains for which some classes are represented by a large number of examples while the others are represented by only a few.
Although the imbalanced data set problem is starting to attract researchers’ attention, attempts at tackling it have remained isolated.  It is our belief that much progress could be achieved from a concerted effort and a greater amount of interactions between researchers interested in this issue.  The purpose of this workshop was to provide a forum to foster such interactions and identify future research directions.