Selective Data Acquisition for Machine Learning

  • Josh Attenberg
  • Prem Melville
  • Foster Provost
  • Maytal Saar-Tsechansky
In many applications, one must invest effort or money to acquire the data and other information required for machine learning and data mining.  Careful selection of the information to acquire can substantially improve generalization performance per unit cost.  The costly information scenario that has received the most research attention (see Chapter 10) has come to be called “active learning,” and focuses on choosing the instances for which target values (labels) will be acquired for training. However, machine learning applications offer a variety of different sorts of information that may need to be acquired.  This chapter focuses on settings and techniques for selectively acquiring information beyond just single training labels (the values of the target variable) for selected instances in order to improve a model’s predictive accuracy.  The different kinds of acquired information include feature values, feature labels, entire examples, values at prediction time, repeated acquisition for the same data item, and more. For example, Figure 5.1 contrasts the acquisition of training labels, feature values, and both.  We will discuss all these sorts of information in detail. Broadening our view beyond simple active learning not only expands the set of applications to which we can apply selective acquisition strategies, but it also highlights additional important problem dimensions and characteristics, and reveals fertile areas of research that to date have received relatively little attention.
In what follows we start by presenting two general notions that are employed to help direct the acquisition of various sorts of information.  The first is to prefer to acquire information for which the current state of modeling is uncertain.  The second is to acquire information that is estimated to be the most valuable to acquire.
After expanding upon these two overarching notions, we discuss a variety of different settings where information acquisition can improve modeling.  The purpose of examining various different acquisition settings in some details is to highlight the different challenges, solutions, and research issues.  As just one brief example, distinct from active learning, active acquisition of feature values may have access to additional information, namely instances’ labels – which enable different sorts of selection strategies.
More specifically, we examine the acquisition of feature values, feature labels, and prediction-time values.  We also examine the specific, common setting where information is not perfect, and one may want to acquire additional information specifically to deal with information quality.  For example, one may want to acquire the same data item more than once.  In addition, we emphasize that i can be fruitful to expand our view of the sorts of acquisition actions we have at our disposal.  Providing specific variable values is only one sort of information “purchase” we might make.  For example, in certain cases, we may be able to acquire entire examples of a rare class of distinguishing words for document classification.  These alternative acquisitions may give a better return on investment for modeling.
Finally, and importantly, most research to date has considered each sort of information acquisition independently.  However, why should we believe that only one sort of information is missing or noisy?  Modelers may find themselves in situations where they need to acquire various pieces of information, and somehow must prioritize the different sorts of acquisition.  This has been addressed in a few research papers for pairs of types of information, for example for target labels and feature labels (“active dual supervision”).  We argue that a challenge problem for machine learning and data mining research should be to work toward a unified framework within which arbitrary information acquisitions can be prioritized, to build the best possible models on a limited budget.