Scaling Up Inductive Algorithms: An Overview

  • Venkateswarlu Kolluri
  • Foster Provost

The most commonly cited reason for attempting to scale inductive methods up to massive data sets is based on the prevailing view of data mining as classifier learning.  When learning classifiers, increasing the size of the training set typically increases the accuracy of the learned models (Catlett 1991).  In many cases, the degradation in accuracy when learning from smaller samples stems from over-fitting due to the need to allow the program to learn small disjuncts (Holte 1989) or due to the existence of a large number of features describing the data.  Large feature sets increase the size of the space of models; they increase the likelihood that, by chance, a learning program will find a model that fits the data well, and thereby increase the size of the example sets required (Haussler 1988).