Problem Definition, Data Cleaning and Evaluation: A Classifier Learning Case Study

  • Andrea Danyluk
  • Foster Provost

Problem definition, data cleaning, and evaluation constitute much of the process of building useful, real-world classifiers with inductive algorithms.  This paper is a case study of this process based on a long-term project addressing the automatic dispatch of technicians to x faults in the local loop of a telephone network.  The bottom line of the project is that simple learning techniques can be effective.  However, constructing a convincing argument to that effect is far from simple.  In particular, we had to consult multiple sources to obtain class labels, use domain knowledge to clean up data, compare with existing methods, and evaluate with data from multiple locations.  Finally, it was necessary to use decision-analytic techniques to evaluate the cost-effectiveness of the learned classifiers, because evaluation based on classification accuracy is misleading without an analysis of cost-effectiveness.  Our view is that application studies should be helpful in guiding future research.  Therefore, we conclude by outlining useful directions suggested by our experience on this long-term project.