Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size rarely is obvious. We analyze methods for progressive sampling-using progressively larger samples as long as model accuracy improves. We explore several notions of efficient progressive sampling. We analyze efficiency relative to induction with all instances; we show that a simple, geometric sampling schedule is asymptotically optimal, and we describe how best to take into account prior expectations of accuracy convergence. We then describe the issues involved in instantiating an efficient progressive sampler, including how to detect convergence. Finally, we provide empirical results comparing a variety of progressive sampling methods. We conclude that progressive sampling can be remarkably efficient.
Efficient Progressive Sampling
- David Jensen
- Tim Oates
- Foster Provost
- Venue: Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-99)
- 1999
- Type: Selected Conference Paper
- Acceptance rates: 9%