Distributed Data Mining: Scaling Up and Beyond

  • Foster Provost

In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date.  I then provide motivation beyond scaling up, arguing that DDM is a more natural way to view data mining generally.  DDM eliminates many difficulties encountered when coalescing already-distributed data for monolithic data mining, such as those associated with heterogeneity of data and with privacy restrictions.  By viewing data mining as inherently distributed, important open research issues come into focus, issues that currently are obscured by the lack of explicit treatment of the process of producing monolithic data sets.  I close with a discussion of the necessity of DDM for an efficient process of knowledge discovery.