Tuesday, 30 July 2013

The Era of the 3V Data Mining

Data mining passed through a number of historical stages. The very first stage was the foundation of the independent field of data mining with the series of the Knowledge Discovery from Databases (KDD) workshops started in 1989 that later became the premier data mining conference. This stage refers to the first “V” of the data mining era we live in nowadays. This “V” refers to the “volume”.  Although the volumes of the data over two decades ago were small when compared to the current volumes of data, they were then big enough to challenge the traditional statistics and machine learning techniques and state-of-the-art hardware. The second “V” was linked to the second stage of the data mining historical development; the “velocity”. This stage referred to data stream mining, which is defined by the process of performing approximate data mining on high speed input data records [3]. We can trace the early developments in this area to the late 1990s [1], with maturity being reached in about a decade of continuous research with thousands of papers published. The last “V” stands for the “variety”. Variety is a more recent development in the data mining area that has been the outcome of the maturity in the field of storage and retrieval of semi-structured and unstructured data. This in turn has been an important development dealing with the increasing reliance on social media websites as important source of information.

The combination of the 3 Vs has been referred to as Big Data analytics. This combination is the third wave of developments in data mining [2]. Successful deployment of Big Data analytics will change the scale at which data were analysed, and digital humanities will be provided with tools that will mark its rise and success.

References
[1] Alon, N., Matias, Y., & Szegedy, M. (1996, July). The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing (pp. 20-29). ACM.
[2] Cuzzocrea, A., & Gaber, M. M. (2013). Data science and distributed intelligence: recent developments and future insights. In Intelligent Distributed Computing VI (pp. 139-147). Springer Berlin Heidelberg.

[3] Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005). Mining data streams: a review. ACM Sigmod Record, 34(2), 18-26.