Efficient kNN Classification With Different Numbers of Nearest NeighborsShichao Zhang, Xuelong Li, Ming Zong et al.|IEEE Transactions on Neural Networks and Learning Systems|2017 nearest neighbor (kNN) method is a popular classification method in data mining and statistics because of its simple implementation and significant classification performance. However, it is impractical for traditional kNN methods to assign a fixed value (even though set by experts) to all test samples. Previous solutions assign different values to different test samples by the cross validation method but are usually time-consuming. This paper proposes a kTree method to learn different optimal values for different test/new samples, by involving a training stage in the kNN classification. Specifically, in the training stage, kTree method first learns optimal values for all training samples by a new sparse reconstruction model, and then constructs a decision tree (namely, kTree) using training samples and the learned optimal values. In the test stage, the kTree fast outputs the optimal value for each test sample, and then, the kNN classification can be conducted using the learned optimal value and all training samples. As a result, the proposed kTree method has a similar running cost but higher classification accuracy, compared with traditional kNN methods, which assign a fixed value to all test samples. Moreover, the proposed kTree method needs less running cost but achieves similar classification accuracy, compared with the newly kNN methods, which assign different values to different test samples. This paper further proposes an improvement version of kTree method (namely, k*Tree method) to speed its test stage by extra storing the information of the training samples in the leaf nodes of kTree, such as the training samples located in the leaf nodes, their kNNs, and the nearest neighbor of these kNNs. We call the resulting decision tree as k*Tree, which enables to conduct kNN classification using a subset of the training samples in the leaf nodes rather than all training samples used in the newly kNN methods. This actually reduces running cost of test stage. Finally, the experimental results on 20 real data sets showed that our proposed methods (i.e., kTree and k*Tree) are much more efficient than the compared methods in terms of classification tasks.
Learning <i>k</i> for kNN ClassificationShichao Zhang, Xuelong Li, Ming Zong et al.|ACM Transactions on Intelligent Systems and Technology|2017 The K Nearest Neighbor (kNN) method has widely been used in the applications of data mining and machine learning due to its simple implementation and distinguished performance. However, setting all test data with the same k value in the previous kNN methods has been proven to make these methods impractical in real applications. This article proposes to learn a correlation matrix to reconstruct test data points by training data to assign different k values to different test data points, referred to as the Correlation Matrix kNN (CM-kNN for short) classification. Specifically, the least-squares loss function is employed to minimize the reconstruction error to reconstruct each test data point by all training data points. Then, a graph Laplacian regularizer is advocated to preserve the local structure of the data in the reconstruction process. Moreover, an ℓ 1 -norm regularizer and an ℓ 2, 1 -norm regularizer are applied to learn different k values for different test data and to result in low sparsity to remove the redundant/noisy feature from the reconstruction process, respectively. Besides for classification tasks, the kNN methods (including our proposed CM-kNN method) are further utilized to regression and missing data imputation. We conducted sets of experiments for illustrating the efficiency, and experimental results showed that the proposed method was more accurate and efficient than existing kNN methods in data-mining applications, such as classification, regression, and missing data imputation.
Data preparation for data miningShichao Zhang, Chengqi Zhang, Qiang Yang|Applied Artificial Intelligence|2003 Data preparation is a fundamental stage of data analysis. While a lot of low-quality information is available in various data sources and on the Web, many organizations or companies are interested in how to transform the data into cleaned forms which can be used for high-profit purposes. This goal generates an urgent need for data analysis aimed at cleaning the raw data. In this paper, we first show the importance of data preparation in data analysis, then introduce some research achievements in the area of data preparation. Finally, we suggest some future directions of research and development