Nittaya Kerdprasop, Fonthip Koongaew, Zagon Budsabong, Phaichayon Kongchai, Kittisak Kerdprasop
The ability to predict correctly rarely occurring cases is important to the success of applying data mining method to many real life applications. In the context of data mining, rare cases refer to labeled data instances that are infrequently occurred in the database. Discovering infrequent patterns are of interest in some specific domains such as genetic mutant identification, fraud credit card detection, network intruder prevention. But most learning algorithms are biased toward the majority cases such that the minority cases are considered as noise and thus they are ignored during the model induction steps. This ignorance causes the learning algorithm to generate a model that cannot classify or predict a minority case. We thus study the replication technique based on the over-sampling method to solve this problem. However, a straightforward application of oversampling method may lead to the over-fitting problem in such a way that the generated model is too specific to the manipulated data. We thus apply the cluster-based technique to selectively filter a training dataset. The experimental results on primary tumor, arrhythmia and communities-and-crime datasets show significant improvement on predicting accuracy, specificity, and sensitivity of the induced models. But the results on multiple features correlation dataset show non-significant improvement; this case requires further investigation.
References  L. Breiman, J. Freidman, R. Olshen, and C. Stone, Classification and Regression Trees, Belmont, California: Wadsworth, 1984.  J. Burez and D. Van den Poel, “Handling class imbalance in customer churn prediction,” Expert Systems with Applications, vol. 36, 2009, pp. 4626-4636.  N. Chawla, “Data mining for imbalanced datasets: an overview,” In: O. Maimon and L. Rokach, (eds.) Data Mining and Knowledge Discovery Handbook, pp. 853-867. Springer, 2005.  N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, 2002, pp. 341-378.  R. Debnath, N. Takahide, and H. Takahashi, “A decision based one-against-one method for multi-class support vector machine,” Pattern Analysis & Applications, vol. 7, no. 2, 2004, pp. 164-175.  A. Frank and A. Asuncion, UCI Machine Learning Repository [http://archive.ics.uci.edu/ ml], Irvine, University of California, School of Information and Computer Science, 2010.  M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten, “The WEKA data mining software: an update,” SIGKDD Explorations, vol. 11, no. 1, 2009, pp. 10- 18.  S. Han, B. Yuan, and W. Liu, “Rare class mining: progress and prospect,” Proceedings of Chinese Conference on Pattern Recognition, 2009, pp.1-5.  I. Jamali, M. Bazmara, and S. Jafari, “Feature selection in imbalance data sets,” International Journal of Computer Science Issues, vol. 9, no. 2, 2012, pp. 42-45.  P. Jhonpita, S. Sinthupinyo, and T. Chaiyawat, “Ordinal classification method for the evaluation of Thai non-life insurance companies,” International Journal of Computer Science Issues, vol. 9, no. 2, 2012, pp. 362-366.  K. Kerdprasop and N. Kerdprasop, “A data mining approach to automate fault detection model development in the semiconductor manufacturing process,” International Journal of Mechanics, vol. 5, issue 4, 2011, pp. 336-344.  E. Kretschmann, W. Fleischmann, and R. Apweiler, “Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT,” Bioinformatics, vol. 17, no. 10, 2001, pp. 920-926.  A.G. Lalkhen and A. McCluskey, “Clinical tests: sensitivity and specificity. Continuing Education in Anaesthesia,” Critical Care & Pain, vol. 8, no. 6, 2008, pp. 221-223.  E.M. Mugambi, A. Hunter, G. Oatley, and L. Kennedy, “Polynomial-fuzzy decision tree structures for classifying medical data,” Knowledge-Based Systems, vol. 17, no. 2-4, 2004, pp. 81-87.  B. Pandey and R.B. Mishra, “Knowledge and intelligent computing system in medicine,” Computers in Biology and Medicine, vol. 39, 2009, pp. 215-230.  R. Pant, T.B. Trafalis, and K. Barker, “Support vector machine classification of uncertain and imbalanced data using robust optimization,” Recent Researches in Computer Science – Proceedings of the 15th WSEAS International Conference on Computers, 2011, pp. 369-374.  J.R. Quinlan, “Induction of decision tree,” Machine Learning, vol. 1, 1986, pp. 81-106.  R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” Journal of Machine Learning Research, vol. 5, 2004, pp. 101-141.  J. Stefanowski and S. Wilk, “Selective pre-processing of imbalanced data for improving classification performance,” Proceedings of DaWaK, 2008, pp. 283-292.  H. Sug, “Improving the performance of minor class in decision tree using duplicating instances,” Recent Researches in Artificial Intelligence, Knowledge Engineering and Data Bases – 10th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Databases, 2011, pp. 234-237.  E. Tapia, L. Ornella, P. Bulacio, and L. Angelone, “Multiclass classification of microarray data samples with a reduced number of genes,” BMC Bioinformatics, vol. 12, 2011, article 59.  F.A. Thabtah, P. Cowling, and Y. Peng, “Multiple labels associative classification,” Knowledge and Information Systems, vol. 9, no. 1, 2006, pp. 109-129.  C.-J. Tsai, C.-I. Lee, C.-T. Chen, and W.-P. Yang, “A multivariate decision tree algorithm to mine imbalanced data,” WSEAS Transactions on Information Science and Applications, vol. 4, issue 1, 2007, pp. 50-58.  J. Van Hulse and T. Khoshgoftaar, “Knowledge discovery from imbalanced and noisy data,” Data & Knowledge Engineering, vol. 68, 2009, pp. 1513-1542.  Webster’s New WorldTM Medical Dictionary, 3rd edition, Wiley Publishing, 2008.  G.M. Weiss, “Mining with rarity: a unifying framework,” SIGKDD Explorations, vol. 6, no. 1, 2004, pp. 7-9.  K.Y. Yeung and R.E. Bumgarner, “Multiclass classification of microarray data with repeated measurements: application to cancer,” Genome Biology, vol. 4, no. 12, 2004, R83.
IJCSI International Journal of Computer Science Issues, Vol. 10, Issue 2, No 3, March 2013 ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784 www.IJCSI.org