Previous |  Up |  Next

Article

Keywords:
text mining; knowledge base management; multi-level categorization; hierarchical text categorization
Summary:
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present a new approach for the text categorization by means of Fuzzy Relational Thesaurus (FRT). FRT is a multilevel category system that stores and maintains adaptive local dictionary for each category. The goal of our approach is twofold; to develop a reliable text categorization method on a certain subject domain, and to expand the initial FRT by automatically added terms, thereby obtaining an incrementally defined knowledge base of the domain. We implemented the categorization algorithm and compared it with some other hierarchical classifiers. Experimental results have been shown that our algorithm outperforms its rivals on all document corpora investigated.
References:
[1] Aas L., Eikvil L.: Text Categorisation: A Survey. Raport NR 941, Norwegian Computing Center, 1999
[2] Apte C., Damerau F. J., Weiss S. M.: Automated learning of decision rules for text categorization. ACM Trans. Information Systems 12 (1994), 3, 233–251 DOI 10.1145/183422.183423
[3] Baker K. D., McCallum A. K.: Distributional clustering of words for text classification. In: Proc. 21th Annual Internat. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Melbourne, Australia 1998, pp. 96–103
[4] Chakrabarti S., Dom B., Agrawal, R., Raghavan P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7 (1998), 3, 163–178 DOI 10.1007/s007780050061
[5] Choi J. H., Park J. J., Yang J. D., Lee, and D. K.: An Object-based Approach to Managing Domain Specific Thesauri: Semiautomatic Thesaurus Construction, Query-based Browsing. Technical Report TR 98/11, Dept. of Computer Science, Chonbuk National University, 1998. http://cs.chonbuk.ac.kr/$\sim $jdyang/publication/techpaper.html
[6] Chuang W., Tiyyagura A., Yang, J., Giuffrida G.: A fast algorithm for hierarchical text classification. In: Proc. 2nd Internat. Conference on Data Warehousing and Knowledge Discovery (DaWaK’00), London–Greenwich, UK 2000, pp. 409–418
[7] Dagan I., Karov, Y., Roth D.: Mistake-driven learning in text categorization. In: Proc. Second Conference on Empirical Methods in Natural Language Processing (C. Cardie and R. Weischedel, eds.), Association for Computational Linguistics, Somerset, NJ 1997, pp. 55–63
[8] Dumais S. T.: Improving the retrieval information from external sources. Behaviour Research Methods, Instruments and Computers 23 (1991), 2, 229–236 DOI 10.3758/BF03203370
[9] Dumais S. T., Platt J., Heckerman, D., Sahami M.: Inductive learning algorithms and representations for text categorization. In: Proc. 7th ACM Internat. Conference on Information and Knowledge Management (CIKM-98), Bethesda, MD 1998, pp. 148-ů155
[10] Fisher D. H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2 (1987), 139–172 DOI 10.1007/BF00114265
[11] Joachims T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Technical Report, University of Dortmund, Dept. of Informatics, Dortmund, Germany 1997
[12] Koller D., Sahami M.: Hierarchically classifying documents using a very few words. In: International Conference on Machine Learning, Volume 14, San Mateo, CA, Morgan-Kaufmann 1997
[13] Korfhage R.: Information Storage and Retrieval. Wiley, New York 1997
[14] Larsen H. L., Yager R. R.: The use of fuzzy relational thesaurus for classificatory problem solving in information retrieval and expert systems. IEEE Trans. on Systems, Man, and Cybernetics 23 (1993), 1, 31–40 DOI 10.1109/21.214765
[15] Lewis D. D., Ringuette M.: A comparison of two learning algorithms for text classification. In: Proc. Third Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 81–93
[16] McCallum A., Rosenfeld R., Mitchell, T., Ng A.: Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of ICML-98, 1998. http://www-2.cs.cmu.edu/$\sim $mccallum/papers/hier-icml98.ps.gz
[17] Mitchell T. M.: Machine Learning. McGraw Hill, New York 1996 Zbl 0913.68167
[18] Miyamoto S.: Fuzzy Sets in Information Retrieval and Cluster Analysis. (Number 4 in Theory and Decision Library D: System Theory, Knowledge Engineering and Problem Solving.) Kluwer, Dordrecht 1990 MR 1060316 | Zbl 0716.68030
[19] Mladenić D., Grobelnik M.: Feature selection for classification based on text hierarchy. In: Working Notes of Learning from Web, Conference on Automated Learning and Discovery (CONALD), 1998
[20] Nigam K., McCallum A., Thrun, S., Mitchell T.: Learning to classify text from labeled and unlabeled documents. In: Proc. 15th National Conference on Artifical Intelligence (AAAI-98), 1998
[21] Radecki T.: Fuzzy set theoretical approach to document retrieval. Information Processing and Management 15 (1979), 5, 247–259 DOI 10.1016/0306-4573(79)90031-1 | Zbl 0413.68101
[22] Ruspini E. H., Bonissone P. P., (eds.) W. Pedrycz: Handbook of Fuzzy Computation. Oxford University Press and Institute of Physics Publishing, Bristol and Philadelphia 1998 MR 1668348 | Zbl 0902.68068
[23] Salton G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addision-Wesley, Reading, MA 1989
[24] Salton G., McGill M. J.: An Introduction to Modern Information Retrieval. McGraw-Hill, New York 1983
[25] Sebastiani F.: Machine learning in automated text categorization. ACM Computing Surveys 34 (2002), 1, 1–47 DOI 10.1145/505282.505283
[26] Rijsbergen C. J. van: Information Retrieval. Second edition. Butterworths, London 1979. http://www.dcs.gla.ac.uk/Keith
[27] Weiss S. M., Apte C., Damerau F. J., Johnson D. E., Oles F. J., Goetz, T., Hampp T.: Maximizing text-mining performance. IEEE Intelligent Systems 14 (1999), 4, 2–8
[28] Wiener E., Pedersen J. O., Weigend A. S.: A neural network approach to topic spotting. In: Proc. 4th Annual Symposium on Document Analysis and Information Retrieval, pages 22–34, 1993
[29] Yang Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1 (1999), 1–2, 69–90. http://citeseer.nj.nec.com/yang97evaluation.html DOI 10.1023/A:1009982220290
Partner of
EuDML logo