Improving Thai educational Web page classification using inverse class frequency

Automatic text classification for a Web collection is a challenge task, especially in the case that the language is not English, such as Thai. However, most of Thai educational Web pages usually include English terms due to their technical aspect. Lots of technical terms and typing errors both in Th...

Full description

Saved in:
Bibliographic Details
Published in:IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005 Vol. 2; pp. 817 - 820
Main Authors: Lertnattee, V., Theeramunkong, T.
Format: Conference Proceeding
Language:English
Published: IEEE 2005
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Automatic text classification for a Web collection is a challenge task, especially in the case that the language is not English, such as Thai. However, most of Thai educational Web pages usually include English terms due to their technical aspect. Lots of technical terms and typing errors both in Thai and in English are found in Web sites of universities. Most previous works on text categorization applied term frequency and inverse document frequency for representing importance of terms. In this paper, we use inverse class frequency instead of inverse document frequency in centroid-based text categorization because it works well on a collection with a large number of unique terms. The experimental results show that inverse class frequency is useful, especially when it is applied on both prototype and query vectors.
ISBN:9780780395381
0780395387
DOI:10.1109/ISCIT.2005.1566992