Beyond Words: Unraveling Text Complexity with Novel Dataset and A Classifier Application
Text classification is a fundamental aspect of Natural Language Processing (NLP). This research presents a novel human-annotated English sentence dataset categorized into four classes (simple, complex, compound, complex-compound) containing 22331 sentences and a sophisticated sentence classifier too...
Saved in:
Published in: | 2023 26th International Conference on Computer and Information Technology (ICCIT) pp. 1 - 6 |
---|---|
Main Authors: | , , , , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
13-12-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Text classification is a fundamental aspect of Natural Language Processing (NLP). This research presents a novel human-annotated English sentence dataset categorized into four classes (simple, complex, compound, complex-compound) containing 22331 sentences and a sophisticated sentence classifier tool offering the capability to analyze and classify sentences within English text with particular relevance to literature writing. This study explores its performance using three distinct feature representation methods: Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word Embedding Features. The study involves the evaluation of four machine learning and two deep learning classifier models. BoW combined with Support Vector Classifier (SVC) and Logistic Regression (LR) demonstrated impressive accuracy rates, excelling in distinguishing sentence complexity. Word Embedding Features, specifically LSTM and RNN, offer a more profound semantic representation. LSTM stands out with the highest accuracy of 98.03% and balanced precision and recall, yielding an average F1-score of 97%. RNN, slightly less accurate at 97.75%, nevertheless exhibits competence in grasping sentence structure dependencies. It offers valuable insights for practical applications and contributes to the broader understanding of sentence structures and semantics. |
---|---|
DOI: | 10.1109/ICCIT60459.2023.10441159 |