Beyond Words: Unraveling Text Complexity with Novel Dataset and A Classifier Application

Text classification is a fundamental aspect of Natural Language Processing (NLP). This research presents a novel human-annotated English sentence dataset categorized into four classes (simple, complex, compound, complex-compound) containing 22331 sentences and a sophisticated sentence classifier too...

Full description

Saved in:

Bibliographic Details
Published in:	2023 26th International Conference on Computer and Information Technology (ICCIT) pp. 1 - 6
Main Authors:	Islam, Mohammad Shariful, Rony, Mohammad Abu Tareq, Saha, Pritom, Ahammad, Mejbah, Nazmul Alam, Shah Md, Saifur Rahman, Md
Format:	Conference Proceeding
Language:	English
Published:	IEEE 13-12-2023
Subjects:	Classifer Complexity theory Natural language processing NLP Semantics Sentence Static VAr compensators Text categorization Vectors Web app Writing
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Text classification is a fundamental aspect of Natural Language Processing (NLP). This research presents a novel human-annotated English sentence dataset categorized into four classes (simple, complex, compound, complex-compound) containing 22331 sentences and a sophisticated sentence classifier tool offering the capability to analyze and classify sentences within English text with particular relevance to literature writing. This study explores its performance using three distinct feature representation methods: Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word Embedding Features. The study involves the evaluation of four machine learning and two deep learning classifier models. BoW combined with Support Vector Classifier (SVC) and Logistic Regression (LR) demonstrated impressive accuracy rates, excelling in distinguishing sentence complexity. Word Embedding Features, specifically LSTM and RNN, offer a more profound semantic representation. LSTM stands out with the highest accuracy of 98.03% and balanced precision and recall, yielding an average F1-score of 97%. RNN, slightly less accurate at 97.75%, nevertheless exhibits competence in grasping sentence structure dependencies. It offers valuable insights for practical applications and contributes to the broader understanding of sentence structures and semantics.
DOI:	10.1109/ICCIT60459.2023.10441159