Multidisciplinary classification for Indonesian scientific articles abstract using pre-trained BERT model
Scientific articles now have multidisciplinary content. These make it difficult for researchers to find out relevant information. Some submissions are irrelevant to the journal's discipline. Categorizing articles and assessing their relevance can aid researchers and journals. Existing research...
Saved in:
Published in: | International journal of advances in intelligent informatics Vol. 9; no. 2; pp. 331 - 346 |
---|---|
Main Authors: | , , , |
Format: | Journal Article |
Language: | English |
Published: |
Universitas Ahmad Dahlan
01-07-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Scientific articles now have multidisciplinary content. These make it difficult for researchers to find out relevant information. Some submissions are irrelevant to the journal's discipline. Categorizing articles and assessing their relevance can aid researchers and journals. Existing research still focuses on single-category predictive outcomes. Therefore, this research takes a new approach by applying a multidisciplinary classification for Indonesian scientific article abstracts using a pre-trained BERT model, showing the relevance between each category in an abstract. The dataset used was 9,000 abstracts with 9 disciplinary categories. On the dataset, text preprocessing is performed. The classification model was built by combining the pre-trained BERT model with Artificial Neural Network. Fine-tuning the hyperparameters is done to determine the most optimal hyperparameter combination for the model. The hyperparameters consist of batch size, learning rate, number of epochs, and data ratio. The best hyperparameter combination is a learning rate of 1e-5, batch size 32, epochs 3, and data ratio 9:1, with a validation accuracy value of 90.8%. The confusion matrix results of the model are compared with the confusion matrix results by experts. In this case, the highest accuracy result obtained by the model is 99.56%. A software prototype used the most accurate model to classify new data, displaying the top two prediction probabilities and the dominant category. This research produces a model that can be used to solve Indonesian text classification-related problems. |
---|---|
ISSN: | 2442-6571 2442-6571 |
DOI: | 10.26555/ijain.v9i2.1051 |