A vision transformer-based automated human identification using ear biometrics

Recent years Vision Transformers (ViTs) have gained significant attention in the field of computer vision for their impressive performance in various tasks, including image recognition and machine translation tasks, question answering, text classification, image captioning. ViTs performs better on s...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of information security and applications Vol. 78; p. 103599
Main Authors:	Mehta, Ravishankar, Shukla, Sindhuja, Pradhan, Jitesh, Singh, Koushlendra Kumar, Kumar, Abhinav
Format:	Journal Article
Language:	English
Published:	Elsevier Ltd 01-11-2023
Subjects:	Attention network Data augmentation Embedding Patch Vision transformer ViT CNN LN GELU Embedding Attention network USTB CLS MSP LCA AWE MLP Vision transformer CeiT NLP GAN Data augmentation BERT Patch AMI LeFF
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recent years Vision Transformers (ViTs) have gained significant attention in the field of computer vision for their impressive performance in various tasks, including image recognition and machine translation tasks, question answering, text classification, image captioning. ViTs performs better on several benchmark image datasets such as ImageNet with fewer parameters and computation compared to CNN-based models. The self-attention part performs the feature extraction component of the convolutional neural network (CNN). The proposed model provides a framework on vision transformer-based model for 2D ear recognition. The self-attention part is jointly applied with Convolutional Neural Network (CNNs) in the proposed model. Adjustments and fine-tuning has been done based on the specific characteristics of the ear dataset and the desired performance requirements. In the field of deep learning, the application areas of the CNNs have been proven to be de-facto mainly due to its learning capability of spatially local representations based on their inductive biases, learning the global representation further enhances the recognition accuracy through self-attention mechanism of vision transformers (ViT's). This has been made possible by direct applications of transformer on to the sequence of image patches for better performance in classifying the images. The proposed work utilizes various patch size of images during the model training. From the experimental analysis, it has been observed that with patch size 16 × 16 it achieves highest accuracy of 99.36%. The proposed model has been validated with the Kaggle and IITD-II data set. The efficiency of the proposed model over the existing models has been also reported in the present work.
ISSN:	2214-2126
DOI:	10.1016/j.jisa.2023.103599