Research on sign language recognition method based on deep learning

(整期优先)网络出版时间:2024-06-04
/ 11

Research on sign language recognition method based on deep learning

Yu Deshui

   Chongqing University of Technology   400000

Abstract: A key component of the communication of the hearing-impaired community is sign language, a unique and connotative form of human expression. Recent years have seen a progressive mainstreaming of RGB video-based methods for sign language detection due to the ongoing advancements in deep learning technology. In this paper, a sign language recognition network model based on vision transformer, temporal attention mechanism, and BLSTM network (ViT-TA-BLSTM) is proposed to address the shortcomings of traditional convolutional temporal methods in the study of dynamic isolated sign language recognition. The network effect is verified using the CSL-500 dataset. Based on RGB video input, the testing findings demonstrate that the ViT-TA-BLSTM model has an accuracy of 93.49%, and compared with other sign language isolated word recognition methods, the method in this paper continues to improve the sign language recognition accuracy and is more superior.

Keywords: BLSTM; sign language recognition; deep learning; convolutional neural network; temporal modeling

0Introduction

The results of China's second national sample survey of people with disabilities show that the number of patients with hearing disabilities has reached a staggering 27.8 million, and the number of people who are hearing-impaired is even higher at approximately 72 million [1]. This indicates that the market prospect for sign language recognition technology is very broad. China's total penetration percentage of barrier-free facilities is only 40.6%, per the 2017 Hundred Cities Barrier-Free Facilities Survey Experience Report [2].

The current state of affairs must be drastically altered, underscoring the need and necessity of developing sign language recognition technology. The whole society is looking forward to intelligent accessibility research that is driven by technical innovation, especially with the rapid advancement of AI technology. It is anticipated that ongoing technology innovation, research, and development will further advance societal accessibility and offer the hearing-impaired community more practical and effective sign language recognition services.

1Current status of thestudy

Sign language recognition has been the subject of a great deal of research, encompassing both fundamental theory and technological elements. Three distinct research directions can be used to classify sign language recognition technologies.The first one uses wearable sensors to recognize signs using technology.Data gloves were first introduced in 1977 when Tom Defanti, Dniel Sandin, and others conducted research on a gesture recognition product called Sayre Glove, a glove-based sensor system.When American inventor G.J. Grimes created a machine in 1983 that could identify 72 alphabetic characters, he was also Using the "data glove" as their primary research tool, Fujitsu Laboratories in Japan carried out research on gesture recognition technology in 1991. The experimental tracking produced accurate and comprehensive recognition of 46 movements.

The second type of sign language recognition technology is based on traditional algorithms. ①Planning classification methods. This type of method mainly includes Support Vector Machine (SVM), which can accomplish different recognition tasks by changing the kernel function, for example, Quadratic SVM and Cubic SVM are used in sign language recognition. (ii) Temporal classification methods. These methods mainly include Dynamic Time Warping (DTW), Hidden Markov Model (HMM), and Connectionist Temporal Classification (CTC), which are simple and robust. robustness, but the training speed is slow and prone to misclassification and non-classification phenomenon, which is usually applied to static sign language recognition methods[3].DTW technique matches and recognizes sign language sequences by locating the shortest path, which can be applied to dynamic sign language recognition.

The third one is deep learning based sign language recognition technique. Neural networks are utilized for high-level feature extraction and classification recognition.Maruyama et al. used a multi-stream framework to construct an I3D model, which combines multiple features such as hand shape, facial expression, and skeleton information, and finally achieved the highest recognition accuracy of 87.47% in the WLASL 2000 dataset

[4].The I3D network has a deeper network structure, which not only in the isolated word dataset exhibits higher recognition accuracy, but also achieves stable convergence of parameters in complex contexts.Abdullahi et al. on the other hand, applied the generative model to discriminative classifiers using fast Fisher vectors to effectively represent high-dimensional features. They combined a bidirectional long and short-term memory network, utilized 3-dimensional hand-skeletal motion, orientation and angle information from a somatosensory system, and fused the body feature information from videos into the training model to further improve the accuracy of sign language recognition [5].

By sorting out the current research status, we can conclude that traditional model networks and deep learning are the two main areas of sign language recognition research. Data sensors and machine learning techniques are the mainstream of traditional sign language research methods. Although sensors can collect real-time data correctly, their expensive equipment and cumbersome wearing methods seriously limit their popularization. On the other hand, machine learning methods can handle limited datasets well. However, when dealing with large-scale datasets, it becomes difficult to effectively perceive the intrinsic connections between data. Therefore, in this paper, we delve deeper from the perspective of deep learning to further explore new methods for sign language recognition.

2ViT-TA-BLSTM sign language recognitionmodel

2.1modelingframework

The dynamic isolated sign language recognition task, in essence, can be regarded as a unique video classification. In this paper, we propose a new dynamic isolated sign language recognition network model based on vision transformer and temporal attention BLSTM, called ViT-TA-BLSTM. this scheme can effectively realize the dynamic isolated sign language recognition task, and its network model structure is detailed in Fig. 1.

Fig. 1 VIT-TA-BLSTM network structure

The VIT-TA-BLSTM Dynamic Isolated Sign Language Recognition Network framework consists of three core modules, covering key aspects such as data input and output, image feature extraction, and temporal feature modeling. When dealing with sign language video data, vision transformer (ViT) is first utilized to extract intermediate feature vectors from these sequence images. To further enhance the model's ability to capture key information, we introduce a temporal attention layer between ViT and BLSTM. This attention layer generates a more focused feature representation by computing the correlation between the query and the key and weighting the values. The processed features are then fed into the BLSTM network to capture the temporal dependencies of the sign language actions. Finally, a Softmax classifier is used to achieve classification and recognition of dynamically isolated sign language.

2.2VisionTransformer

Before Vision Transformer, computer vision tasks have long been accustomed to using CNN for feature extraction from image data, and its performance is difficult to deal with global information. And ViT can better extract the long-distance dependencies between images by introducing self-attention and multi-attention mechanisms, which can better handle the global information in image recognition tasks[6].

The structure of ViT model contains image serialization, linear transformation, positional coding, Transformer encoder, and a classifier composed of fully connected network. Firstly, the image serialization splits the whole picture into 9 parts, and then using linear transformation, the image information can be one-dimensionalized and the image size can be modified. In order to keep the positional feature information in the subsequent operation of the ViT model, it is necessary to add the classification token (class token) and position encoding (Position Embedding) in the vector, and then input it into the Transformer model in the form of an entire sequence to perform the operation

[7].

2.3Principle of temporalattention

Attention mechanism is a key technique in the field of computer vision, and its main concept is to give neural networks the ability to selectively focus on different parts of the input data to enhance the model's ability to perceive important information. The sign language recognition system in this paper chooses to use the temporal attention mechanism, and its model will learn the relevance weights of each time step in the temporal attention mechanism to help the model to understand the temporal sequence and evolution of the data.

Fig. 2 Structural principle of temporal attention mechanism

As shown in Fig. 2, the three vectors are Q, K, and V. Through the dot product operation between the three vectors, the temporal attention mechanism is able to weight the focused time period, which is the information that this attention mechanism is able to capture the focus of a time period, thus enhancing the ability of the LSTM to capture dynamic features in the time dimension. This mechanism effectively solves the problem of inadequate performance of LSTM in extracting temporal features. The core formulation of the temporal attention mechanism is as follows:



where V is the transpose of the input matrix St, which is computed by the neural network to obtain the unnormalized temporal probability weight matrix; Wt is the weight matrix of the neural network; b is the bias vector; B is the temporal probability weight matrix normalized by the Softmax activation function, with the sum of the probabilities in each row being 1; and represents the final output of the temporal attention vector (i.e., Time Attention)[8] .


2.4Bidirectional Long Short-Term Memory Network(BLSTM)

BLSTM, or Bidirectional Long Short-Term Memory Network, is generated from its structure, which is jointly constructed from the forward and backward LSTMs. In computer vision tasks, forward LSTM is used to capture forward contextual information, and similarly reverse LSTM captures information from the reverse direction.

Fig. 3 BLSTM structure


In the BLSTM network shown in Fig. 3, the forward and backward layers are continuously iterated, and the two layers of LSTM jointly determine the output result, which is computed as follows for the iteration from 1 time step to T .

H, W, and b in Eq. denote the activation function as well as the weights and bias,respectively; The relationship between ,denote the states of the forward LSTM and the backward LSTM at time step t;𝑦𝑡 is the output corresponding to time step t. It is due to the special structure of BLSTM that the information of forward and backward inputs can be reflected at each moment, so the BLSTM network can fully learn the time series information[9] .

3Experimentalanalysis

3.1Experimental data set and evaluationindicators

The experiments in this paper utilize the CSL-500 dataset collected from the University of Science and Technology of China. The dataset covers a rich set of 500 categories of isolated words, each of which was recorded five times each by 50 different operators, resulting in a total of 125,000 unique data samples [10]. In this study, RGB video data from the CSL-500 dataset is used, and the resolution of the video is as high as 1280*720 with a stable frame rate of 30fps, which clearly and smoothly records the sign language actions.

This paper focuses on Chinese isolated word recognition and uses ACC metrics to measure the Chinese isolated word sign language recognition model.

3.2Experimentalparameterization

The network models designed in this paper are all constructed based on the PyTorch deep learning framework, and the training and testing work is completed in the Windows 10 system environment. This experiment uses the hardware environment device information, please refer to Table 1.

Table 1 Hardware environment equipment list

computer hardware

model number

CPU

Intel(R) Core(TM) i5-9300H CPU@2.40GHz

main frequency

2.4GHZ

random access memory (RAM)

32G

display card (computer)

NVDIA GEFORCE RTX 3060

(computer) disk

SSD Solid State (500G) + 1T Mechanical

In the deep learning experiments, the dataset was first pided, where the training set occupied 80% of the total and the test set 20%, with a total of 25,000 samples. After 18 iterations of Epoch, the network converged successfully. In order to verify the superiority of the method used in this paper, the model of this paper is compared with other sign language recognition models in CSL-500, the models include C3D, CBAM-C3D, 3D-ResNet, CBAM-3D-ResNet, and CBAW-ResNet-BLSTM. experiments are conducted to demonstrate the superiority and feasibility of the method used. Details of the specific parameter configurations can be found in Table 2.

Table 2 Experimental parameter setting table

parameters

(be) worth

GPU

4080

operating system

Windows 11

development language

Python 3.8.10

Deep Learning Framework

PyTorch 1.13.1

learning rate

0.0001

optimizer

Adam

3.3Experimental results andanalysis

InordertoverifytheeffectivenessoftheBLSTMnetworkmodelproposedinthis paper that incorporates vision transformer and temporal attention mechanism, this paperconductscomparativeexperimentsbetweentheViT-TA-BLSTMmodelandother models on the CSL-500 dynamic isolated sign languagedataset.

From the analysis of the experimental data in Table 3, it can be seen that the deep learning network has a huge inherent advantage over the traditional HMM model, and it can also be noted that the accuracy of video feature extraction by other researchers usingC3DorResNetnetworksisimproving.Theperformanceisalsofurtherenhanced by networks that incorporate channel attention mechanisms. Thanks to the unique advantage in temporal learning, the performance of the sign language recognition models with the addition of BLSTM networks are all betterimproved.

However, in response to the difficulties in existing sign language recognition, it is difficult to capture its dynamic features in fast and brief sign language action videos, insufficient weighting of the key time period of sign language actions, and poor coherence between actions. To further improve the accuracy of sign language recognition, more specialized sign language recognition techniques are needed to solve this problem. Since

ViT has been widely used in the  field  of image recognition  in  recent  years, and ViT  ismore superior compared to CNN in the case of large-scale data, and at the same time, temporal attention is more capable of capturing subtle changes than channel attention forshort-timesignlanguageactions,thispaperinnovativelydesignsaBLSTMnetwork model integrating vision transformer and temporal attention mechanism ( ViT-TA- BLSTM), replacing ResNet network with ViT and channel attention with temporal attention. And the superiority of the ViT-TA-BLSTM sign language recognition network designed in this paper is verified byexperiments.

Eventually, the accuracy of the sign language recognition model training set and testsetisgraduallystabilizedafter18Epochiterations.Inordertovisualizethechange trendoftheexperimentalresults,thecorrespondinggraphswereplottedusingthe TensorBoard plotting library under the Python framework, as shown in Figures 4 to 5.

Fig. 4 ACC profile of ViT-TA-BLSTMnetwork

Fig. 5 ViT-TA-BLSTM network lossplot

Fig. 4 refers to the graph of the ACC trend towards the training process based on the ViT-TA-BLSTM network model, where the horizontal axis represents the number of iterations and the vertical axis represents the accuracy rate, which ranges between [0,1].The accuracy of ViT-TA-BLSTM network in sign language isolated word recognition reaches 93.49%, and the sign language recognition method proposed in this paper outperforms other methods.

Table 3 Comparison of the final experimental results of

this paper's method with other methods on the CSL-500 dataset


HMM(RGB)[11]

68.90

C3D(RGB)[11]

70.53

CBAM-C3D(RGB) [11]

71.77

3D-ResNet (RGB) [11]

79.52

CBAM-3D-ResNet (RGB) [11]

82.90

ResNet-LSTM (RGB)[11]

82.34

ResNet-BLSTM (RGB)[11]

85.76

CBAW-ResNet-BLSTM (RGB) [12]

89.90

ViT-TA-BLSTM (methods in this paper,RGB)

93.49

4Summary andOutlook

This paper innovatively proposes the ViT-TA-BLSTM network model, in the previous sign language recognition experiments mostly use convolutional networks to achieve feature extraction, it is difficult to capture its dynamic features in the fast and brief sign language action video, this paper through the introduction of the ViT, with its superior performance of feature extraction in sign language video, the self-attention mechanism can pay more perfect attention to the features between different sign language actions, temporal attention , BLSTM network model, temporal learning of sign language features, while enhancing the attention weight in time, so as to solve the problem of weak action representation in sign language isolated word recognition. The model aims to solve the problems of poor sign language characterization ability, improper allocation of weights in the focus time period, and insufficient temporal order.

It should be noted that the current research on sign language recognition is mostly based on large-scale datasets, and the designed recognition networks tend to be more complex, with greater application limitations, so their research results are mostly confined to laboratory environments. In future research, further in-depth discussions can be conducted on how to improve the practicality and user experience of sign language recognition systems, and attempts can be made to adopt lighter weight algorithms for sign language recognition research.

References:

[1]WU Weihua,WANG Nian,ZHANG Yikai. Digital Youth in the Silent Wall--A Study of InformationAccessibilityandKnowledgeAcquisitionofHearingImpairedYouthinBeijing[J]. Journal ofEducation,2023,19(04):143-155.DOI:10.14082/j.cnki.1673-1298.2023.04.014.

[2]Qing Zujie. Reflections on the development of the cause of the disabled in the new era:background, problems and suggestions[J]. Modern SpecialEducation,2019,(08):3-8.

[3]Tao Tangfei,Liu Tianyu. A review of sign language recognition techniques based on sign languageexpressioncontentandexpressionfeatures[J].JournalofElectronicsandInformation, 2023,45(10):3439-3457.

[4]MARUYAMAM.GHOSES,INOUEK.etalWord-levelsignlanguagerecognitionwithmulti- stream neural networks focusing on local regions[J].arXiv.2106.15989,2021.

[5]ABDULLAHI S B and CHAMNONGTHAI K.American sign language words recognition using spatio-temporal prosodic and angle features:A sequential learning approach[J].IEEE Access,2022,10:15911-15923.

[6]Jiahui Y, Zirui W, Vijay V et al. CoCa: Contrastive Captioners are Image-Text Foundation Models.

[7]Huang Muhao. Research on image classification algorithm based on ViT[D]. Hainan University,2023.DOI:10.27073/d.cnki.ghadu.2023.000249.

[8]Su Yan,Fu Jiayuan,Lin Chuan et al. A dynamic deformation prediction model for dams based on temporal attention mechanism[J]. Journal of HydropowerGeneration,2022,41(07):72-84.

[9]Fan Zhe. Research and application of parallel acceleration for deep learning [D]. Hunan University,2020.DOI:10.27135/d.cnki.ghudu.2020.003419.

[10]Xiao Q, Chang X, Zhang X, et al. Multi-Information Spatial-Temporal LSTM Fusion Continuous Sign Language Neural Machine Translation[J]. IEEE Access, 2020, 8(12): 718- 728.

[11]Wang Fanhua, Zhang Qiang, Huang Chao et al. Fusion of two-stream 3D convolution and attention mechanism for dynamic gesture recognition[J]. Journal of Electronics and Information, 2021, 43(05):1389-1396.

[12]Huang Yanglai. Research on dynamic Chinese sign language recognition based on deep learning[D]. Northeast Power University,2023.DOI:10.27008/d.cnki.gdbdc.2023.000381.

[13]WangZ,XiongC,ZhangQ.EnhancingtheonlineestimationoffingerkinematicsfromsEMG using LSTM with attention mechanisms[J].Biomedical Signal Processing and Control,2024,92105971-.

[14]UrRehman Z ,Qiang Y ,Wang L , et al. Effective lung nodule detection using deep CNN with dual attention mechanisms.[J].Scientific reports,2024,14(1).3934-3934.

[15]Liu L. Comparative study of classifier evaluation indexes MCC, CEN and ACC[D]. Tianjin NormalUniversity,2019.

[16]PUGEAULT N and BOWDEN R.Spelling it out:Real-time ASL fingerspelling recognition[C].2011 IEEE International Conference on Computer Vision workshops (ICCV Workshops),Barcelona,Spain,2011:1114-1119.doi:10.,1100/1CCVW.2011.6130290.

[17]ESCALERA S.BARO X.GONZALEZ J.et al Learn looking at people challenge 2014: dataset and results [C.The European Conference on Computer Vision, Zurich, Swit2 erland.2015:450473.doi:10.1007/978-3-310-16178532.

[18]LiuM,BaoY,LiangY,etal.Spatial-temporalAsynchronousNormalizationforUnsupervised 3d Action Representation Learning[J]. IEEE Signal Processing Letters, 2022, 29(5):632-636.

[19]Chen J, Chen S, Bai M, et al. Graph Decoupling Attention Markov Networks fo Semisupervised Graph Node Classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 10(3):102-115.

[20]Jia Z, Cai X, Jiao Z. Multi-modal Physiological Signals Based Squeeze-and-excitation NetworkwithDomainAdversarialLearningforSleepStaging[J].IEEESensorsJournal,2022, 22(4):3464-3471.

[21]Lu E, Hu X. Image Super-resolution Via Channel Attention and Spatial Attention[J]. Applied Intelligence, 2022, 52(2):2260-2268.