Incorporating Transformer Networks and Joint Distance Images into Skeleton-driven Human Activity Recognition

Document Type : Research Article

Authors

1 Department of Applied Mathematics, Graduate University of Advanced Technology, Kerman, Iran

2 Department of Artificial Intelligence, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran

3 Department of Electrical Engineering, Shahid Bahonar University of Kerman, Kerman, Iran

Abstract

Skeleton-based action recognition has attracted significant attention in the field of computer vision. In recent years, Transformer networks have improved action recognition as a result of their ability to capture long-range dependencies and relationships in sequential data. In this context, a novel approach is proposed to enhance skeleton-based activity recognition by introducing Transformer self-attention alongside Convolutional Neural Network (CNN) architectures. The proposed method capitalizes on the 3D distances between pair-wise joints, utilizing this information to generate Joint Distance Images (JDIs) for each frame. These JDIs offer a relatively view-independent representation, allowing the model to discern intricate details of human actions. To further enhance the model's understanding of spatial features and relationships, the extracted JDIs from different frames are processed. They can be directly input into the Transformer network or first fed into a CNN, enabling the extraction of crucial spatial features. The obtained features, combined with positional embeddings, serve as input to a Transformer encoder, enabling the model to reconstruct the underlying structure of the action from the training data. Experimental results showcase the effectiveness of the proposed method, demonstrating performance comparable to other state-of-the-art transformer-based approaches on benchmark datasets such as NTU RGB+D and NTU RGB+D120. The incorporation of Transformer networks and Joint Distance Images presents a promising avenue for advancing the field of skeleton-based human action recognition, offering robust performance and improved generalization across diverse action datasets.

Keywords

Main Subjects


[1]           L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon, "Sensor-based and vision-based human activity recognition: A comprehensive survey," Pattern Recognition, vol. 108, p. 107561, 2020.
[2]           F. Shafizadegan, A. R. Naghsh-Nilchi, and E. Shabaninia, "Multimodal vision-based human action recognition using deep learning: A review," Accepted in Artificial Intelligence Review, 2024.
[3]           N. Imanpour, A. R. Naghsh‐Nilchi, A. Monadjemi, H. Karshenas, K. Nasrollahi, and T. B. Moeslund, "Memory‐and time‐efficient dense network for single‐image super‐resolution," IET Signal Processing, vol. 15, no. 2, pp. 141-152, 2021.
[4]           M. Liu, H. Liu, and C. Chen, "Enhanced skeleton visualization for view invariant human action recognition," Pattern Recognition, vol. 68, pp. 346-362, 2017.
[5]           C. Li, Y. Hou, P. Wang, and W. Li, "Joint distance maps based action recognition with convolutional neural networks," IEEE Signal Processing Letters, vol. 24, no. 5, pp. 624-628, 2017.
[6]           Y. Kong and Y. Fu, "Human action recognition and prediction: A survey," International Journal of Computer Vision, vol. 130, no. 5, pp. 1366-1401, 2022.
[7]           A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998-6008.
[8]           A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
[9]           S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, "Transformers in vision: A survey," ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1-41, 2022.
[10]         D. Neimark, O. Bar, M. Zohar, and D. Asselmann, "Video transformer network," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3163-3172.
[11]         E. Shabaninia and H. Nezamabadi-pour, "Skeleton-based human action recognition using joint distance images and vision transformers," presented at the ICCE, 2023.
[12]         P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, "RGB-D-based human motion recognition with deep learning: A survey," Computer Vision and Image Understanding, vol. 171, pp. 118-139, 2018.
[13]         Q. Ke, S. An, M. Bennamoun, F. Sohel, and F. Boussaid, "Skeletonnet: Mining deep part features for 3-d action recognition," IEEE signal processing letters, vol. 24, no. 6, pp. 731-735, 2017.
[14]         C. Caetano, J. Sena, F. Brémond, J. A. Dos Santos, and W. R. Schwartz, "SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition," in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2019, pp. 1-8: IEEE.
[15]         J. Liu, N. Akhtar, and A. Mian, "Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition," in CVPR Workshops, 2019.
[16]         K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, "Skeleton-based action recognition with shift graph convolutional network," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 183-192.
[17]         Y. Li, Z. He, X. Ye, Z. He, and K. Han, "Spatial temporal graph convolutional networks for skeleton-based dynamic hand gesture recognition," EURASIP Journal on Image and Video Processing, vol. 2019, no. 1, p. 78, 2019.
[18]         S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Thirty-second AAAI conference on artificial intelligence, 2018.
[19]         A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "Ntu rgb+ d: A large scale dataset for 3d human activity analysis," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010-1019.
[20]         J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, "Skeleton-based human action recognition with global context-aware attention LSTM networks," IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586-1599, 2018.
[21]         I. Lee, D. Kim, S. Kang, and S. Lee, "Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1012-1020.
[22]         P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, "View adaptive neural networks for high performance skeleton-based human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1963-1978, 2019.
[23]         P. Zhang, J. Xue, C. Lan, W. Zeng, Z. Gao, and N. Zheng, "Eleatt-rnn: Adding attentiveness to neurons in recurrent neural networks," IEEE Transactions on Image Processing, vol. 29, pp. 1061-1073, 2019.
[24]         H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, "Training data-efficient image transformers & distillation through attention," in International Conference on Machine Learning, 2021, pp. 10347-10357: PMLR.
[25]         N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in European Conference on Computer Vision, 2020, pp. 213-229: Springer.
[26]         Y. Wang et al., "End-to-end video instance segmentation with transformers," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8741-8750.
[27]         X. Li, Y. Hou, P. Wang, Z. Gao, M. Xu, and W. Li, "Trear: Transformer-based rgb-d egocentric action recognition," IEEE Transactions on Cognitive and Developmental Systems, 2021.
[28]         E. Shabaninia, H. Nezamabadi-pour, and F. Shafizadegan, "Multimodal action recognition: a comprehensive survey on temporal modeling," Multimedia Tools and Applications, pp. 1-51, 2023.
[29]         C. Plizzari, M. Cannici, and M. Matteucci, "Spatial temporal transformer network for skeleton-based action recognition," in International Conference on Pattern Recognition, 2021, pp. 694-701: Springer.
[30]         Y. Sun, Y. Shen, and L. Ma, "MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition," Sensors, vol. 21, no. 16, p. 5339, 2021.
[31]         Y.-B. Cheng, X. Chen, J. Chen, P. Wei, D. Zhang, and L. Lin, "Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition," in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1-6: IEEE.
[32]         Y.-B. Cheng, X. Chen, D. Zhang, and L. Lin, "Motion-transformer: self-supervised pre-training for skeleton-based action recognition," in Proceedings of the 2nd ACM International Conference on Multimedia in Asia, 2021, pp. 1-6.
[33]         F. Shafizadegan, A. R. Naghsh-Nilchi, and E. Shabaninia, "Hybrid Embedding for Few-Frames Action Recognition Using Vision Transformers," Under review in internationl journal of multimedia information retrieval.
[34]         J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, "Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding," IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684-2701, 2019.
[35]         J. Do and M. Kim, "SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition," arXiv preprint arXiv:2403.09508, 2024.
[36]         B. L. N. Huu and T. Matsui, "Step catformer: Spatial-temporal effective body-part cross attention transformer for skeleton-based action recognition," 2022.
[37]         R. Vemulapalli, F. Arrate, and R. Chellappa, "Human action recognition by representing 3d skeletons as points in a lie group," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588-595.
[38]         Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110-1118.
[39]         A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "Vivit: A video vision transformer," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836-6846.
[40]         Z. Tong, Y. Song, J. Wang, and L. Wang, "Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training," Advances in neural information processing systems, vol. 35, pp. 10078-10093, 2022.
[41]         H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, "Going deeper with image transformers," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 32-42.