[1] L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon, "Sensor-based and vision-based human activity recognition: A comprehensive survey," Pattern Recognition, vol. 108, p. 107561, 2020.
[2] F. Shafizadegan, A. R. Naghsh-Nilchi, and E. Shabaninia, "Multimodal vision-based human action recognition using deep learning: A review," Accepted in Artificial Intelligence Review, 2024.
[3] N. Imanpour, A. R. Naghsh‐Nilchi, A. Monadjemi, H. Karshenas, K. Nasrollahi, and T. B. Moeslund, "Memory‐and time‐efficient dense network for single‐image super‐resolution," IET Signal Processing, vol. 15, no. 2, pp. 141-152, 2021.
[4] M. Liu, H. Liu, and C. Chen, "Enhanced skeleton visualization for view invariant human action recognition," Pattern Recognition, vol. 68, pp. 346-362, 2017.
[5] C. Li, Y. Hou, P. Wang, and W. Li, "Joint distance maps based action recognition with convolutional neural networks," IEEE Signal Processing Letters, vol. 24, no. 5, pp. 624-628, 2017.
[6] Y. Kong and Y. Fu, "Human action recognition and prediction: A survey," International Journal of Computer Vision, vol. 130, no. 5, pp. 1366-1401, 2022.
[7] A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998-6008.
[8] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
[9] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, "Transformers in vision: A survey," ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1-41, 2022.
[10] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, "Video transformer network," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3163-3172.
[11] E. Shabaninia and H. Nezamabadi-pour, "Skeleton-based human action recognition using joint distance images and vision transformers," presented at the ICCE, 2023.
[12] P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, "RGB-D-based human motion recognition with deep learning: A survey," Computer Vision and Image Understanding, vol. 171, pp. 118-139, 2018.
[13] Q. Ke, S. An, M. Bennamoun, F. Sohel, and F. Boussaid, "Skeletonnet: Mining deep part features for 3-d action recognition," IEEE signal processing letters, vol. 24, no. 6, pp. 731-735, 2017.
[14] C. Caetano, J. Sena, F. Brémond, J. A. Dos Santos, and W. R. Schwartz, "SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition," in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2019, pp. 1-8: IEEE.
[15] J. Liu, N. Akhtar, and A. Mian, "Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition," in CVPR Workshops, 2019.
[16] K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, "Skeleton-based action recognition with shift graph convolutional network," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 183-192.
[17] Y. Li, Z. He, X. Ye, Z. He, and K. Han, "Spatial temporal graph convolutional networks for skeleton-based dynamic hand gesture recognition," EURASIP Journal on Image and Video Processing, vol. 2019, no. 1, p. 78, 2019.
[18] S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Thirty-second AAAI conference on artificial intelligence, 2018.
[19] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "Ntu rgb+ d: A large scale dataset for 3d human activity analysis," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010-1019.
[20] J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, "Skeleton-based human action recognition with global context-aware attention LSTM networks," IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586-1599, 2018.
[21] I. Lee, D. Kim, S. Kang, and S. Lee, "Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1012-1020.
[22] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, "View adaptive neural networks for high performance skeleton-based human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1963-1978, 2019.
[23] P. Zhang, J. Xue, C. Lan, W. Zeng, Z. Gao, and N. Zheng, "Eleatt-rnn: Adding attentiveness to neurons in recurrent neural networks," IEEE Transactions on Image Processing, vol. 29, pp. 1061-1073, 2019.
[24] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, "Training data-efficient image transformers & distillation through attention," in International Conference on Machine Learning, 2021, pp. 10347-10357: PMLR.
[25] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in European Conference on Computer Vision, 2020, pp. 213-229: Springer.
[26] Y. Wang et al., "End-to-end video instance segmentation with transformers," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8741-8750.
[27] X. Li, Y. Hou, P. Wang, Z. Gao, M. Xu, and W. Li, "Trear: Transformer-based rgb-d egocentric action recognition," IEEE Transactions on Cognitive and Developmental Systems, 2021.
[28] E. Shabaninia, H. Nezamabadi-pour, and F. Shafizadegan, "Multimodal action recognition: a comprehensive survey on temporal modeling," Multimedia Tools and Applications, pp. 1-51, 2023.
[29] C. Plizzari, M. Cannici, and M. Matteucci, "Spatial temporal transformer network for skeleton-based action recognition," in International Conference on Pattern Recognition, 2021, pp. 694-701: Springer.
[30] Y. Sun, Y. Shen, and L. Ma, "MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition," Sensors, vol. 21, no. 16, p. 5339, 2021.
[31] Y.-B. Cheng, X. Chen, J. Chen, P. Wei, D. Zhang, and L. Lin, "Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition," in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1-6: IEEE.
[32] Y.-B. Cheng, X. Chen, D. Zhang, and L. Lin, "Motion-transformer: self-supervised pre-training for skeleton-based action recognition," in Proceedings of the 2nd ACM International Conference on Multimedia in Asia, 2021, pp. 1-6.
[33] F. Shafizadegan, A. R. Naghsh-Nilchi, and E. Shabaninia, "Hybrid Embedding for Few-Frames Action Recognition Using Vision Transformers," Under review in internationl journal of multimedia information retrieval.
[34] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, "Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding," IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684-2701, 2019.
[35] J. Do and M. Kim, "SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition," arXiv preprint arXiv:2403.09508, 2024.
[36] B. L. N. Huu and T. Matsui, "Step catformer: Spatial-temporal effective body-part cross attention transformer for skeleton-based action recognition," 2022.
[37] R. Vemulapalli, F. Arrate, and R. Chellappa, "Human action recognition by representing 3d skeletons as points in a lie group," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588-595.
[38] Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110-1118.
[39] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "Vivit: A video vision transformer," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836-6846.
[40] Z. Tong, Y. Song, J. Wang, and L. Wang, "Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training," Advances in neural information processing systems, vol. 35, pp. 10078-10093, 2022.
[41] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, "Going deeper with image transformers," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 32-42.