骨架数据增强和双重最近邻检索自监督动作识别

doi:10.11896/jsjkx.230500158

计算机科学 ›› 2023, Vol. 50 ›› Issue (11): 97-106.doi: 10.11896/jsjkx.230500158

• 数据库&大数据&数据科学 • 上一篇下一篇

骨架数据增强和双重最近邻检索自监督动作识别

吴雨珊^1,2, 徐增敏^1,2, 张雪莲^1,2, 王涛³

1 桂林电子科技大学数学与计算科学学院广西高校数据分析与计算重点实验室广西桂林 541004
2 广西应用数学中心(桂林电子科技大学) 广西桂林 541002
3 桂林电子科技大学建筑与交通工程学院广西智慧交通重点实验室广西桂林 541004

收稿日期:2023-05-23 修回日期:2023-08-28 出版日期:2023-11-15 发布日期:2023-11-06
通讯作者: 徐增敏(xzm@guet.edu.cn)
作者简介:(wuyushan2929@163.com)
基金资助:
国家自然科学基金(61862015,52262047);广西科技基地和人才专项(AD23023002,AD21220114,AD20159035);广西重点研发计划项目(AB17195025)

Self-supervised Action Recognition Based on Skeleton Data Augmentation and Double Nearest Neighbor Retrieval

WU Yushan^1,2, XU Zengmin^1,2, ZHANG Xuelian^1,2, WANG Tao³

1 School of Mathematics and Computing Science,Guangxi Colleges and Universities Key Laboratory of Data Analysis and Computation,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China
2 Center for Applied Mathematics of Guangxi(Guilin University of Electronic Technology),Guilin,Guangxi 541002,China
3 School of Architecture and Transportation Engineering,Guangxi Key Laboratory of ITS,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China

Received:2023-05-23 Revised:2023-08-28 Online:2023-11-15 Published:2023-11-06
About author:WU Yushan,born in 1998,postgra-duate,is a member of China Computer Federation.Her main research interests include action recognition,self-supervised learning and applied mathematics,etc.XU Zengmin,born in 1981,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include human action recognition,multimodal semantic under-standing,computer vision and pattern recognition,etc.
Supported by:
National Natural Science Foundation of China(61862015,52262047),Science and Technology Project of Guangxi(AD23023002,AD21220114,AD20159035) and Guangxi Key Research and Development Program(AB17195025).

摘要/Abstract

摘要： 传统基于骨架数据的自监督方法常将某一样本的不同增强作为正例,将其余样本均视为负例,这使得正负样本的比例严重失衡,限制了相同语义信息的样本发挥作用。针对上述问题,提出了一种正样本不受数据增强限制的双重最近邻检索动作识别算法DNNCLR。首先,基于人体关节的物理连接设计了一个新的关节级空间数据增强,即Bodypart增强,对输入的骨架序列用正态分布数组随机替换,以获得高级语义嵌入;其次,为避免正样本受数据增强的限制,提出了一种更合理的双重最近邻检索(DNN)正样本扩充策略,进一步提出了双重最近邻检索对比损失DNN Loss。具体为利用支撑集进行全局检索,将正样本集的寻找范围扩展到普通数据增强无法覆盖的新数据点;而负样本集中存在被误判的正样本,其是来自不同视频但语义信息相同的骨架样本。为此,再一次利用最近邻检索,从负样本集中寻找这种潜在的正例,二次扩展正样本集,并进一步提出双重最近邻检索对比损失,迫使模型学习更多的一般特征表示,使得模型优化更加合理。最后,将DNNCLR算法应用在AimCLR模型上,得到AimDNNCLR模型,并在NTU-RGB+D数据集上对该模型进行了线性评估,与前沿模型相比,所提方法在精度上平均提升了3.6%。

关键词: 对比学习, 最近邻检索, 数据增强, 动作识别, 人体骨架

Abstract: Traditional self-supervised methods based on skeleton data often take different data augmentation of a sample as positive examples,and the rest of the samples are regarded as negative examples,which makes the ratio of positive and negative samples seriously unbalanced,and limits the usefulness of samples with the same semantic information.In order to solve the above problems,this paper proposes a double nearest neighbor retrieval action recognition algorithm named DNNCLR,in which positive samples are not limited by data augmentation.First,a new joint level spatial data augmentation,namely Bodypart augmentation,is designed based on the physical connection of human joints.The input skeleton sequence is randomly replaced with a normal distribution array to obtain high-level semantic embedding.Secondly,in order to avoid the limitation of positive samples by data augmentation,a more reasonable double nearest neighbor retrieval(DNN) positive sample augmentation strategy is proposed,and further,a double nearest neighbor retrieval contrastive loss(DNN Loss) is proposed.Specifically,by using support sets for global retrieval,the search range of the positive sample set is expanded to new data points that cannot be covered by ordinary data augmentation.In the negative sample set,there are positive samples that have been misjudged,which are skeleton samples with the same semantic information but from different videos.Therefore,by using nearest neighbor retrieval again,these potential positive examples are searched from the negative sample set to further expand the positive sample set,and the double nearest neighbor retrieval contrastive loss is further proposed,forcing the model to learn more general feature representations,making the model optimization more reasonable.Finally,the DNNCLR algorithm is applied to the AimCLR model to obtain the AimDNNCLR model,and the model is evaluated linearly on the NTU-RGB+D dataset.Compared with the first line model,the proposed method has an average improvement of 3.6% in accuracy.

Key words: Contrastive learning, Nearest neighbor retrieval, Data augmentation, Action recognition, Human skeleton

中图分类号:

TP391.41

吴雨珊, 徐增敏, 张雪莲, 王涛. 骨架数据增强和双重最近邻检索自监督动作识别[J]. 计算机科学, 2023, 50(11): 97-106. https://doi.org/10.11896/jsjkx.230500158

WU Yushan, XU Zengmin, ZHANG Xuelian, WANG Tao. Self-supervised Action Recognition Based on Skeleton Data Augmentation and Double Nearest Neighbor Retrieval[J]. Computer Science, 2023, 50(11): 97-106. https://doi.org/10.11896/jsjkx.230500158

参考文献

[1]SHOITAN R,MOUSSA M M,EL NEMR H A.Attribute based spatio-temporal person retrieval in video surveillance[J].Ale-xandria Engineering Journal,2023,63:441-454.
[2]TRAN M T,HOANG-XUAN N,TRANG-TRUNG H P,et al.V-FIRST:A Flexible Interactive Retrieval System for Video at VBS 2022[C]//MultiMedia Modeling:28th International Conference.Cham:Springer International Publishing,2022:562-568.
[3]LIU W,BAO Q,SUN Y,et al.Recent advances of monocular 2d and 3d human pose estimation:a deep learning perspective[J].ACM Computing Surveys,2022,55(4):1-41.
[4]RAUTER M,ABSEHER C,SAFAR M.Augmenting virtualreality with near real world objects[C]//2019 IEEE Conference on Virtual Reality and 3D User Interfaces(VR).USA:IEEE,2019:1134-1135.
[5]CAO Z,HIDALGO G,SIMON T,et al.Openpose:Realtimemulti-person 2d pose estimation using part affinity fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(1):172-186.
[6]FANG H S,XIE S Q,TAI Y W,et al.Rmpe:Regional multi-person pose estimation[C]// Proceedings of the IEEE International Conference on Computer Vision.Venice,Italy:IEEE,2017:2334-2343.
[7]XU J W,YU Z B,NI B B,et al.Deep kinematics analysis for monocular 3d human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Seattle,USA:IEEE,2020:899-908.
[8]SHAHROUDY A,LIU J,NG T T,et al.NTU RGB+D:A large scale dataset for 3d human activity analysis[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,USA:IEEE,2016:1010-1019.
[9]KE Q H,BENNAMOUN M,AN S J,et al.A new representation of skeleton sequences for 3d action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,USA:IEEE,2017:3288-3297.
[10]LIU M Y,LIU H,CHEN C.Enhanced skeleton visualization for view invariant human action recognition[J].Pattern Recognition,2017,68:346-362.
[11]SONG S J,LAN C L,XING J L,et al.Spatio-temporal attention-based LSTM networks for 3D action recognition and detection[J].IEEE Transactions on Image Processing,2018,27(7):3459-3471.
[12]ZHANG P F,LAN C L,XING J L,et al.View adaptive neural networks for high performance skeleton-based human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,41(8):1963-1978.
[13]YAN S J,XIONG Y J,LIN D H.Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Thirty-second AAAI Conference on Artificial Intelligence.New Orleans,USA:AAAI Press,2018:7444-7452.
[14]SHI L,ZHANG Y F,CHENG J,et al.Two-stream adaptivegraph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Long Beach,USA:IEEE,2019:12026-12035.
[15]SI C Y,CHEN W T,WANG W,et al.An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Long Beach,USA:IEEE,2019:1227-1236.
[16]CHEN Z,LI S C,YANG B,et al.Multi-scale spatial temporalgraph convolutional network for skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Vancouver,Canada:AAAI Press,2021,35(2):1113-1122.
[17]ISLAM A,LUNDELL B,SAWHNEY H,et al.Self-supervised Learning with Local Contrastive Loss for Detection and Semantic Segmentation[C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.Waikoloa,HI,USA:IEEE,2023:5624-5633.
[18]JIAO Y,YANG K,SONG D J,et al.Timeautoad:Autonomous anomaly detection with self-supervised contrastive loss for multi-variate time series[J].IEEE Transactions on Network Science and Engineering,2022,9(3):1604-1619.
[19]WICKSTRØM K,KAMPFFMEYER M,MIKALSEN K Ø,et al.Mixing up contrastive learning:Self-supervised representation learning for time series[J].Pattern Recognition Letters,2022,155:54-61.
[20]ALBELWI S.Survey on self-supervised learning:auxiliary pretext tasks and contrastive learning methods in imaging[J].Entropy,2022,24(4):551.
[21]KOMODAKIS N,GIDARIS S.Unsupervised representationlearning by predicting image rotations[C]//International Conference on Learning Representations.Canada:ICLR,2018.
[22]HE K M,FAN H Q,WU Y X,et al.Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Seattle,USA:IEEE,2020:9729-9738.
[23]CHEN T,KORNBLITH S,NOROUZI M,et al.A simpleframework for contrastive learning of visual representations[C]//International Conference on Machine Learning.Virtual Event:PMLR,2020:1597-1607.
[24]LI L G,WANG M S,NI B B,et al.3d human action representation learning via cross-view consistency pursuit[C]//Procee-dings of the IEEE FConference on Computer Vision and Pattern Recognition.Nashville,USA:IEEE,2021:4741-4750.
[25]GUO T Y,LIU H,CHEN Z,et al.Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Virtual Event:AAAI Press,2022,36(1):762-770.
[26]DWIBEDI D,AYTAR Y,TOMPSON J,et al.With a little help from my friends:Nearest-neighbor contrastive learning of visual representations[C]// Proceedings of the IEEE International Conference on Computer Vision.Montreal,Canada:IEEE,2021:9588-9597.
[27]ZHENG N G,WEN J,LIU R S,et al.Unsupervised representation learning with long-term dynamics for skeleton based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New Orleans,USA:AAAI Press,2018,32(1):2644-2651.
[28]LIN L L,SONG S J,YANG W H,et al.Ms2l:Multi-task self-supervised learning for skeleton based action recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia.Seattle,USA:ACM,2020:2490-2498.
[29]SU K,LIU X L,SHLIZERMAN E.Predict & cluster:Unsupervised skeleton based action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Seattle,USA:IEEE,2020:9631-9640.
[30]RAO H C,XU S H,HU X P,et al.Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition[J].Information Sciences,2021,569:90-109.
[31]LIU X,ZHANG F J,HOU Z Y,et al.Self-supervised learning:Generative or contrastive[J].IEEE Transactions on Knowledge and Data Engineering,2023,35(1):857-876.
[32]MISRA I,ZITNICK C L,HEBERT M.Shuffle and learn:unsupervised learning using temporal order verification[C]//Euro-pean Conference on Computer Vision.Amsterdam,Netherlands:Springer,Cham,2016:527-544.
[33]NIE Q,LIU Z W,LIU Y H.Unsupervised 3d human pose representation with viewpoint and pose disentanglement[C]//European Conference on Computer Vision.Glasgow,UK:Springer,Cham,2020:102-118.
[34]NOROOZI M,FAVARO P.Unsupervised learning of visual representations by solving jigsaw puzzles[C]//European Confe-rence on Computer Vision.Amsterdam,Netherlands:Springer,Cham,2016:69-84.
[35]CHEN X L,FAN H Q,GIRSHICK R,et al.Improved baselines with momentum contrastive learning[J].arXiv:2003.04297,2020.
[36]OORD A,LI Y Z,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018.
[37]SHORTEN C,KHOSHGOFTAAR T M.A survey on imagedata augmentation for deep learning[J].Journal of Big Data,2019,6(1):1-48.
[38]MEMMESHEIMER R,HÄRING S,THEISEN N,et al.Skeleton-DML:deep metric learning for skeleton-based one-shot action recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.USA:IEEE,2022:3702-3710.
[39]LIN C C,LIN K,WANG L J,et al.Cross-modal representation learning for zero-shot action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.USA:IEEE,2022:19978-19988.
[40]WU C R,PENG Q L,LEE J,et al.Effective hierarchical clustering based on structural similarities in nearest neighbor graphs[J].Knowledge-Based Systems,2021,228:107295.
[41]DANG Z Y,DENG C,YANG X,et al.Nearest neighbor ma-tching for deep clustering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.USA:IEEE,2021:13693-13702.
[42]CARON M,TOUVRON H,MISRA I,et al.Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York,USA:IEEE,2021:9650-9660.
[43]WU Z R,EFROS A A,YU S X.Improving generalization via scalable neighborhood component analysis[C]//Proceedings of the European Conference on Computer Vision.Munich,Germany:Springer,2018:685-701.
[44]HAN T,XIE W,ZISSERMAN A.Self-supervised co-trainingfor video representation learning[J].Advances in Neural Information Processing Systems,2020,33:5679-5690.
[45]CHEN Z,LIU H,GUO T Y,et al.Contrastive Learning fromSpatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition[J].arXiv:2207.03065,2022.
[46]LIU J,SHAHROUDY A,PEREZ M,et al.NTU RGB+D 120:A large-scale benchmark for 3d human activity understanding[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,42(10):2684-2701.
[47]THOKER F M,DOUGHTY H,SNOEK C G M.Skeleton-con-trastive 3D action representation learning[C]//Proceedings of the 29th ACM International Conference on Multimedia.Virtual Event,China:ACM,2021:1655-1663.
[48]YANG S Y,LIU J,LU S J,et al.Skeleton cloud colorization for unsupervised 3d action representation learning[C]//Proceedings of the IEEE International Conference on Computer Vision.Montreal,Canada:IEEE,2021:13423-13433.
[49]VAN DER MAATEN L,HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research,2008,9(11):2579-2605.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

骨架数据增强和双重最近邻检索自监督动作识别

Self-supervised Action Recognition Based on Skeleton Data Augmentation and Double Nearest Neighbor Retrieval

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0