Ministry of Education
ORCID: 0009-0003-3334-9508Publishes on Human Pose and Action Recognition, Magnetism in coordination complexes, Multimodal Machine Learning Applications. 10 papers and 226 citations.
Add your photo, update your bio, and get notified when your ranking changes.
coordination in the equatorial plane of Dy2 than that of Dy1, indicating that Dy2 has stronger axial anisotropy and a weaker transverse field than Dy1. Consistent with the above structural analysis, compound 2 exhibits a coexistence of QTM and magnetic relaxation through the second excited state under zero dc field, with a relaxation energy barrier of 58 K, higher than that of compound 1, and exhibits butterfly-shaped hysteresis loops below 7 K, in contrast to the weak hysteresis at 2 K for 1. Therefore, we utilized 1 and 2 as a class of model complexes to explore the significant impact of local distortion of the axially compressed pentagonal bipyramidal coordination geometry on single-ion magnetic performance under extremely similar coordination environments.
This paper focuses on how to improve the efficiency of the action recognition framework by optimizing its complicated feature extraction pipelines and enhancing explainability, benefiting future adaptation to more complex visual understanding tasks (e.g. video captioning). To achieve this task, we propose a novel decoupled two-stream framework for action recognition - HSAR, which utilizes high-semantic features for increased efficiency and provides well-founded explanations in terms of spatial-temporal perceptions that will benefit further expansions on visual understanding tasks. The inputs are decoupled into spatial and temporal streams with designated encoders aiming to extract only the pinnacle of representations, gaining high-semantic features while reducing computation costs greatly. A lightweight Temporal Motion Transformer (TMT) module is proposed for globally modeling temporal features through self-attention, omitting redundant spatial features. Decoupled spatial-temporal embeddings are further merged dynamically by an attention fusion model to form a joint high-semantic representation. The visualization of the attention in each module offers intuitive interpretations of HSAR’s explainability. Extensive experiments on three widely-used benchmarks (Kinetics400, 600, and Sthv2) show that our framework achieves high prediction accuracy with significantly reduced computation (only 64.07 GFLOPs per clip), offering a great trade-off between accuracy and computational costs.