Human Activity Recognition (HAR) technology is pivotal in medical and surveillance systems. Traditional HAR methods using a single sensor are limited in the types of actions they can classify due to the inherent constraints of the sensor. This paper addresses multimodal HAR, which enables the classification of a broader range of activities by utilizing both a camera and an IMU, thereby compensating for each sensor’s shortcomings. The proposed method employs images, acceleration, and angular velocity as inputs to a network that integrates an attention mechanism. This mechanism dynamically adjusts the importance of each sensor feature based on the input data. Furthermore, the proposed network is lightweight and high-performance, designed to operate on edge devices with limited computational resources. Experimental results demonstrate that our network can classify five actions with approximately 97% precision, significantly outperforming single-sensor baselines. For a robust evaluation, we conducted experiments using the UESTC-MMEA-CL dataset, confirming that our multimodal approach is more accurate than single-sensor methods.