Attention-based Siamese networks have shown remarkable results for occlusion-aware single-camera Multi-Object Tracking (MOT) applied to persons as they can effectively combine motion and appearance features. However, expanding their usage for multi-camera MOT in crowded areas such as train stations and airports is challenging. In these kinds of scenarios, there is a higher visual appearance variability of people as the viewpoints from where they are observed while they move could be very diverse. This adds extra difficulty to the already high variability coming from partial occlusions and body pose differences (standing, sitting, or lying). Besides, attention-based MOT methods are computationally intensive and therefore difficult to scale to multiple cameras. To overcome these problems, in this paper, we propose a method that exploits contextual information of the scenario such as the viewpoint, occlusion, and pose-related visual appearance characteristics of persons to improve the inter and intra feature representations in attention-based Siamese networks. Our approach combines a smart context-aware training data batching and hard triplet mining strategy with an automated model complexity tuning procedure to train the optimal model for the scenario. This method improves the fusion of motion and appearance features of persons for the data association cost matrix of the MOT algorithm. Experimental results, validated on the MOT17 dataset, demonstrate the effectiveness and efficiency of our approach, showcasing promising results for real-world applications requiring robust MOT capabilities in multi-camera setups.
|