Attention-based Siamese networks have shown remarkable results for occlusion-aware single-camera Multi-Object Tracking (MOT) applied to persons as they can effectively combine motion and appearance features. However, expanding their usage for multi-camera MOT in crowded areas such as train stations and airports is challenging. In these kinds of scenarios, there is a higher visual appearance variability of people as the viewpoints from where they are observed while they move could be very diverse. This adds extra difficulty to the already high variability coming from partial occlusions and body pose differences (standing, sitting, or lying). Besides, attention-based MOT methods are computationally intensive and therefore difficult to scale to multiple cameras. To overcome these problems, in this paper, we propose a method that exploits contextual information of the scenario such as the viewpoint, occlusion, and pose-related visual appearance characteristics of persons to improve the inter and intra feature representations in attention-based Siamese networks. Our approach combines a smart context-aware training data batching and hard triplet mining strategy with an automated model complexity tuning procedure to train the optimal model for the scenario. This method improves the fusion of motion and appearance features of persons for the data association cost matrix of the MOT algorithm. Experimental results, validated on the MOT17 dataset, demonstrate the effectiveness and efficiency of our approach, showcasing promising results for real-world applications requiring robust MOT capabilities in multi-camera setups.
Current diffusion models could assist in creating training datasets for Deep Neural Network (DNN)-based person detectors by producing high-quality, realistic, and custom images of non-existent people and objects, avoiding privacy issues. However, these models have difficulties in generating images of people in a fully controlled way. Problems may occur such as abnormal proportions, distortions in body or face, extra limbs, or elements that do not match the input text prompt. Moreover, biases related to factors like gender, clothing type and colors, ethnicity or age can also limit the control over the generated images. Both generative AI models and DNN-based person detectors need large sets of annotated images that reflect the diverse visual appearances expected in the application context. In this paper we explore the capabilities of state-of-the-art text-to-image a diffusion models for person image generation and propose a methodology to exploit their usage for training DNN-based person detectors. For the generation of virtual persons, this includes variations in the environment, such as illumination or background, and people characteristics, such as body pose, skin tones, gender, age, clothing types and colors, as well as multiple types of partial occlusions with other objects (or people). Our method leverages explainability techniques to gain more understanding of the behaviour of the diffusion models and the relation between inputs and outputs to improve the diversity of the person detection training dataset. Experimental results using the WiderPerson benchmark of a YOLOX detection model trained with the proposed methodology show the potential use of this approach.
Runway and taxiway pavements are exposed to high stress during their projected lifetime, which inevitably leads to a decrease in their condition over time. To make sure airport pavement condition ensure uninterrupted and resilient operations, it is of utmost importance to monitor their condition and conduct regular inspections. UAV-based inspection is recently gaining importance due to its wide range monitoring capabilities and reduced cost.In this work, we propose a vision-based approach to automatically identify pavement distress using images captured by UAVs. The proposed method is based on Deep Learning (DL) to segment defects in the image. The DL architecture leverages the low computational capacities of embedded systems in UAVs by using an optimised implementation of EfficientNet feature extraction and Feature Pyramid Network segmentation. To deal with the lack of annotated data for training we have developed a synthetic dataset generation methodology to extend available distress datasets. We demonstrate that the use of a mixed dataset composed of synthetic and real training images yields better results when testing the training models in real application scenarios.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.