Multi-modal fusion transformer for visual question answering in remote sensing

Tim Siebert; Kai Norman Clasen; Mahdyar Ravanbakhsh; Begüm Demir

doi:10.1117/12.2636276

26 October 2022 Multi-modal fusion transformer for visual question answering in remote sensing

Tim Siebert, Kai Norman Clasen, Mahdyar Ravanbakhsh, Begüm Demir

Proceedings Volume 12267, Image and Signal Processing for Remote Sensing XXVIII; 122670L (2022) https://doi.org/10.1117/12.2636276
Event: SPIE Remote Sensing, 2022, Berlin, Germany

Abstract

With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modalityspecific representations in their fusion modules instead of joint representation learning. However, to discover the underlying relation between both the image and question modality, the model is required to learn the joint representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the VisualBERT model (VB); and iii) the classification module to obtain the answer. In contrast to recently proposed transformer-based models in RS VQA, the presented architecture (called VBFusion) is not limited to specific questions, e.g., questions concerning pre-defined objects. Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. Experimental results show the importance of utilizing these bands to characterize the land-use land-cover classes present in the images in the framework of VQA. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/multimodal- fusion-transformer-for-vqa-in-rs.

Conference Presentation

Citation Download Citation

Tim Siebert, Kai Norman Clasen, Mahdyar Ravanbakhsh, and Begüm Demir "Multi-modal fusion transformer for visual question answering in remote sensing", Proc. SPIE 12267, Image and Signal Processing for Remote Sensing XXVIII, 122670L (26 October 2022); https://doi.org/10.1117/12.2636276

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available