Categorizing dark web image content is critical for identifying and averting potential threats. However, this remains a challenge due to the nature of the data, which includes multiple co-existing domains and intra-class variations, as well as continuously having newer classes due to the rapidly augmenting amount of criminals in Darkweb. While many methods have been proposed to classify this image content, multi-label multi-class continuous learning classification remains under explored. In this paper, we propose a novel and efficient strategy for transforming a zero-shot single-label classifier into a few-shot multi-label classifier. This approach combines a label empowering methodology with few-shot data. We use CLIP, a conservative learning model that uses image-text pairs, to demonstrate the effectiveness of our strategy. Furthermore, we demonstrate the most appropriate continuous learning methodology to overcome with the challenges of accessing old data and training over and over again for each newly added class. Finally, we compare the performance with multi-label methodologies applied to CLIP, leading multi-label methods and the continuous learning approaches.
Visual indexing, or the ability to search and analyze visual media such as images and videos, is important for law enforcement agencies because it can speed up criminal investigations. As more and more visual media is created and shared online, the ability to effectively search and analyze this data becomes increasingly important for investigators to do their job effectively. The major challenges for video captioning include accurately recognizing the objects and activities in the image, understanding their relationships and context, generating natural and descriptive language, and ensuring the captions are relevant and useful. Near real-time processing is also required in order to facilitate agile forensic decision making and prompt triage, hand-over and reduction of the amount of data to be processed by investigators or subsequent processing tools. This paper presents a captioning-driven efficient video analytic which is able to extract accurate descriptions of images and videos files. The proposed approach includes a temporal segmentation technique providing the most relevant frames. Subsequently, an image captioning approach has been specialized to describe visual media related to counter-terrorism and cybercrime for each relevant frame. Our proposed method achieves high consistency and correlation with human summary on SumMe dataset, outperforming previous similar methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.