Most previous works in synset induction generally ignore the visual data, which contains important semantic information. Instead, in this paper, we present an effective multi-modal solution for tag synsets induction task by leveraging the massive image tags pairs in image-centric social networks, such as Instagram and Pinterest. The proposed method consists of three stages: the first stage learns the textual and visual representations of tags, the second stage learns the distance between tags based on the learned representations in a supervised fashion, and the last stage performs clustering to construct the tag synsets based on the learned distance. In order to perform the tasks of these stages, we collect a suite of new datasets from social media: TagSet with over 415 million sentences for textual representation learning, ImageSet with nearly 16 million image tags pairs for visual representation learning and TagSynSet with a total of 5644 tags from 3680 synsets for distance learning and tag synset induction. Extensive experiments are conducted and the results on TagSynSet show that our proposed method is effective on the tag synset induction task.
|