Paper
2 December 2022 Cross-lingual sentence embedding for mining low-resources parallel sentences
Chenhao Zhang, Yongzhong Huang, Yongqing Deng
Author Affiliations +
Proceedings Volume 12288, International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2022); 122881P (2022) https://doi.org/10.1117/12.2641013
Event: International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2022), 2022, Zhuhai, China
Abstract
Mining low-resource multilingual parallel sentences would be useful in natural language processing tasks. A large number of high-quality parallel sentences provide necessary data support for tasks such as building bilingual parallel corpora and cross-language information retrieval. In this paper, we present an approach to mine Thai-English parallel sentences in huge documents using cross-lingual sentence embedding. To evaluate the approach, we used two extensive bilingual corpora, which provide golden scores. On TED and Tanzil set, our approach improves nearly 0.73 points in AUC and reaches 96.5%. In the task of mining BUCC parallel corpus, our approach uses less time and space but gets an F1 score similar to the LaBSE model, which has reached state-of-the-art on BUCC. Our model not only solves the problem of sentence alignment with insufficient resources but also uses less time.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Chenhao Zhang, Yongzhong Huang, and Yongqing Deng "Cross-lingual sentence embedding for mining low-resources parallel sentences", Proc. SPIE 12288, International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2022), 122881P (2 December 2022); https://doi.org/10.1117/12.2641013
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Mining

Computer programming

Data mining

RELATED CONTENT


Back to Top