Mining low-resource multilingual parallel sentences would be useful in natural language processing tasks. A large number of high-quality parallel sentences provide necessary data support for tasks such as building bilingual parallel corpora and cross-language information retrieval. In this paper, we present an approach to mine Thai-English parallel sentences in huge documents using cross-lingual sentence embedding. To evaluate the approach, we used two extensive bilingual corpora, which provide golden scores. On TED and Tanzil set, our approach improves nearly 0.73 points in AUC and reaches 96.5%. In the task of mining BUCC parallel corpus, our approach uses less time and space but gets an F1 score similar to the LaBSE model, which has reached state-of-the-art on BUCC. Our model not only solves the problem of sentence alignment with insufficient resources but also uses less time.
|