Language-agnostic BERT Sentence Embedding (LaBSE)
Convert the original tfhub weights to the BERT format.
LaBSE
Original Introduction:
We adapt multilingual BERT to produce language-agnostic sentence embeddings for 109 languages. %The state-of-the-art for numerous monolingual and multilingual NLP tasks is masked language model (MLM) pretraining followed by task specific fine-tuning. While English sentence embeddings have been obtained by fine-tuning a pretrained BERT model, such models have not been applied to multilingual sentence embeddings. Our model combines masked language model (MLM) and translation language model (TLM) pretraining with a translation ranking task using bi-directional dual encoders. The resulting multilingual sentence embeddings improve average bi-text retrieval accuracy over 112 languages to 83.7%, well above the 65.5% achieved by the prior state-of-the-art on Tatoeba. Our sentence embeddings also establish new state-of-the-art results on BUCC and UN bi-text retrieval.
Download
The converted weights can be downloaded at:
链接: https://pan.baidu.com/s/17qUdDSrPhhNTvPnEeI56sg 提取码: p52d
or
Google Drive: https://drive.google.com/file/d/14Zaq8RE9NMyJb_9B-lkgFZQ9H1K-U-Nf
We can load it with bert4keras:
from bert4keras.backend import keras
from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer
import numpy as np
config_path = '/root/kg/bert/labse/bert_config.json'
checkpoint_path = '/root/kg/bert/labse/bert_model.ckpt'
dict_path = '/root/kg/bert/labse/vocab.txt'
tokenizer = Tokenizer(dict_path)
model = build_transformer_model(config_path, checkpoint_path, with_pool='linear')
# 编码测试
token_ids, segment_ids = tokenizer.encode(u'语言模型')
print('\n ===== predicting =====\n')
print(model.predict([np.array([token_ids]), np.array([segment_ids])]))