Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

Go

HTML

C

Groovy

Solidity

OCaml

CSS

Ruby

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

PowerShell

Elm

C

Racket

Lua

Crystal

Java

Ada

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇦🇱 Albania

🇬🇫 French Guiana

🇮🇩 Indonesia

🇸🇪 Sweden

🇼🇸 Samoa

🇧🇷 Brazil

🇧🇭 Bahrain

🇲🇩 Moldova

All Countries Compare Countries

bojone/labse

Stars
122
Rank 292,031 (Top 6 %)
Language
Python
Created over 4 years ago
Updated over 1 year ago

bojone/labse

bojone

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Language-agnostic BERT Sentence Embedding (LaBSE)

Language-agnostic BERT Sentence Embedding (LaBSE)

Convert the original tfhub weights to the BERT format.

LaBSE

Paper: https://arxiv.org/abs/2007.01852
TFHUB: https://tfhub.dev/google/LaBSE/1

Original Introduction:

We adapt multilingual BERT to produce language-agnostic sentence embeddings for 109 languages. %The state-of-the-art for numerous monolingual and multilingual NLP tasks is masked language model (MLM) pretraining followed by task specific fine-tuning. While English sentence embeddings have been obtained by fine-tuning a pretrained BERT model, such models have not been applied to multilingual sentence embeddings. Our model combines masked language model (MLM) and translation language model (TLM) pretraining with a translation ranking task using bi-directional dual encoders. The resulting multilingual sentence embeddings improve average bi-text retrieval accuracy over 112 languages to 83.7%, well above the 65.5% achieved by the prior state-of-the-art on Tatoeba. Our sentence embeddings also establish new state-of-the-art results on BUCC and UN bi-text retrieval.

Download

The converted weights can be downloaded at:

链接: https://pan.baidu.com/s/17qUdDSrPhhNTvPnEeI56sg 提取码: p52d

or

Google Drive: https://drive.google.com/file/d/14Zaq8RE9NMyJb_9B-lkgFZQ9H1K-U-Nf

We can load it with bert4keras:

from bert4keras.backend import keras
from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer
import numpy as np

config_path = '/root/kg/bert/labse/bert_config.json'
checkpoint_path = '/root/kg/bert/labse/bert_model.ckpt'
dict_path = '/root/kg/bert/labse/vocab.txt'

tokenizer = Tokenizer(dict_path)
model = build_transformer_model(config_path, checkpoint_path, with_pool='linear')

# 编码测试
token_ids, segment_ids = tokenizer.encode(u'语言模型')

print('\n ===== predicting =====\n')
print(model.predict([np.array([token_ids]), np.array([segment_ids])]))

Contact

https://kexue.fm

bert4keras

keras implement of transformers for humans

attention

some attention implements

vae

a simple vae and cvae from keras

kg-2019

2019年百度的三元组抽取比赛，“科学空间队”源码

bert_in_keras

在Keras下微调Bert的一些例子；some examples of bert in keras

SimCSE

SimCSE在中文任务上的简单实验

word-discovery

速度更快、效果更好的中文新词发现

BERT-whitening

简单的向量白化改善句向量质量

SPACES

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

Capsule

A Capsule Implement with Pure Keras

seq2seq

keras example of seq2seq, auto title

CoSENT

比Sentence-BERT更有效的句向量方案

gan

some demo of GANs

crf

keras implementation of conditional random field

lic2020_baselines

some baselines for lic2020 (http://lic2020.cipsc.org.cn/)

kg-2019-baseline

2019年百度的三元组抽取比赛，一个baseline

GlobalPointer

全局指针统一处理嵌套与非嵌套NER

flow

Keras implement of flow-based models

NBCE

Naive Bayes-based Context Extension

keras_lookahead

lookahead optimizer for keras

dgcnn_for_reading_comprehension

keras implement of dgcnn for reading comprehension

t5_in_bert4keras

整理一下在keras中使用T5模型的要点

Pattern-Exploiting-Training

Pattern-Exploiting Training在中文上的简单实验

on-lstm

Keras implement of ON-LSTM (Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks)

Keras-DDPM

生成扩散模型的Keras实现

infomax

extract features by maximizing mutual information

nlp-zero

基于最小熵原理的NLP工具包

oppo-text-match

小布助手对话短文本语义匹配的一个baseline

CLUE-bert4keras

真 · “Deep Learning for Humans”

o-gan

O-GAN: Extremely Concise Approach for Auto-Encoding Generative Adversarial Networks

P-tuning

P-tuning方法在中文上的简单实验

ee-2019-baseline

面向金融领域的事件主体抽取（ccks2019），一个baseline

accum_optimizer_for_keras

wrapping a keras optimizer to implement gradient accumulation

tf_word2vec

a tensorflow version of Word2Vec with a new loss

el-2019-baseline

2019年百度的实体链指比赛（ccks2019），一个baseline

GPLinker

基于GlobalPointer的实体/关系/事件抽取

gan-qp

GAN-QP: A Novel GAN Framework without Gradient Vanishing and Lipschitz Constraint

el-2019

2019年百度的实体链接比赛，“科学空间队”源码

perturbed_masking

基于BERT的无监督分词和句法分析

margin-softmax

keras sparse implement of margin-softmax

chinese-gen

中文生成式预训练模型

nezha_gpt_dialog

r-drop

R-Drop方法在中文任务上的简单实验

n2n-ocr-for-qqcaptcha

an n2n ocr for qq captcha, 端到端的腾讯验证码识别

ape210k_baseline

用bert4keras来解小学数学应用题

keras_radam

RAdam optimizer for keras

KgCLUE-bert4keras

基于“Seq2Seq+前缀树”的知识图谱问答

albert_zh

转换 https://github.com/brightmart/albert_zh 到google格式

CCL_CMRC2017

第一届“讯飞杯”中文机器阅读理解评测参考模型

simple-chinese-ocr

A Simple Chinese OCR from tipdm contest

CPM_LM_bert4keras

在bert4keras下加载CPM_LM模型

univae

基于Transformer的单模型、多尺度的VAE模型

unsupervised-text-generation

无监督文本生成的一些方法

sohu2021-baseline

2021搜狐校园文本匹配算法大赛baseline

keras_adversarial_training

Adversarial Training for NLP in Keras

bytepiece

更纯粹、更高压缩率的Tokenizer

infomap

a beautiful method for cluster or community detection

gpt_cchess

bert4keras实现gpt下中国象棋

rnn

一些RNN的实现

CDial-GPT-tf

用bert4keras加载CDial-GPT

tiger

A Tight-fisted Optimizer

T-GANs

Training Generative Adversarial Networks Via Turing Test

keras_recompute

saving memory by recomputing for keras

bert-of-theseus

bert-of-theseus via bert4keras

shuffle

Python下shuffle几百G文件

text_compare

用python比较两个字符串差异，高亮差异部分

mydog

监控文件改动，随时自动备份，彻底防止误删

pytorch_bert_to_tf

pytorch版bert权重转tf

nezha

精简版NEZHA模型权重

keras_lazyoptimizer

Keras implement of Lazy optimizer

unsupervised-vocabulary-search

完整的新词发现&词库构建例子

adafactor

adafactor optimizer for keras

NNCWS

Neutral Network based Chinese Segment System

exposure_bias

some strategies for exposure bias in seq2seq

vib

Variational Information Bottleneck

antiminer

简单的挖矿病毒查杀脚本

lic2021_baselines

目前只有阅读理解赛道的

LST-CLUE

Ladder Side-Tuning在CLUE上的简单尝试

FSQ

Keras implement of Finite Scalar Quantization

baidu_dog_classifier

analytical-classification

逻辑回归和单层softmax的解析解

baidu-ner-contest

a bilstm-seq2seq ner script from baidu-ner contest

memm

Keras implementation of Maximum Entropy Markov Model

python-snippets

some frequently-used snippets for python

adax

AdaX optimizer for keras

beiguo

根据进程pid找对应的container