• This repository has been archived on 06/Jul/2021
  • Stars
    star
    138
  • Rank 263,008 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 5 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

bert chinese similarity

How to use

Prediction

This project, I improve model which was trained, so you can download it, and use it to prediction!

  • this project just support every sentences with 45 char length
  • download model file, pwd: vv1k
  • just use like this
    • first

      bs = BertSim(gpu_no=0, log_dir='log/', bert_sim_dir='bert_sim_model\\', verbose=True)
    • second

      similarity sentences

      text_a = '技术侦查措施只能在立案后采取'
      text_b = '未立案不可以进行技术侦查'
      bs.predict([[text_a, text_b]])

      you will get result like this: [[0.00942544 0.99057454]]

      not similarity sentence

      text_a = '华为还准备起诉美国政府'
      text_b = '飞机出现后货舱火警信息'
      bs.predict([[text_a, text_b]])

      you will get result like this: [[0.98687243 0.01312758]]

Parameter

name type detail
gpu_no int which gpu will be use to init bert ner graph
log_dir str log dir
verbose bool whether show tensorflow log
bert_sim_model str bert sim model path

Train

Code

In this project, I just use bert pre model to fine tuning, so I just use their original code. I try to create new one, but the new one just same as the original code, so I given up.

Dataset

Because of my domain work, my work is based on judicial examination education, so I didn't use common dataset, my dataset were labeled by manual work, it include 80000+, 50000+ are similar, 30000+ are dissimilar, because of the privacy, I can't open source of this dataset

Suggest:

In original code, they just got the model pool output, I think there may be other ways to increase the accuracy, I tried some ways to increase the accuracy, but I found one, just concat the [CLS] embedding of the fourth from bottom to tailender in encoder output list, if you want to use my way, just do like this。

  • Delete the following code
output_layer = model.get_pooled_output()
  • Use the following code, it can increase the accuracy 1%.
output_layer = tf.concat([tf.squeeze(model.all_encoder_layers[i][:, 0:1, :], axis=1) for i in range(-4, 0, 1)], axis=-1)