• Stars
    star
    152
  • Rank 244,685 (Top 5 %)
  • Language
    Java
  • License
    MIT License
  • Created over 11 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple implementation of simhash algorithm by java.

simhash-java

A simple implementation of simhash algorithm by java.

Features:

  1. compute the simhash of a string
  2. compute the similarity between all the strings by building smart index, so we can deal with big data.

How to use:

  • run Main with inputfile and outputfile.

  • The format of inputfile(see src/test_in): one doc eachline with the utf8 charset.

  • The format of outputfile(see src/test_out):

  • start //start flag

  • first line // doc

  • sencode lien // doc1\tdist where dist is the hamming distance between doc and doc1

  • end //end flag

Future:

  1. Build the project to a runnable jar.
  2. Improve the performace under big data.

Note:

  1. Before run Main.java, you should choose a better analyzer instead of BinaryWordSeg!