• Stars
    star
    1
  • Language
    Python
  • Created about 5 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This projects intends to index a huge amount of files and accessing those files upon a search like a search engine by using term frequency (TF) and inverted document frequency (IDF) to help determine how important are the words in a text and which words give more information about the text. To be able to find the most relevant texts upon a given search query. This project indexes million of archives and stores the a file of the indexed archives via the TF-IDF algorithm. It stores the information in files because of the huge amount of data it is managing. When the query is made it searches upon the file with the indexed terms to look for the document that has the highest TF-IDF for the words in the query. By managing data in files we achieve the fastest solution and avoid overcrowding the main memory that would happen if everything is kept under variables in the project. The information is written to the indexed file partially to avoid giving to much data to main memory that would slow down the indexing process. Instead of indexing all the files and then writing it to the specific file, every given amount of files all the data is written and the main memory is cleared so everything is kept as optimal as possible. The goal is to achieve a program that indexes huge amounts of data and searches for the most relevant files upon a query in the most effective way by using indexing algorithms like TF-IDF and query solving algorithms like cosine similarity.