• Stars
    star
    233
  • Rank 172,230 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This repo is to showcase how you can run a model locally and offline, free of OpenAI dependencies.

local_llama

Interested in chatting with your PDFs entirely offline and free from OpenAI dependencies? Then you're in the right place. I made my other project, gpt_chatwithPDF with the ultimate goal of local_llama in mind. This repo assumes the same functionality as that project but is local and can be run in airplane mode.. Drop a star if you like it!

Video demo here: https://www.reddit.com/user/Jl_btdipsbro/comments/13n6hbz/local_llama/?utm_source=share&utm_medium=ios_app&utm_name=ioscss&utm_content=2&utm_term=1

DISCLAIMER: This is an experimental repo, not an end all be all for your solution. It is meant as a way forward towards many use cases for local offline use of LLMs.

Installation

On windows you have to have Visual Studio with a C compiler installed. Secondly you need a model, I used GPT4All-13B-snoozy.ggml.q4_0.bin, however any ggml should work. I downloaded the model here https://huggingface.co/TheBloke/GPT4All-13B-snoozy-GGML/tree/main Run pip install -r requirements.txt

Usage

Use the command python -m streamlit run "path/to/project/local_llama.py". This will start the app in your browser. Once you have uploaded your PDFs, refresh the browser, select a manual, and ask away!

CLI output as an example for inference time running on my alienware x14 with 3060:

TIMES
/GPT4All-13B-snoozy.ggml.q4_0.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_print_timings: load time = 21283.78 ms
llama_print_timings: sample time = 3.08 ms / 13 runs ( 0.24 ms per token)
llama_print_timings: prompt eval time = 21283.70 ms / 177 tokens ( 120.25 ms per token)
llama_print_timings: eval time = 2047.03 ms / 12 runs ( 170.59 ms per token)
llama_print_timings: total time = 24057.21 ms

History

Credits

The-Bloke and his model GPT4All-13B-snoozy.ggml.q4_0.bin that I used for this project.

License

Apache 2.0