• Stars
    star
    3,608
  • Rank 11,760 (Top 0.3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 1 year ago
  • Updated 10 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

H2O LLM Studio - a framework and no-code GUI for fine-tuning LLMs. Documentation: https://h2oai.github.io/h2o-llmstudio/

Welcome to H2O LLM Studio, a framework and no-code GUI designed for
fine-tuning state-of-the-art large language models (LLMs).

homelogs

Jump to

With H2O LLM Studio, you can

  • easily and effectively fine-tune LLMs without the need for any coding experience.
  • use a graphic user interface (GUI) specially designed for large language models.
  • finetune any LLM using a large variety of hyperparameters.
  • use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint.
  • use Reinforcement Learning (RL) to finetune your model (experimental)
  • use advanced evaluation metrics to judge generated answers by the model.
  • track and compare your model performance visually. In addition, Neptune integration can be used.
  • chat with your model and get instant feedback on your model performance.
  • easily export your model to the Hugging Face Hub and share it with the community.

Quickstart

For questions, discussing, or just hanging out, come and join our Discord!

We offer several ways of getting started quickly.

Using CLI for fine-tuning LLMs:

Kaggle Open in Colab

What's New

  • PR 328 RLHF is now a separate problem type. Note that starting a new RLHF experiment from an old experiment that used RLHF is no longer supported. To continue from a previous experiment, please start a new experiment and enter the settings from the previous experiment manually.
  • PR 308 Sequence to sequence models have been added as a new problem type.
  • PR 152 Add RLHF functionality for fine-tuning LLMs.
  • PR 132 Add 4bit training that allows training of larger LLM backbones with less GPU memory. See here for a comprehensive summary of this method.
  • PR 40 Added functionality for supporting nested conversations in data. A new parent_id_column can be selected for datasets to support tree-like structures in your conversational data. Additional augmentation settings have been added for this feature.

Please note that due to current rapid development we cannot guarantee full backwards compatibility of new functionality. We thus recommend to pin the version of the framework to the one you used for your experiments. For resetting, please delete/backup your data and output folders.

Setup

H2O LLM Studio requires a machine with Ubuntu 16.04+ and at least one recent Nvidia GPU with Nvidia drivers version >= 470.57.02. For larger models, we recommend at least 24GB of GPU memory.

For more information about installation prerequisites, see the Set up H2O LLM Studio guide in the documentation.

Recommended Install

The recommended way to install H2O LLM Studio is using pipenv with Python 3.10. To install Python 3.10 on Ubuntu 16.04+, execute the following commands:

System installs (Python 3.10)

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.10
sudo apt-get install python3.10-distutils
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10

Installing NVIDIA Drivers (if required)

If deploying on a 'bare metal' machine running Ubuntu, one may need to install the required Nvidia drivers and CUDA. The following commands show how to retrieve the latest drivers for a machine running Ubuntu 20.04 as an example. One can update the following based on their OS.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.4.3/local_installers/cuda-repo-ubuntu2004-11-4-local_11.4.3-470.82.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-4-local_11.4.3-470.82.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-4-local/7fa2af80.pub
sudo apt-get -y update
sudo apt-get -y install cuda

Create virtual environment (pipenv)

The following command will create a virtual environment using pipenv and will install the dependencies using pipenv:

make setup

Using requirements.txt

If you wish to use conda or another virtual environment, you can also install the dependencies using the requirements.txt file:

pip install -r requirements.txt

Run H2O LLM Studio GUI

You can start H2O LLM Studio using the following command:

make llmstudio

This command will start the H2O wave server and app. Navigate to http://localhost:10101/ (we recommend using Chrome) to access H2O LLM Studio and start fine-tuning your models!

If you are running H2O LLM Studio with a custom environment other than Pipenv, you need to start the app as follows:

H2O_WAVE_MAX_REQUEST_SIZE=25MB \
H2O_WAVE_NO_LOG=true \
H2O_WAVE_PRIVATE_DIR="/download/@output/download" \
wave run app

Run H2O LLM Studio GUI using Docker from a nightly build

Install Docker first by following instructions from NVIDIA Containers. H2O LLM Studio images are stored in the h2oai GCR vorvan container repository.

mkdir -p `pwd`/data
mkdir -p `pwd`/output
docker run \
    --runtime=nvidia \
    --shm-size=64g \
    --init \
    --rm \
    -u `id -u`:`id -g` \
    -p 10101:10101 \
    -v `pwd`/data:/workspace/data \
    -v `pwd`/output:/workspace/output \
    -v ~/.cache:/home/llmstudio/.cache \
    gcr.io/vorvan/h2oai/h2o-llmstudio:nightly

Navigate to http://localhost:10101/ (we recommend using Chrome) to access H2O LLM Studio and start fine-tuning your models!

(Note other helpful docker commands are docker ps and docker kill.)

Run H2O LLM Studio GUI by building your own Docker image

docker build -t h2o-llmstudio .
docker run \
    --runtime=nvidia \
    --shm-size=64g \
    --init \
    --rm \
    -u `id -u`:`id -g` \
    -p 10101:10101 \
    -v `pwd`/data:/workspace/data \
    -v `pwd`/output:/workspace/output \
    -v ~/.cache:/home/llmstudio/.cache \
    h2o-llmstudio

Run H2O LLM Studio with command line interface (CLI)

You can also use H2O LLM Studio with the command line interface (CLI) and specify the configuration file that contains all the experiment parameters. To finetune using H2O LLM Studio with CLI, activate the pipenv environment by running make shell, and then use the following command:

python train.py -C {path_to_config_file}

To run on multiple GPUs in DDP mode, run the following command:

bash distributed_train.sh {NR_OF_GPUS} -C {path_to_config_file}

By default, the framework will run on the first k GPUs. If you want to specify specific GPUs to run on, use the CUDA_VISIBLE_DEVICES environment variable before the command.

To start an interactive chat with your trained model, use the following command:

python prompt.py -e {experiment_name}

where experiment_name is the output folder of the experiment you want to chat with (see configuration). The interactive chat will also work with model that were finetuned using the UI.

To publish the model to Hugging Face, use the following command:

make shell 

python publish_to_hugging_face.py -p {path_to_experiment} -d {device} -a {api_key} -u {user_id} -m {model_name} -s {safe_serialization}

path_to_experiment is the output folder of the experiment. device is the target device for running the model, either 'cpu' or 'cuda:0'. Default is 'cuda:0'. api_key is the Hugging Face API Key. If user logged in, it can be omitted. user_id is the Hugging Face user ID. If user logged in, it can be omitted. model_name is the name of the model to be published on Hugging Face. It can be omitted. safe_serialization is a flag indicating whether safe serialization should be used. Default is True.

Data format and example data

For details on the data format required when importing your data or example data that you can use to try out H2O LLM Studio, see Data format in the H2O LLM Studio documentation.

Training your model

With H2O LLM Studio, training your large language model is easy and intuitive. First, upload your dataset and then start training your model. Start by creating an experiment. You can then monitor and manage your experiment, compare experiments, or push the model to Hugging Face to share it with the community.

Example: Run on OASST data via CLI

As an example, you can run an experiment on the OASST data via CLI. For instructions, see Run an experiment on the OASST data guide in the H2O LLM Studio documentation.

Model checkpoints

All open-source datasets and models are posted on H2O.ai's Hugging Face page and our H2OGPT repository.

Documentation

Detailed documentation and frequently asked questions (FAQs) for H2O LLM Studio can be found at https://docs.h2o.ai/h2o-llmstudio/. If you wish to contribute to the docs, navigate to the /documentation folder of this repo and refer to the README.md for more information.

Contributing

We are happy to accept contributions to the H2O LLM Studio project. Please refer to the CONTRIBUTING.md file for more information.

License

H2O LLM Studio is licensed under the Apache 2.0 license. Please see the LICENSE file for more information.

More Repositories

1

h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/
Python
10,513
star
2

h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Jupyter Notebook
6,658
star
3

wave

Realtime Web Apps and Dashboards for Python and R
Python
3,820
star
4

h2o-2

Please visit https://github.com/h2oai/h2o-3 for latest H2O
Java
2,222
star
5

datatable

A Python package for manipulating 2-dimensional tabular data structures
C++
1,790
star
6

h2o-tutorials

Tutorials and training material for the H2O Machine Learning Platform
Jupyter Notebook
1,457
star
7

sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster
Scala
954
star
8

mli-resources

H2O.ai Machine Learning Interpretability Resources
Jupyter Notebook
478
star
9

h2o4gpu

H2Oai GPU Edition
C++
453
star
10

h2o-meetups

Presentations from H2O meetups & conferences by the H2O.ai team
Jupyter Notebook
412
star
11

awesome-h2o

A curated list of research, applications and projects built using the H2O Machine Learning platform
353
star
12

db-benchmark

reproducible benchmark of database-like ops
R
299
star
13

deepwater

Deep Learning in H2O using Native GPU Backends
C++
285
star
14

pystacknet

Jupyter Notebook
284
star
15

h2o-wizardlm

Open-Source Implementation of WizardLM to turn documents into Q:A pairs for LLM fine-tuning
Python
242
star
16

driverlessai-recipes

Recipes for Driverless AI
Python
224
star
17

nitro

Create apps 10x quicker, without Javascript/HTML/CSS.
TypeScript
198
star
18

wave-apps

Sample AI Apps built with H2O Wave.
Python
139
star
19

h2o-flow

Web based interactive computing environment for H2O
CoffeeScript
131
star
20

tutorials

This is a repo for all the tutorials put out by H2O.ai. This includes learning paths for Driverless AI, H2O-3, Sparkling Water and more...
Jupyter Notebook
127
star
21

rsparkling

RSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)
R
64
star
22

steam

DEPRECATED Build, manage and deploy H2O's high-speed machine learning models.
Java
60
star
23

h2o-world-2014-training

training material
Java
47
star
24

h2o-sparkling

DEPRECATED! Use https://github.com/h2oai/sparkling-water repository! H2O and Spark interoperability based on Tachyon.
Scala
43
star
25

app-consumer-loan

HTML
41
star
26

h2o-kubeflow

Jsonnet
37
star
27

h2o-droplets

Templates for projects based on top of H2O.
Java
37
star
28

driverlessai-tutorials

H2OAI Driverless AI Code Samples and Tutorials
Jupyter Notebook
37
star
29

app-malicious-domains

Domain name classifier looking for good vs. possibly malicious providers
HTML
33
star
30

data-science-examples

A collection of data science examples implemented across a variety of languages and libraries.
CSS
33
star
31

xgboost-predictor

Java
32
star
32

wave-ml

Automatic Machine Learning (AutoML) for Wave Apps
Python
32
star
33

h2o-LLM-eval

Large-language Model Evaluation framework with Elo Leaderboard and A-B testing
Jupyter Notebook
28
star
34

Deep-Learning-with-h2o-in-R

Deep neural networks on over 50 classification problems from the UC Irvine Machine Learning Repository
R
23
star
35

h2o.js

Node.js bindings to H2O, the open-source prediction engine for big data science.
CoffeeScript
21
star
36

perf

Performance Benchmarks
Jupyter Notebook
21
star
37

typesentry

Python 2.7 & 3.5+ runtime type-checker
Python
20
star
38

covid19-datasets

20
star
39

h2o-kubernetes

H2O Open Source Kubernetes operator and a command-line tool to ease deployment (and undeployment) of H2O open-source machine learning platform H2O-3 to Kubernetes.
Rust
20
star
40

sql-sidekick

Experiment on QnA tabular data using LLMs and SQL
Python
18
star
41

AITD

Jupyter Notebook
17
star
42

dai-deployment-templates

Production ready templates for deploying Driverless AI (DAI) scorers. https://h2oai.github.io/dai-deployment-templates/
Java
17
star
43

qcon2015

Repository for SF QConf 2015 Workshop
Java
16
star
44

h2o3-sagemaker

Integrating H2O-3 AutoML with Amazon Sagemaker
Python
13
star
45

wave-image-styling-playground

A interactive playground to style and edit images, generate art and have fun.
Python
13
star
46

article-information-2019

Article for Special Edition of Information: Machine Learning with Python
Jupyter Notebook
13
star
47

genai-app-store-apps

GenAI apps from H2O made Wave
Python
12
star
48

social_ml

Python
12
star
49

challenge-wildfires

Starter kit for H2O.ai competition Challenge Wildfires.
Jupyter Notebook
11
star
50

h2o-jenkins-pipeline-lib

Library of different Jenkins pipeline building blocks.
Groovy
11
star
51

haic-tutorials

Jupyter Notebook
10
star
52

wave-h2o-automl

Wave App for H2O AutoML
Python
9
star
53

cvpr-multiearth-deforestation-segmentation

Jupyter Notebook
8
star
54

app-ask-craig

Ask Craig application
Scala
7
star
55

dai-deployment-examples

Examples for deploying Driverless AI (DAI) scorers.
Java
7
star
56

ml-security-audits

TeX
7
star
57

ht-catalog

Diverse collection of 100 Hydrogen Torch Use-Cases by different industries, data-types, and problem types
HTML
7
star
58

wave-big-data-visualizer

Python
6
star
59

xai_guidelines

Guidelines for the responsible use of explainable AI and machine learning
Jupyter Notebook
5
star
60

authn-py

Universal Token Provider
Python
5
star
61

fluid

Rapid application development for a more... civilized age.
CoffeeScript
5
star
62

h2o-scoring-service

Scoring service backend by model POJOs.
Java
5
star
63

app-news-classification

Scala
5
star
64

covid19-backtesting-publication

Jupyter Notebook
5
star
65

app-mojo-servlet

Example of putting a mojo zip file as a resource into a java servlet.
Java
5
star
66

cloud-discovery-py

H2O Cloud Discovery Client.
Python
4
star
67

jacocoHighlight

Java
4
star
68

h2o-automl-paper

H2O AutoML paper
R
4
star
69

docai-recipes

Jupyter Notebook
4
star
70

deepwater-nae

Python
3
star
71

h2oai-power-nae

Shell
3
star
72

nitro-matplotlib

Matplotlib plugin for H2O Nitro
Python
3
star
73

h2o-cloud

H2O Cloud code.
Jupyter Notebook
3
star
74

h2o-rf1-bench

Python
3
star
75

nitro-plotly

Plotly plugin for H2O Nitro
Python
3
star
76

residuals-vis

JavaScript
3
star
77

python-chat-ui

3
star
78

driverlessai-alt-containers

Shell
2
star
79

camelot

Modified version of https://github.com/camelot-dev/camelot
Python
2
star
80

nitro-bokeh

Bokeh plugin for H2O Nitro
Python
2
star
81

wave-amlb

Wave Dashboard for the OpenML AutoML Benchmark
Python
2
star
82

app-titanic

HTML
2
star
83

py-repo

Python package repository
HTML
2
star
84

roc-chart

JavaScript
2
star
85

h2o3-xgboost-nae

Shell
2
star
86

residuals-vis-example-project

JavaScript
2
star
87

wave-r-data-table

This wave application is a R data.table tutorial and interactive learning environment developed using the wave library for R.
R
2
star
88

h2o_genai_training

Repository for H2O.ai's Generative AI Training
Jupyter Notebook
2
star
89

dai-centos7-x86_64-nae

Dockerfile
1
star
90

correlation-graph

JavaScript
1
star
91

residuals-vis-data

JavaScript
1
star
92

pydart

Dart/Flutter <-> Python transpiler
Python
1
star
93

2017-06-21-hackathon

Meetup Hackathon 06/21/2017
HTML
1
star
94

h2o-health

An initiate of H2O.ai to build AI apps to solve complex healthcare and life science problems
Makefile
1
star
95

lightning

High performance, interactive statistical graphics engine for the web.
CoffeeScript
1
star
96

h2o-google-bigquery

Python
1
star
97

fiction

Yet another markdown-to-documentation generator
CoffeeScript
1
star
98

dallas-tutorials

Temporary repository for fast git cloning during the h2o dallas event.
Jupyter Notebook
1
star
99

pydata2016-h2o-loganalysis

Log Analysis Use Case for PyData2016
Java
1
star
100

aggregator-zoom

JavaScript
1
star