• Stars
    star
    118
  • Rank 299,923 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Reimplementation of the task generation part from the Alpaca paper

Alpaca Libre

🦙🗽 Small research project - how much it would cost to create Alpaca-like dataset, with 50k+ demonstrations, using slightly different approach. All data byproducts are CC0/MIT-licensed.

🔥 The project also contains 100k+ MIT-licensed demonstrations from Anthropics HH-RLHF repo - converted into "Alpaca compatible format".

👉 Follow me on Twitter for news and updates.

🚫 Remember that releasing a model based on data you generated via model API might violate the Terms of Service of the model API provider.

BTW: This repo shows how easy it is to fine-tune (PEFT=LORA) Flan-T5-* model with Alpaca-like dataset.

alpaca on the Altiplano grasslands with the Statue of Liberty in the background

Usage

  1. Clone the repo: git clone https://github.com/mobarski/alpaca-libre && cd alpaca-libre
  2. Install required python modules: pip install -r requirements.txt
  3. View / edit generate.py
  4. Set API_KEY: export OPENAI_KEY=...
  5. Run the script: python3 generate.py

Attribution

  • data/seed_tasks.jsonl - is from the Self-Instruct paper
  • data/alpaca_libre_prompt_v1.txt - is from the Alpaca paper (with slight modfification)

Output

Files in the data/output directory are in the same format as original Alpaca dataset.

Files in the data/output/work directory are in the .jsonl format and:

  • contain one task (JSON object) per line,

  • contain also tasks that failed quality checks (status!='ok')

    • these tasks might be marked as 'ok' after manual inspection
  • each task object has the following items:

    • status - anything other than 'ok' is bad

    • instruction - instruction part of the prompt

    • input - input part of the prompt

    • output - expected output

    • other - dictionary for other information (similarity, etc)

References

GitHub repos:

Papers:

Changelog

  • 0.4.2
    • MIT-licensed demonstrations from Anthropics HH-RLHF repo
      • 104k human preferred responses from the train datasets:
        • 41k harmless
        • 42k helpful
        • 21k helpful-online
  • 0.4.1
    • v4 dataset converted into the same format as original Alpaca
    • jsonl dataset moved into work dir
  • 0.4
    • grouping turns into rounds
    • basic input quality check
    • better <noinput> handling
    • <nooutput> handling
    • retry with backoff on API error
    • progressbars
    • fixed: typos in Alpaca prompt
    • fixed: whitespace handling after task number
  • 0.3
    • parallel main loop
    • better cli output
    • output format change (everythig not essential is placed in the "other" object)
    • basic output quality check
    • fixed: multiline input/output handling
    • fixed: no initial space / empty section handling
    • fixed: <noinput>

More Repositories

1

ask-my-pdf

Question answering system for PDF files
Python
575
star
2

aidapter

Adapter / facade for language models (OpenAI, Anthropic, Cohere, local transformers, etc)
Python
18
star
3

ai-bricks

AI adapters / facade
Python
9
star
4

tkv

Table-Key-Value adapter for various db-engines: SQLite, Redis, MongoDB, Snowflake, DuckDB, ...
Python
3
star
5

thorvald

Similarity calculation engine for unary data.
Go
3
star
6

bench

Lean micro-benchmarking framework for the V language
V
2
star
7

morty

Morty programming language, Morty virtual machine and MortyVM assembler
C
2
star
8

vimes

Virtual Machines Experimentation Sandbox
C
2
star
9

vimes2

Virtual Machines Experimentation Sandbox 2
Nim
2
star
10

st_repl_connection

Connect Streamlit to local REPL applications
Python
1
star
11

fabris

Fabris Programming Language
C
1
star
12

clean-room

Data Clean Room utilities for probabilistic information exchange.
Python
1
star
13

faraway

Remote Hadoop operations via SSH
Python
1
star
14

itsy

Minimalistic fantasy console API for JS
JavaScript
1
star
15

hike

Hike is a library for automatically generating command line interfaces (CLIs) from Python scripts allowing selection and reordering of steps to run.
Python
1
star
16

smol

Smol is a minimal register-based virtual machine and assembly language designed for building simple games and applications.
JavaScript
1
star
17

kraken

Contextual Bandit Engine
Python
1
star
18

st_redis_connection

Connect to Redis and other compatible databases (KeyDB, DragonflyDB, LedisDB, SSDB, ARDB) from your Streamlit app.
Python
1
star
19

inverness

Natural Language Processing framework built on top of gensim and nmslib.
Python
1
star