• Stars
    star
    138
  • Rank 259,246 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created 11 months ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simplistic linear and multiprocessed approach to sentiment analysis using Gzip Normalized Compression Distances with k nearest neighbors

Simple-kNN-Gzip

A simplistic linear and multiprocessed approach to sentiment analysis using Gzip Normalized Compression Distances with k nearest neighbors

Original work that this concept is based on: https://aclanthology.org/2023.findings-acl.426.pdf Paper authors also have implementation code here: https://github.com/bazingagin/npc_gzip

This is not a fork of their work, this one is written myself based on what I read in the paper just to see if I actually understood what was going on. They achieve a higher accuracy than I found personally on a separate dataset, but it would appear there's something interesting and useful about this methodology.

Ken Schutte also has a couple writeups, explaining at least 2 of the major issues with the original paper:

  1. k=2 tiebreaker issue: https://kenschutte.com/gzip-knn-paper/
  2. Dataset test leakage (test data was also in the training data): https://kenschutte.com/gzip-knn-paper2/

I don't think either of those things invalidate my findings here, though notably my accuracy is far below their reported accuracy as well anyway.

Future work here:

I wonder about further "feature extraction" based on this sort of "compression lengths" as features. For example, rather than NCDs, maybe instead the compression ratio from original string to compressed sizeadd would be even more useful than NCDs, since (I believe) the reason this works is statistical similarities in words/phrases and their syntactic uses which Gzip uses for compression.

I am also curious if it's remotely possible to use a compressor like gzip as a compressor and potentially tokenizer for transformers? Surely this isn't a new idea and there's a great reason why this wont work, but I am tempted to try that probably next.

IDK. this just shouldnt work at all IMO :D

More Repositories

1

pygta5

Explorations of Using Python to play Grand Theft Auto 5.
Python
3,864
star
2

NNfSiX

Neural Networks from Scratch in various programming languages
C++
1,358
star
3

GANTheftAuto

Python
843
star
4

socialsentiment

Sentiment Analysis application created with Python and Dash, hosted at socialsentiment.net
Python
467
star
5

TermGPT

Giving LLMs like GPT-4 the ability to plan and execute terminal commands
Jupyter Notebook
395
star
6

Carla-RL

Reinforcement Learning codebase for self-driving car in Carla
Python
339
star
7

ChatGPT-at-Home

ChatGPT @ Home: Large Language Model (LLM) chatbot application, written by ChatGPT
Python
325
star
8

ChatGPT-API-Basics

Jupyter Notebook
292
star
9

BCI

Brain-Computer interface stuff
Python
285
star
10

nnfs_book

Sample code from the Neural Networks from Scratch book.
Python
261
star
11

BLOOM_Examples

Some quick BLOOM LLM examples
Jupyter Notebook
258
star
12

nnfs

Neural Networks from Scratch
Python
177
star
13

Falcon-LLM

Helper scripts and examples for exploring the Falcon LLM models
Jupyter Notebook
168
star
14

SC2RL

Reinforcement Learning + Starcraft 2
Python
139
star
15

QuantumComputing

Collection of Tutorials and other Quantum Computer programming related things.
Jupyter Notebook
134
star
16

cyberpython2077

Using Python to Play Cyberpunk 2077
Python
122
star
17

GPT-Journey

Building a text and image-based journey game powered by, and with, GPT 3.5
Python
79
star
18

OpenAssistant_API_Pythia_12B

Creating and Using an Open Assistant API locally (Pythia 12B GPT model)
Jupyter Notebook
75
star
19

neural-net-internals-visualized

Visualizing some of the internals of a neural network during training and inference.
Jupyter Notebook
59
star
20

reddit_spam_detector_bot

Bot that detects spam/affiliate marketing authors, and posts some stats on their threads.
Python
58
star
21

Together-API-Basics

Some information for working with the Together inference API for Open Source AI models
Jupyter Notebook
55
star
22

sentdebot

Code for Sentdebot in the Sentdex discord channel (discord.gg/sentdex)
Python
53
star
23

NEAT-samples

samples of neat code
Python
50
star
24

Lambda-Cloud

Helpers and such for working with Lambda Cloud
Python
49
star
25

LLM-Finetuning

Some helpers and examples for creating an LLM fine-tuning dataset
Jupyter Notebook
46
star
26

uarm

uArm Things
Python
29
star
27

satisfunctions

Fighting arthritis from Satisfactory one function at a time.
Python
23
star
28

PyGTA5_Reboot

Python Plays GTA V Reboot
18
star
29

TTSentdex9000

I am a human just like you!
16
star
30

chatbotrnd

working with chatbot response scoring.
Python
14
star
31

HF-Cache-Cleanup

cleanup cached models.
Python
10
star
32

cellvolution

Evolutionary cell-based simulation
Python
1
star