Sentdex/Simple-kNN-Gzip

Stars
142
Rank 258,495 (Top 6 %)
Language
Jupyter Notebook
License
Apache License 2.0
Created over 1 year ago
Updated over 1 year ago

Sentdex/Simple-kNN-Gzip

Sentdex

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

A simplistic linear and multiprocessed approach to sentiment analysis using Gzip Normalized Compression Distances with k nearest neighbors

Simple-kNN-Gzip

A simplistic linear and multiprocessed approach to sentiment analysis using Gzip Normalized Compression Distances with k nearest neighbors

Original work that this concept is based on: https://aclanthology.org/2023.findings-acl.426.pdf Paper authors also have implementation code here: https://github.com/bazingagin/npc_gzip

This is not a fork of their work, this one is written myself based on what I read in the paper just to see if I actually understood what was going on. They achieve a higher accuracy than I found personally on a separate dataset, but it would appear there's something interesting and useful about this methodology.

Ken Schutte also has a couple writeups, explaining at least 2 of the major issues with the original paper:

k=2 tiebreaker issue: https://kenschutte.com/gzip-knn-paper/
Dataset test leakage (test data was also in the training data): https://kenschutte.com/gzip-knn-paper2/

I don't think either of those things invalidate my findings here, though notably my accuracy is far below their reported accuracy as well anyway.

Future work here:

I wonder about further "feature extraction" based on this sort of "compression lengths" as features. For example, rather than NCDs, maybe instead the compression ratio from original string to compressed sizeadd would be even more useful than NCDs, since (I believe) the reason this works is statistical similarities in words/phrases and their syntactic uses which Gzip uses for compression.

I am also curious if it's remotely possible to use a compressor like gzip as a compressor and potentially tokenizer for transformers? Surely this isn't a new idea and there's a great reason why this wont work, but I am tempted to try that probably next.

IDK. this just shouldnt work at all IMO :D

pygta5

Explorations of Using Python to play Grand Theft Auto 5.

NNfSiX

Neural Networks from Scratch in various programming languages

GANTheftAuto

socialsentiment

Sentiment Analysis application created with Python and Dash, hosted at socialsentiment.net

TermGPT

Giving LLMs like GPT-4 the ability to plan and execute terminal commands

Jupyter Notebook

Carla-RL

Reinforcement Learning codebase for self-driving car in Carla

ChatGPT-at-Home

ChatGPT @ Home: Large Language Model (LLM) chatbot application, written by ChatGPT

BCI

Brain-Computer interface stuff

ChatGPT-API-Basics

Jupyter Notebook

nnfs_book

Sample code from the Neural Networks from Scratch book.

BLOOM_Examples

Some quick BLOOM LLM examples

Jupyter Notebook

nnfs

Neural Networks from Scratch

Falcon-LLM

Helper scripts and examples for exploring the Falcon LLM models

Jupyter Notebook

SC2RL

Reinforcement Learning + Starcraft 2

QuantumComputing

Collection of Tutorials and other Quantum Computer programming related things.

Jupyter Notebook

cyberpython2077

Using Python to Play Cyberpunk 2077

GPT-Journey

Building a text and image-based journey game powered by, and with, GPT 3.5

OpenAssistant_API_Pythia_12B

Creating and Using an Open Assistant API locally (Pythia 12B GPT model)

Jupyter Notebook

neural-net-internals-visualized

Visualizing some of the internals of a neural network during training and inference.

Jupyter Notebook

reddit_spam_detector_bot

Bot that detects spam/affiliate marketing authors, and posts some stats on their threads.

Together-API-Basics

Some information for working with the Together inference API for Open Source AI models

Jupyter Notebook

LLM-Finetuning

Some helpers and examples for creating an LLM fine-tuning dataset

Jupyter Notebook

sentdebot

Code for Sentdebot in the Sentdex discord channel (discord.gg/sentdex)

Lambda-Cloud

Helpers and such for working with Lambda Cloud

NEAT-samples

samples of neat code

uarm

cellvolution

Evolutionary cell-based simulation

satisfunctions

Fighting arthritis from Satisfactory one function at a time.

PyGTA5_Reboot

Python Plays GTA V Reboot

TTSentdex9000

I am a human just like you!

chatbotrnd

working with chatbot response scoring.

HF-Cache-Cleanup

cleanup cached models.