rogeriochaves/driver

Stars
117
Rank 301,828 (Top 6 %)
Language
Python
License
MIT License
Created 12 months ago
Updated 11 months ago

rogeriochaves/driver

rogeriochaves

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Driver: GPT-V + OCR Screen Control

This project integrates GPT-V with OCR to address the shortcommings of GPT-V in being able to point to something precisely on the screen. As a result, we have an AI that can fully see, understand and interact with your computer screen.

The way it works is by annotating each identified element with a label that GPT-V can use:

Screenshot	Annotated

Demo

demo.mp4

Here is the link to the tweet

More demos

Playing Tic-Tac-Toe

Navigating in Chinese

Installation

Clone this repo:

git clone https://github.com/rogeriochaves/driver.git

Install the dependencies:

pip install -r requirements.txt

Now make a copy of .env.example to .env, you need to have both OpenAI and Azure Vision OR Google Cloud Vision for the OCR (if you want to contribute, DM me on twitter and I can send you a key for Azure Vision)

# OpenAI key to use GPT-V to be able to "see" the screen
OPENAI_API_KEY=""
# Choose either Azure, Google Cloud Vision or Baidu (better for chinese) to use for OCR to help GPT-V finding elements on the screen
AZURE_VISION_API_KEY=""
AZURE_VISION_ENDPOINT=""
# GCLOUD_VISION_API_KEY=""
# BAIDU_OCR_API_KEY=""
# BAIDU_OCR_SECRET_KEY=""

Finally, ask it to do anything you want!

python main.py "hey there, please go to my gmail and send an email to Laura with a poem declaring my love"

Acknowledgments

Thanks to @MulongXie et al for building the UIED algorithm, which is used together with OCR for identifying GUI elements

Contributing

There is A LOT that can be done, the project is very very new so contributions are very welcome! If you have suggestions for improvements or new features, open an issue, and feel free to fork the repository, make your changes, and submit a pull request.

License

This project is open-source and available under the MIT License. See the LICENSE file for more details.

npm-force-resolutions

Force npm to install a specific transitive dependency version

langstream

Build robust LLM applications with true composability 🔗

spades

Start an Elm SPA ready to the real world

feedless

elm-peer-tweet

Decentralized feeds using BitTorrent's DHT. Based on lmatteis' peer-tweet.

structured-elm-todomvc

Structured TodoMVC with Elm to exemplify real-world apps

bmo

BMO is a ChatGPT voice assistant

remote-retrospectives

Have fun retrospectives using Google Docs

elm-test-bdd-style

BDD-style matchers for elm-test

react-decompiler

Decompile react components back into jsx format

pictureit-editor

jasmine-react-diff

Outputs nicely formated jsx when diffing two react components

NodeJS-MotionCAPTCHA

josscrowcroft's MotionCAPTCHA server-side validated with NodeJS

elm-suspense

Recreating react-suspense features using elm

notebooks

I'll munch some data here

Jupyter Notebook

rastreiounico

Rastreie qualquer coisa, pra qualquer um, com um clique, de graça, sem cadastro.

elm-todomvc-pwa

elm-todomvc with pwa features

unbreakablejs

JavaScript without runtime errors

llm-cost

NodeJS utility for counting tokens and estimating the cost of LLMs

codesearch

safe-externals-loader

Load webpack externals only if they are available globally, else require them

elm-ternary

Ternary and Null Coalescing operators for Elm

elm-test-example

Reactbox

A true responsive lightbox, when in mobile, it allows you to zoom and scroll the bigger image

rubber

Evaluate LaTeX math code

ml-101

Jupyter Notebook

unit

Universal Test Generator

bayes-akinator

Building Akinator with Python using Bayes Theorem

simple-platform

Docker Machine + Ngninx + Prometheus + Grafana to provision on a single host

oltwitter

Good Ol' Twitter UI, rebuilt

atom-spec-finder

Shortcut for atom editor to switch between the file and its spec

gpt2-bot

require-js-plugins-loader

A webpack loader to lazy load RequireJS plugins

sublime-spec-finder

Shortcut for sublime text to switch between the file and it's spec

ptbr-datascience-crawler

Scrapes links from Pt-BR Data Science telegram group because they are too good to miss

FOADocs

Sistema de controle de arquivos e versão para projetos acadêmicos utilizando cloud storage

signal-concat-map

ConcatMap for Elm Signals

mle

backbone-indexeddb-offline

Allows your Backbone.js app to work offline using indexeddb

elm-testable-css-helpers

Wraper for elm-testable of helper functions for using elm-css with elm-html

chatje

basic version of workchat, because the facebook one is slow as hell

dotfiles

perl-tdd-runner

Run Perl tests continuously

blog

my php (personal home page)

astah-anycode-rails

Generate scaffolds from your astah classes

offline-gpt

stop-touching-your-face

Uses your camera to buzz you when you touch your face

react_editable_content

let you edit block of text or image in any page on your Rails application

matrizes-clifford

cálculos simples de matriz para a aula do clifford

memekombat

The best game that existed on facebook circa 2012

Combina--o-de-Conjuntos

newgen-js-bundlers

Comparing new generation of JS bundlers

log-filter

filter repetitive logs and focus on the different ones

Jogo-da-Ket

Online Multiplayer Facebook Card Game in Node.JS with socket.io

hastext

Clone of a social network based on micro-blogging in Elm

prepack-elm-poc

JavaScript-vs-CSS-animations-tests

forecaster

probability weighted weather forecasts

Jupyter Notebook

babel-plugin-unbreakablejs

No runtime errors for javascript!

SEO-Friendly-Single-Page-Website

Create a stylish single-page website without losing links based navigation ;)

spree_pagseguro

deprecated - não use

yyyy_mm_dd

Easy string-based date manipulation library for Python

langchain-docs-bot

A simple docs retrieval bot, indexing markdown and jupyter notebooks, which also can have a conversation, built using langchain and vectordb, with a nice chainlit UI