• Stars
    star
    4,914
  • Rank 8,546 (Top 0.2 %)
  • Language
    JavaScript
  • Created almost 11 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A collection of small corpuses of interesting data for the creation of bots and similar stuff.

Corpora

This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place.

I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.

I'm also hoping that this can be used as a teaching tool: maybe someone has three hours to teach how to make Twitter bots. That doesn't give the student much time to find/scrape/clean/parse interesting data. My hope is that students can be pointed to this project and they can pick and choose different interesting data sources to meld together for the creation of prototypes.

License

Since Corpora is more data than code, I have chosen to CC0 license this (rather than MIT license or similar).

To the extent possible under law, Darius Kazemi has waived all copyright and related or neighboring rights to Corpora. This work is published from: United States.

What is Corpora NOT?

This project is not meant to replace exhaustive APIs -- if you want nouns, and you want every noun in the English language, replete with metadata, consider Wordnik. If you want the title of every Wikipedia article, use the MediaWiki API.

What is Corpora?

  • Corpora is a repository of JSON files, meant to be language-neutral. If you want to create an NPM repo or whatever based on this, be my guest, but this repository will remain a collection of data files that can be interpreted by any language that can parse JSON.
  • Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.
    • For example, Corpora will not contain any complete "dictionary" style files. Instead we host a sampling of 1000 common nouns, adjectives, and verbs.
    • Some lists are small enough by nature that we may contain a complete list of things in their category. For example, a list of heavily populated U.S. cities may only have 75 cities and be considered complete.

List of Corpora-related tools

I have some data, how do I submit?

We accept pull requests to this repository. Some guidelines:

  • BY SUBMITTING DATA AS A PULL REQUEST, YOU AGREE TO OUR APPLYING A CC0 FREE CULTURE LICENSE TO THE DATA, MEANING THAT ANYONE CAN USE THE DATA FOR ANY REASON WITHOUT ATTRIBUTION IN PERPETUITY.
  • Please submit all data as JSON format in a file with a .json extension, and please JSONLint your files before submitting -- also, thanks to Matt Rothenberg we have Travis-CI testing, which will jsonlint your pull request automatically. If you see a test failure notification in your PR after you submit, there's a problem with your JSON!
  • Keep individual files to about 1000 "things" maximum. Fewer than 1000 is fine, too.
  • If you'd like attribution, I'm happy to include your name in this Readme file. Just remember that nobody who uses this data is obligated to include attribution in their own projects.

Contributors

By Darius Kazemi and Many Wonderful Contributors.

More Repositories

1

express-activitypub

A very simple reference implementation of an ActivityPub server using Express.js
JavaScript
588
star
2

rss-to-activitypub

An RSS to ActivityPub converter.
JavaScript
557
star
3

NaNoGenMo-2015

National Novel Generation Month, 2015 edition.
340
star
4

twitter-archiver

Make your own simple, public, searchable Twitter archive
JavaScript
299
star
5

NaNoGenMo-2014

National Novel Generation Month, 2014 edition.
257
star
6

examplebot

A simple example Twitter bot using NodeJS.
JavaScript
225
star
7

wordfilter

A small module meant for use in text generators that lets you filter strings for bad words.
Python
220
star
8

metaphor-a-minute

Metaphor a Minute! You too can write an annoying philosophy twitter bot.
JavaScript
210
star
9

NaNoGenMo

National Novel Generation Month. Because.
183
star
10

ja2

The source code for Jagged Alliance 2. I didn't write this; see the Strategy First license agreement for details. Supplementary material for the Jagged Alliance 2 Boss Fight Book.
C
114
star
11

rapbot

JavaScript
64
star
12

grunt-init-twitter-bot

A grunt init template for making Twitter bots, preloaded with some useful libs.
JavaScript
60
star
13

ea-thesaurus

The Edinburgh Associative Thesaurus (EAT) is a set of word association norms showing the counts of word association as collected from subjects.
45
star
14

sorting-bot

The Sorting Hat Bot (@SortingBot on Twitter)
JavaScript
40
star
15

twoheadlines

@twoheadlines
CSS
38
star
16

latourswag

Bruno Latour + #swag = Twitter bot!
JavaScript
33
star
17

gender-probability

Providing gender probabilities for US/UK names using Open Gender Tracker's [Global Name Data](https://github.com/OpenGenderTracking/globalnamedata) resource.
JavaScript
27
star
18

TheEthicalAdBlocker

This browser extension provides a 100% guaranteed ethical ad blocking experience.
JavaScript
25
star
19

gaunt

Simple, versatile, achingly beautiful.
JavaScript
24
star
20

spewer

A reverse part-of-speech tagger. Give it a list of tags and it spews out matching language.
JavaScript
23
star
21

gutencorpus

This is a simple tool that lets you search the top 100-ish Project Gutenberg ebooks for text.
JavaScript
21
star
22

projects

A listing of my projects.
JavaScript
20
star
23

reverseocr

A bot that attempts to draw words.
JavaScript
19
star
24

wordnik-bb

A node.js interface to the Wordnik API, which lets you get dictionary definitions, random words, pronunciation, and more!
JavaScript
18
star
25

bracket-meme-bot

A bot that make "bracket memes".
JavaScript
16
star
26

farewell

Employee farewell letter generator.
JavaScript
16
star
27

museumbot

Tweeting the Met.
JavaScript
16
star
28

roof-slapping-bot

*slaps roof of source code* this bad boy can fit so many bugs in it
JavaScript
12
star
29

painterly-textures

JavaScript
11
star
30

harpooneers

Code for "HARPOONEERS AND SAILORS", a novel I generated for NaNoGenMo 2015.
JavaScript
10
star
31

mastodon-autoreply

A bot that replies to new followers, ideally saying "I've moved! Follow me (here)."
JavaScript
10
star
32

grunt-init-textgen

A grunt-init template for text generating pages with twitter/link sharing.
JavaScript
10
star
33

outslide

A random slide generator. I'm sorry.
JavaScript
10
star
34

teamsnake-simple

Early network build of Team Snake.
JavaScript
8
star
35

tweetYourArchive

Set up a bot to tweet your twitter archive, on a delay.
JavaScript
8
star
36

cyberfiction

It was the best of cybertimes, it was the worst of cybertimes.
JavaScript
7
star
37

hottestStartups

Really hot startup ideas.
JavaScript
7
star
38

very-simple-whiteboard

Very simple whiteboard, tuned for a Chrome Pixel.
JavaScript
6
star
39

corpora-project

This is the NPM package to access the latest corpora data.
JavaScript
5
star
40

wordnik-hackathon

The Wordnik / Bot Summit Hackathon
5
star
41

overzealous-autocomplete

Overzealous autocomplete.
JavaScript
5
star
42

chum-corpus

Occasionally updated chumbox images and headlines.
4
star
43

youMustBe

Software, you must be a generator because you are a thing that generates output.
JavaScript
4
star
44

amen-chopper

This is a little toy that takes the Amen Break and chops it up into slices of different lengths and offsets, playing the slices in random order at a certain bpm and running the whole thing through a filter. You can get very different beats just by adjusting these few settings.
JavaScript
3
star
45

intersections

Venn Diagrams.
JavaScript
3
star
46

dialogue

Generative dialogue.
JavaScript
3
star
47

allthethings

Verb ALL the nouns!
JavaScript
3
star
48

slowtext

s l o w w w w t e x t
JavaScript
2
star
49

4myrealfriends

Source code for my real friends, real code for my source friends.
JavaScript
2
star
50

wolf3d

wolf3d hacks
JavaScript
2
star
51

lastwords

Last words of executed Texas death row inmates that contain "love".
JavaScript
2
star
52

pennyarcade

JavaScript
2
star
53

integers

29 Positive Integers Under 30
1
star
54

documentationPlayground

CSS
1
star
55

netboard-server

JavaScript
1
star
56

dariusbots

An account that RTs tweets from my twitter list that reach a certain number of favs+RTs.
JavaScript
1
star
57

netboard

JavaScript
1
star
58

fmk

Fuck, marry, or kill? A Twitter bot.
JavaScript
1
star
59

lmmtfy

Let Me Moogle That For You
JavaScript
1
star
60

TwineOnline

Web-based port of Twine
JavaScript
1
star
61

doctorwhat

Doctor Who speculation generator.
JavaScript
1
star
62

jqProclamations

everything is the jQuery of everything
JavaScript
1
star
63

ao3

God help me.
JavaScript
1
star
64

gengen

HTML
1
star
65

gqTest

gameQuery test project
1
star
66

generateShare

A template I can use for generators that has twitter sharing built in
JavaScript
1
star
67

ColorSprite

Grab colors from an image using Color Thief, create a sprite. Proof of concept.
JavaScript
1
star
68

spinny-machine

A spinny machine.
JavaScript
1
star
69

fuckvideogames

Fuck videogames.
JavaScript
1
star