• Stars
    star
    495
  • Rank 88,974 (Top 2 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created about 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

visualization of common words in different programming languages

Common words

This visualization shows which words are used most often in different programming languages.

The index was built between mid/end of 2016 from ~3 million public open source GitHub repositories. Results are presented as word clouds and text:

demo

Below is description of hows and whys. If you want to explore visualizations - please click here: common words.

Tidbits

  • I store the most common words from many different programming languages as part of this repository. GitHub's language recognition treats this repository as mostly C++. It makes sense because many of those languages were inspired by C/C++: github thinks it C++

  • License text is commonly put into comments in every programming language. Of all languages Java code was the winner with 127 words out of 966 coming from license text: lots of license in Java

    • In fact it was so overwhelming that I decided to filter out license text.
  • Lua is the only programming language that has a swear word in top 1,000. Can you find it?

  • In Go err is as popular as return. Here is why.

If you find more interesting discoveries - please let me know. I'd be happy to include them here.

How?

I extracted individual words from the github_repos data set using BigQuery. A word is extracted along with the top 10 lines of code where this word appeared.

I apply several constraints before saving individual words:

  • The line where this word appears should be shorter than 120 characters. This helps me filter out code not written by a human, like minified JavaScript.
  • I ignore punctuation (, ; : .), operators (+ - * ...) and numbers. So if the line is a+b + 42, then only two words are extracted: a and b.
  • I ignore lines with "license markers" - words that predominantly appear inside license text (e.g. license, noninfringement, etc.). License text is very common in code. It was interesting to see at the beginning, but overwhelming at the end, so I filtered it out.
  • Words are case sensitive: This and this will be counted as two separate words.

How was the data collected?

In this section we take deeper look into words extraction. If you are not interested jump to word clouds algorithm.

Data comes from the GitHub's public data set, indexed by BigQuery: github_repos

BigQuery stores the contents of each indexed file in a table as plain text:

File Id Content
File 1.h // File 1 content\n#ifndef FOO\n#define FOO...
File 2.h // File 2 content\n#ifndef BAR\n#define BAR...

To build a word cloud we need a weight to scale each word accordingly.

To get the weight we could split text into individual words, and then group table by each word:

Word Count
File 2
content 2
... ...

Unfortunately, this naive approach does exactly what people don't like about word clouds - each word will be taken out of context.

I wanted to avoid this problem, and allow people to explore each word along with their contexts:

context demo

To achieve this, I created a temporary table (code), that instead of counting individual words counts lines:

Line Count
// File 1 content 1
#ifndef FOO 1
#define FOO 1
... ...

This gave me "contexts" for each word and reduced overall data size from couple terabytes to ~12GB.

To get top words from this table we can employ the previously mentioned technique of splitting line content into individual words, and then group the table by each word. We can also get a word's context if we keep the original line in an intermediate table:

Line Word
// File 1 content File
// File 1 content content
#ifndef FOO ifndef
#ifndef FOO FOO
... ...

From this intermediate representation we can use SQL window function to group by word and get top 10 lines for each word (more info here: Select top 10 records for each category)

Current extraction code can be found here: extract_words.sql

Note 1: My SQL-fu is in kindergarten, so please let me know if you find an error or maybe more appropriate way to get the data. While the current script is working, I think there may be cases where results are slightly skewed.

Note 2: BigQuery is amazing. It is powerful, flexible, and fast. Huge kudos to the amazing people who work on it.

How are word clouds rendered?

At the heart of word clouds lies very simple algorithm:

for each word `w`:
  repeat:
    place word `w` at random point (x, y)
  until `w` does not intersect any other word

To prevent the inner loop from running indefinitely we can try only limited number of times and/or reduce word's font size if it doesn't fit.

If we step back a little bit from the words, we can formulate this problem in terms of rectangles: For each rectangle try to place it onto a canvas, until it doesn't intersect any other pixel.

Obviously, when canvas is heavily occupied finding a spot for a new rectangle can become challenging or not even possible.

Various implementations tried to speed up this algorithm by indexing occupied space:

  • Use summed area table to quickly, in O(1) time, tell if a new candidate rectangle intersects anything under it. The downside of this method is that each canvas update requires updating the entire table, which gives bad performance;
  • Maintain some sort of R-tree to quickly tell if a new candidate rectangle intersects anything under it. Intersection lookup in this approach is slower than in summed area tables, but index maintenance is faster.

I think the main downside of both of these methods is that we still can get wrong initial point many number of times before we find a spot that fits new rectangle.

I wanted to try something different. I wanted to build an index that would let me quickly pick a rectangle large enough to fit my new incoming rectangles. Make index of the free space, not occupied one.

I choose a quadtree to be my index. Each non-leaf node in the tree contains information about how many free pixels are available underneath. At the very basic level this can immediately answer question: "Is there enough space to fit M pixels?". If a quad has less available pixels than M, then there is no need to look inside.

Take a look at this quad tree for JavaScript logo:

javascript quadtree

Empty white rectangles are quads with available space. If our candidate rectangle is smaller than any of these empty quads we could immediately place it inside such quad.

A simple approach with quadtree index gives decent results, however, it is also susceptible to visual artifacts. You can see quadrants borders - no text can be placed on the intersection of quads:

quad tree artifacts

The largest quad approach can also miss opportunities. What if there is no single quad large enough to fit a new rectangle, but, if united with neighboring quads a fit can be found?

Indeed, uniting quads helps to find spots for new words, as well as removes visual artifacts. Many quads are united, and the text is likely to appear on intersection of two quads:

quad tree no artifacts

My final code for quadtree word cloud generation is not released. I don't think it is ready to be reused anywhere else.

How was the website created?

Rendering text

Overall I was happy with achieved speed of word cloud generation. Yet, it was still too slow for common-words website.

I'm using SVG to render each word on a screen. Rendering alone so many text elements can halt the UI thread for a couple seconds. There is just not enough CPU time to squeeze in text layout computation. The good news - we don't have to.

Instead of computing layout of words over and over again every time when you open a page, I decided to compute layout once, and store results into a JSON file. This helped me to focus on UI thread optimization.

To prevent UI blocking for long periods of time, we need to add words asynchronously. Within one event loop cycle we add N words, and let browser handle user commands and updates. On the second loop cycle we add more, and so on. For these purposes I made anvaka/rafor, which is an asynchronous for loop iterator that adapts and distributes CPU load across multiple event loop cycles.

Pan and zoom

The website supports Google-maps like navigation on SVG scene. It is also mobile and keyboard friendly. All these feature are implemented by panzoom library.

Application structure

I'm using vue.js as my rendering framework. Mostly because it's very simple and fast. Single file components and hot reload make it fast to develop in.

The entire application state is stored in a single object and individual language files are loaded when user selects corresponding element from a drop down.

As my message dispatcher I'm using ngraph.events, a very small message passing library with focus on speed.

I use anvaka/query-state to store currently selected language in the query string.

query state

Tools summary

Why word clouds?

Word clouds in general are considered bad for several reasons:

  • They take words out of their context. So good does not necessary mean something is good (e.g. when word not was dropped from visualization)
  • They scale words to fit inside a picture. So the size of a word cannot be trusted
  • They drop some common words (like a, the, not, etc.)

However, I was always fascinated by algorithms that fit words inside a given shape to produce word cloud.

I spent last couple months of my spare time developing my own word cloud algorithm. And this website was born. It was fun :).

Thank you!

Thank you, dear reader, for being curious. I hope you enjoyed this small exploration. Also special thanks to my co-worker, Ryan, who showed me word clouds in the first place. And to my lovely wife who inspires me and encourages me in all my pursuits.

PS

I also tried to bring word clouds into "real life" and created several printed products (T-Shirts, hoodies and mugs). However I didn't like T-Shirts very much, so I'm not going to show them here.

The javascript mug - I think is my best real world word cloud:

js mug

Feel free to buy it if you love javascript. I hope you enjoy it!

More Repositories

1

city-roads

Visualization of all roads within any city
JavaScript
5,402
star
2

VivaGraphJS

Graph drawing library for JavaScript
JavaScript
3,646
star
3

ngraph.path

Path finding in a graph
JavaScript
2,801
star
4

atree

Just a simple Christmas tree, based on reddit story
JavaScript
2,424
star
5

panzoom

Universal pan and zoom library (DOM, SVG, Custom)
JavaScript
1,570
star
6

pm

package managers visualization
JavaScript
1,408
star
7

ngraph

Beautiful Graphs
1,360
star
8

npmgraph.an

2d visualization of npm
JavaScript
1,160
star
9

fieldplay

A vector field explorer
JavaScript
1,108
star
10

sayit

Visualization of related subreddits
JavaScript
937
star
11

vs

Visualization of Google's autocomplete
JavaScript
919
star
12

word2vec-graph

Exploring word2vec embeddings as a graph of nearest neighbors
Python
691
star
13

time

Simple Google Sheets interface to track time
JavaScript
623
star
14

graph-drawing-libraries

Trying to compare known graph drawing libraries
JavaScript
584
star
15

peak-map

Make a ridge line chart from any region on Earth
JavaScript
583
star
16

map-of-reddit

Interactive map of reddit
JavaScript
540
star
17

ngraph.graph

Graph data structure in JavaScript
JavaScript
463
star
18

ngraph.pixel

Fast graph renderer based on low level ShaderMaterial from three.js
JavaScript
295
star
19

npmrank

npm dependencies graph metrics
JavaScript
284
star
20

streamlines

Streamlines calculator
JavaScript
275
star
21

isect

Segments intersection detection library
JavaScript
253
star
22

gauss-distribution

A fun little project to show distribution of pixels in Gauss's portrait
HTML
249
star
23

oflow

Optical flow detection in JavaScript
JavaScript
200
star
24

ghindex

Creates github index for similar repositories discovery
JavaScript
189
star
25

gazer

GitHub analysis and discovery
JavaScript
186
star
26

git-also

For a `file` in your git repository, prints other files that are most often committed together
JavaScript
185
star
27

allnpmviz3d

3d visualization of npm
JavaScript
179
star
28

ngraph.path.demo

This is a demo project for ngraph.path
JavaScript
165
star
29

map-of-github

Inspirational Mapping
Vue
157
star
30

ngraph.forcelayout

Force directed graph layout
JavaScript
146
star
31

pixchart

Turn any image into delightful splash of colors and order
JavaScript
112
star
32

city-script

Collection of scripts that can be loaded into city-roads
JavaScript
112
star
33

winvelviz

Wind visualization over time
JavaScript
100
star
34

yasiv-youtube

Graph of related videos from YouTube
JavaScript
100
star
35

query-state

Application state in query string
JavaScript
97
star
36

e-sum

Visualization of exponential sums
JavaScript
97
star
37

ngraph.forcelayout3d

Force directed graph layout in 3d
JavaScript
95
star
38

index-large-cities

A simple indexer of road networks from OSM. Data for @anvaka/city-roads
JavaScript
92
star
39

graph-start

a simple graph shell to explore ideas
JavaScript
88
star
40

greview

Books that I read and their neighborhoods
86
star
41

lsystem

A simple L-Systems explorer powered by WebGL
JavaScript
86
star
42

jsruntime

Chrome Extension to explore javascript runtime.
JavaScript
85
star
43

map-of-reddit-data

Contains scripts and data to render map of reddit
JavaScript
82
star
44

redsim

reddit discovery
JavaScript
82
star
45

w-gl

A simple WebGL renderer
TypeScript
80
star
46

pplay

Create, play and share pixels. Online WebGL shader editor.
GLSL
76
star
47

ngraph.hde

High dimensional embedding of a graph and its layout
JavaScript
76
star
48

dotparser

Parser of GraphViz dot file format
PEG.js
72
star
49

wind-lines

Streamline animation of wind data
JavaScript
63
star
50

set-vs-object

What is faster Set or Object?
JavaScript
62
star
51

three.map.control

A three.js camera that mimics 2d maps navigation with pan and zoom
JavaScript
53
star
52

ngraph.native

C++ implementation of force-based layout from ngraph
C++
49
star
53

circles

A simple spirograph toy
JavaScript
49
star
54

yaot

Yet another octree
JavaScript
48
star
55

ngraph.three

3D graph renderer powered by three.js
JavaScript
44
star
56

ngraph.generators

Graph generators
JavaScript
42
star
57

citations

Most cited papers by keyword
C++
42
star
58

rafor

requestAnimationFrame friendly async for iterator
JavaScript
41
star
59

ngraph.fabric

Fabric.js graph renderer
JavaScript
38
star
60

playground

Just a set of experiments that I want to play with, but they are too small to be in their own repository
JavaScript
35
star
61

ngraph.centrality

Module to calculate graph centrality metrics
JavaScript
33
star
62

streak

Streak tracking with Google Sheets
JavaScript
33
star
63

sayit-data

data with similar subreddits graph
JavaScript
32
star
64

ngraph.offline.layout

Performs offline layout of large graphs and saves results to the disk
JavaScript
32
star
65

cord-19

exploring research papers about coronaviruses
JavaScript
31
star
66

how-to-debug-node-js-addons

How to debug node.js addons in xcode
30
star
67

wheel

Mouse wheel event unified for all browsers
JavaScript
30
star
68

npmgraphbuilder

Builds graph of npm dependencies from npm registry
JavaScript
29
star
69

generator-n

minimalistic node package yeoman generator
JavaScript
28
star
70

allgithub

Crawling github data
JavaScript
27
star
71

tiny.xml

Tiny (1.6KB) in-browser xml parser
JavaScript
27
star
72

mars

Map of Mars
JavaScript
26
star
73

ngraph.pixi

PIXI.js graph renderer
JavaScript
26
star
74

ngraph.pagerank

PageRank calculation for ngraph.graph
JavaScript
25
star
75

noisylines

Tracking noise with streamlines
JavaScript
23
star
76

nb

Neighborhood beautification: Graph layout through message passing
JavaScript
23
star
77

ngraph.physics.simulator

Physics library for ngraph
JavaScript
23
star
78

strangeb

The strangest thing happens when you rotate Bezier control points
JavaScript
22
star
79

graph-to-vector-field

Converts a graph into vector field texture
JavaScript
20
star
80

amator

Tiny animation library
JavaScript
20
star
81

ngraph.louvain

Given a graph instance detects communities using the Louvain Method
JavaScript
20
star
82

npmgraph

Visualization of NPM dependencies
JavaScript
19
star
83

similar-cities

Visualization of cities with similar road networks
JavaScript
19
star
84

allnpm

Graph generator for entire npm registry
JavaScript
18
star
85

twitter-recommended-graph

Building a proposal for Twitter to show a map of recommended people
JavaScript
18
star
86

streaming-svg-parser

Streaming SVG/XML parser with zero dependencies
JavaScript
18
star
87

extract-osm-roads

A simple utility to fetch a city graph from OSM
JavaScript
18
star
88

quadtree.cc

A C++ implementation of quadtree
C++
17
star
89

rules-of-ml

A simple visualization of Martin Zinkevich article
JavaScript
17
star
90

ngraph.events

Events support in ngraph.*
JavaScript
17
star
91

simplegrad

Simple reverse mode automatic differentiation of scalar values in javascript
JavaScript
16
star
92

portrait

Portrait of quotes
JavaScript
16
star
93

sunburst

For a given tree builds an SVG based SunBurst diagram
JavaScript
15
star
94

local-chat

Local instance of ChatGPT for my kiddo
HTML
15
star
95

what-people-google

Visualization of what people google
JavaScript
15
star
96

vuereddit

A simple reddit client written as a vue component.
Vue
14
star
97

ngraph.fromjson

Library to load graph from simple json format
JavaScript
12
star
98

color-high

A demo of ngraph.forcelayout in 6D space
JavaScript
12
star
99

color-force-vis

Visualizing forces acting on nodes during force layout
JavaScript
11
star
100

allnpmviz.an

Visualization of entire npm
JavaScript
11
star