• Stars
    star
    236
  • Rank 169,559 (Top 4 %)
  • Language
    R
  • Created over 12 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Quick introduction to ggplot2 (no knowledge of R assumed)

Introduction

This is a bare-bones introduction to ggplot2, a visualization package in R. It assumes no knowledge of R.

There is also a literate programming version of this tutorial in ggplot2-tutorial.R.

Preview

Let's start with a preview of what ggplot2 can do.

Given Fisher's iris data set and one simple command...

qplot(Sepal.Length, Petal.Length, data = iris, color = Species)

...we can produce this plot of sepal length vs. petal length, colored by species.

Sepal vs. Petal, Colored by Species

Installation

You can download R here. After installation, you can launch R in interactive mode by either typing R on the command line or opening the standard GUI (which should have been included in the download).

R Basics

Vectors

Vectors are a core data structure in R, and are created with c(). Elements in a vector must be of the same type.

numbers = c(23, 13, 5, 7, 31)
names = c("edwin", "alice", "bob")

Elements are indexed starting at 1, and are accessed with [] notation.

numbers[1] # 23
names[1] # edwin

Data frames

Data frames are like matrices, but with named columns of different types (similar to database tables).

books = data.frame(
    title = c("harry potter", "war and peace", "lord of the rings"), # column named "title"
    author = c("rowling", "tolstoy", "tolkien"),
    num_pages = c("350", "875", "500")
)

You can access columns of a data frame with $.

books$title # c("harry potter", "war and peace", "lord of the rings")
books$author[1] # "rowling"

You can also create new columns with $.

books$num_bought_today = c(10, 5, 8)
books$num_bought_yesterday = c(18, 13, 20)

books$total_num_bought = books$num_bought_today + books$num_bought_yesterday

read.table

Suppose you want to import a TSV file into R as a data frame.

tsv file without header

For example, consider the data/students.tsv file (with columns describing each student's age, test score, and name).

13   100 alice
14   95  bob
13   82  eve

We can import this file into R using read.table().

students = read.table("data/students.tsv", 
    header = F, # file does not contain a header (`F` is short for `FALSE`),
                # so we must manually specify column names                    
    sep = "\t", # file is tab-delimited        
    col.names = c("age", "score", "name") # column names
)

We can now access the different columns in the data frame with students$age, students$score, and students$name.

csv file with header

For an example of a file in a different format, look at the data/studentsWithHeader.tsv file.

age,score,name
13,100,alice
14,95,bob
13,82,eve

Here we have the same data, but now the file is comma-delimited and contains a header. We can import this file with

students = read.table("data/students.tsv", 
    sep = ",",
    header = T  # first line contains column names, so we can
)               # immediately call `students$age`                    

(Note: there is also a read.csv function.)

help

There are many more options that read.table can take. For a list of these, just type help(read.table) (or ?read.table) at the prompt to access documentation.

# These work for other functions as well.
help(read.table)
?read.table

ggplot2

With these R basics in place, let's dive into the ggplot2 package.

Installation

One of R's greatest strengths is its excellent set of packages. To install a package, you can use the install.packages() function.

install.packages("ggplot2")

To load a package into your current R session, use library().

library(ggplot2)

Scatterplots with qplot()

Let's look at how to create a scatterplot in ggplot2. We'll use the iris data frame that's automatically loaded into R.

What does the data frame contain? We can use the head function to look at the first few rows.

head(iris) # by default, head displays the first 6 rows. see `?head`
head(iris, n = 10) # we can also explicitly set the number of rows to display

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
           5.1         3.5          1.4         0.2  setosa
           4.9         3.0          1.4         0.2  setosa
           4.7         3.2          1.3         0.2  setosa
           4.6         3.1          1.5         0.2  setosa
           5.0         3.6          1.4         0.2  setosa
           5.4         3.9          1.7         0.4  setosa

(The data frame actually contains three types of species: setosa, versicolor, and virginica.)

Let's plot Sepal.Length against Petal.Length using ggplot2's qplot() function.

qplot(Sepal.Length, Petal.Length, data = iris)
# Plot Sepal.Length vs. Petal.Length, using data from the `iris` data frame.
# * First argument `Sepal.Length` goes on the x-axis.
# * Second argument `Petal.Length` goes on the y-axis.
# * `data = iris` means to look for this data in the `iris` data frame.    

Sepal Length vs. Petal Length

To see where each species is located in this graph, we can color each point by adding a color = Species argument.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species) # dude!

Sepal vs. Petal, Colored by Species

Similarly, we can let the size of each point denote petal width, by adding a size = Petal.Width argument.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width)
# We see that Iris setosa flowers have the narrowest petals.

Sepal vs. Petal, Sized by Petal Width

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width, alpha = I(0.7))
# By setting the alpha of each point to 0.7, we reduce the effects of overplotting.

Sepal vs. Petal, with Transparency

Finally, let's fix the axis labels and add a title to the plot.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species,
    xlab = "Sepal Length", ylab = "Petal Length", 
    main = "Sepal vs. Petal Length in Fisher's Iris data")

Sepal vs. Petal, Titled

Other common geoms

In the scatterplot examples above, we implicitly used a point geom, the default when you supply two arguments to qplot().

# These two invocations are equivalent.
qplot(Sepal.Length, Petal.Length, data = iris, geom = "point")
qplot(Sepal.Length, Petal.Length, data = iris)

But we can also easily use other types of geoms to create more kinds of plots.

Barcharts: geom = "bar"

movies = data.frame(
    director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"),
    movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"),
    minutes = c(124, 163, 195, 600, 187)
)

# Plot the number of movies each director has.
qplot(director, data = movies, geom = "bar", ylab = "# movies")
# By default, the height of each bar is simply a count.

# Movies

# But we can also supply a different weight.
# Here the height of each bar is the total running time of the director's movies.
qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "total length (min.)")

Total Running Time

Line charts: geom = "line"

qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species) 
# Using a line geom doesn't really make sense here, but hey.

Sepal vs. Petal, Lined

# `Orange` is another built-in data frame that describes the growth of orange trees.
qplot(age, circumference, data = Orange, geom = "line",
    colour = Tree,
    main = "How does orange tree circumference vary with age?")

Orange Tree Growth

# We can also plot both points and lines.
qplot(age, circumference, data = Orange, geom = c("point", "line"), colour = Tree)

Orange Tree with Points

And that's it with what I'll cover.

Next Steps

I skipped over a lot of aspects of R and ggplot2 in this intro.

For example,

  • There are many geoms (and other functionalities) in ggplot2 that I didn't cover, e.g., boxplots and histograms.
  • I didn't talk about ggplot2's layering system, or the grammar of graphics it's based on.

So I'll end with some additional resources on R and ggplot2.

Edwin Chen :: @echen :: http://blog.echen.me

More Repositories

1

restricted-boltzmann-machines

Restricted Boltzmann Machines in Python.
Python
940
star
2

dirichlet-process

Introduction to Nonparametric Bayes, Infinite Mixture Models, and the Dirichlet Process (+ McDonald's)
R
297
star
3

link-prediction

Solution to Facebook's link prediction contest on Kaggle.
Scala
205
star
4

scaldingale

Movie recommendations and more in MapReduce and Scalding
Scala
117
star
5

streaming-simulations

Simulating the performance of various streaming algorithms. #experimentalmathematics
R
59
star
6

minifolds

ggplot2-inspired d3 app to make instant interactive visualizations
CoffeeScript
55
star
7

unsupervised-language-identification

An unsupervised language identification algorithm in Ruby, built originally for detecting English-language tweets.
Ruby
39
star
8

gap-statistic

An implementation of the gap statistic algorithm to compute the number of clusters in a set of numerical data.
R
39
star
9

sarah-palin-lda

Topic Modeling the Sarah Palin emails.
Scala
34
star
10

principal-components-analysis

Python/Numpy PCA using the transpose trick.
Python
28
star
11

information-propagation

Information Propagation in a Social Network
R
28
star
12

sparta

Instantly turn your data into charts and dashboards. It's like a mini Tableau.
JavaScript
27
star
13

rosetta-scone

A collection of MapReduce tasks translated (from Pig, Hive, MapReduce streaming, Cascalog, etc.) into Scalding.
Ruby
24
star
14

prediction-strength

An implementation of the prediction strength algorithm from Tibshirani, Walther, Botstein, and Brown's "Cluster validation by prediction strength".
R
19
star
15

data-hacks

Command-line utilities for data analysis.
Ruby
18
star
16

lstm-explorer

Web app for exploring LSTMs.
JavaScript
17
star
17

kickstarter-data-analysis

Digging into data from kickstarter.com.
16
star
18

twss-classifier

A That's What She Said classifier, built off a simple unigram Naive Bayes model.
Ruby
16
star
19

dangle

Playing around with Tangle + d3.
JavaScript
11
star
20

gradient-svd

A simple SVD + LSI implementation in Ruby, based on gradient descent. Useful if you have a *small* matrix with missing values.
Ruby
8
star
21

d3-tutorial

Quick introduction to d3.
JavaScript
6
star
22

nvd3

D3 graphing library, originally forked from nvd3.js
JavaScript
6
star
23

scalding-book

5
star
24

echen.github.io

HTML
5
star
25

old-blog

JavaScript
5
star
26

embedding-explorer

JavaScript
3
star
27

hurricane-sandy-outages

Power outages during Hurricane Sandy.
R
2
star
28

pinterest-evals

JavaScript
1
star