• Stars
    star
    117
  • Rank 300,005 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created almost 7 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Experiments with satellite image data

Project Status: WIP โ€“ Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

Laika

Notes and experiments with satellite image data.

Synopsis

The goal of this repo is to research potential sources of satellite image data and to implement various algorithms for satellite image segmentation.

Table of contents

Running the code

The following steps describe the end-to-end flow for the current work in progress. The implementation makes use of a utility to help build a training dataset, and a SegNet encoder/decoder network for image segmentation.

Creating a training dataset

Install some os deps:

brew install mapnik
brew install paralell

Clone and install the skynet-data project:

git clone https://github.com/developmentseed/skynet-data
cd skynet-data

The skynet-data project is a tool for sampling OSM QA tiles and associated satelite image tiles from MapBox.

The first task is to decide what classes to include in the dataset. These are specified in a JSON configuration file and follow the osm tag format. This project attempts to identify 6 types of land use and objects:

cd into classes and create a new configuration mine.json:

[{
  "name": "residential",
  "color": "#010101",
  "stroke-width": "1",
  "filter": "[landuse] = 'residential'",
  "sourceLayer": "osm"
}, {
  "name": "commercial",
  "color": "#020202",
  "stroke-width": "1",
  "filter": "[landuse] = 'commercial'",
  "sourceLayer": "osm"
}, {
  "name": "industrial",
  "color": "#030303",
  "stroke-width": "1",
  "filter": "[landuse] = 'industrial'",
  "sourceLayer": "osm"
}, {
  "name": "vegetation",
  "color": "#040404",
  "stroke-width": "1",
  "filter": "([natural] = 'wood') or 
             ([landuse] = 'forest') or 
             ([landuse] = 'tree_row') or 
             ([landuse] = 'tree') or 
             ([landuse] = 'scrub') or 
             ([landuse] = 'heath') or 
             ([landuse] = 'grassland') or 
             ([landuse] = 'orchard') or 
             ([landuse] = 'farmland') or 
             ([landuse] = 'tree') or 
             ([landuse] = 'allotments') or 
             ([surface] = 'grass') or 
             ([landuse] = 'meadow') or 
             ([landuse] = 'vineyard')",
  "sourceLayer": "osm"
},
{
  "name": "building",
  "color": "#050505",
  "stroke-width": "1",
  "filter": "[building].match('.+')",
  "sourceLayer": "osm"
},
{
  "name": "brownfield",
  "color": "#060606",
  "stroke-width": "1",
  "filter": "[landuse] = 'brownfield'",
  "sourceLayer": "osm"
}]

The skynet-data tool will use this configuration to create ground-truth labels for the specified classes. For each satelite image instance, it's pixel-by-pixel ground-truth will be encoded as an image with the same size as the satelite image. An individual class will be encoded by colour, such that a specific pixel belonging to an individual class will assume one of 7 colour values corresponding to the above configuration.

For example, a pixel belonging to the vegetation class will assume the RGB colour #040404 and a building will assume the value #050505. Note that these can be any RGB colour. For convenience, I have chosen to encode the class number in each of the 3 RGB bytes so that it can be easily retrieved later on without the need for a lookup table.

Note that it is possible for a pixel to assume an unknown class in which case, it can be considered as "background". Thus Unknown pixels have been encoded as #000000 (the 7th class).

Next, in the skynet-data parent directory, add the following to the Makefile:

QA_TILES?=united_kingdom
BBOX?='-3.3843,51.2437,-2.3923,51.848'
IMAGE_TILES?="tilejson+https://a.tiles.mapbox.com/v4/mapbox.satellite.json?access_token=$(MapboxAccessToken)"
TRAIN_SIZE?=10000
CLASSES?=classes/mine.json
ZOOM_LEVEL?=17

This will instruct the proceeding steps to download 10,000 images from within a bounding box (defined as part of the South-west here). The images will be randomly sampled within the bounding box area. Zoom level 17 corresponds to approx. 1m per pixel resolution. To specify the bounding box area, this tool is quite handy. Note that coordinates are specified in the following form:

-lon, -lat, +lon, +lat

Before following the next steps, go to MapBox and sign up for a developer key.

Having obtained your developer key from MapBox, store it in an env. variable:

export MapboxAccessToken="my_secret_token"

Then initiate the download process:

make clean
rm -f data/all_tiles.txt
make download-osm-tiles
make data/all_tiles.txt
make data/sample.txt
make data/labels/color
make data/images

You will end up with 10,000 images in data/images and 10,000 "ground truth" images in data/labels. data/sample-filtered.txt contains a list of files of which at least 1 pixel belongs to a specified class.

Image sampling

Note, that there is in a convenience tool in the skynet-data utility for quickly viewing the downloaded data. To use it, first install a local webserver, e.g., Nginx and add an alias to the preview.html file. You can then visualise the sampled tiles by following a URL of the following form:

http://localhost:8080/preview.html?accessToken=MAPBOX_KEY&prefix=data

See notebooks for a visual inspection of some of the data. The following shows some of the downloaded tiles with overlaid OSM labels:

OSM labels

The level of detail can be quite fine in places, while in others, quite sparse. This example shows a mix of industrial (yellow) and commercial (blue) land areas mixed in with buildings (red) and vegetation (green).

The model

The model implemented here is the SegNet encoder/decoder architecture. There are 2 variations of this architecure, of which the simplified version has been implemented here. See paper for details. Briefly, the architecture is suited for multi-class pixel-by-pixel segmentation and has been shown to be effective in scene understanding tasks. Given this. it may also be suited to segmentation of satelite imagery.

Side note: The architecutre has been shown to be very effective at segmenting images from car dashboard cameras, and is of immediate interest in our street -view related research.

The model, specified in model.py, consists of 2 main components. The first is an encoder which takes as input a 256x256 RGB image and compresses the image into a set of features. In fact, this component is the same as a VGG16 network without the final fully connected layer. In place of the final fully connected layer, the encoder is connected to a decoder. This decoder is a reverse image of the encoder, and acts to up-sample the features.

The final output of the model is a N*p matrix, where p = 256*256 corresponding to the original number of image pixels and N = the number of segment classes. As such, each pixel has an associated class probability vector. The predicted segment/class can be extrcacted by taking the the max of these values.

Training

First install numpy, theano, keras and opencv2 Then:

python3 train.py

train.py will use the training data created with skynet in the previous step. Note that by default, train.py expects to find this data in ../skynet-data/data. Having loaded the raw training data and associated segment labels into a numpy array, the data are stored in HDF5 format in data/training.hdf5. On subsequent runs, the data loader will first look for this HDF5 data as to reduce the startup time. Note that the data/training.hdf5 can be used by other models/frameworks/languages.

In current form, all parameters are hard-coded. These are the default parameters:

Parameter Default Note
validation 0.2 % of dataset to use as training validation subset
epochs ! 10 number of training epochs
learning_rate 0.001 learning rate
momentum 0.9 momentum

As-is, the model converges at a slow rate:

img

Training and validation errors (loss and accuracy) are stored in training_log.csv. On completion, the network weights are dumped into weights.hdf5 - Note that this may be loaded into the same model implemented in another language/framework.

Validating

Having trained the model, validate it using the testing data held back in the data-preparation stage:

python3 validate.py

validate.py expects to find the trained model weights in weights.hdf5 in the current working directory. In addition to printing out the validation results, the pixel-by-pixel class probabilities for each instance are stored in predictions.hdf5 which can be inspected to debug the model.

Running

feed_forward.py takes as input trained weights, an input image and an output directory to produce pixel-by-pixel class predictions.

python3 feed_forward.py <hdf5 weights> <img location> <output dir>

Specifically, given an input satellite image, the script outputs the number of pixels belonging to one the 8 land-use classes, such that the sum of class pixels = total number of pixels in the image. In addition, the script will output class heatmaps and class segments visualisations in to the <output dir>.

segmented

The class heatmaps (1 image per class) show the model's confidence that a pixel belongs to a particular class (buildings and vegetation shown above). Given this confidence, the maximum value from each class is used to determine the final pixel segment, shown on the right in the above image. Some more visualisations can be found in this notebook.

Further work/notes

  • The model as-is, is quite poor, trained to only 70% accuracy over the validation set).
  • The model has only been trained once: fine-tuning and hyperparameter search has not yet been completed.
  • The training data is very noisy: the segments are only partially labelled. As such, missing labels are assigned as "background".

Background research

Satellite themes of interest

In general, satellite image processing themes can be categorised into two main themes:

Earth Observation (EO)

The field of Earth observation is concerned with monitoring the status of the planet with various sensors, which includes, but is not limited to, the use of satellite data for monitoring large areas of the earth's surface at regular, frequent intervals.

EO is a broad area, which may cover water management, forestry, agriculture, urban fabric and land-use/cover in general.

A good example of EO is the use of the normalised difference vegetation index (NDVI) for monitoring nationwide vegetation levels.

This sub-field does not depend on very high resolution data since it is mainly concerned with quantifying some aspect of very large areas of land.

Object detection

In contrast to EO, the field of object detection from satellite data is principally concerned with the localisation of specific objects as opposed to general land-use cover.

A good example of Object detection from satellite data is counting cars in carparks from which various indicators can be derived, depending on the context. For example, it would be possible to derive some form of consumer/retail indicator by periodically counting cars in super market car parks.

This sub-field will depend on very high resolution data depending on the application.

Computer vision themes of interest

In addition to the two main satellite image processing themes of interest (EO and object detection), there are four more general image processing sub-fields which may be applicable to a problem domain. From "easiest" to most difficult:

  1. Classification. At the lowest level, the task is to identify which objects are present in a scene. If there are many objects, the output may be an ordered list sorted by amount or likelyhood of the object being present in the scene. The classification may also extend beyond objects to abstract concepts such as aesthetic value.

  2. Detection. The next level involves localisation of the entities/concepts in the scene. Typically this will include a bounding-box around identified objects, and/or object centroids.

  3. Segmentation. This level extends classification and detection to include pixel-by-pixel class labeling. Each pixel in the scene must be labeled with a particular class, such that the entire scene can be described. Segmentation is particularly appropriate for land-use cover. In addition, segmentaiton may be extended to provide a form of augmented bounding-box: pixels outside of the bounding box area can be negatively weighted, pixels on the border +1 and pixels inside the region [0, 1] inversely proportionate to the distance form the bounding box perimeter.

  4. Instance segmentation. Perhaps the most challenging theme: In addition to pixel-by-pixel segmentation, provide a segmented object hierarchy, such that objects/areas belonging to the same class may be individually segmented. E.g., segments for cars and car models. Commercial area, office within a commercial area, roof-top belonging to a shop etc.

The initial version of this project focuses on option 3: image segmentation in the domain of both Earth Observation and object detection.

Data sources

There are three types of data of interest for this project.

  1. Raw image data. There are numerous sources for satellite image data, ranging from lower resolution (open) data most suited for EO applications, through to high resolution (mostly propitiatory) data-sets.

  2. Pre-labeled image data. For training an image classificiation, object detection or image segmentation supervised learning model, it is necessary to obtain ample training instances along with associated ground truth labels. In the domain of general image classification, there exist plenty of datasets which are mainly used to benchmark various algorithms.

  3. Image labels. It will later be required to create training datasets with arbitrary labeled instances. For this reason, a source of ground-truth and/or a set of tools to facilitate image labeling and manual image segmentation will be necessary.

Raw image data

This project (to date) focuses specifically on open data. The 2 main data sources for EO grade images come from the Sentinel-2 and Landsat-8 satellites. Both satellites host a payload of multi-spectrum imaging equipment.

Sentinel 2 (ESA)

The Sentinel-2 satellite is capable of sensing the following wavelengths:

Band Wavelength (ฮผm) Resolution (m)
01 โ€“ Coastal aerosol 0.443 60
02 โ€“ Blue 0.490 10
03 โ€“ Green 0.560 10
04 โ€“ Red 0.665 10
05 โ€“ Vegetation Red Edge 0.705 20
06 โ€“ Vegetation Red Edge 0.740 20
07 โ€“ Vegetation Red Edge 0.783 20
08 โ€“ NIR 0.842 10
8A โ€“ Narrow NIR 0.865 20
09 โ€“ Water vapour 0.945 60
10 โ€“ SWIR โ€“ Cirrus 1.375 60
11 โ€“ SWIR 1.610 20
12 โ€“ SWIR 2.190 20

"Sentinel 2"

The visible spectrum captured by Sentinel-2 is the highest (open) data resolution available: 10 metres per pixel. Observations are frequent: Every 5 days for the same viewing angle.

Landsat-8 (NASA)

The Landsat-8 satellite is limited to 30m resolution accross all wavelengths with the exception of it's panchromatic sensor, which is capable of capturing 15m per pixel data. The revisit frequency for landsat is 16 days.

Band Wavelength (ฮผm) Resolution (m)
01 - Coastal / Aerosol 0.433 โ€“ 0.453 30
02 - Blue 0.450 โ€“ 0.515 30
03 - Green 0.525 โ€“ 0.600 30
04 - Red 0.630 โ€“ 0.680 30
05 - Near Infrared 0.845 โ€“ 0.885 30
06 - Short Wavelength Infrared 1.560 โ€“ 1.660 30
07 - Short Wavelength Infrared 2.100 โ€“ 2.300 30
08 - Panchromatic 0.500 โ€“ 0.680 15
09 - Cirrus 1.360 โ€“ 1.390 30

"Landsat 8"

Links

Papers

Other

Pre-labeled image data

There are numerous sources of pre-labeled image data available. Recently, there have been a number of satellite image related competitions hosted on Kaggle and TopCoder. This data may be useful to augment an existing dataset, to pre-train models or to train a model for later use in an ensemble.

Object and land-use labels

There exist a number of land-use and land-cover datasets. The main issue is dataset age: If the objective is to identify construction sites or urban sprawl then a dataset > 1 year is next to useless, unless it can be matched to imagery from the same time period which would then only be useful for the creation of a training dataset.

The most promising source (imo) is the OpenStreetMap project since it is updated constantly and contains an extensive hierarchy of relational objects. There is also the possibility to contribute back to the OMS project should manual labeling be necessary.

Modeling

The proof of concept in this project makes use of concepts from deep learning. A review of the current state of the art, covering papers, articles, competitive data science platform results and open source projects indicates that the most recent advances have been in the area of image segmentation - most likely fueled by research trends in autonomous driving..

From the image segmentation domain, the top performing, most recent developments tend to be some form of encoder/decoder neural network in which the outputs of a standard CNN topology (E.g., VGG16) are upsampled to form class probability matrices with the same dimensions as the original input image.

The following papers and resources provide a good overview of the field.

Modeling Papers

Satelite specific

2017
2016
2015
2014
2013

Modeling specific

2017
2016
2015

Model implementations

There exist a number of implementations of recent satelite image procesing techniques. The following Github respositories are a good research starting point:

  1. random forests Java :)

  2. Multi-scale context aggregation by dilated convolutions (Tensorflow)

  3. Instance-aware semantic segmentation via multi- task network cascades (Caffe)

  4. Fully Convolutional Network (FCN) + Gradient Boosting Decision Tree (GBDT) (Keras)

  5. FCN (Keras)

Image segmentation model implementations

General

SegNet

Segnet architecture

Keras implementations:

(note these are all the basic version from the approach described in the paper.

U-Net

Keras implementations:

DeepLab

Note, there seems to be no Keras implementation. Tensorflow implementations:

Dilated Convolutions

PSPNet (Pyramid Scene Parsing Network)

Bleeding edge. More details here

Comeptitive data science

Lots of interesting reading and projects.

DSTL - Kaggle

  1. U-Net

  2. not known

  3. Another U-Net

  4. modified U-Net

  5. pixel-wise logistic regression model

Understaning Amazon from space - Kaggle

  1. 11 convolutional neural networks - This is pretty nice, details winners' end-to-end architecture.

Current competitions

Tools/utilities

  • SpaceNetChallenge utlitiles - Github - Packages intended to assist in the preprocessing of SpaceNet satellite imagery data corpus to a format that is consumable by machine learning algorithms.

  • Sentinelsat - Github - Utility to search and download Copernicus Sentinel satellite images.

Visualisations/existing tools

Projects using Sentinel data

Blogs

Other

More Repositories

1

google-mobility-reports-data

Archive of data extracted from the google community mobility reports
79
star
2

pygrams

Extracts key terminology (n-grams) from any large collection of documents (>1000) and forecasts emergence
Python
62
star
3

mobius

Scripts to extract data from the COVID-19 Google Community Mobility Reports
Python
49
star
4

community-visualizations

A set of community visualisations for Google Data Studio created by the Data Science Campus
JavaScript
35
star
5

off-course

Analysing port and shipping operations using big data
Jupyter Notebook
32
star
6

deploy-dash-with-gcp

A simple dash application that shows users how to deploy dash using Google Cloud Platform (GCP)
CSS
29
star
7

synthgauge

Python
25
star
8

app_review

Code to get reviews from the Apple App Store and Google Play APIs
Python
23
star
9

coffee-and-coding

A resource repo for coffee-and-coding sessions in ONS
HTML
22
star
10

statschat-app

Prototype search engine for ONS bulletins
Python
22
star
11

SigNow_ONS_Turing

A repository for nowcasting with signature methods
Jupyter Notebook
20
star
12

transport-network-performance

Measuring the performance of transport networks around urban centres
Python
19
star
13

gov-uk-rap-materials

Learning materials describing Reproducible Analytical Pipeline
HTML
18
star
14

chrono_lens

Cloud hosted traffic camera image analysis and related time series creation, providing scalable data collection and processing
Python
17
star
15

coding-standards

Coding standards for Government Data Science Projects
Shell
16
star
16

proper

A repository for the R tool propeR, which analyses travel time and cost using an OTP graph (see datasciencecampus/graphite)
HTML
15
star
17

optimus

A text processing pipeline for turning unstructured text data into hierarchical datasets
Python
13
star
18

green-spaces

Render GeoJSON polygons over aerial imagery and analyse pixels covered by vegetation; used to calculate green spaces in residential gardens
Python
13
star
19

energy-efficiency

Predicting energy efficiency from Energy Performance Certificates (EPC) using machine learning
Jupyter Notebook
13
star
20

access-to-services

Repository for the Data Science Campus Access to Services project (linked to datasciencecampus/graphite and datasciencecampus/propeR)
10
star
21

sic-soc-llm

Python
10
star
22

street-view-pipeline

Distributed Google street view image processing
Python
9
star
23

coding-in-the-open

A compendium of open-source guidance which aims to share the benefits, risks and a summarised strategy for open-source coding.
TeX
8
star
24

woffle

NLP template for dummies, by dummies
Python
8
star
25

graphite

A repository for building an OTP network graph
Shell
6
star
26

DSCA_Intro-R

HTML
6
star
27

synthetic-data

Repo on generating synthetic data using GAN
Jupyter Notebook
6
star
28

openstreetmap-network-sampling

Construct sample points at equidistant points from OpenStreetMap road network data
Java
6
star
29

pprl_toolkit

The privacy-preserving record linkage toolkit: a proof-of-concept public demo of next-gen data linkage techniques.
Python
6
star
30

accessrmd

accessrmd is a package with functions intended to improve the accessibility of Rmarkdown documents.
HTML
5
star
31

uk-parliament-stats

Tools for analysing parliamentary transcripts (hansard data)
Python
5
star
32

readpyne

Toolkit for extracting relevant lines from receipts or similar image data.
Python
5
star
33

consultation_nlp

Preliminary analysis for 2023 population transformation consultation
Python
4
star
34

consultation-analysis-nafw

Repo for National Assembly for Wales consultation analysis
Jupyter Notebook
4
star
35

bus-metrics-england

Bus coverage and punctuality metrics for all public services across England
Python
4
star
36

detecting-trucks

A machine learning approach to detecting large vehicles in Sentinel-2 Earth observation satellite imagery. This experimental project, developed by the International Development squad, was concerned with exploring whether the number of trucks on roads could be used as an economic indicator.
Python
4
star
37

anomaly-detection

Jupyter Notebook
3
star
38

ft-hackathon-python-team

Python
3
star
39

DSCA_intropython

Introduction to Python Programming
Jupyter Notebook
3
star
40

gRoot

Prototype tool for visualising urban forest data
R
3
star
41

kamino

DSC visualisation app for the UK fishing industry
R
3
star
42

trendr

An R package to extracts the trend and first derivative using a local linear model in state-space form and the Diffuse Kalman Filter
R
3
star
43

govtools

An R package to help set up government R packages
R
3
star
44

DSCA_intermediate-python

Repo for intermediate python course materials
Jupyter Notebook
3
star
45

sdg_661_analysis_and_reporting

Open-source tool for reporting SDG indicator 6.6.1
Jupyter Notebook
2
star
46

obr-macro-model

A Python translation of the Office for Budget Responsiblity's (OBR's) UK macroeconomic model
Python
2
star
47

ace2

A text classification app
Python
2
star
48

awesome-campus

A repository containing an index of our projects and tools that we have published.
2
star
49

jtstats-py

Python
2
star
50

road-data-dump

Code to download all England road traffic data once per month
Jupyter Notebook
2
star
51

census21api

A Python wrapper for the England & Wales Census 2021 "Create a Custom Dataset" API
Python
2
star
52

finbins

FinBins project
Scala
2
star
53

syn-data-gen

Synthetic data generation pipeline with differential privacy
Jupyter Notebook
2
star
54

bus-metrics-england-visual

Interactive visual of bus reliability metrics by LSOA for England (February - April 2024, based on daily BODS data from 7-10am)
HTML
2
star
55

propeR-viz

A demonstration of propeR
JavaScript
1
star
56

nlp-for-health

a place to share analysis and code for Natural Language Processing (NLP) techniques, using publicly available health and social care datasets.
Jupyter Notebook
1
star
57

coffee-and-coding-ldn

A public repository to hold resources and notes from ONS Coffee & Coding sessions in London
1
star
58

iati-partner-search

Search IATI descriptions using different methods
Python
1
star
59

nlp-club

The Data Science Campus' journal club for Natural Language Processing (NLP)
1
star
60

rail_reporter

Generate statistics and visualisations describing scheduled train movements (GB)
Python
1
star
61

mx-tea

A Slack App for posting tea break groups using Google Cloud Platform
Python
1
star
62

geojson_to_csv

A simple script for parsing geojson files and converting them to CSV files ready to upload to GCP BigQuery.
Python
1
star
63

DSCA_Version-controlled-RStudio-projects

HTML
1
star
64

DSCA_Version-control-with-Git-Github

HTML
1
star
65

DSCA_data_wrangling_with_r

Course on data wrangling with R (pending review)
HTML
1
star
66

eclipse

Evaluating CaLorie Intake for Population Statistical Estimates
Stata
1
star
67

jtstats-r

R
1
star
68

jtstats

R
1
star
69

mst-demo

A demonstration of the MST method using the 1% 2011 Census Teaching File
HTML
1
star
70

gcp_utilities

Useful generic code to interact with Cloud Native tools on Google Cloud Platform, includes Cloud Functions, BigQuery, GCS, Pub/Sub, Cloud Run and FireStore
Python
1
star
71

parliai-public

Using Generative AI to filter parliamentary debate
Python
1
star
72

assess_gtfs

Inspect, validate & filter General Transit Feed Specification.
Python
1
star
73

centhesus

Synthesising the 2021 England and Wales Census from publicly available data
1
star