• This repository has been archived on 14/Dec/2023
  • Stars
    star
    280
  • Rank 147,492 (Top 3 %)
  • Language
    Python
  • License
    GNU Affero Genera...
  • Created over 11 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.

This is the source code for the Media Cloud core system. Media Cloud, a joint project of the Berkman Center for Internet & Society at Harvard University and the Center for Civic Media at MIT, is an open source, open data platform that allows researchers to answer complex quantitative and qualitative questions about the content of online media.

For more information on Media Cloud, go to mediacloud.org.

Note: Most users prefer to use Media Cloud's API and public tools to query our data instead of running their own Media Cloud instance.

The code in this repository will be of interest to those users who wish to run their own Media Cloud instance and users of the public tools who want to understand how Media Cloud is implemented.

The Media Cloud code here does three things:

  • Runs a web app that allows you to manage a set of media sources and their feeds.

  • Periodically crawls the feeds setup within the web app and downloads any new stories found within the downloaded feeds.

  • Extracts the substantive text from the downloaded story content (minus the ads, navigation, comments, etc.) and associates a set of tags with each story based on that extracted text.

For very brief installation instructions, see INSTALL.markdown.

Please send us a note at [email protected] if you are using any of this code or if you have any questions. We are very interested in knowing who's using the code and for what.

Build Status

Pull, build, push, test

History of the Project

Print newspapers are declaring bankruptcy nationwide. High-profile blogs are proliferating. Media companies are exploring new production techniques and business models in a landscape that is increasingly dominated by the Internet. In the midst of this upheaval, it is difficult to know what is actually happening to the shape of our news. Beyond one-off anecdotes or painstaking manual content analysis, there are few ways to examine the emerging news ecosystem.

The idea for Media Cloud emerged through a series discussions between faculty and friends of the Berkman Center. The conversations would follow a predictable pattern: one person would ask a provocative question about what was happening in the media landscape, someone else would suggest interesting follow-on inquiries, and everyone would realize that a good answer would require heavy number crunching. Nobody had the time to develop a huge infrastructure and download all the news just to answer a single question. However, there were eventually enough of these questions that we decided to build a tool for everyone to use.

Some of the early driving questions included:

  • Do bloggers introduce storylines into mainstream media or the other way around?
  • What parts of the world are being covered or ignored by different media sources?
  • Where do stories begin?
  • How are competing terms for the same event used in different publications?
  • Can we characterize the overall mix of coverage for a given source?
  • How do patterns differ between local and national news coverage?
  • Can we track news cycles for specific issues?
  • Do online comments shape the news?

Media Cloud offers a way to quantitatively examine all of these challenging questions by collecting and analyzing the news stream of tens of thousands of online sources.

Using Media Cloud, academic researchers, journalism critics, policy advocates, media scholars, and others can examine which media sources cover which stories, what language different media outlets use in conjunction with different stories, and how stories spread from one media outlet to another.

Sponsors

Media Cloud is made possible by the generous support of the Ford Foundation, the Open Society Foundations, and the John D. and Catherine T. MacArthur Foundation.

Collaborators

Past and present collaborators include Morningside Analytics, Betaworks, and Bit.ly.

License

Media Cloud is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Media Cloud is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with Media Cloud . If not, see <http://www.gnu.org/licenses/>.

More Repositories

1

sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Python
223
star
2

cliff-annotator

A lightweight server to allow HTTP requests to the Stanford Named Entity Recognized and a heavily modified CLAVIN geoparser.
Java
119
star
3

api-client

Public client for consuming content from the Media Cloud Online News Archive & Directory.
Python
68
star
4

web-tools

The shared repository for Media Cloud web apps (Explorer, Source Manager, Topic Mapper)
JavaScript
63
star
5

date_guesser

A library to extract a publication date from a web page, along with a measure of the accuracy.
Python
42
star
6

nyt-news-labeler

Tag news stories based on models trained on the NYT corpus.
Python
39
star
7

api-tutorial-notebooks

A set of jupyter notebooks demonstrating how to use the Media Cloud API.
Jupyter Notebook
33
star
8

feed_seeker

Find rss, atom, xml, and rdf feeds on webpages
Python
31
star
9

metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Python
12
star
10

web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
JavaScript
9
star
11

copy-kvs

Copy a lot of objects between various key-value stores (MongoDB GridFS, PostgreSQL BLOBs, Amazon S3)
Perl
8
star
12

rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
Python
5
star
13

cliff-api-client

A Python client for the CLIFF geoparsing tool
Python
5
star
14

email-templates

Templates for emails that Media Cloud sends.
HTML
4
star
15

wayback-news-client

A client library to access the Wayback Machine news archive search.
Python
4
star
16

word-embeddings-server

Helpful micro-service to return results from word2vec models
Python
2
star
17

glimpse

Get a glimpse of attention to a topic on social media.
Python
2
star
18

docker-compose-just-quieter

Docker Compose CLI utility wrapper which makes `docker-compose` quieter.
Python
2
star
19

postgresql-citus-aws-graviton2

PostgreSQL built for AWS Graviton2
2
star
20

sitemap-tools

simple toolkit of tools for consuming sitemaps
Python
2
star
21

fernandos-csv-randomizer

Fernando's CSV randomizer -- reads a CSV file, picks a specified number of random rows and writes them to a separate file
Python
1
star
22

cliff-homepage

A simple homepage for the CLIFF project
HTML
1
star
23

hausastemmer

Hausa language stemmer (Bimba et al., 2015)
Python
1
star
24

clavin-build-geonames-index

Builds and releases CLAVIN GeoNames.org index as a binary
1
star
25

sous-chef

Configurable Data Analytics Pipeline
Python
1
star
26

news-search-api

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).
Python
1
star
27

story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
Python
1
star