• Stars
    star
    642
  • Rank 70,096 (Top 2 %)
  • Language
    Lua
  • Created over 10 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data related to the investigation of realtime censorship

Overview

This repository contains keyword blocklists and lists of other content such as URLs or images used to trigger censorship in apps used in China. The WeChat, QQMail, Apple, and Bing lists were discovered using sample testing and thus do not completely cover the censored content on these platforms. The remainder of the lists in this repository were reverse engineered from the application's software and are the exhaustive lists of keywords used to trigger censorship on these platforms.

The full details on data collection and analysis methods and results are available below.

Chat apps

The research below tracks daily changes to censorship in three different chat apps used in China: TOM-Skype, Sina UC, and Line. Overall, our chat app data consists of over 4,000 blocked keywords.

Data: TOM-Skype and Sina UC, LINE

Live-streaming apps

The research below tracks hourly changes to censorship in three different live streaming apps in China: YY, Sina Show, and 9158; and documents the keywords censored by GuaGua, which does not include a mechanism for downloading updates to its censorship blocklists. Overall, our live-streaming data consists of over 20,000 blocked keywords.

Data: Original live-streaming data (2015), Updated live-streaming data (2017), Coronavirus keywords (2020)

Mobile games

Our research on mobile games analyzes domestic Chinese games as well as international games that have been altered to comply with Chinese regulations. Overall, we found hundreds of mobile games performing censorship, collectively censoring over 100,000 unique blocked keywords.

Data: Mobile games

Open source projects

This research analyzes Chinese censorship in open source projects. We extracted over 1,000 Chinese keyword blocklists from open source projects on GitHub, collectively spanning over 200,000 unique blocked keywords.

Data: Open source blocklists

WeChat

Our research on WeChat censorship uses sample testing to determine what type of content, such as words, URLs, and images, can be communicated over the platform and which content is censored. We have studied what categorical content WeChat generally filters in addition to what content WeChat filters in response to specific events.

Data: Keywords and URLs (November 2016), 709 Crackdown keywords and images (April 2017), Liu Xiaobo keywords and images (July 2017), 19th Party Congress keywords (November 2017), Image filtering test data (May 2018), Coronavirus keywords (March 2020)

Apple engravings

Our research measuring Apple's filtering of product engravings uses sample testing to discover keywords that cannot be engraved in each of six different regions: United States, Canada, Japan, Taiwan, Hong Kong, and mainland China. We found that part of Apple’s mainland China political censorship bleeds into both Hong Kong and Taiwan. Much of this censorship exceeds Apple’s legal obligations in Hong Kong, and we are aware of no legal justification for the political censorship of content in Taiwan.

Six months after our initial report, in a follow study, we found that Apple eliminated their Chinese political censorship in Taiwan. However, Apple continued to perform broad, keyword-based political censorship outside of mainland China in Hong Kong, despite human rights groups’ recommendations for American companies to resist blocking content. As other tech companies do not perform similar levels of political censorship in Hong Kong, we assess possible motivations Apple may have for performing it, including appeasement of the Chinese government.

Data: Keyword filtering rules

QQMail

On Tencent's QQMail, we discover that certain combinations of keywords being present in email messages triggers their censorship. However, the presence of other combinations, which we call extenuating combinations, deactivates the censorship of some censored keywords.

Data: Censored and extenuating keyword combinations

MY2022 Beijing Olympics App

MY2022, an app required to be installed by attendees of the 2022 Olympics Games, includes features that allow users to report “politically sensitive” content. We found that the app also includes a censorship keyword list, which, while presently inactive, targets a variety of political topics including domestic issues such as Xinjiang and Tibet as well as references to Chinese government agencies. It is unclear whether the list is inactive purposefully or in a bid to hide the extent of China’s censorship regime from outsiders.

Data: Inactive blocklist

Microsoft Bing

Testing Microsoft Bing's censorship of autosuggestions, we find Chinese political censorship of suggestions for individual's names, such as Xi Jinping, not only in China but also in North America. The findings in this report again demonstrate that an Internet platform cannot facilitate free speech for one demographic of its users while applying extensive political censorship against another demographic of its users.

Data: Censored names

Keyword Content Analysis

Datasets include raw keyword lists collected from the applications. Many also include processed data including translations and categorization of keywords. Keywords were translated to English using a combination of machine and human translation. Based on interpreting these translations with contextual information, we coded each keyword into content categories grouped under six general themes according to a code book.

License

All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here.

More Repositories

1

test-lists

URL testing lists intended for discovering website censorship
Python
450
star
2

malware-indicators

Citizen Lab Malware Reports
YARA
265
star
3

malware-signatures

Yara rules for malware families seen as part of targeted threats project
Vim Script
133
star
4

web-censorship

Collection of data about URL filtering in various countries
HTML
40
star
5

wechat-security-report

TypeScript
37
star
6

spyware-scan

Ruby
36
star
7

ami

AMI is a web application that helps people to create legal requests for copies of their personal information from data operators.
PHP
29
star
8

blockpages

Collection of censorship blockpages as collected by various sources
HTML
26
star
9

badtraffic

Supporting data for BAD TRAFFIC Citizen Lab report.
Python
23
star
10

vuln-disclosures

This repository contains information related to vulnerability disclosures done by the Citizen Lab.
22
star
11

tiktok-report-data

JavaScript
22
star
12

wechat-report-data

JavaScript
21
star
13

bluecoat-investigations

Investigation data from two reports around the Blue Coat networking kit.
18
star
14

ami-frontend

Access My Info Frontend
CSS
12
star
15

censored-keyword-isolation

Algorithms for determining keyword combinations used to filter text
Python
10
star
16

filtering-annotations

A collection of text patterns related to filtering infrastructure
HTML
9
star
17

planetnetsweeper

Supporting data for Citizen Lab Planet Netsweeper Report
6
star
18

endless_mayfly

Dataset for the report "Burned After Reading: Endless Mayfly’s Ephemeral Disinformation Campaign"
6
star
19

reports

A mirror of various Citizen Lab research reports in PDF
4
star
20

alg-policing-foi-records

A collection of records and letters from freedom of information requests submitted to various federal and provincial departments, and municipal police services in Canada.
3
star
21

ami-community

JavaScript
1
star
22

not-ok-on-vk-data

Data release associated with the "Not OK On VK" report.
1
star
23

ami-docker

Dockerfiles for AMI
PHP
1
star
24

lgbtiq-report-data

1
star