• Stars
    star
    114
  • Rank 296,959 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 6 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a general utility for anonymizing data

anonymize-it

A general utility for anonymizing data

anonymize-it can be run as a script that accepts a config file specifying the type source, anonymization mappings, and destination and an anonymizer pipeline. Individual pipeline components can also be imported into any python program that wishes to anonymize data.

Currently, the anonymize-it supports two methods for anonymization:

  1. Faker-based: Relies on providers from Faker to perform masking of fields. This method is suitable for one-off anonymization usecases, where correlation between data obtained from different sources (indices/clusters) is not necessary.

E.g.:

>>> from faker import Faker
>>> f = Faker()
>>> f.file_path()
'/break/Congress.json'
  1. Hash-based: Uses a unique user/customer ID as a salt to anonymize fields. This method is suitable when anonymization of data needs to be performed regularly and/or if correlation of data from different sources is crucial.

E.g.: A user wants to anonymize network events and process events stored in two separate indices but wants to correlate all activity for a particular host even after anonymization

Disclaimer

anonymize-it is intended to serve as a tool to replace real data values with sensical artificial ones such that the semantics of the data are retained. It is not intended to be used for anonymization requirements of GDPR policies, but rather to aid pseudonymization efforts. There may also be some collisions in high cardinality datasets on using the Faker implementation.

Instructions for use

Installation

This must be run in a virtualenvironment with the correct dependencies installed. These are enumerated in requirements.txt

Install virtualenv globally:

[sudo] pip install virtualenv

Create a virtualenv and install the dependencies of anonymize-it

virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements.txt

and run:

python anonymize.py configs/config.json

Quick Start

anonymize.py is reproduced below to walk through a simple anonymization pipeline.

First load and parse the config file.

config_file = sys.argv[1]
config = read_config(config_file) # opens json file and stores as python dict
config = utils.parse_config(config) # utility function for parsing configuration and setting variables

Then, create the reader as defined in the configuration. reader_mapping is used as a dispatcher that maps human reader reader types (e.g. elasticsearch) to reader classes (e.g. ESReader()).

reader = reader_mapping[config.source['type']]
reader = reader(config.source['params'], config.masked_fields, config.suppressed_fields)

Next, create the writer in the same way.

writer = writer_mapping[config.dest['type']]
writer = writer(config.dest['params'])

Finally, create an anonymizer by passing the reader and writer instances and run anonymize().

anon = Anonymizer(reader=reader, writer=writer)
anon.anonymize()

Creating your own anonymizer pipeline

An anonymizer requires a reader and a writer. Currently, only an elasticsearch reader readers.ESReader() and a filesystem writer writers.FSWriter() are provided.

readers

Creating an instance of a reader requires the following:

  • a source object, which contains parameters about the source. Please note that each reader class requires a different set of parameters. Please consult docstrings for specific parameters.
  • masked_fields which is a dictionary that contains field names that should be masked, along with the faker provider to be used for masking, if using the Faker-based anonymization. e.g.: {"user.name": "user_name", "user.email": "email"} If using the hash-based implementation, masked_fields is simply a list of field names to be masked. e.g.: ["user.name", "user.email"]
  • suppressed_fields which is a list of fields that should NOT be included in anonymization.

masked_fields is required on the reader since the reader is responsible for enumerating the distinct values for each field to be used as a lookup for masking values in the faker-based anonymization.

suppressed_fields is required on the reader since we will explicitly exclude these from a search query.

Readers must implement the following methods:

  • get_data(), which is responsible for returning data from the source and passing it to the anonymizer.
  • (If using Faker-based anonymization), create_mappings(), which is responsible for generating a dictionary to be used by the anonymizer object. The dictionary is structured as so:
    {
      "field.1": {
          "val1.1": None,
          "val1.2": None,
          ...,
          "val1.n": None
        },
      "field.2": {
          "val2.1": None,
          "val2.2": None,
          ...,
          "val2.m": None
        }
    }

where field.1 and field.2 are the fields to be anonymized and the val1.1, val1.2 etc. are the distinct values for each field

writers

Creating an instance of a writer requires the following:

  • A dest object, which contains parameters about the destination. Please note that each writer class requires a different set of parameters. Please consult docstrings for specific parameters.

Writers must implement the following methods:

  • write_data(), which send anonymized data to the destination.

Run as Script

anonymizers

python anonymize.py configs/config.json

config.json defines the work to be done, please see template file at configs/config.json for guidance:

  • source defines the location of the original data to be anonymized along with the type of reader that should be invoked.
    • source.type: a reader type. one of:
      • "elasticsearch"
      • "csv" (TBD)
      • "json" (TBD)
    • source.params: parameters allowing for access of data. specific to the reader type.
      • "elasticsearch":
        • host
        • index
        • use_ssl
        • auth (native optional)
  • dest defines the location where the data should be written back to
    • dest.type a writer type. one of:
      • "filesystem"
      • "csv' (TBD)
      • "elasticsearch" (TBD)
    • dest.params: parameters allowing for writing of data. specific to writer types
      • "json":
        • directory : directory to write output json files
  • anonymization: type of anonymization i.e. faker or hash
  • include: the fields to mask along with the method for anonymization in case of faker-based anonymization. This is a dict with entries like {"field.name":"faker.provider.mask"}. Please see faker documentation for providers here. For hash-based anonymization, this can be a list of fields to be masked like ["field.name"].
  • exclude: specific fields to exclude
  • sensitive: included fields (apart from the masked fields) that should not be completely replaced by a faker/hash substitute, but should be searched for sensitive information
  • include_rest: {true|false} if true, all fields except excluded fields will be written. if false, only fields specified in masks will be written.

Important notes for Faker-based anonymization

  1. Set the provider_map class attribute for the Anonymizer class, which is a dict with entries like {"field.name":self.faker.provider.mask}. Refer anonymizers.py for a test configuration of provider_map.
  2. If the fields being anonymized have high cardinality, set the high_cardinality_fields class attribute for the Anonymizer class, which is a dict with entries like {"field.name": [self.faker.provider.mask(10) for _ in range(10)]}.

Important notes for hash-based anonymization

  1. The user should have monitor privilege for the Elastic environment in which to run the anonymization.
  2. If you are a Cloud user and want to perform hash-based anonymization, you'll need to create an API key in the Elasticsearch Service Console and provide it as input when prompted. To create an API key, follow the instructions here.

In addition to the above settings, for more fine-grained control over the anonymization, you can also set the following class attributes for Anonymizer:

  1. user_regexes, which is a dict with entries like {"regex.name": "regex"}. These regexes are used to redact PII (apart from secrets, which is already taken care of) from the sensitive fields
  2. keywords, which is a list like ["keyword1", "keyword2"]. Documents containing any of the keywords in any of the sensitive fields are dropped.

Adding Masks

For the faker-based anonymization, the anonymizer class only knows how to use providers that are enumerated in the provider_map class attribute. If you would like to add support for new faker providers, please add entries to this dict.

Adding Readers

Readers can be added to readers.py, simply extend the base reader class and implement all abstract methods. Add a new entry to reader_mapping

Adding Writers

Readers can be added to writers.py, simply extend the base writer class and implement all abstract methods. Add a new entry to reader_mapping

General Notes

https://stackoverflow.com/questions/17486578/how-can-you-bundle-all-your-python-code-into-a-single-zip-file

Running Tests

To run the unit tests,

  1. Create a virtual environment and install dependencies in requirements.txt
  2. Execute py.test from the top-level repository directory

More Repositories

1

elasticsearch

Free and Open, Distributed, RESTful Search Engine
Java
65,029
star
2

kibana

Your window into the Elastic Stack
TypeScript
19,124
star
3

logstash

Logstash - transport and process your logs, events, or other data
Java
13,615
star
4

beats

🐠 Beats - Lightweight shippers for Elasticsearch & Logstash
Go
11,967
star
5

elasticsearch-php

Official PHP client for Elasticsearch.
PHP
5,190
star
6

elasticsearch-js

Official Elasticsearch client library for Node.js
TypeScript
5,174
star
7

go-elasticsearch

The official Go client for Elasticsearch
Go
4,933
star
8

elasticsearch-py

Official Python client for Elasticsearch
Python
4,034
star
9

elasticsearch-dsl-py

High level Python client for Elasticsearch
Python
3,695
star
10

elasticsearch-definitive-guide

The Definitive Guide to Elasticsearch
HTML
3,521
star
11

elasticsearch-net

This strongly-typed, client library enables working with Elasticsearch. It is the official client maintained and supported by Elastic.
C#
3,469
star
12

curator

Curator: Tending your Elasticsearch indices
Python
3,020
star
13

elasticsearch-rails

Elasticsearch integrations for ActiveModel/Record and Ruby on Rails
Ruby
3,017
star
14

examples

Home for Elasticsearch examples available to everyone. It's a great way to get started.
Jupyter Notebook
2,587
star
15

cloud-on-k8s

Elastic Cloud on Kubernetes
Go
2,461
star
16

elasticsearch-ruby

Ruby integrations for Elasticsearch
Ruby
1,928
star
17

elasticsearch-hadoop

🐘 Elasticsearch real-time search and analytics natively integrated with Hadoop
Java
1,915
star
18

helm-charts

You know, for Kubernetes
Python
1,807
star
19

search-ui

Search UI. Libraries for the fast development of modern, engaging search experiences.
TypeScript
1,796
star
20

logstash-forwarder

An experiment to cut logs in preparation for processing elsewhere. Replaced by Filebeat: https://github.com/elastic/beats/tree/master/filebeat
Go
1,788
star
21

detection-rules

Python
1,751
star
22

ansible-elasticsearch

Ansible playbook for Elasticsearch
Ruby
1,567
star
23

otel-profiling-agent

The production-scale datacenter profiler
Go
1,231
star
24

stack-docker

Project no longer maintained.
Shell
1,189
star
25

apm-server

APM Server
Go
1,100
star
26

ecs

Elastic Common Schema
Python
920
star
27

protections-artifacts

Elastic Security detection content for Endpoint
YARA
848
star
28

ember

Elastic Malware Benchmark for Empowering Researchers
Jupyter Notebook
799
star
29

elasticsearch-docker

Official Elasticsearch Docker image
Python
790
star
30

elasticsearch-rs

Official Elasticsearch Rust Client
Rust
612
star
31

elasticsearch-cloud-aws

AWS Cloud Plugin for Elasticsearch
580
star
32

apm-agent-dotnet

Elastic APM .NET Agent
C#
540
star
33

apm-agent-nodejs

Elastic APM Node.js Agent
JavaScript
540
star
34

apm-agent-java

Elastic APM Java Agent
Java
536
star
35

eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Python
516
star
36

elasticsearch-mapper-attachments

Mapper Attachments Type plugin for Elasticsearch
Java
503
star
37

elasticsearch-servicewrapper

A service wrapper on top of elasticsearch
Shell
489
star
38

apm-agent-go

Official Go agent for Elastic APM
Go
390
star
39

sense

A JSON aware developer's interface to Elasticsearch. Comes with handy machinery such as syntax highlighting, autocomplete, formatting and code folding.
JavaScript
382
star
40

apm-agent-python

Official Python agent for Elastic APM
Python
381
star
41

elastic-charts

📊 Elastic Charts library
TypeScript
362
star
42

stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
Clojure
356
star
43

timelion

Timelion was absorbed into Kibana 5. Don't use this. Time series composer for Elasticsearch and beyond.
JavaScript
347
star
44

elasticsearch-labs

Notebooks & Example Apps for Search & AI Applications with Elasticsearch
Jupyter Notebook
341
star
45

apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
Gherkin
317
star
46

elasticsearch-net-example

A tutorial repository for Elasticsearch and NEST
305
star
47

elasticsearch-migration

This plugin will help you to check whether you can upgrade directly to the next major version of Elasticsearch, or whether you need to make changes to your data and cluster before doing so.
291
star
48

logstash-docker

Official Logstash Docker image
Python
286
star
49

elasticsearch-py-async

Backend for elasticsearch-py based on python's asyncio module.
Python
283
star
50

support-diagnostics

Support diagnostics utility for elasticsearch and logstash
Java
278
star
51

elasticsearch-java

Official Elasticsearch Java Client
Java
274
star
52

es2unix

Command-line ES
Clojure
274
star
53

elasticsearch-analysis-smartcn

Smart Chinese Analysis Plugin for Elasticsearch
268
star
54

dockerfiles

Dockerfiles for the official Elastic Stack images
Shell
253
star
55

go-sysinfo

go-sysinfo is a library for collecting system information.
Go
249
star
56

kibana-docker

Official Kibana Docker image
Python
243
star
57

elasticsearch-metrics-reporter-java

Metrics reporter, which reports to elasticsearch
Java
232
star
58

apm-agent-php

Elastic APM PHP Agent
PHP
229
star
59

docs

Ruby
229
star
60

elasticsearch-river-twitter

Twitter River Plugin for elasticsearch (STOPPED)
Java
202
star
61

elasticsearch-formal-models

Formal models of core Elasticsearch algorithms
Isabelle
200
star
62

rally-tracks

Track specifications for the Elasticsearch benchmarking tool Rally
Python
197
star
63

beats-dashboards

DEPRECATED. Moved to https://github.com/elastic/beats. Please use the new repository to add new issues.
Shell
192
star
64

elasticsearch-analysis-icu

ICU Analysis plugin for Elasticsearch
189
star
65

elasticsearch-river-rabbitmq

RabbitMQ River Plugin for elasticsearch (STOPPED)
Java
173
star
66

elasticsearch-analysis-kuromoji

Japanese (kuromoji) Analysis Plugin
168
star
67

terraform-provider-ec

Terraform provider for the Elasticsearch Service and Elastic Cloud Enterprise
Go
165
star
68

beats-docker

Official Beats Docker images
Python
165
star
69

elasticsearch-river-couchdb

CouchDB River Plugin for elasticsearch (STOPPED)
Java
163
star
70

apm-agent-ruby

Elastic APM agent for Ruby
Ruby
156
star
71

integrations

Elastic Integrations
Handlebars
155
star
72

require-in-the-middle

Module to hook into the Node.js require function
JavaScript
149
star
73

harp

Secret management by contract toolchain
Go
143
star
74

dorothy

Dorothy is a tool to test security monitoring and detection for Okta environments
Python
141
star
75

ml-cpp

Machine learning C++ code
C++
139
star
76

ecs-logging-java

Centralized logging for Java applications with the Elastic stack made easy
Java
137
star
77

SWAT

Simple Workspace Attack Tool (SWAT) is a tool for simulating malicious behavior against Google Workspace in reference to the MITRE ATT&CK framework.
Python
135
star
78

go-libaudit

go-libaudit is a library for communicating with the Linux Audit Framework.
Go
133
star
79

ansible-beats

Ansible Beats Role
Ruby
131
star
80

logstash-contrib

THIS REPOSITORY IS NO LONGER USED.
Ruby
128
star
81

elasticsearch-analysis-phonetic

Phonetic Analysis Plugin for Elasticsearch
127
star
82

azure-marketplace

Elasticsearch Azure Marketplace offering + ARM template
Shell
122
star
83

bpfcov

Source-code based coverage for eBPF programs actually running in the Linux kernel
C
115
star
84

windows-installers

Windows installers for the Elastic stack
C#
113
star
85

terraform-provider-elasticstack

Terraform provider for Elastic Stack
Go
111
star
86

makelogs

JavaScript
108
star
87

golang-crossbuild

Shell
107
star
88

elasticsearch-lang-python

Python language Plugin for elasticsearch
104
star
89

elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Go
102
star
90

go-freelru

GC-less, fast and generic LRU hashmap library for Go
Go
101
star
91

elasticsearch-lang-javascript

JavaScript language Plugin for elasticsearch
93
star
92

stack-docs

Elastic Stack Documentation
Java
92
star
93

elasticsearch-specification

Elasticsearch full specification
TypeScript
89
star
94

elasticsearch-perl

Official Perl low-level client for Elasticsearch.
Perl
87
star
95

next-eui-starter

Start building Kibana protoypes quickly with the Next.js EUI Starter
TypeScript
87
star
96

vue-search-ui-demo

A demo of implementing Elastic's Search UI and App Search using Vue.js
Vue
87
star
97

elasticsearch-transport-thrift

Thrift Transport for elasticsearch (STOPPED)
Java
84
star
98

ecs-dotnet

.NET integrations that use the Elastic Common Schema (ECS)
HTML
82
star
99

generator-kibana-plugin

DEPRECATED Yeoman Generator for Kibana Plugins, please use https://github.com/elastic/template-kibana-plugin/
JavaScript
79
star
100

hipio

A DNS server that parses a domain for an IPv4 Address
Haskell
76
star