• Stars
    star
    304
  • Rank 137,274 (Top 3 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created over 10 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Spark / graphX implementation of the distributed louvain modularity algorithm

dga-graphx

  • GraphX Algorithms

The dga-graphX package contains several pre-built executable graph algorithms built on Spark using the GraphX framework.

pre-requisites

build

If necessary edit the build.gradle file to set your version of spark and graphX

gradle clean dist

Check the build/dist folder for dga-graphx-0.1.jar.

Algorithms

Louvain

about louvain

Louvain distributed community detection is a parallelized version of this work:

Fast unfolding of communities in large networks, 
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, 
Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P10008 (12pp)

In the original algorithm each vertex examines the communities of its neighbors and makes a chooses a new community based on a function to maximize the calculated change in modularity. In the distributed version all vertices make this choice simultaneously rather than in serial order, updating the graph state after each change. Because choices are made in parallel some choice will be incorrect and will not maximize modularity values, however after repeated iterations community choices become more stable and we get results that closely mirror the serial algorithm.

running louvain

After building the package (See above) you can execute the lovain algorithm against an edge list using the provided script

bin/louvain

Usage: class com.soteradefense.dga.graphx.louvain.Main$ [options] [<property>=<value>....]

  -i <value> | --input <value>
        input file or path  Required.
  -o <value> | --output <value>
        output path Required
  -m <value> | --master <value>
        spark master, local[N] or spark://host:port default=local
  -h <value> | --sparkhome <value>
        SPARK_HOME Required to run on cluster
  -n <value> | --jobname <value>
        job name
  -p <value> | --parallelism <value>
        sets spark.default.parallelism and minSplits on the edge file. default=based on input partitions
  -x <value> | --minprogress <value>
        Number of vertices that must change communites for the algorithm to consider progress. default=2000
  -y <value> | --progresscounter <value>
        Number of times the algorithm can fail to make progress before exiting. default=1
  -d <value> | --edgedelimiter <value>
        specify input file edge delimiter. default=","
  -j <value> | --jars <value>
        comma seperated list of jars
  -z <value> | --ipaddress <value>
        Set to true to convert ipaddresses to Long ids. Defaults to false
  <property>=<value>....

To run a small local example execute:

bin/louvain -i examples/small_edges.tsv -o test_output --edgedelimiter "\t" 2> stderr.txt

Spark produces alot of output, so sending stderr to a log file is recommended. Examine the test_output folder. you should see

test_output/
├── level_0_edges
│   ├── _SUCCESS
│   └── part-00000
├── level_0_vertices
│   ├── _SUCCESS
│   └── part-00000
└── qvalues
    ├── _SUCCESS
    └── part-00000
cat test_output/level_0_vertices/part-00000 
(7,{community:8,communitySigmaTot:13,internalWeight:0,nodeWeight:3})
(4,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:4})
(2,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:4})
(6,{community:8,communitySigmaTot:13,internalWeight:0,nodeWeight:4})
(8,{community:8,communitySigmaTot:13,internalWeight:0,nodeWeight:3})
(5,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:4})
(9,{community:8,communitySigmaTot:13,internalWeight:0,nodeWeight:3})
(3,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:4})
(1,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:5})

cat test_output/qvalues/part-00000 
(0,0.4134948096885813)

Note: the output is laid out as if you were in hdfs even when running local. For each level you see an edges directory and a vertices directory. The "level" refers to the number of times the graph has been "community compressed". At level 1 all of the level 0 vertices in community X are represented by a single vertex with the VertexID: X. For the small example all modulairyt was maximized with no community compression so only level 0 was computed. The vertices show the state of each vertex while the edges file specify the graph structure. The qvalues directory lists the modularity of the graph at each level of compression. For this example you should be able to see all of vertices splitting off into two distinct communities (community 4 and 8 ) with a final qvalue of ~ 0.413

running louvain on a cluster

To run on a cluster be sure your input and output paths are of the form "hdfs:///path" and ensure you provide the --master and --sparkhome options. The --jars option is already set by the louvain script itself and need not be applied.

parallelism

To change the level of parallelism use the -p or --parallelism option. If this option is not set parallelism will be based on the layout of the input data in HDFS. The number of partitions of the input file sets the level of parallelism.

advanced

If you would like to include the louvain algorithm in your own compute pipeline or create a custom output format, etc you can easily do so by extending the com.soteradefense.dga.graphx.louvain.LouvainHarness class. See HDFSLouvainRunner which extends LouvainHarness and is called by Main for the example above

More Repositories

1

distributed-graph-analytics

Distributed Graph Analytics (DGA) is a compendium of graph analytics written for Bulk-Synchronous-Parallel (BSP) processing frameworks such as Giraph and GraphX. The analytics included are High Betweenness Set Extraction, Weakly Connected Components, Page Rank, Leaf Compression, and Louvain Modularity.
Java
168
star
2

correlation-approximation

Spark implementation of the Google Correlate algorithm to quickly find highly correlated vectors in huge datasets
Scala
91
star
3

mitie-trainer

Model Training tool for MITIE
JavaScript
77
star
4

newman

Quickly analyze and explore email with advanced analytics and visualization.
JavaScript
55
star
5

pst-extraction

PST extraction and analytic pipeline
Python
37
star
6

distributed-louvain-modularity

Community Detection and Compression Analytic for Big Graph Data
Java
37
star
7

graphene

JavaScript
24
star
8

zephyr

Zephyr is a big data, platform agnostic ETL API, with Hadoop MapReduce, Storm, and other big data bindings.
Java
21
star
9

watchman

Watchman: An open-source social-media event-detection system
JavaScript
20
star
10

aggregate-micro-paths

Infer movement patterns from large amounts of geo-temporal data in a cloud environment.
Python
14
star
11

track-communities

A series of analytics for creating networks from geo-temporal track data based on time/space co-occurrence. Includes UI for visualization of communities and tracks.
JavaScript
14
star
12

Datawake

Browser add-on and web server to support collection and analysis of web browsing data.
JavaScript
13
star
13

Datawake-Legacy

This project is superseded by the current Datawake project but is maintained here for existing users. Browser extension and backend services aimed at enhancing Internet search with domain specific knowledge, collaboration, and analysis.
JavaScript
10
star
14

DatawakeDepot

Loopback web application for administration of Datawake networks
JavaScript
9
star
15

high-betweenness-set-extraction

Approximate Betweenness Centrality computation for big graph data.
Java
8
star
16

rhipe-arima

An R/Hadoop Arima analytic using Rhipe to submit mapreduce jobs.
R
8
star
17

GEQE

Geo Event Quey by Example - Leverage geo-located temporal text data in order to identify similar locations or events.
Python
8
star
18

firmament

NodeJS script and Docker files to create MySQL/MongoDB backed AngularJS/Bootstrap web application
JavaScript
7
star
19

datawake-prefetch

Python
7
star
20

page-rank

Java
6
star
21

social-sandbox

Geo-temporal scraping of social media, unsupervised event detection
JavaScript
4
star
22

xdata-vm

Vagrant-Ubuntu VM serving as a platform for XDATA performer software integration
Ruby
4
star
23

xdata-nba

Tools to mine nba data
Python
3
star
24

leaf-compression

Java
3
star
25

DatawakeManager-WebApp

DatawakeManager Web Server
JavaScript
2
star
26

newman-vm

newman vm
Shell
2
star
27

interactive-graph-viewer

An R Shiny app for interactively viewing the results of the Louvain method for community detection.
JavaScript
2
star
28

hive-common-udf

A collection of common Apache Hive UDFs
Java
2
star
29

triangle-counting

A port of the work at Sandia National Laboratories on approximate triangle counting via wedge sampling.
Scala
2
star
30

merlin-stack

Shell
2
star
31

graphene-enron

JavaScript
2
star
32

go_watchman

github.com/watchman apps for which go is specifically well suited
Go
2
star
33

graphene-walker

Java
2
star
34

Rmmtsne

A native R implementation of multiple maps t-distributed stochastic neighbor embedding (mmtsne).
R
1
star
35

twitter-cacher

Twitter Scraper
Java
1
star
36

zephyr-sample-project

A sample project (or, rather, sample projects) to show various ways of using Zephyr - generally a good starting point for your own Zephyr implementations.
Java
1
star
37

vande

Java
1
star
38

sotera.github.io

CSS
1
star
39

DatawakeManager-Loopback

DatawakeManager Data Layer
JavaScript
1
star
40

newman-research

Tools to be evaluated prior to integration into Newman
Python
1
star
41

graphene-instagram

A version of Graphene that runs on scraped Instagram data.
Java
1
star
42

DatawakeFFPlugin

JMI based Datawake plugin for Firefox 38+
JavaScript
1
star
43

zephyr-contrib

Useful classes for functions outside the scope of Zephyr's ETL, but still used in many scenarios (generally with extensive dependencies that probably shouldn't be in the core API).
Java
1
star
44

DatawakeSuite

1
star
45

micropath-kml

For creating kml to visualize aggregate micro-path output.
Java
1
star