Albedo
A recommender system for discovering GitHub repos, built with Apache Spark.
Albedo is a fictional character in Dan Simmons's Hyperion Cantos series. Councilor Albedo is the TechnoCore's AI advisor to the Hegemony of Man.
Setup
$ git clone https://github.com/vinta/albedo.git
$ cd albedo
$ make up
Collect Data
You need to create your own GITHUB_PERSONAL_TOKEN
on your GitHub settings page.
# get into the main container
$ make attach
# this step might take a few hours to complete
# depends on how many repos you starred and how many users you followed
$ (container) python manage.py migrate
$ (container) python manage.py collect_data -t GITHUB_PERSONAL_TOKEN -u GITHUB_USERNAME
# or
$ (container) wget https://s3-ap-northeast-1.amazonaws.com/files.albedo.one/albedo.sql
$ (container) mysql -h mysql -u root -p123 albedo < albedo.sql
# username: albedo
# password: hyperion
$ make run
$ open http://127.0.0.1:8000/admin/
Start a Spark Cluster
You could also create a Spark cluster on Google Cloud Dataproc.
# start a local Spark cluster in Standalone mode
$ make spark_start
Use Popularity as the Recommendation Baseline
See PopularityRecommenderBuilder.scala for complete code.
$ spark-submit \
--master spark://localhost:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.PopularityRecommenderTrainer \
target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.002017744675282716
Build the User Profile for Feature Engineering
See UserProfileBuilder.scala for complete code.
$ spark-submit \
--master spark://localhost:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.UserProfileBuilder \
target/albedo-1.0.0-SNAPSHOT.jar
Build the Item Profile for Feature Engineering
See RepoProfileBuilder.scala for complete code.
$ spark-submit \
--master spark://localhost:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.RepoProfileBuilder \
target/albedo-1.0.0-SNAPSHOT.jar
Train an ALS Model for Candidate Generation
See ALSRecommenderBuilder.scala for complete code.
$ spark-submit \
--master spark://localhost:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.ALSRecommenderBuilder \
target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.05209047292612741
Build a Content-based Recommender for Candidate Generation
Elasticsearch's More Like This API will do the tricks.
$ (container) python manage.py sync_data_to_es
See ContentRecommenderBuilder.scala for complete code.
$ spark-submit \
--master spark://localhost:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,org.apache.httpcomponents:httpclient:4.5.2,org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.2,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.ContentRecommenderBuilder \
target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.002559563451967487
Train a Word2Vec Model for Text Vectorization
See Word2VecCorpusBuilder.scala for complete code.
$ spark-submit \
--master spark://localhost:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.Word2VecCorpusBuilder \
target/albedo-1.0.0-SNAPSHOT.jar
Train a Logistic Regression Model for Ranking
See LogisticRegressionRanker.scala for complete code.
$ spark-submit \
--master spark://localhost:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.LogisticRegressionRanker \
target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.021114356461615493
TODO
- Build a recommender system with Spark: Factorization Machine
- Build a recommender system with Spark: GDBT for Feature Learning
- Build a recommender system with Spark: Item2Vec
- Build a recommender system with Spark: PageRank and GraphX
- Build a recommender system with Spark: XGBoost