• Stars
    star
    171
  • Rank 220,994 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 9 years ago
  • Updated about 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An artist recommendation engine, from feeding Spotify playlists through word2vec

artistrecs

A similar artist recommendation engine powered by Spotify playlists and word2vec.

This proof of concept was inspired by two pieces and my own longstanding belief that the transitions between songs in playlists, when given enough, are valuable insights.

Some bathroom reading:

Also, quick reminder: this is a proof of concept! It's working and pretty cool, but that doesn't mean the tools are complete, the project isn't without layout or design decision kinks, or that things won't change. In fact, plan on things changing for as long as this message is here.


This application consists of two major components:

  • A Celery-backed extraction setup for ingesting playlists from Spotify en masse. Celery workers are responsible for importing playlists and extracting artist names.
  • Helper scripts for training and querying data extracted from said playlists

NOTE - project layout will be changing shortly. See TODO for what's up.

Setup

Before getting started with this project, please ensure you have the following installed:

Please also have your OAuth client ID and secret ready from the Spotify application you wish to use. If you need to register a Spotify app, do so here before continuing.

Install word2vec bindings

OSX using Homebrew

$ brew install --HEAD homebrew/head-only/word2vec

Setup

Install requirements

Check out this repo into a virtualenv, and then install its Python requirements using pip:

$ pip install -r requirements.txt

Set up environment variables

This application's configuration data is taken from environment variables.

  • SPOTIFY_CLIENT_ID: Spotify OAuth client id
  • SPOTIFY_CLIENT_SECRET: Spotify OAuth client secret
  • ARTISTRECS_BROKER_URL: Celery broker URL. Defaults to localhost:6379. Read more.
  • ARTISTRECS_RESULT_BACKEND: Celery result backend URL. Defaults to localhost:6379 Read more

How you set these variables is up to you.

TIP: envdir is great for this.

Workflow and output

Extraction workflow

First, a quick overview of what the Celery workers are doing, from start to finish:

  1. playlist_generator task receives a request to search Spotify's playlists for a specific term. For each playlist found, playlist_generator creates a new task: resolve_playlist.

    If recycle is enabled, playlist_generator will respawn itself to collect more information. This is controlled by the max_recycles parameter - if left undefined, playlist_generator will collect all playlists available for the given term.

  2. resolve_playlist receives a username and playlist id and fetches the playlist's tracks. It then compiles into a list each artist name from each track in the playlist [1]. Once all names have been collected, resolve_playlist hands off the list to export_artist_sentence_from_playlist.

  3. export_artist_sentence_from_playlist accepts a user id, playlist id, and a list of artist names. It JSON-encodes them and then writes the object to a file.

    This task is run in a separate queue to avoid overlapping file writes.

Once all jobs have been processed, you should have a text file ready to be processed by the parser.py script.

[1] If the artist name about to be collected is also the last entry in the current list, it is ignored. This is a naive way of protecting against entire albums influencing transitional frequencies.

Right now, there are many moving pieces. As this project matures past "proof of concept", noise will be reduced.

Output format

word2vec works by analyzing sentences. That's a gross generalization. Here's something from the official docs:

The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

gensim, the word2vec implementation used here, expects sentences to be delivered as lists of words. In our particular use case, we're constructing sentences from artist names. Wild, right?

At the end of a run thorugh the extraction workflow described above, you'll have a file whose every line has a sentence constructed from artists in a playlist. The format for this output is:

{
    "playlist_id": "<spotify playlist id>",
    "user_id": "<spotify user id>",
    "sentence": [
        "<artist name>",
        "<artist name>",
        ...
    ]
}

This file can be found at SENTENCE_OUTPUT_PATH.

Running

Once you've completed the above, you're ready to begin.

Start Redis

If redis-server is not already running in the background, fire it up.

$ redis-server

Start extraction queues

In one shell inside your virtualenv, fire up the extraction queue. This queue is responsible for the tasks concerned with querying Spotify and compiling artist names from tracks inside of playlists.

$ celery worker -A artistrecs -l info -Q extraction

In another shell, start the writer queue with 1 worker. This is a sloppy workaround to prevent overlapping file writes by unlocked append access to a file.

$ celery worker -A artistrecs -l info -Q writer

Insert a task

To query Spotify for playlists, insert a task using insert_task.py. This helper script accepts two parameters, expecting one of them:

$ python insert_task.py -t <term>

See insert_task.py's --help print-out for extended usage details.

Parsing output

Data extracted from playlists may be parsed using parser.py.

$ python parser.py -i <path to output file> -t <artist name>

This will emit a JSON-encoded ranked list of similar artists. Make sure that the artist name you're checking was, in fact, a member of at least a few of the playlists brought in during extraction.

See --help for extended usage details.

TODO

  • Add setup.py, which should also install helper scripts as console scripts
  • Parser should default to using SENTENCE_OUTPUT_PATH env var when -i is not given.

Issues

Want to contribute? File an issue or a pull request.

More Repositories

1

react-helmet-example

A no-frills example of using react, react-router, and react-helmet together in a universal context
JavaScript
74
star
2

hash-ring-ctypes

Fast ctypes-based wrapper around libhashring
Python
28
star
3

python-songkick

Python asked Songkick to the dance. Songkick said yes!
Python
26
star
4

retrosheet-sql-queries

Retrosheet SQL cookbook
PLSQL
17
star
5

baseball-brooks-pitch-importer

Flexible Brooks Baseball's corrected Pitch f/x data importer
Python
17
star
6

python-spotify-api

Quick pass at wrapping Spotify Metadata search API. A work in progress.
Python
10
star
7

baseball-normalize-player-ids

Normalize MLB player id registries
Python
9
star
8

mlbam-utils

mlbam download helper, more
Python
9
star
9

python-amazon-mp3-api

python wrapper for amazon mp3 api
Python
8
star
10

baseball-projection-schematics

Rewrite and normalize baseball player projections
Python
7
star
11

baseball-fangraphs-steamer-importer

Extracts full Steamer projections from Fangraphs
Python
6
star
12

2017-sloan-data-scraping

Scripts used in Sloan scraping demo
Python
6
star
13

django-active-login-required

Django view decorator that ensures a user is both active and authenticated
Python
6
star
14

sloan-2019-scraping-code

MIT SSAC 2019 scraping session code
Python
6
star
15

retrosheet-pitch-sequences

extracts pass-through counts from retrosheet pitch sequences
Python
5
star
16

universal-redux-boilerplate

an example of a universal app w/ redux, react-router, and webpack. hot loading included.
JavaScript
5
star
17

text-fingerprinting

Implementation of Winnowing: Local Algorithms for Document Fingerprinting
Python
4
star
18

homers

every home run
CSS
4
star
19

baseball-pagerank

baseball links for the pagerank god, or a quick and dirty look at surfacing research through graph centrality
Python
2
star
20

retrosheet-downloader

bulk retrosheet event file downloader
Go
2
star
21

baseball-bb-america-players

Imports basic player details from Baseball America
Python
2
star
22

jones

"collects" art while the servers aren't looking
Python
2
star
23

baseball-seamheads-park-dimensions

Exports active park dimensions from Seamheads
Python
2
star
24

rsd

Record Store Day API
2
star
25

python-acoustid-api

Wrapper around AcoustID v2. Presently covers lookups. Comes with tests.
Python
2
star
26

corrupt

Python port of Recyclism's image corruption tool
Python
2
star
27

django-sugarjs

django admin + sugar.js integration
JavaScript
2
star
28

modular-grid-scraper

extracts all inventory and prices from modular grid's modular inventory
Python
2
star
29

pipe-to-slack

Follow a file, pipe to Slack (via incoming webbook)
Python
1
star
30

bpm

Python
1
star
31

article-club

public repo for article club project @ sndmakes
Python
1
star
32

mattreduce

Tiny mapreduce implementation for tiny data
Python
1
star
33

homebrew-chadwick

homebrew recipe for chadwick
Ruby
1
star
34

also-in-c

for drew
Python
1
star
35

cbs-keeper-sweeper

A Chrome extension which removes kept players from CBS Fantasy Baseball draft rankings
JavaScript
1
star
36

projections

scratch repo for building a private customizable stats ui
Python
1
star