• Stars
    star
    396
  • Rank 108,801 (Top 3 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python package to accelerate the sparse matrix multiplication and top-n similarity selection

sparse_dot_topn:

sparse_dot_topn provides a fast way to performing a sparse matrix multiplication followed by top-n multiplication result selection.

Comparing very large feature vectors and picking the best matches, in practice often results in performing a sparse matrix multiplication followed by selecting the top-n multiplication results. In this package, we implement a customized Cython function for this purpose. When comparing our Cythonic approach to doing the same use with SciPy and NumPy functions, our approach improves the speed by about 40% and reduces memory consumption.

This package is made by ING Wholesale Banking Advanced Analytics team. This blog or this blog explains how we implement it.

Example

import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import rand
from sparse_dot_topn import awesome_cossim_topn

N = 10
a = rand(100, 1000000, density=0.005, format='csr')
b = rand(1000000, 200, density=0.005, format='csr')

# Default precision type is np.float64, but you can down cast to have a small memory footprint and faster execution
# Remark : These are the only 2 types supported now, since we assume that float16 will be difficult to implement and will be slower, because C doesn't support a 16-bit float type on most PCs
a = a.astype(np.float32)
b = b.astype(np.float32)

# Use standard implementation
c = awesome_cossim_topn(a, b, N, 0.01)

# Use parallel implementation with 4 threads
d = awesome_cossim_topn(a, b, N, 0.01, use_threads=True, n_jobs=4)

# Use standard implementation with 4 threads and with the computation of best_ntop: the value of ntop needed to capture all results above lower_bound
d, best_ntop = awesome_cossim_topn(a, b, N, 0.01, use_threads=True, n_jobs=4, return_best_ntop=True)

You can also find code which compares our boosting method with calling scipy+numpy function directly in example/comparison.py

Dependency and Install

Install numpy and cython first before installing this package. Then,

pip install sparse_dot_topn

From version >=0.3.0, we don't proactively support python 2.7. However, you should still be able to install this package in python 2.7. If you encounter gcc compiling issue, please refer these discussions and setup CFLAGS and CXXFLAGS variables

Uninstall

pip uninstall sparse_dot_topn

Local development

python setup.py clean --all
python setup.py develop
pytest
python -m build
cd dist/
pip install sparse_dot_topn-*.tar.gz

Release strategy

From version 0.3.2, we employ Github Actions to build wheels in different OS and Python environments with cibuildwheel, and release automatically. Hopefully this will solve many issues related to installation. The build and publish pipeline is configured in ./github/workflows/wheels.yml. When a new release is neeeded, please follow these steps

  1. Create a test branch with branch name test/x.x.x from main branch.
  2. In test/x.x.x branch, update the version number such as x.x.x.rcx (e.g. 0.3.4.rc0) in setup.py, and update changelog in CHANGES.md file.
  3. Git push test/x.x.x branch, then build and publish pipeline will be triggered automatically. New release will be uploaded in PyPI test https://test.pypi.org/project/sparse-dot-topn/.
  4. Please do a sanity check on PyPI test release.
  5. Update the changelog in CHANGES.md
  6. Create a branch on top of the test branch.
  7. Modify the version number by remove the rcx suffix in setup.py.
  8. Git push, then build and publish pipeline will be triggered automatically. New release will be uploaded to PyPI https://pypi.org/project/sparse-dot-topn
  9. Merge the release branch back to master

More Repositories

1

lion

Fundamental white label web component features for your design system.
JavaScript
1,752
star
2

popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎
Python
497
star
3

baker

Orchestrate microservice-based process flows
Scala
333
star
4

threshold-signatures

Threshold Signature Scheme for ECDSA
Rust
199
star
5

flink-deployer

A tool that help automate deployment to an Apache Flink cluster
Go
151
star
6

probatus

Validation (like Recursive Feature Elimination for SHAP) of (multiclass) classifiers & regressors and data used to develop them.
Python
130
star
7

scruid

Scala + Druid: Scruid. A library that allows you to compose queries in Scala, and parse the result back into typesafe classes.
Scala
115
star
8

skorecard

scikit-learn compatible tools for building credit risk acceptance models
Python
84
star
9

cassandra-jdbc-wrapper

A JDBC wrapper of Java Driver for Apache Cassandra®, which offers a simple JDBC compliant API to work with CQL3.
Java
73
star
10

rokku

Rokku project. This project acts as a proxy on top of any S3 storage solution providing services like authentication, authorization, short-term tokens, and lineage.
Scala
66
star
11

INGenious

INGenious Playwright Studio
Java
66
star
12

EntityMatchingModel

Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
Python
52
star
13

ing-open-banking-cli

Shell
45
star
14

ing-open-banking-sdk

Mustache
41
star
15

bdd-mobile-security-automation-framework

Mobile Security testing Framework
Ruby
40
star
16

doing-cli

CLI tool to simplify the development workflow on azure devops
Python
34
star
17

spark-matcher

Record matching and entity resolution at scale in Spark
Python
31
star
18

gohateoas

Plug-and-play HATEOAS for REST API's written in Go
Go
29
star
19

industry2vec

Jupyter Notebook
28
star
20

rokku-dev-apache-atlas

Apache Atlas development image for the Rokku project: https://github.com/ing-bank/rokku
Shell
20
star
21

apache-ranger-s3-plugin

Apache Ranger Plugin for S3
Java
19
star
22

vscode-psl

For distributing plugins to the community and foster further developments with the community going forward.
TypeScript
19
star
23

skafos

Kubenetes operator framework in Python
Python
13
star
24

quota-scaler

Kubernetes Autoscaling operator
Go
12
star
25

zkkrypto

Collection of ZKP-related cryptographic primitives
Kotlin
12
star
26

zkflow

The ZKFlow consensus protocol enables private transactions on Corda for arbitrary smart contracts using Zero Knowledge Proofs
Kotlin
11
star
27

rokku-sts

STS service for the Rokku project: https://github.com/ing-bank/rokku
Scala
9
star
28

prometheus-scenarios

This repo contains a collection of learning scenarios. Each scenario is meant to teach a topic through explanation and practical exercices.
Go
7
star
29

tsforecast

A pipeline to execute time series forecasts and visualize them in a dashboard. The employable forecast models fall into three categories: simple heuristics (mean of last 12 months, last 3 months etc), classical time series econometrics (ARIMA, Holt-Winters, Kalman filters etc.) and machine learning (Neural networks, Facebook’s Prophet etc.)
R
7
star
30

orchestration-pkg

orchestration-pkg
Go
5
star
31

mint

An automated exploratory testing tool for Android
Kotlin
5
star
32

rokku-dev-apache-ranger

Apache Ranger development image for the Rokku project: https://github.com/ing-bank/rokku
Shell
5
star
33

ing-ideal-connectors-java

Opensource tools and API to connect webshops and merchants to ING using iDeal
Java
4
star
34

ing-ideal-connectors-php

Opensource tools and API to connect webshops and merchants to ING using iDeal
PHP
4
star
35

ing-ideal-connectors-net

Opensource tools and API to connect webshops and merchants to ING using iDeal
C#
4
star
36

ginerr

Error registry for Gin, translate rough error messages to user-friendly objects and status codes
Go
4
star
37

gormtestutil

Utilities for writing unit-tests with Gorm
Go
4
star
38

psl-linter

TypeScript
3
star
39

tsclean

Takes a time series, possibly with multiple groupings, and starts up a dashboard to visualize potential anomalies. Anomalies are detected via several algorithms. If anomalies are erroneous, the user can correct them from within the dashboard
R
3
star
40

psl-parser

TypeScript
2
star
41

gintestutil

Utilities for writing unit-tests with Gin
Go
1
star
42

rokku-dev-keycloak

Keycloak development image for the Rokku project: https://github.com/ing-bank/rokku
Dockerfile
1
star
43

tstools

A set of helper functions, geared mostly towards data with a time dimension
R
1
star
44

rokku-dev-mariadb

MariaDB development image for the Rokku project: https://github.com/ing-bank/rokku
Dockerfile
1
star