• Stars
    star
    379
  • Rank 108,969 (Top 3 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created almost 7 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python package to accelerate the sparse matrix multiplication and top-n similarity selection

sparse_dot_topn:

sparse_dot_topn provides a fast way to performing a sparse matrix multiplication followed by top-n multiplication result selection.

Comparing very large feature vectors and picking the best matches, in practice often results in performing a sparse matrix multiplication followed by selecting the top-n multiplication results. In this package, we implement a customized Cython function for this purpose. When comparing our Cythonic approach to doing the same use with SciPy and NumPy functions, our approach improves the speed by about 40% and reduces memory consumption.

This package is made by ING Wholesale Banking Advanced Analytics team. This blog or this blog explains how we implement it.

Example

import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import rand
from sparse_dot_topn import awesome_cossim_topn

N = 10
a = rand(100, 1000000, density=0.005, format='csr')
b = rand(1000000, 200, density=0.005, format='csr')

# Default precision type is np.float64, but you can down cast to have a small memory footprint and faster execution
# Remark : These are the only 2 types supported now, since we assume that float16 will be difficult to implement and will be slower, because C doesn't support a 16-bit float type on most PCs
a = a.astype(np.float32)
b = b.astype(np.float32)

# Use standard implementation
c = awesome_cossim_topn(a, b, N, 0.01)

# Use parallel implementation with 4 threads
d = awesome_cossim_topn(a, b, N, 0.01, use_threads=True, n_jobs=4)

# Use standard implementation with 4 threads and with the computation of best_ntop: the value of ntop needed to capture all results above lower_bound
d, best_ntop = awesome_cossim_topn(a, b, N, 0.01, use_threads=True, n_jobs=4, return_best_ntop=True)

You can also find code which compares our boosting method with calling scipy+numpy function directly in example/comparison.py

Dependency and Install

Install numpy and cython first before installing this package. Then,

pip install sparse_dot_topn

From version >=0.3.0, we don't proactively support python 2.7. However, you should still be able to install this package in python 2.7. If you encounter gcc compiling issue, please refer these discussions and setup CFLAGS and CXXFLAGS variables

Uninstall

pip uninstall sparse_dot_topn

Local development

python setup.py clean --all
python setup.py develop
pytest
python -m build
cd dist/
pip install sparse_dot_topn-*.tar.gz

Release strategy

From version 0.3.2, we employ Github Actions to build wheels in different OS and Python environments with cibuildwheel, and release automatically. Hopefully this will solve many issues related to installation. The build and publish pipeline is configured in ./github/workflows/wheels.yml. When a new release is neeeded, please follow these steps

  1. Create a test branch with branch name test/x.x.x from main branch.
  2. In test/x.x.x branch, update the version number such as x.x.x.rcx (e.g. 0.3.4.rc0) in setup.py, and update changelog in CHANGES.md file.
  3. Git push test/x.x.x branch, then build and publish pipeline will be triggered automatically. New release will be uploaded in PyPI test https://test.pypi.org/project/sparse-dot-topn/.
  4. Please do a sanity check on PyPI test release.
  5. Update the changelog in CHANGES.md
  6. Create a branch on top of the test branch.
  7. Modify the version number by remove the rcx suffix in setup.py.
  8. Git push, then build and publish pipeline will be triggered automatically. New release will be uploaded to PyPI https://pypi.org/project/sparse-dot-topn
  9. Merge the release branch back to master

More Repositories

1

lion

Fundamental white label web component features for your design system.
JavaScript
1,680
star
2

popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎
Python
478
star
3

baker

Orchestrate microservice-based process flows
Scala
317
star
4

threshold-signatures

Threshold Signature Scheme for ECDSA
Rust
198
star
5

zkrp

Reusable library for creating and verifying zero-knowledge range proofs and set membership proofs.
Go
170
star
6

flink-deployer

A tool that help automate deployment to an Apache Flink cluster
Go
149
star
7

probatus

Validation (like Recursive Feature Elimination for SHAP) of (multiclass) classifiers & regressors and data used to develop them.
Python
121
star
8

scruid

Scala + Druid: Scruid. A library that allows you to compose queries in Scala, and parse the result back into typesafe classes.
Scala
115
star
9

skorecard

scikit-learn compatible tools for building credit risk acceptance models
Python
78
star
10

rokku

Rokku project. This project acts as a proxy on top of any S3 storage solution providing services like authentication, authorization, short-term tokens, and lineage.
Scala
65
star
11

cassandra-jdbc-wrapper

A JDBC wrapper of Java Driver for Apache Cassandra®, which offers a simple JDBC compliant API to work with CQL3.
Java
57
star
12

ing-open-banking-cli

Shell
42
star
13

bdd-mobile-security-automation-framework

Mobile Security testing Framework
Ruby
40
star
14

ing-open-banking-sdk

Mustache
34
star
15

doing-cli

CLI tool to simplify the development workflow on azure devops
Python
32
star
16

spark-matcher

Record matching and entity resolution at scale in Spark
Python
28
star
17

industry2vec

Jupyter Notebook
27
star
18

gohateoas

Plug-and-play HATEOAS for REST API's written in Go
Go
24
star
19

rokku-dev-apache-atlas

Apache Atlas development image for the Rokku project: https://github.com/ing-bank/rokku
Shell
20
star
20

vscode-psl

For distributing plugins to the community and foster further developments with the community going forward.
TypeScript
19
star
21

apache-ranger-s3-plugin

Apache Ranger Plugin for S3
Java
18
star
22

EntityMatchingModel

Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
Python
16
star
23

skafos

Kubenetes operator framework in Python
Python
13
star
24

zkkrypto

Collection of ZKP-related cryptographic primitives
Kotlin
12
star
25

zkflow

The ZKFlow consensus protocol enables private transactions on Corda for arbitrary smart contracts using Zero Knowledge Proofs
Kotlin
11
star
26

quota-scaler

Kubernetes Autoscaling operator
Go
10
star
27

rokku-sts

STS service for the Rokku project: https://github.com/ing-bank/rokku
Scala
9
star
28

tsforecast

A pipeline to execute time series forecasts and visualize them in a dashboard. The employable forecast models fall into three categories: simple heuristics (mean of last 12 months, last 3 months etc), classical time series econometrics (ARIMA, Holt-Winters, Kalman filters etc.) and machine learning (Neural networks, Facebook’s Prophet etc.)
R
7
star
29

prometheus-scenarios

This repo contains a collection of learning scenarios. Each scenario is meant to teach a topic through explanation and practical exercices.
Go
6
star
30

orchestration-pkg

orchestration-pkg
Go
5
star
31

rokku-dev-apache-ranger

Apache Ranger development image for the Rokku project: https://github.com/ing-bank/rokku
Shell
5
star
32

ing-ideal-connectors-java

Opensource tools and API to connect webshops and merchants to ING using iDeal
Java
4
star
33

ing-ideal-connectors-php

Opensource tools and API to connect webshops and merchants to ING using iDeal
PHP
4
star
34

ing-ideal-connectors-net

Opensource tools and API to connect webshops and merchants to ING using iDeal
C#
4
star
35

mint

An automated exploratory testing tool for Android
Kotlin
4
star
36

gormtestutil

Utilities for writing unit-tests with Gorm
Go
4
star
37

tsclean

Takes a time series, possibly with multiple groupings, and starts up a dashboard to visualize potential anomalies. Anomalies are detected via several algorithms. If anomalies are erroneous, the user can correct them from within the dashboard
R
4
star
38

psl-linter

TypeScript
3
star
39

ginerr

Error registry for Gin, translate rough error messages to user-friendly objects and status codes
Go
3
star
40

psl-parser

TypeScript
2
star
41

tstools

A set of helper functions, geared mostly towards data with a time dimension
R
2
star
42

rokku-dev-keycloak

Keycloak development image for the Rokku project: https://github.com/ing-bank/rokku
Dockerfile
1
star
43

gintestutil

Utilities for writing unit-tests with Gin
Go
1
star
44

rokku-dev-mariadb

MariaDB development image for the Rokku project: https://github.com/ing-bank/rokku
Dockerfile
1
star