🔥 HyperLogLog for Rails and Postgres
For fast, approximate count-distinct queries
First, install the hll extension on your database server:
cd /tmp
curl -L https://github.com/citusdata/postgresql-hll/archive/refs/tags/v2.18.tar.gz | tar xz
cd postgresql-hll-2.18
make
make install # may need sudo
Then add this line to your application’s Gemfile:
gem "active_hll"
And run:
bundle install
rails generate active_hll:install
rails db:migrate
HLLs provide an approximate count of unique values (like unique visitors). By rolling up data by day, you can quickly get an approximate count over any date range.
Create a table with an hll
column
class CreateEventRollups < ActiveRecord::Migration[7.1]
def change
create_table :event_rollups do |t|
t.date :time_bucket, index: {unique: true}
t.hll :visitor_ids
end
end
end
You can use batch and stream approaches to build HLLs
To generate HLLs from existing data, use the hll_agg
method
hlls = Event.group_by_day(:created_at).hll_agg(:visitor_id)
Install Groupdate to use the
group_by_day
method
And store the result
EventRollup.upsert_all(
hlls.map { |k, v| {time_bucket: k, visitor_ids: v} },
unique_by: [:time_bucket]
)
For a large number of HLLs, use SQL to generate and upsert in a single statement
To add new data to HLLs, use the hll_add
method
EventRollup.where(time_bucket: Date.current).hll_add(visitor_ids: ["visitor1", "visitor2"])
or the hll_upsert
method (experimental)
EventRollup.hll_upsert({time_bucket: Date.current, visitor_ids: ["visitor1", "visitor2"]})
Get approximate unique values for a time range
EventRollup.where(time_bucket: 30.days.ago.to_date..Date.current).hll_count(:visitor_ids)
Get approximate unique values by time bucket
EventRollup.group(:time_bucket).hll_count(:visitor_ids)
Get approximate unique values by month
EventRollup.group_by_month(:time_bucket, time_zone: false).hll_count(:visitor_ids)
Get the union of multiple HLLs
EventRollup.hll_union(:visitor_ids)
Cardinality estimators like HyperLogLog do not preserve privacy, so protect hll
columns the same as you would the raw data.
For instance, you can check membership with a good probability with:
SELECT
time_bucket,
visitor_ids = visitor_ids || hll_hash_text('visitor1') AS likely_member
FROM
event_rollups;
Data should only be retained for as long as it’s needed. Delete older data with:
EventRollup.where("time_bucket < ?", 2.years.ago).delete_all
There’s not a way to remove data from an HLL, so to delete data for a specific user, delete the underlying data and recalculate the rollup.
The hll
extension is available on a number of hosted providers.
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/active_hll.git
cd active_hll
bundle install
bundle exec rake test