• Stars
    star
    145
  • Rank 254,144 (Top 6 %)
  • Language PLpgSQL
  • License
    BSD 3-Clause "New...
  • Created about 2 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Faceted query acceleration for PostgreSQL using roaring bitmaps

pgfaceting

PostgreSQL extension to quickly calculate facet counts using inverted index built with roaring bitmaps. Requires pg_roaringbitmap to be installed.

Faceting means counting number occurrences of each value in a result set for a set of attributes. Typical example of faceting is be a web shop where you can see how many items are remaining after filtering your search by red, green or blue, and how many when filtering by size small, medium or large.

Work on this project has been sponsored by Xenit.

Build and install

make install
make installcheck
psql -c "create extension roaringbitmap" -c "create extension pgfaceting"

Usage

pgfaceting creates and maintains two extra tables for your main table. tbl_facets contains for each facet and value combination a list of id values for rows containing that combination. The list is stored as a roaring bitmap for quick intersection and cardinality operations. Because updating this list is a heavy operation any changes to the main table get stored in tbl_facets_deltas as a combination of facet, value, id and +1 or -1 depending on the kind of update. A periodic maintenance job is responsible for merging deltas into the main facets table.

Currently only 32bit integer id columns are supported. When pg_roaringbitmap adds support for 64bit bitmaps then int8 and possibly ctid could be supported.

Adding faceting to a table

SELECT faceting.add_faceting_to_table(
    'documents',
    key => 'id',
    facets => array[
        faceting.datetrunc_facet('created', 'month'),
        faceting.datetrunc_facet('finished', 'month'),
        faceting.plain_facet('category_id'),
        faceting.plain_facet('type'),
        faceting.bucket_facet('size', buckets => array[0,1000,5000,10000,50000,100000,500000])
    ]
);

The add_faceting_to_table() adds the facets tables and populates the contents. It takes an array of facets to extract from each row.

  • plain_facet(col name) - Takes the column value as is as the facet value.
  • datetrunc_facet(col name, precision text) - Applies a date_trunc function on a column to get the facet value. Useful for timebucketing (yearly, monthly, etc.)
  • bucket_facet(col name, buckets anyarray) - Assigns a continuous variable (price, weight, etc.) to a set of buckets and stores the index of the chosen bucket as the facet value.

For merging changes create a periodic job that runs:

CALL faceting.run_maintenance();

This will run delta merging on all faceted tables. There is also a function for maintaining a single table:

SELECT faceting.merge_deltas('documents'::regclass);

Querying facets

Getting top 10 values for each kind of facet:

SELECT * FROM faceting.top_values('documents'::regclass, n => 10);

We can also filter by some facets and get the results of other facets:

SELECT * FROM faceting.count_results('documents'::regclass,
                                     filters => array[row('category_id', '24'),
                                                      row('type', 'image/jpeg')]::faceting.facet_filter[]);

For advanced usage the inverted index tables can be accessed directly.

WITH lookup AS (
    SELECT id >> 20 AS chunk_id, rb_build_agg(id) postinglist
    FROM documents
    WHERE ...
)
SELECT facet_id, facet_value, sum(rb_and_cardinality(flt.postinglist, fct.postinglist))
FROM lookup flt JOIN documents_facets USING (chunk_id)
GROUP BY 1, 2;

How fast is it

Calculating facets for 61% of rows in 100M row table:

-- 24 vcore parallel seq scan
postgres=# SELECT facet_name, count(distinct facet_value), sum(cardinality)
FROM (SELECT facet_name, facet_value, COUNT(*) cardinality
      FROM test2.documents d, LATERAL (VALUES
            ('created', date_trunc('month', created)::text),
            ('finished', date_trunc('month', finished)::text),
            ('type', type::text),
            ('size', width_bucket(size, array[0,1000,5000,10000,50000,100000,500000])::text)
          ) t(facet_name, facet_value)
      WHERE category_id = 24
      GROUP BY 1, 2) count_results
GROUP BY 1;
 facet_name | count |   sum    
------------+-------+----------
 created    |   154 | 60812252
 finished   |   154 | 60812252
 size       |     7 | 60812252
 type       |     8 | 60812252
(4 rows)

Time: 18440.061 ms (00:18.440)

-- Single core only
postgres=# SET max_parallel_workers_per_gather = 0;
SET
Time: 0.206 ms
postgres=# SELECT facet_name, count(distinct facet_value), sum(cardinality)
FROM (SELECT facet_name, facet_value, COUNT(*) cardinality
      FROM test2.documents d, LATERAL (VALUES
            ('created', date_trunc('month', created)::text),
            ('finished', date_trunc('month', finished)::text),
            ('type', type::text),
            ('size', width_bucket(size, array[0,1000,5000,10000,50000,100000,500000])::text)
          ) t(facet_name, facet_value)
      WHERE category_id = 24
      GROUP BY 1, 2) count_results
GROUP BY 1;
 facet_name | count |   sum    
------------+-------+----------
 created    |   154 | 60812252
 finished   |   154 | 60812252
 size       |     7 | 60812252
 type       |     8 | 60812252
(4 rows)

Time: 222019.758 ms (03:42.020)

-- Using facets index
postgres=# SELECT facet_name, count(distinct facet_value), sum(cardinality)
FROM faceting.count_results('documents'::regclass,
    filters => array[row('category_id', 24)]::faceting.facet_filter[])
GROUP BY 1;
 facet_name | count |   sum    
------------+-------+----------
 created    |   154 | 60812252
 finished   |   154 | 60812252
 size       |     7 | 60812252
 type       |     8 | 60812252
(4 rows)

 Time: 155.228 ms

More Repositories

1

pgwatch2

PostgreSQL metrics monitor/dashboard
PLpgSQL
1,792
star
2

pg_timetable

pg_timetable: Advanced scheduling for PostgreSQL
Go
1,072
star
3

pg_squeeze

A PostgreSQL extension for automatic bloat cleanup
C
462
star
4

vip-manager

Manages a virtual IP based on state kept in etcd or Consul
Go
195
star
5

pg_show_plans

Show query plans of all currently running SQL statements
C
186
star
6

postgres-showcase

Postgres features showcase (commented SQL samples) for beginners
164
star
7

pg_permissions

A simple set of views to see ALL permissions in a PostgreSQL database
PLpgSQL
143
star
8

ora_migrator

Tools for Oracle to PostgreSQL migration
PLpgSQL
109
star
9

zheap

⚑️ Development status and progress reporting.
HTML
93
star
10

rjsf-material-ui

[Archived] Material UI theme for react-jsonschema-form.
TypeScript
62
star
11

pgwatch

πŸ”¬PGWATCH: PostgreSQL metrics monitor/dashboard
Go
59
star
12

postgres

PostgreSQL with Transparent Data Encryption (TDE)
C
49
star
13

pg_cgroups

PostgreSQL extension to manage Linux Control Groups
C
36
star
14

pg_rewrite

Perform maintenance tasks which require a table to be rewritten (i.e. the table data to be copied to a new storage) and which are expected to limit the access to the table as little as possible
C
34
star
15

pg_crash

Periodically or randomly crash your database
C
26
star
16

safe-backup

Pre- and post-backup scripts for a safe PostgreSQL online file system backup
Shell
25
star
17

cybertec_migrator

CYBERTEC Migrator πŸ—„πŸ”€πŸ—„
Shell
22
star
18

patroni-packaging

Shell
20
star
19

db_migrator

PLpgSQL
18
star
20

react-database-diagram

A react component to render nice database diagram using storm-react-diagrams
TypeScript
17
star
21

patroni-windows-packaging

Automate installing and launching of Patroni under Windows
Batchfile
15
star
22

pg_timetable_gui

GUI for pg_timetable
Pascal
13
star
23

CYBERTEC-pg-operator

Go
12
star
24

pg_sequence_fixer

Fixing PostgreSQL sequences which got out of sync with the data
PLpgSQL
12
star
25

walbouncer

A proxy server for PostgreSQL replication connections with the capability to filter out user defined subsets of data
C
12
star
26

scripts

Various scripts around PostgreSQL management
Python
10
star
27

generic-plan

generate a generic plan for a parameterized SQL statement
PLpgSQL
10
star
28

pg_remote_exec

Run shell commands in SQL prompt
C
8
star
29

drop_role_helper

Generate SQL to revoke all privileges of a role in the current database
PLpgSQL
7
star
30

poc-plpgsql-analyzer

Proof of concept for tooling to migrate PL/SQL code to PL/pgSQL written in Rust
Rust
7
star
31

PES

Patroni Environment Setup
Pascal
6
star
32

walbouncer-companion

A tool to perform a selective Postgres basebackup
Python
6
star
33

essence

β˜• breaks during database dump extraction? Not on our watch!
Rust
4
star
34

merge_ips

PLpgSQL
3
star
35

today-i-learned

TypeScript
3
star
36

python_template

Template repository for Python command line apps.
Python
3
star
37

.github

Repository to share actions and workflows
3
star
38

CYBERTEC-operator-tutorials

3
star
39

yaim

yaim - yet another ip manager
Go
2
star
40

postgresql-action

:octocat: GitHub Action for PostgreSQL
2
star
41

layman

fuse-overlayfs within Docker for volume mounts.
Shell
2
star
42

debezium2postgres

Application to apply CDC log from Debezium to the destination PostgreSQL
Go
2
star
43

pgbackrest-tde

pgbackrest fork, patched for PostgreSQL-TDE
C
2
star
44

patroni-infoblox-integration

Patroni callbacks for integrating into Infoblox DNS services
Python
2
star
45

CYBERTEC-pg-container

Python
1
star
46

today-i-learned-content

This repository contains all the content for Today I learned @Cybertec-postgresql.
Shell
1
star
47

postgresql

A fork of https://salsa.debian.org/postgresql/postgresql/
1
star