Skip to content

Self-Maintaining Filter Index (SMFI)

Overview

HeliosDB-Lite implements a Self-Maintaining Filter Index (SMFI) architecture that provides automatic storage-level filtering with zero memory overhead. Unlike traditional database indexes that require manual creation and maintenance, SMFI structures are automatically updated during DML operations and intelligently created based on query patterns.

Key Benefits

Benefit Description
Zero Memory Overhead All filter structures stored on disk, loaded on-demand
Self-Maintaining Automatic updates during INSERT/UPDATE/DELETE
CPU-Aware Background consolidation respects system load
Intelligent Auto-creates filters based on query patterns
Parallel-Ready Full parallel query support with rayon

Architecture

                    DML OPERATIONS
        INSERT ───────────┼──────────── UPDATE ──────────── DELETE
              ┌───────────────────────┐
              │  Filter Delta Tracker │  (~150ns overhead)
              │  - Bloom filter bits  │
              │  - Zone map updates   │
              │  - Delta log append   │
              └───────────────────────┘
            ┌─────────────┴─────────────┐
            ▼                           ▼
   ┌─────────────────┐        ┌─────────────────────────┐
   │ Speculative     │        │ Background Consolidation│
   │ Filter Manager  │        │ Worker (CPU-aware)      │
   │ - Pattern track │        │ - Merge deltas          │
   │ - Auto-create   │        │ - Rebuild when idle     │
   └─────────────────┘        └─────────────────────────┘
              ┌───────────────────────┐
              │  QUERY EXECUTION      │
              │  Phase 1: Bloom check │
              │  Phase 2: Zone prune  │
              │  Phase 3: SIMD filter │
              └───────────────────────┘

Components

1. Bloom Filters

Probabilistic data structures for fast membership testing. Ideal for equality predicates.

How It Works: - Each value is hashed to multiple bit positions - Query: Check all bit positions (if any 0, definitely not present) - False positive rate configurable (default: 1%)

When Used:

-- Bloom filter effective for equality checks
SELECT * FROM orders WHERE customer_id = 12345;
SELECT * FROM users WHERE email = 'user@example.com';

Performance: - Check time: ~10ns per value - Memory: ~10 bits per distinct value - False positive rate: ~1% (configurable)

2. Zone Maps

Block-level min/max statistics for range predicates. Enables block pruning.

How It Works: - Each block tracks min/max values per column - Query: Skip blocks where predicate can't match - Zero false positives (exact pruning)

When Used:

-- Zone map effective for range queries
SELECT * FROM events WHERE timestamp > '2025-01-01';
SELECT * FROM products WHERE price BETWEEN 100 AND 500;

Performance: - Check time: ~1ns per block - Storage: ~100 bytes per block per column - Pruning: 50-99% of blocks skipped (data-dependent)

3. Columnar Zone Summaries (CZS)

Enhanced per-block column statistics with cardinality estimation.

Statistics Tracked: - Min/Max values - Null count - Approximate distinct count (HyperLogLog) - Most Common Values (MCV) - Histogram buckets

When Used:

-- Optimizer uses CZS for cost estimation
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Shows selectivity estimate from MCV

-- HyperLogLog for distinct estimation
SELECT COUNT(DISTINCT customer_id) FROM orders;
-- Uses pre-computed approximate distinct count

4. Speculative Filter Indexes

Automatically created filters based on query patterns.

How It Works: 1. Query pattern tracker records predicate usage 2. High-frequency low-selectivity patterns detected 3. Bloom filter automatically created for column 4. Unused filters dropped after configurable period

Example Pattern Detection:

Query: SELECT * FROM orders WHERE customer_id = ?
Executions: 847 in last hour
Selectivity: 0.01% (very selective)
→ Auto-create bloom filter for orders.customer_id

5. Parallel Filter Evaluation

Three-phase parallel filtering using rayon.

Execution Phases:

Phase 1: Parallel Bloom Filter Checks
         └─ Eliminate non-matching rows early

Phase 2: Parallel Zone Map Block Pruning
         └─ Skip entire blocks that can't match

Phase 3: Parallel SIMD Predicate Evaluation
         └─ High-throughput vectorized filtering

Parallelism: - Uses all available CPU cores - Adaptive chunk sizing based on CPU load - Work-stealing for load balancing

Performance Impact

Query Performance

Scenario Without SMFI With SMFI Improvement
Point lookup (1M rows) 100ms 5ms 20x
Range scan (10M rows, 1% match) 500ms 50ms 10x
Multi-predicate (100M rows) 5s 200ms 25x
Parallel 8-core (100M rows) 5s 30ms 166x

Write Overhead

Operation SMFI Overhead Total Impact
INSERT ~150ns < 0.02%
UPDATE ~200ns < 0.02%
DELETE ~100ns < 0.02%

Storage Overhead

Table Size Bloom Filters Zone Maps Total
1 GB 10 MB (1%) 5 MB (0.5%) ~1.5%
10 GB 100 MB 50 MB ~1.5%
100 GB 1 GB 500 MB ~1.5%

Implicit Benefits

SMFI provides several implicit benefits that require no configuration:

1. Automatic Query Acceleration

All SELECT queries with WHERE clauses automatically benefit:

-- These queries automatically use SMFI filtering
SELECT * FROM orders WHERE customer_id = 123;
SELECT * FROM events WHERE timestamp > '2025-01-01';
SELECT * FROM products WHERE category = 'electronics' AND price < 1000;

2. Reduced I/O

Block pruning via zone maps reduces disk reads:

-- Query on ordered timestamp column
SELECT * FROM logs WHERE timestamp > NOW() - INTERVAL '1 hour';
-- With zone maps: Only reads recent blocks
-- Without zone maps: Full table scan

3. Memory Efficiency

All filter structures are disk-resident:

  • 100 concurrent connections use same filter structures
  • No per-connection memory allocation for filters
  • Filters loaded on-demand, evicted after use

4. Automatic Index Suggestions

Speculative Filter Manager tracks queries and suggests optimizations:

-- View automatically detected patterns
SELECT * FROM pg_speculative_filters();

-- Shows: table, column, pattern_type, frequency, selectivity

5. Parallel Query Execution

Multi-core utilization is automatic:

-- Large scan automatically parallelized
SELECT COUNT(*) FROM orders WHERE amount > 1000;
-- Uses all available CPU cores for filtering

Configuration

Filter Consolidation Settings

-- Maximum CPU for background consolidation (default: 15%)
SET smfi_max_cpu_percent = 15;

-- Delta threshold before consolidation (default: 1000)
SET smfi_delta_threshold = 1000;

-- Time threshold in seconds (default: 300)
SET smfi_time_threshold = 300;

Speculative Filter Settings

-- Minimum query frequency for auto-filter (default: 10)
SET smfi_min_query_frequency = 10;

-- Days before dropping unused filter (default: 7)
SET smfi_drop_after_days = 7;

-- Maximum speculative filters per table (default: 10)
SET smfi_max_filters_per_table = 10;

Parallel Filtering Settings

-- Enable/disable parallel filtering (default: on)
SET smfi_parallel_enabled = on;

-- Minimum rows for parallel execution (default: 10000)
SET smfi_parallel_threshold = 10000;

-- Maximum parallel workers (default: CPU cores)
SET smfi_max_workers = 8;

Bloom Filter Settings

-- False positive rate (default: 0.01 = 1%)
SET smfi_bloom_fpr = 0.01;

-- Number of hash functions (default: 7)
SET smfi_bloom_hashes = 7;

System Views

pg_smfi_status()

Overview of SMFI system status:

SELECT * FROM pg_smfi_status();
Column Description
tables_tracked Number of tables with SMFI
bloom_filters Total bloom filters
zone_maps Total zone maps
pending_deltas Deltas awaiting consolidation
last_consolidation Last consolidation timestamp

pg_smfi_table_stats()

Per-table SMFI statistics:

SELECT * FROM pg_smfi_table_stats();
Column Description
table_name Table name
bloom_filter_count Number of bloom filters
zone_map_blocks Number of zone map blocks
speculative_filters Auto-created filters
bloom_checks Total bloom filter checks
bloom_hits Bloom filter positive matches
blocks_pruned Blocks skipped via zone maps

pg_speculative_filters()

Currently active speculative filters:

SELECT * FROM pg_speculative_filters();
Column Description
table_name Table name
column_name Column name
filter_type Filter type (bloom, zone)
pattern_type Query pattern (equality, range)
query_frequency Times pattern executed
selectivity Estimated selectivity
created_at Filter creation time
last_used Last usage time

Best Practices

1. Let SMFI Work Automatically

SMFI is designed to be zero-configuration. Avoid over-tuning:

-- Don't do this unless you have specific needs
SET smfi_bloom_fpr = 0.001;  -- Very low FPR wastes space

-- Let defaults work for most cases
RESET smfi_bloom_fpr;

2. Monitor Speculative Filters

Check which filters are being auto-created:

-- See what SMFI has learned about your workload
SELECT table_name, column_name, query_frequency, selectivity
FROM pg_speculative_filters()
ORDER BY query_frequency DESC;

3. Use EXPLAIN to Verify

Check if SMFI is being used:

EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- Look for: "Bloom Filter Check" or "Zone Map Prune"

4. Bulk Load Optimization (Automatic)

SMFI automatically detects and optimizes bulk load operations:

-- Bulk INSERT (100+ rows) - SMFI auto-suspends
INSERT INTO orders VALUES
  (1, 'A', 100),
  (2, 'B', 200),
  -- ... 100+ rows
  (1000, 'Z', 999);
-- SMFI auto-resumes and schedules rebuild

-- INSERT ... SELECT - auto-suspends (when implemented)
INSERT INTO archive SELECT * FROM orders WHERE date < '2024-01-01';

-- COPY FROM - auto-suspends (when implemented)
COPY orders FROM '/data/orders.csv';

How It Works: 1. Detects operations with ≥100 rows 2. Suspends per-row tracking for affected table 3. Counts rows modified during suspension 4. Resumes tracking when operation completes 5. Schedules filter rebuild automatically

Manual Override (Optional):

For very large imports, you can still manually control:

-- Temporarily disable for bulk insert
SET smfi_tracking_enabled = off;

-- Bulk load data
COPY orders FROM '/data/orders.csv';

-- Re-enable and rebuild
SET smfi_tracking_enabled = on;
CALL smfi_rebuild_table('orders');

Comparison with Traditional Indexes

Feature B-Tree Index SMFI Bloom Filter SMFI Zone Map
Creation Manual CREATE INDEX Automatic Automatic
Maintenance Manual REINDEX Self-maintaining Self-maintaining
Memory In-memory pages Disk-only Disk-only
False Positives None ~1% None
Best For Range + equality Equality Range
Write Overhead 10-30% < 0.02% < 0.02%

Troubleshooting

SMFI Not Being Used

-- Check if SMFI is enabled
SHOW smfi_enabled;

-- Verify table has been tracked
SELECT * FROM pg_smfi_table_stats() WHERE table_name = 'orders';

-- Force SMFI rebuild
CALL smfi_rebuild_table('orders');

High Consolidation CPU Usage

-- Lower max CPU for consolidation
SET smfi_max_cpu_percent = 10;

-- Increase delta threshold (less frequent consolidation)
SET smfi_delta_threshold = 5000;

Speculative Filters Not Created

-- Check query frequency threshold
SHOW smfi_min_query_frequency;

-- Lower threshold if needed
SET smfi_min_query_frequency = 5;

See Also