Self-Maintaining Filter Index (SMFI)¶
Overview¶
HeliosDB-Lite implements a Self-Maintaining Filter Index (SMFI) architecture that provides automatic storage-level filtering with zero memory overhead. Unlike traditional database indexes that require manual creation and maintenance, SMFI structures are automatically updated during DML operations and intelligently created based on query patterns.
Key Benefits¶
| Benefit | Description |
|---|---|
| Zero Memory Overhead | All filter structures stored on disk, loaded on-demand |
| Self-Maintaining | Automatic updates during INSERT/UPDATE/DELETE |
| CPU-Aware | Background consolidation respects system load |
| Intelligent | Auto-creates filters based on query patterns |
| Parallel-Ready | Full parallel query support with rayon |
Architecture¶
DML OPERATIONS
│
INSERT ───────────┼──────────── UPDATE ──────────── DELETE
│
▼
┌───────────────────────┐
│ Filter Delta Tracker │ (~150ns overhead)
│ - Bloom filter bits │
│ - Zone map updates │
│ - Delta log append │
└───────────────────────┘
│
┌─────────────┴─────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────────────┐
│ Speculative │ │ Background Consolidation│
│ Filter Manager │ │ Worker (CPU-aware) │
│ - Pattern track │ │ - Merge deltas │
│ - Auto-create │ │ - Rebuild when idle │
└─────────────────┘ └─────────────────────────┘
│
▼
┌───────────────────────┐
│ QUERY EXECUTION │
│ Phase 1: Bloom check │
│ Phase 2: Zone prune │
│ Phase 3: SIMD filter │
└───────────────────────┘
Components¶
1. Bloom Filters¶
Probabilistic data structures for fast membership testing. Ideal for equality predicates.
How It Works: - Each value is hashed to multiple bit positions - Query: Check all bit positions (if any 0, definitely not present) - False positive rate configurable (default: 1%)
When Used:
-- Bloom filter effective for equality checks
SELECT * FROM orders WHERE customer_id = 12345;
SELECT * FROM users WHERE email = 'user@example.com';
Performance: - Check time: ~10ns per value - Memory: ~10 bits per distinct value - False positive rate: ~1% (configurable)
2. Zone Maps¶
Block-level min/max statistics for range predicates. Enables block pruning.
How It Works: - Each block tracks min/max values per column - Query: Skip blocks where predicate can't match - Zero false positives (exact pruning)
When Used:
-- Zone map effective for range queries
SELECT * FROM events WHERE timestamp > '2025-01-01';
SELECT * FROM products WHERE price BETWEEN 100 AND 500;
Performance: - Check time: ~1ns per block - Storage: ~100 bytes per block per column - Pruning: 50-99% of blocks skipped (data-dependent)
3. Columnar Zone Summaries (CZS)¶
Enhanced per-block column statistics with cardinality estimation.
Statistics Tracked: - Min/Max values - Null count - Approximate distinct count (HyperLogLog) - Most Common Values (MCV) - Histogram buckets
When Used:
-- Optimizer uses CZS for cost estimation
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Shows selectivity estimate from MCV
-- HyperLogLog for distinct estimation
SELECT COUNT(DISTINCT customer_id) FROM orders;
-- Uses pre-computed approximate distinct count
4. Speculative Filter Indexes¶
Automatically created filters based on query patterns.
How It Works: 1. Query pattern tracker records predicate usage 2. High-frequency low-selectivity patterns detected 3. Bloom filter automatically created for column 4. Unused filters dropped after configurable period
Example Pattern Detection:
Query: SELECT * FROM orders WHERE customer_id = ?
Executions: 847 in last hour
Selectivity: 0.01% (very selective)
→ Auto-create bloom filter for orders.customer_id
5. Parallel Filter Evaluation¶
Three-phase parallel filtering using rayon.
Execution Phases:
Phase 1: Parallel Bloom Filter Checks
└─ Eliminate non-matching rows early
Phase 2: Parallel Zone Map Block Pruning
└─ Skip entire blocks that can't match
Phase 3: Parallel SIMD Predicate Evaluation
└─ High-throughput vectorized filtering
Parallelism: - Uses all available CPU cores - Adaptive chunk sizing based on CPU load - Work-stealing for load balancing
Performance Impact¶
Query Performance¶
| Scenario | Without SMFI | With SMFI | Improvement |
|---|---|---|---|
| Point lookup (1M rows) | 100ms | 5ms | 20x |
| Range scan (10M rows, 1% match) | 500ms | 50ms | 10x |
| Multi-predicate (100M rows) | 5s | 200ms | 25x |
| Parallel 8-core (100M rows) | 5s | 30ms | 166x |
Write Overhead¶
| Operation | SMFI Overhead | Total Impact |
|---|---|---|
| INSERT | ~150ns | < 0.02% |
| UPDATE | ~200ns | < 0.02% |
| DELETE | ~100ns | < 0.02% |
Storage Overhead¶
| Table Size | Bloom Filters | Zone Maps | Total |
|---|---|---|---|
| 1 GB | 10 MB (1%) | 5 MB (0.5%) | ~1.5% |
| 10 GB | 100 MB | 50 MB | ~1.5% |
| 100 GB | 1 GB | 500 MB | ~1.5% |
Implicit Benefits¶
SMFI provides several implicit benefits that require no configuration:
1. Automatic Query Acceleration¶
All SELECT queries with WHERE clauses automatically benefit:
-- These queries automatically use SMFI filtering
SELECT * FROM orders WHERE customer_id = 123;
SELECT * FROM events WHERE timestamp > '2025-01-01';
SELECT * FROM products WHERE category = 'electronics' AND price < 1000;
2. Reduced I/O¶
Block pruning via zone maps reduces disk reads:
-- Query on ordered timestamp column
SELECT * FROM logs WHERE timestamp > NOW() - INTERVAL '1 hour';
-- With zone maps: Only reads recent blocks
-- Without zone maps: Full table scan
3. Memory Efficiency¶
All filter structures are disk-resident:
- 100 concurrent connections use same filter structures
- No per-connection memory allocation for filters
- Filters loaded on-demand, evicted after use
4. Automatic Index Suggestions¶
Speculative Filter Manager tracks queries and suggests optimizations:
-- View automatically detected patterns
SELECT * FROM pg_speculative_filters();
-- Shows: table, column, pattern_type, frequency, selectivity
5. Parallel Query Execution¶
Multi-core utilization is automatic:
-- Large scan automatically parallelized
SELECT COUNT(*) FROM orders WHERE amount > 1000;
-- Uses all available CPU cores for filtering
Configuration¶
Filter Consolidation Settings¶
-- Maximum CPU for background consolidation (default: 15%)
SET smfi_max_cpu_percent = 15;
-- Delta threshold before consolidation (default: 1000)
SET smfi_delta_threshold = 1000;
-- Time threshold in seconds (default: 300)
SET smfi_time_threshold = 300;
Speculative Filter Settings¶
-- Minimum query frequency for auto-filter (default: 10)
SET smfi_min_query_frequency = 10;
-- Days before dropping unused filter (default: 7)
SET smfi_drop_after_days = 7;
-- Maximum speculative filters per table (default: 10)
SET smfi_max_filters_per_table = 10;
Parallel Filtering Settings¶
-- Enable/disable parallel filtering (default: on)
SET smfi_parallel_enabled = on;
-- Minimum rows for parallel execution (default: 10000)
SET smfi_parallel_threshold = 10000;
-- Maximum parallel workers (default: CPU cores)
SET smfi_max_workers = 8;
Bloom Filter Settings¶
-- False positive rate (default: 0.01 = 1%)
SET smfi_bloom_fpr = 0.01;
-- Number of hash functions (default: 7)
SET smfi_bloom_hashes = 7;
System Views¶
pg_smfi_status()¶
Overview of SMFI system status:
| Column | Description |
|---|---|
| tables_tracked | Number of tables with SMFI |
| bloom_filters | Total bloom filters |
| zone_maps | Total zone maps |
| pending_deltas | Deltas awaiting consolidation |
| last_consolidation | Last consolidation timestamp |
pg_smfi_table_stats()¶
Per-table SMFI statistics:
| Column | Description |
|---|---|
| table_name | Table name |
| bloom_filter_count | Number of bloom filters |
| zone_map_blocks | Number of zone map blocks |
| speculative_filters | Auto-created filters |
| bloom_checks | Total bloom filter checks |
| bloom_hits | Bloom filter positive matches |
| blocks_pruned | Blocks skipped via zone maps |
pg_speculative_filters()¶
Currently active speculative filters:
| Column | Description |
|---|---|
| table_name | Table name |
| column_name | Column name |
| filter_type | Filter type (bloom, zone) |
| pattern_type | Query pattern (equality, range) |
| query_frequency | Times pattern executed |
| selectivity | Estimated selectivity |
| created_at | Filter creation time |
| last_used | Last usage time |
Best Practices¶
1. Let SMFI Work Automatically¶
SMFI is designed to be zero-configuration. Avoid over-tuning:
-- Don't do this unless you have specific needs
SET smfi_bloom_fpr = 0.001; -- Very low FPR wastes space
-- Let defaults work for most cases
RESET smfi_bloom_fpr;
2. Monitor Speculative Filters¶
Check which filters are being auto-created:
-- See what SMFI has learned about your workload
SELECT table_name, column_name, query_frequency, selectivity
FROM pg_speculative_filters()
ORDER BY query_frequency DESC;
3. Use EXPLAIN to Verify¶
Check if SMFI is being used:
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- Look for: "Bloom Filter Check" or "Zone Map Prune"
4. Bulk Load Optimization (Automatic)¶
SMFI automatically detects and optimizes bulk load operations:
-- Bulk INSERT (100+ rows) - SMFI auto-suspends
INSERT INTO orders VALUES
(1, 'A', 100),
(2, 'B', 200),
-- ... 100+ rows
(1000, 'Z', 999);
-- SMFI auto-resumes and schedules rebuild
-- INSERT ... SELECT - auto-suspends (when implemented)
INSERT INTO archive SELECT * FROM orders WHERE date < '2024-01-01';
-- COPY FROM - auto-suspends (when implemented)
COPY orders FROM '/data/orders.csv';
How It Works: 1. Detects operations with ≥100 rows 2. Suspends per-row tracking for affected table 3. Counts rows modified during suspension 4. Resumes tracking when operation completes 5. Schedules filter rebuild automatically
Manual Override (Optional):
For very large imports, you can still manually control:
-- Temporarily disable for bulk insert
SET smfi_tracking_enabled = off;
-- Bulk load data
COPY orders FROM '/data/orders.csv';
-- Re-enable and rebuild
SET smfi_tracking_enabled = on;
CALL smfi_rebuild_table('orders');
Comparison with Traditional Indexes¶
| Feature | B-Tree Index | SMFI Bloom Filter | SMFI Zone Map |
|---|---|---|---|
| Creation | Manual CREATE INDEX | Automatic | Automatic |
| Maintenance | Manual REINDEX | Self-maintaining | Self-maintaining |
| Memory | In-memory pages | Disk-only | Disk-only |
| False Positives | None | ~1% | None |
| Best For | Range + equality | Equality | Range |
| Write Overhead | 10-30% | < 0.02% | < 0.02% |
Troubleshooting¶
SMFI Not Being Used¶
-- Check if SMFI is enabled
SHOW smfi_enabled;
-- Verify table has been tracked
SELECT * FROM pg_smfi_table_stats() WHERE table_name = 'orders';
-- Force SMFI rebuild
CALL smfi_rebuild_table('orders');
High Consolidation CPU Usage¶
-- Lower max CPU for consolidation
SET smfi_max_cpu_percent = 10;
-- Increase delta threshold (less frequent consolidation)
SET smfi_delta_threshold = 5000;
Speculative Filters Not Created¶
-- Check query frequency threshold
SHOW smfi_min_query_frequency;
-- Lower threshold if needed
SET smfi_min_query_frequency = 5;
See Also¶
- SQL Settings Reference - All SMFI settings
- Query Optimization - General optimization techniques
- EXPLAIN Guide - Query plan analysis
- Scalability Report - Detailed benchmarks