Self-Maintaining Filter Index (SMFI)¶

Overview¶

HeliosDB-Lite implements a Self-Maintaining Filter Index (SMFI) architecture that provides automatic storage-level filtering with zero memory overhead. Unlike traditional database indexes that require manual creation and maintenance, SMFI structures are automatically updated during DML operations and intelligently created based on query patterns.

Key Benefits¶

Benefit	Description
Zero Memory Overhead	All filter structures stored on disk, loaded on-demand
Self-Maintaining	Automatic updates during INSERT/UPDATE/DELETE
CPU-Aware	Background consolidation respects system load
Intelligent	Auto-creates filters based on query patterns
Parallel-Ready	Full parallel query support with rayon

Architecture¶

                    DML OPERATIONS
                          │
        INSERT ───────────┼──────────── UPDATE ──────────── DELETE
                          │
                          ▼
              ┌───────────────────────┐
              │  Filter Delta Tracker │  (~150ns overhead)
              │  - Bloom filter bits  │
              │  - Zone map updates   │
              │  - Delta log append   │
              └───────────────────────┘
                          │
            ┌─────────────┴─────────────┐
            ▼                           ▼
   ┌─────────────────┐        ┌─────────────────────────┐
   │ Speculative     │        │ Background Consolidation│
   │ Filter Manager  │        │ Worker (CPU-aware)      │
   │ - Pattern track │        │ - Merge deltas          │
   │ - Auto-create   │        │ - Rebuild when idle     │
   └─────────────────┘        └─────────────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │  QUERY EXECUTION      │
              │  Phase 1: Bloom check │
              │  Phase 2: Zone prune  │
              │  Phase 3: SIMD filter │
              └───────────────────────┘

Components¶

1. Bloom Filters¶

Probabilistic data structures for fast membership testing. Ideal for equality predicates.

How It Works: - Each value is hashed to multiple bit positions - Query: Check all bit positions (if any 0, definitely not present) - False positive rate configurable (default: 1%)

When Used:

-- Bloom filter effective for equality checks
SELECT * FROM orders WHERE customer_id = 12345;
SELECT * FROM users WHERE email = 'user@example.com';

Performance: - Check time: ~10ns per value - Memory: ~10 bits per distinct value - False positive rate: ~1% (configurable)

2. Zone Maps¶

Block-level min/max statistics for range predicates. Enables block pruning.

How It Works: - Each block tracks min/max values per column - Query: Skip blocks where predicate can't match - Zero false positives (exact pruning)

When Used:

-- Zone map effective for range queries
SELECT * FROM events WHERE timestamp > '2025-01-01';
SELECT * FROM products WHERE price BETWEEN 100 AND 500;

Performance: - Check time: ~1ns per block - Storage: ~100 bytes per block per column - Pruning: 50-99% of blocks skipped (data-dependent)

3. Columnar Zone Summaries (CZS)¶

Enhanced per-block column statistics with cardinality estimation.

Statistics Tracked: - Min/Max values - Null count - Approximate distinct count (HyperLogLog) - Most Common Values (MCV) - Histogram buckets

When Used:

-- Optimizer uses CZS for cost estimation
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Shows selectivity estimate from MCV

-- HyperLogLog for distinct estimation
SELECT COUNT(DISTINCT customer_id) FROM orders;
-- Uses pre-computed approximate distinct count

4. Speculative Filter Indexes¶

Automatically created filters based on query patterns.

How It Works: 1. Query pattern tracker records predicate usage 2. High-frequency low-selectivity patterns detected 3. Bloom filter automatically created for column 4. Unused filters dropped after configurable period

Example Pattern Detection:

Query: SELECT * FROM orders WHERE customer_id = ?
Executions: 847 in last hour
Selectivity: 0.01% (very selective)
→ Auto-create bloom filter for orders.customer_id

5. Parallel Filter Evaluation¶

Three-phase parallel filtering using rayon.

Execution Phases:

Phase 1: Parallel Bloom Filter Checks
         └─ Eliminate non-matching rows early

Phase 2: Parallel Zone Map Block Pruning
         └─ Skip entire blocks that can't match

Phase 3: Parallel SIMD Predicate Evaluation
         └─ High-throughput vectorized filtering

Parallelism: - Uses all available CPU cores - Adaptive chunk sizing based on CPU load - Work-stealing for load balancing

Performance Impact¶

Query Performance¶

Scenario	Without SMFI	With SMFI	Improvement
Point lookup (1M rows)	100ms	5ms	20x
Range scan (10M rows, 1% match)	500ms	50ms	10x
Multi-predicate (100M rows)	5s	200ms	25x
Parallel 8-core (100M rows)	5s	30ms	166x

Write Overhead¶

Operation	SMFI Overhead	Total Impact
INSERT	~150ns	< 0.02%
UPDATE	~200ns	< 0.02%
DELETE	~100ns	< 0.02%

Storage Overhead¶

Table Size	Bloom Filters	Zone Maps	Total
1 GB	10 MB (1%)	5 MB (0.5%)	~1.5%
10 GB	100 MB	50 MB	~1.5%
100 GB	1 GB	500 MB	~1.5%

Implicit Benefits¶

SMFI provides several implicit benefits that require no configuration:

1. Automatic Query Acceleration¶

All SELECT queries with WHERE clauses automatically benefit:

-- These queries automatically use SMFI filtering
SELECT * FROM orders WHERE customer_id = 123;
SELECT * FROM events WHERE timestamp > '2025-01-01';
SELECT * FROM products WHERE category = 'electronics' AND price < 1000;

2. Reduced I/O¶

Block pruning via zone maps reduces disk reads:

-- Query on ordered timestamp column
SELECT * FROM logs WHERE timestamp > NOW() - INTERVAL '1 hour';
-- With zone maps: Only reads recent blocks
-- Without zone maps: Full table scan

3. Memory Efficiency¶

All filter structures are disk-resident:

100 concurrent connections use same filter structures
No per-connection memory allocation for filters
Filters loaded on-demand, evicted after use

4. Automatic Index Suggestions¶

Speculative Filter Manager tracks queries and suggests optimizations:

-- View automatically detected patterns
SELECT * FROM pg_speculative_filters();

-- Shows: table, column, pattern_type, frequency, selectivity

5. Parallel Query Execution¶

Multi-core utilization is automatic:

-- Large scan automatically parallelized
SELECT COUNT(*) FROM orders WHERE amount > 1000;
-- Uses all available CPU cores for filtering

Configuration¶

Filter Consolidation Settings¶

-- Maximum CPU for background consolidation (default: 15%)
SET smfi_max_cpu_percent = 15;

-- Delta threshold before consolidation (default: 1000)
SET smfi_delta_threshold = 1000;

-- Time threshold in seconds (default: 300)
SET smfi_time_threshold = 300;

Speculative Filter Settings¶

-- Minimum query frequency for auto-filter (default: 10)
SET smfi_min_query_frequency = 10;

-- Days before dropping unused filter (default: 7)
SET smfi_drop_after_days = 7;

-- Maximum speculative filters per table (default: 10)
SET smfi_max_filters_per_table = 10;

Parallel Filtering Settings¶

-- Enable/disable parallel filtering (default: on)
SET smfi_parallel_enabled = on;

-- Minimum rows for parallel execution (default: 10000)
SET smfi_parallel_threshold = 10000;

-- Maximum parallel workers (default: CPU cores)
SET smfi_max_workers = 8;

Bloom Filter Settings¶

-- False positive rate (default: 0.01 = 1%)
SET smfi_bloom_fpr = 0.01;

-- Number of hash functions (default: 7)
SET smfi_bloom_hashes = 7;

System Views¶

pg_smfi_status()¶

Overview of SMFI system status:

SELECT * FROM pg_smfi_status();

Column	Description
tables_tracked	Number of tables with SMFI
bloom_filters	Total bloom filters
zone_maps	Total zone maps
pending_deltas	Deltas awaiting consolidation
last_consolidation	Last consolidation timestamp

pg_smfi_table_stats()¶

Per-table SMFI statistics:

SELECT * FROM pg_smfi_table_stats();

Column	Description
table_name	Table name
bloom_filter_count	Number of bloom filters
zone_map_blocks	Number of zone map blocks
speculative_filters	Auto-created filters
bloom_checks	Total bloom filter checks
bloom_hits	Bloom filter positive matches
blocks_pruned	Blocks skipped via zone maps

pg_speculative_filters()¶

Currently active speculative filters:

SELECT * FROM pg_speculative_filters();

Column	Description
table_name	Table name
column_name	Column name
filter_type	Filter type (bloom, zone)
pattern_type	Query pattern (equality, range)
query_frequency	Times pattern executed
selectivity	Estimated selectivity
created_at	Filter creation time
last_used	Last usage time

Best Practices¶

1. Let SMFI Work Automatically¶

SMFI is designed to be zero-configuration. Avoid over-tuning:

-- Don't do this unless you have specific needs
SET smfi_bloom_fpr = 0.001;  -- Very low FPR wastes space

-- Let defaults work for most cases
RESET smfi_bloom_fpr;

2. Monitor Speculative Filters¶

Check which filters are being auto-created:

-- See what SMFI has learned about your workload
SELECT table_name, column_name, query_frequency, selectivity
FROM pg_speculative_filters()
ORDER BY query_frequency DESC;

3. Use EXPLAIN to Verify¶

Check if SMFI is being used:

EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- Look for: "Bloom Filter Check" or "Zone Map Prune"

4. Bulk Load Optimization (Automatic)¶

SMFI automatically detects and optimizes bulk load operations:

-- Bulk INSERT (100+ rows) - SMFI auto-suspends
INSERT INTO orders VALUES
  (1, 'A', 100),
  (2, 'B', 200),
  -- ... 100+ rows
  (1000, 'Z', 999);
-- SMFI auto-resumes and schedules rebuild

-- INSERT ... SELECT - auto-suspends (when implemented)
INSERT INTO archive SELECT * FROM orders WHERE date < '2024-01-01';

-- COPY FROM - auto-suspends (when implemented)
COPY orders FROM '/data/orders.csv';

How It Works: 1. Detects operations with ≥100 rows 2. Suspends per-row tracking for affected table 3. Counts rows modified during suspension 4. Resumes tracking when operation completes 5. Schedules filter rebuild automatically

Manual Override (Optional):

For very large imports, you can still manually control:

-- Temporarily disable for bulk insert
SET smfi_tracking_enabled = off;

-- Bulk load data
COPY orders FROM '/data/orders.csv';

-- Re-enable and rebuild
SET smfi_tracking_enabled = on;
CALL smfi_rebuild_table('orders');

Comparison with Traditional Indexes¶

Feature	B-Tree Index	SMFI Bloom Filter	SMFI Zone Map
Creation	Manual CREATE INDEX	Automatic	Automatic
Maintenance	Manual REINDEX	Self-maintaining	Self-maintaining
Memory	In-memory pages	Disk-only	Disk-only
False Positives	None	~1%	None
Best For	Range + equality	Equality	Range
Write Overhead	10-30%	< 0.02%	< 0.02%

Troubleshooting¶

SMFI Not Being Used¶

-- Check if SMFI is enabled
SHOW smfi_enabled;

-- Verify table has been tracked
SELECT * FROM pg_smfi_table_stats() WHERE table_name = 'orders';

-- Force SMFI rebuild
CALL smfi_rebuild_table('orders');

High Consolidation CPU Usage¶

-- Lower max CPU for consolidation
SET smfi_max_cpu_percent = 10;

-- Increase delta threshold (less frequent consolidation)
SET smfi_delta_threshold = 5000;

Speculative Filters Not Created¶

-- Check query frequency threshold
SHOW smfi_min_query_frequency;

-- Lower threshold if needed
SET smfi_min_query_frequency = 5;

Self-Maintaining Filter Index (SMFI)¶

Overview¶

Key Benefits¶

Architecture¶

Components¶

1. Bloom Filters¶

2. Zone Maps¶

3. Columnar Zone Summaries (CZS)¶

4. Speculative Filter Indexes¶

5. Parallel Filter Evaluation¶

Performance Impact¶

Query Performance¶

Write Overhead¶

Storage Overhead¶

Implicit Benefits¶

1. Automatic Query Acceleration¶

2. Reduced I/O¶

3. Memory Efficiency¶

4. Automatic Index Suggestions¶

5. Parallel Query Execution¶

Configuration¶

Filter Consolidation Settings¶

Speculative Filter Settings¶

Parallel Filtering Settings¶

Bloom Filter Settings¶

System Views¶

pg_smfi_status()¶

pg_smfi_table_stats()¶

pg_speculative_filters()¶

Best Practices¶

1. Let SMFI Work Automatically¶

2. Monitor Speculative Filters¶

3. Use EXPLAIN to Verify¶

4. Bulk Load Optimization (Automatic)¶

Comparison with Traditional Indexes¶

Troubleshooting¶

SMFI Not Being Used¶

High Consolidation CPU Usage¶

Speculative Filters Not Created¶

See Also¶