Molniya Roadmap

This document tracks features that are documented but not yet fully implemented in the library.

Enhanced String Handling (NEW - In Progress) ✨

Completed ✅

[x] StringView implementation for high-cardinality strings
[x] AdaptiveStringColumn with cardinality sampling and automatic encoding selection
[x] EnhancedChunk supporting mixed StringView + Dictionary columns
[x] ColumnStorageAdapter for unified column access
[x] UTF-8 aware operations (prefix/suffix filtering, sorting)
[x] Shared resource management (dictionaries and buffers)
[x] Comprehensive unit tests (45/45 passing)
[x] Integration tests with DataFrame (basic operations working)
[x] Chunk integration complete (EnhancedChunk, EnhancedChunkBuilder)
[x] Public API exports (all enhanced string classes exported)
[x] fromRecords() integration: Uses AdaptiveStringColumn for automatic cardinality detection
- High-cardinality strings auto-detected (cardinality > 10%)
- Automatically chooses Dictionary vs StringView based on data
- Seamlessly converts back to dictionary for pipeline compatibility

Partially Complete 🔄

CSV Integration: Dictionary encoding works perfectly for categorical data
- ✅ Dictionary encoding for low-cardinality strings (existing implementation)
- ❌ StringView for high-cardinality strings (TODO comment added for future work)
- Future: Use AdaptiveStringColumn in streaming CSV parser
String Operations: Forward-looking code exists but depends on EnhancedColumnBuffer
- ✅ Architecture designed in src/ops/string-ops.ts
- ✅ StringView and Dictionary implementations drafted
- ❌ EnhancedColumnBuffer class doesn't exist yet
- ❌ Integration with DataFrame string methods incomplete
- Current: Dictionary-only string operations work for existing code

Not Yet Implemented ❌

[ ] EnhancedColumnBuffer class (wrapper to unify ColumnBuffer + AdaptiveStringColumn)
[ ] DataFrame execution pipeline StringView support (currently converts to dict)
[ ] Expression compiler updates for StringView comparisons
[ ] GroupBy/Join optimizations with shared dictionaries
[ ] String operations (trim, replace) working with StringView columns

Implementation Notes

What Works Now:

AdaptiveStringColumn correctly detects cardinality and chooses encoding
fromRecords() uses adaptive encoding, converting high-cardinality data to dictionary
All existing DataFrame operations work (filter, select, groupBy, etc.)
Dictionary encoding efficiently handles categorical strings

Architecture Gap: The main gap is that we have two separate column systems:

Legacy: ColumnBuffer (used by Chunk, Pipeline, all operators)
New: AdaptiveStringColumn (used only in fromRecords, then converted)

To fully integrate StringView:

Create EnhancedColumnBuffer that wraps both systems
Update Pipeline/Operators to handle EnhancedColumnBuffer
Update execution.ts toArray() to read StringView columns
Complete string operation implementations

Core API Gaps

DataFrame Creation

[x] fromColumns() - Create DataFrame from column-oriented data with TypedArrays ✅
[x] fromArrays() - Create DataFrame from arrays of values ✅
[x] range() - Create DataFrame with sequence of numbers ✅
[ ] DataFrame.empty() - Static method needs verification

String Operations (ColumnRef methods)

[x] col().length() - String length ✅
[x] col().substring(start, len) - Extract substring ✅
[x] col().upper() / col().lower() - Case conversion ✅
[x] col().trim() - Remove whitespace ✅
[x] col().replace() - Replace substring ✅
[x] col().contains() - Check if contains substring ✅
[x] col().startsWith() / col().endsWith() - Prefix/suffix checks ✅

Date/Time Operations (ColumnRef methods)

[x] col().year() - Extract year ✅
[x] col().month() - Extract month ✅
[x] col().day() - Extract day ✅
[x] col().dayOfWeek() - Extract day of week ✅
[x] col().quarter() - Extract quarter ✅
[x] col().hour() / col().minute() / col().second() - Extract time components ✅
[x] col().addDays() / col().subDays() - Date arithmetic ✅
[x] col().diffDays() - Difference between dates ✅
[x] col().truncateDate() - Truncate to period ✅

Math Functions

[x] col().round(decimals) - Round to decimal places ✅
[x] col().floor() - Floor function ✅
[x] col().ceil() - Ceiling function ✅
[x] col().abs() - Absolute value ✅
[x] col().sqrt() - Square root ✅
[x] col().pow(exp) - Power/exponentiation ✅

Aggregation Functions

[x] std() - Standard deviation ✅
[x] var() - Variance ✅
[x] median() - Median value ✅
[x] countDistinct() - Count unique values ✅

Sorting Enhancements

[ ] asc().nullsLast() - Sort nulls last option
[ ] desc().nullsLast() - Sort nulls last option

Join Operations

[ ] rightJoin() - Right outer join
[ ] fullJoin() - Full outer join
[ ] Multi-column join support (array syntax)

DataFrame Methods

[x] df.union() - Union with deduplication ✅
[x] df.unionAll() - Union without deduplication ✅
[x] df.offset() / df.slice() - Pagination support ✅
[x] df.tail(n) - Last n rows ✅
[x] df.shuffle() - Random shuffle ✅
[x] df.sample(fraction) - Random sample ✅
[x] df.explode() - Explode array column into rows ✅

Utility Functions

[x] when().otherwise() - Complete conditional expression ✅
[x] coalesce() - First non-null value ✅
[x] between() - Range check ✅
[x] isIn() - Check if in array ✅

I/O Features

CSV

[x] readCsv() with projection option ✅
[x] readCsv() with filter predicate pushdown ✅

Parquet

[x] readParquet() with projection option ✅
[x] readParquet() with filter predicate pushdown ✅
[x] Complete Parquet type mapping (INT96 timestamps, complex types) ✅

Type System

Schema

[x] createSchema() export verification ✅
[x] Schema validation utilities ✅
[x] Schema comparison functions ✅

Type Casting

[x] toDate() / toTimestamp() - String to date parsing ✅
[x] formatDate() - Date to string formatting ✅
[x] parseJson() - Parse JSON strings ✅

Documentation Notes

The following documentation files contain API that may not be fully implemented:

guide/strings-dates.md - Many date/string methods documented
guide/expressions.md - Some expression builders documented
guide/aggregations.md - Statistical aggregations documented
api/column-ops.md - Many column operations documented
api/sort-limit.md - Some sorting options documented
api/groupby.md - Pivot functionality documented
api/joins.md - Right and full joins documented

Implementation Priority

High Priority (Core functionality)

String operations on ColumnRef (length, substring, upper, lower, trim)
Date extraction methods (year, month, day)
Math functions (round, floor, ceil, abs)
when().otherwise() completion

Medium Priority (Convenience)

fromColumns() / fromArrays() creation methods
df.slice() / df.offset() for pagination
coalesce() function
between() and isIn() operators

Low Priority (Advanced)

Statistical aggregations (std, var, median)
rightJoin() / fullJoin()
df.shuffle() / df.sample()
df.explode()

✅ What We've Implemented

1. Enhanced String Handling System

StringView class for high-cardinality strings (string-view.ts)
- UTF-8 byte storage with offset/length arrays
- O(1) insertion, zero V8 heap pressure
- Shared data buffers across columns
AdaptiveStringColumn (adaptive-string.ts)
- Cardinality sampling (first N strings)
- Auto-selection between Dictionary and StringView
- User override via DType hints
- No automatic O(N) switching during operations
Enhanced DType system (dtypes.ts)
- Added StringDict, StringView, String (adaptive) types
- Updated type mappings and constructors
Resource Management (string-resources.ts)
- Shared dictionary/buffer pooling
- Reference counting and lifecycle management
- Tag-based resource discovery
UTF-8 Operations (utf8-ops.ts)
- Byte-level vs character-level operations
- Fast prefix/suffix filtering without materialization
- Sorting, grouping, duplicate detection
EnhancedColumnBuffer (enhanced-column-buffer.ts)
- Unified interface for all column types
- Transparent adaptive string handling
Comprehensive Tests (enhanced-strings.test.ts, enhanced-strings-benchmark.test.ts)
- Unit tests for all components
- Integration tests for workflows
- Performance benchmarks

❌ What's Missing - Integration Work

1. Chunk Integration ✅ COMPLETED

~~The existing Chunk class (chunk.ts) only knows about DTypeKind.String~~

Status: DONE

✅ EnhancedChunk supports both StringView and Dictionary columns
✅ ColumnStorageAdapter provides unified access
✅ EnhancedChunkBuilder for easy construction
✅ All 8 integration tests passing
✅ Exported in public API

2. String Operations Update ✅ COMPLETED

String ops (string-ops-enhanced.ts) now work with both Dictionary and StringView:

typescript

trimColumnEnhanced(column: EnhancedColumnBuffer)
replaceColumnEnhanced(column, pattern, replacement, all)
toLowerColumnEnhanced / toUpperColumnEnhanced
substringColumnEnhanced / padColumnEnhanced

Implementation:

✅ StringView-aware versions of trim, replace, toLowerCase, toUpperCase, etc.
✅ Detection of encoding type and dispatch to appropriate implementation
✅ UTF8Utils integration for byte-level operations
✅ Full backward compatibility with existing Dictionary operations
✅ Comprehensive test suite (29 tests) verifying both encodings

3. DataFrame Integration ✅ COMPLETED

What works:

✅ fromRecords() now uses AdaptiveStringColumn for optimal string encoding
✅ fromColumns() now uses AdaptiveStringColumn for optimal string encoding
✅ Automatic detection of high vs low cardinality strings
✅ Support for explicit encoding hints (DType.stringDict, DType.stringView)
✅ Proper handling of nullable string columns
✅ String materialization in toArray() for all encoding types
✅ Basic DataFrame operations (filter, select) work with enhanced strings
✅ 19 comprehensive integration tests passing

Implementation Details:

Both fromRecords and fromColumns now use AdaptiveStringColumn to:
- Sample first N strings to detect cardinality
- Auto-select Dictionary encoding for low cardinality
- Auto-select StringView encoding for high cardinality
- Respect user hints (DType.stringDict / DType.stringView)
- Handle nullable string columns properly
Updated toArray() execution to materialize StringDict and StringView types
Added comprehensive test coverage (test/enhanced-dataframe.test.ts)

What's tested:

✅ Dictionary encoding for low-cardinality strings
✅ StringView encoding for high-cardinality strings
✅ Explicit DType hints (stringDict, stringView)
✅ Null value handling in string columns
✅ Mixed-length string efficiency
✅ TypedArray support in fromColumns
✅ Filtering on string columns
✅ Selecting string columns
✅ Large dataset handling (10,000 rows)
✅ Edge cases (empty DataFrames, single rows, empty strings, unicode)

4. CSV Reader Updates ❌ NOT IMPLEMENTED

src/io/csv/source.ts currently:

Always uses Dictionary for strings
No cardinality detection during parsing

Missing:

Sampling during CSV parsing
Creating StringView columns for high-cardinality data
User hints for string encoding in CsvSchemaSpec

5. Parquet Reader Updates ❌ NOT IMPLEMENTED

Similar issues with Parquet reader - needs StringView support

6. Expression Compiler ❌ NOT IMPLEMENTED

compiler.ts needs to:

Handle StringView columns in expressions
Generate code that works with both Dictionary and StringView
Use UTF8Utils for string comparisons when available

7. GroupBy/Join Operations ❌ NOT IMPLEMENTED

groupby.ts and src/ops/joins.ts:

Leverage shared dictionaries for faster joins
Hash grouping on StringView bytes
Detect when columns share dictionaries

8. Type Casting ❌ NOT IMPLEMENTED

cast.ts:

Support casting to/from StringDict and StringView
Preserve encoding when possible

9. Buffer Pool ❌ NOT IMPLEMENTED

pool.ts:

Pool EnhancedColumnBuffer instances
Reuse StringView buffers

10. Export/Index Updates ✅ COMPLETED

~~index.ts: Export enhanced string classes~~

✅ All classes exported in src/buffer/index.ts

🎯 Implementation Goals

Primary Goals

Backward Compatibility - Existing code continues to work
Zero-Cost Abstraction - No performance penalty for non-string columns
Transparent Optimization - Users get benefits without code changes
Opt-in Control - Advanced users can force specific encodings

Performance Goals

Dictionary: < 50MB memory for 10M rows with 100 unique values
StringView: < 200MB memory for 10M rows with 10M unique values
Adaptive: Choose correctly within 1000 samples
UTF-8 ops: 2-10x faster than string materialization

API Design Goals

Simple default: DType.string → automatic (starts with StringView)
Explicit control: DType.stringDict / DType.stringView
Shared resources: { sharedDictionaryId: "cities" }
Conversion API: column.convertToDictionary() for user control

📊 Migration Strategy

Phase 1: Core Infrastructure ✅ DONE

[x] Enhanced DType system
[x] StringView implementation
[x] AdaptiveStringColumn
[x] Resource management
[x] UTF-8 operations
[x] Tests

Phase 2: Integration ✅ COMPLETED

[x] Update Chunk to support StringView columns ✅
[x] Export enhanced string classes ✅
[x] Update string-ops.ts to work with both encodings ✅
[x] Update DataFrame creation functions (fromRecords, fromColumns) ✅
[x] Add comprehensive integration tests ✅
[ ] Update CSV/Parquet readers

Phase 3: Optimization ⏳ NOT STARTED

[ ] Expression compiler optimizations
[ ] GroupBy/Join leveraging shared dictionaries
[ ] Buffer pooling for EnhancedColumnBuffer

Phase 4: Polish ⏳ NOT STARTED

[ ] Documentation updates
[ ] Migration guide
[ ] Performance tuning
[ ] Real-world benchmarks

🚧 Immediate Next Steps

~~Update Chunk class to handle StringView columns~~ ✅ DONE
~~Create adapter layer in string-ops.ts to detect encoding and dispatch~~ ✅ DONE
- Implemented as string-ops-enhanced.ts with full encoding detection and dispatch
- All 29 tests passing for both Dictionary and StringView encodings
Update fromRecords/fromColumns to use EnhancedColumnBuffer ✅ DONE
- Both functions now use AdaptiveStringColumn
- Automatic cardinality detection
- Support for explicit encoding hints
- 19 comprehensive tests passing
~~Add integration tests that exercise full DataFrame pipeline~~ ✅ DONE
~~Update index.ts to export enhanced string classes~~ ✅ DONE

Next priorities:

CSV reader integration - Update CSV reader to use AdaptiveStringColumn during parsing
Parquet reader integration - Update Parquet reader to support StringView
Expression compiler - Update compiler to handle StringView in expressions
GroupBy/Join optimizations - Leverage shared dictionaries for performance

📖 Key Design Decisions

Decision	Rationale
StringView as default	O(1) insertion, handles any cardinality
No auto-switching	Avoids O(N) latency spikes during operations
Sampling-based	Non-intrusive cardinality detection
Shared dictionaries	Enable fast integer-based joins
UTF-8 byte ops	Avoid V8 string materialization tax
Separate null bitmap	SIMD-friendly null handling

The core infrastructure is solid. The main work ahead is integration with existing DataFrame infrastructure to make the enhanced string handling transparent to users while maintaining backward compatibility.

Molniya Roadmap ​

Enhanced String Handling (NEW - In Progress) ✨ ​

Completed ✅ ​

Partially Complete 🔄 ​

Not Yet Implemented ❌ ​

Implementation Notes ​

Core API Gaps ​

DataFrame Creation ​

String Operations (ColumnRef methods) ​

Date/Time Operations (ColumnRef methods) ​

Math Functions ​

Aggregation Functions ​

Sorting Enhancements ​

Join Operations ​

DataFrame Methods ​

Utility Functions ​

I/O Features ​

CSV ​

Parquet ​

Type System ​

Schema ​

Type Casting ​

Documentation Notes ​

Implementation Priority ​

High Priority (Core functionality) ​

Medium Priority (Convenience) ​

Low Priority (Advanced) ​

✅ What We've Implemented ​

1. Enhanced String Handling System ​

❌ What's Missing - Integration Work ​

1. Chunk Integration ✅ COMPLETED ​

2. String Operations Update ✅ COMPLETED ​

3. DataFrame Integration ✅ COMPLETED ​

4. CSV Reader Updates ❌ NOT IMPLEMENTED ​

5. Parquet Reader Updates ❌ NOT IMPLEMENTED ​

6. Expression Compiler ❌ NOT IMPLEMENTED ​

7. GroupBy/Join Operations ❌ NOT IMPLEMENTED ​

8. Type Casting ❌ NOT IMPLEMENTED ​

9. Buffer Pool ❌ NOT IMPLEMENTED ​

10. Export/Index Updates ✅ COMPLETED ​

🎯 Implementation Goals ​

Primary Goals ​

Performance Goals ​

API Design Goals ​

📊 Migration Strategy ​

Phase 1: Core Infrastructure ✅ DONE ​

Phase 2: Integration ✅ COMPLETED ​

Phase 3: Optimization ⏳ NOT STARTED ​

Phase 4: Polish ⏳ NOT STARTED ​

🚧 Immediate Next Steps ​

📖 Key Design Decisions ​