Molniya Roadmap
This document tracks features that are documented but not yet fully implemented in the library.
Enhanced String Handling (NEW - In Progress) ✨
Completed ✅
- [x] StringView implementation for high-cardinality strings
- [x] AdaptiveStringColumn with cardinality sampling and automatic encoding selection
- [x] EnhancedChunk supporting mixed StringView + Dictionary columns
- [x] ColumnStorageAdapter for unified column access
- [x] UTF-8 aware operations (prefix/suffix filtering, sorting)
- [x] Shared resource management (dictionaries and buffers)
- [x] Comprehensive unit tests (45/45 passing)
- [x] Integration tests with DataFrame (basic operations working)
- [x] Chunk integration complete (EnhancedChunk, EnhancedChunkBuilder)
- [x] Public API exports (all enhanced string classes exported)
- [x] fromRecords() integration: Uses AdaptiveStringColumn for automatic cardinality detection
- High-cardinality strings auto-detected (cardinality > 10%)
- Automatically chooses Dictionary vs StringView based on data
- Seamlessly converts back to dictionary for pipeline compatibility
Partially Complete 🔄
CSV Integration: Dictionary encoding works perfectly for categorical data
- ✅ Dictionary encoding for low-cardinality strings (existing implementation)
- ❌ StringView for high-cardinality strings (TODO comment added for future work)
- Future: Use AdaptiveStringColumn in streaming CSV parser
String Operations: Forward-looking code exists but depends on EnhancedColumnBuffer
- ✅ Architecture designed in
src/ops/string-ops.ts - ✅ StringView and Dictionary implementations drafted
- ❌ EnhancedColumnBuffer class doesn't exist yet
- ❌ Integration with DataFrame string methods incomplete
- Current: Dictionary-only string operations work for existing code
- ✅ Architecture designed in
Not Yet Implemented ❌
- [ ] EnhancedColumnBuffer class (wrapper to unify ColumnBuffer + AdaptiveStringColumn)
- [ ] DataFrame execution pipeline StringView support (currently converts to dict)
- [ ] Expression compiler updates for StringView comparisons
- [ ] GroupBy/Join optimizations with shared dictionaries
- [ ] String operations (trim, replace) working with StringView columns
Implementation Notes
What Works Now:
- AdaptiveStringColumn correctly detects cardinality and chooses encoding
- fromRecords() uses adaptive encoding, converting high-cardinality data to dictionary
- All existing DataFrame operations work (filter, select, groupBy, etc.)
- Dictionary encoding efficiently handles categorical strings
Architecture Gap: The main gap is that we have two separate column systems:
- Legacy: ColumnBuffer (used by Chunk, Pipeline, all operators)
- New: AdaptiveStringColumn (used only in fromRecords, then converted)
To fully integrate StringView:
- Create EnhancedColumnBuffer that wraps both systems
- Update Pipeline/Operators to handle EnhancedColumnBuffer
- Update execution.ts toArray() to read StringView columns
- Complete string operation implementations
Core API Gaps
DataFrame Creation
- [x]
fromColumns()- Create DataFrame from column-oriented data with TypedArrays ✅ - [x]
fromArrays()- Create DataFrame from arrays of values ✅ - [x]
range()- Create DataFrame with sequence of numbers ✅ - [ ]
DataFrame.empty()- Static method needs verification
String Operations (ColumnRef methods)
- [x]
col().length()- String length ✅ - [x]
col().substring(start, len)- Extract substring ✅ - [x]
col().upper()/col().lower()- Case conversion ✅ - [x]
col().trim()- Remove whitespace ✅ - [x]
col().replace()- Replace substring ✅ - [x]
col().contains()- Check if contains substring ✅ - [x]
col().startsWith()/col().endsWith()- Prefix/suffix checks ✅
Date/Time Operations (ColumnRef methods)
- [x]
col().year()- Extract year ✅ - [x]
col().month()- Extract month ✅ - [x]
col().day()- Extract day ✅ - [x]
col().dayOfWeek()- Extract day of week ✅ - [x]
col().quarter()- Extract quarter ✅ - [x]
col().hour()/col().minute()/col().second()- Extract time components ✅ - [x]
col().addDays()/col().subDays()- Date arithmetic ✅ - [x]
col().diffDays()- Difference between dates ✅ - [x]
col().truncateDate()- Truncate to period ✅
Math Functions
- [x]
col().round(decimals)- Round to decimal places ✅ - [x]
col().floor()- Floor function ✅ - [x]
col().ceil()- Ceiling function ✅ - [x]
col().abs()- Absolute value ✅ - [x]
col().sqrt()- Square root ✅ - [x]
col().pow(exp)- Power/exponentiation ✅
Aggregation Functions
- [x]
std()- Standard deviation ✅ - [x]
var()- Variance ✅ - [x]
median()- Median value ✅ - [x]
countDistinct()- Count unique values ✅
Sorting Enhancements
- [ ]
asc().nullsLast()- Sort nulls last option - [ ]
desc().nullsLast()- Sort nulls last option
Join Operations
- [ ]
rightJoin()- Right outer join - [ ]
fullJoin()- Full outer join - [ ] Multi-column join support (array syntax)
DataFrame Methods
- [x]
df.union()- Union with deduplication ✅ - [x]
df.unionAll()- Union without deduplication ✅ - [x]
df.offset()/df.slice()- Pagination support ✅ - [x]
df.tail(n)- Last n rows ✅ - [x]
df.shuffle()- Random shuffle ✅ - [x]
df.sample(fraction)- Random sample ✅ - [x]
df.explode()- Explode array column into rows ✅
Utility Functions
- [x]
when().otherwise()- Complete conditional expression ✅ - [x]
coalesce()- First non-null value ✅ - [x]
between()- Range check ✅ - [x]
isIn()- Check if in array ✅
I/O Features
CSV
- [x]
readCsv()withprojectionoption ✅ - [x]
readCsv()withfilterpredicate pushdown ✅
Parquet
- [x]
readParquet()withprojectionoption ✅ - [x]
readParquet()withfilterpredicate pushdown ✅ - [x] Complete Parquet type mapping (INT96 timestamps, complex types) ✅
Type System
Schema
- [x]
createSchema()export verification ✅ - [x] Schema validation utilities ✅
- [x] Schema comparison functions ✅
Type Casting
- [x]
toDate()/toTimestamp()- String to date parsing ✅ - [x]
formatDate()- Date to string formatting ✅ - [x]
parseJson()- Parse JSON strings ✅
Documentation Notes
The following documentation files contain API that may not be fully implemented:
guide/strings-dates.md- Many date/string methods documentedguide/expressions.md- Some expression builders documentedguide/aggregations.md- Statistical aggregations documentedapi/column-ops.md- Many column operations documentedapi/sort-limit.md- Some sorting options documentedapi/groupby.md- Pivot functionality documentedapi/joins.md- Right and full joins documented
Implementation Priority
High Priority (Core functionality)
- String operations on ColumnRef (length, substring, upper, lower, trim)
- Date extraction methods (year, month, day)
- Math functions (round, floor, ceil, abs)
when().otherwise()completion
Medium Priority (Convenience)
fromColumns()/fromArrays()creation methodsdf.slice()/df.offset()for paginationcoalesce()functionbetween()andisIn()operators
Low Priority (Advanced)
- Statistical aggregations (std, var, median)
rightJoin()/fullJoin()df.shuffle()/df.sample()df.explode()
✅ What We've Implemented
1. Enhanced String Handling System
StringView class for high-cardinality strings (string-view.ts)
- UTF-8 byte storage with offset/length arrays
- O(1) insertion, zero V8 heap pressure
- Shared data buffers across columns
AdaptiveStringColumn (adaptive-string.ts)
- Cardinality sampling (first N strings)
- Auto-selection between Dictionary and StringView
- User override via DType hints
- No automatic O(N) switching during operations
Enhanced DType system (dtypes.ts)
- Added
StringDict,StringView,String(adaptive) types - Updated type mappings and constructors
- Added
Resource Management (string-resources.ts)
- Shared dictionary/buffer pooling
- Reference counting and lifecycle management
- Tag-based resource discovery
UTF-8 Operations (utf8-ops.ts)
- Byte-level vs character-level operations
- Fast prefix/suffix filtering without materialization
- Sorting, grouping, duplicate detection
EnhancedColumnBuffer (enhanced-column-buffer.ts)
- Unified interface for all column types
- Transparent adaptive string handling
Comprehensive Tests (enhanced-strings.test.ts, enhanced-strings-benchmark.test.ts)
- Unit tests for all components
- Integration tests for workflows
- Performance benchmarks
❌ What's Missing - Integration Work
1. Chunk Integration ✅ COMPLETED
The existing Chunk class (chunk.ts) only knows about DTypeKind.String
Status: DONE
- ✅ EnhancedChunk supports both StringView and Dictionary columns
- ✅ ColumnStorageAdapter provides unified access
- ✅ EnhancedChunkBuilder for easy construction
- ✅ All 8 integration tests passing
- ✅ Exported in public API
2. String Operations Update ✅ COMPLETED
String ops (string-ops-enhanced.ts) now work with both Dictionary and StringView:
trimColumnEnhanced(column: EnhancedColumnBuffer)
replaceColumnEnhanced(column, pattern, replacement, all)
toLowerColumnEnhanced / toUpperColumnEnhanced
substringColumnEnhanced / padColumnEnhancedImplementation:
- ✅ StringView-aware versions of trim, replace, toLowerCase, toUpperCase, etc.
- ✅ Detection of encoding type and dispatch to appropriate implementation
- ✅ UTF8Utils integration for byte-level operations
- ✅ Full backward compatibility with existing Dictionary operations
- ✅ Comprehensive test suite (29 tests) verifying both encodings
3. DataFrame Integration ✅ COMPLETED
What works:
- ✅
fromRecords()now uses AdaptiveStringColumn for optimal string encoding - ✅
fromColumns()now uses AdaptiveStringColumn for optimal string encoding - ✅ Automatic detection of high vs low cardinality strings
- ✅ Support for explicit encoding hints (
DType.stringDict,DType.stringView) - ✅ Proper handling of nullable string columns
- ✅ String materialization in
toArray()for all encoding types - ✅ Basic DataFrame operations (filter, select) work with enhanced strings
- ✅ 19 comprehensive integration tests passing
Implementation Details:
- Both
fromRecordsandfromColumnsnow useAdaptiveStringColumnto:- Sample first N strings to detect cardinality
- Auto-select Dictionary encoding for low cardinality
- Auto-select StringView encoding for high cardinality
- Respect user hints (DType.stringDict / DType.stringView)
- Handle nullable string columns properly
- Updated
toArray()execution to materialize StringDict and StringView types - Added comprehensive test coverage (test/enhanced-dataframe.test.ts)
What's tested:
- ✅ Dictionary encoding for low-cardinality strings
- ✅ StringView encoding for high-cardinality strings
- ✅ Explicit DType hints (stringDict, stringView)
- ✅ Null value handling in string columns
- ✅ Mixed-length string efficiency
- ✅ TypedArray support in fromColumns
- ✅ Filtering on string columns
- ✅ Selecting string columns
- ✅ Large dataset handling (10,000 rows)
- ✅ Edge cases (empty DataFrames, single rows, empty strings, unicode)
4. CSV Reader Updates ❌ NOT IMPLEMENTED
src/io/csv/source.ts currently:
- Always uses Dictionary for strings
- No cardinality detection during parsing
Missing:
- Sampling during CSV parsing
- Creating StringView columns for high-cardinality data
- User hints for string encoding in CsvSchemaSpec
5. Parquet Reader Updates ❌ NOT IMPLEMENTED
Similar issues with Parquet reader - needs StringView support
6. Expression Compiler ❌ NOT IMPLEMENTED
compiler.ts needs to:
- Handle StringView columns in expressions
- Generate code that works with both Dictionary and StringView
- Use UTF8Utils for string comparisons when available
7. GroupBy/Join Operations ❌ NOT IMPLEMENTED
groupby.ts and src/ops/joins.ts:
- Leverage shared dictionaries for faster joins
- Hash grouping on StringView bytes
- Detect when columns share dictionaries
8. Type Casting ❌ NOT IMPLEMENTED
cast.ts:
- Support casting to/from StringDict and StringView
- Preserve encoding when possible
9. Buffer Pool ❌ NOT IMPLEMENTED
pool.ts:
- Pool EnhancedColumnBuffer instances
- Reuse StringView buffers
10. Export/Index Updates ✅ COMPLETED
index.ts: Export enhanced string classes
- ✅ All classes exported in src/buffer/index.ts
🎯 Implementation Goals
Primary Goals
- Backward Compatibility - Existing code continues to work
- Zero-Cost Abstraction - No performance penalty for non-string columns
- Transparent Optimization - Users get benefits without code changes
- Opt-in Control - Advanced users can force specific encodings
Performance Goals
- Dictionary: < 50MB memory for 10M rows with 100 unique values
- StringView: < 200MB memory for 10M rows with 10M unique values
- Adaptive: Choose correctly within 1000 samples
- UTF-8 ops: 2-10x faster than string materialization
API Design Goals
- Simple default:
DType.string→ automatic (starts with StringView) - Explicit control:
DType.stringDict/DType.stringView - Shared resources:
{ sharedDictionaryId: "cities" } - Conversion API:
column.convertToDictionary()for user control
📊 Migration Strategy
Phase 1: Core Infrastructure ✅ DONE
- [x] Enhanced DType system
- [x] StringView implementation
- [x] AdaptiveStringColumn
- [x] Resource management
- [x] UTF-8 operations
- [x] Tests
Phase 2: Integration ✅ COMPLETED
- [x] Update Chunk to support StringView columns ✅
- [x] Export enhanced string classes ✅
- [x] Update string-ops.ts to work with both encodings ✅
- [x] Update DataFrame creation functions (fromRecords, fromColumns) ✅
- [x] Add comprehensive integration tests ✅
- [ ] Update CSV/Parquet readers
Phase 3: Optimization ⏳ NOT STARTED
- [ ] Expression compiler optimizations
- [ ] GroupBy/Join leveraging shared dictionaries
- [ ] Buffer pooling for EnhancedColumnBuffer
Phase 4: Polish ⏳ NOT STARTED
- [ ] Documentation updates
- [ ] Migration guide
- [ ] Performance tuning
- [ ] Real-world benchmarks
🚧 Immediate Next Steps
Update Chunk class to handle StringView columns✅ DONECreate adapter layer in string-ops.ts to detect encoding and dispatch✅ DONE- Implemented as
string-ops-enhanced.tswith full encoding detection and dispatch - All 29 tests passing for both Dictionary and StringView encodings
- Implemented as
Update fromRecords/fromColumns to use EnhancedColumnBuffer✅ DONE- Both functions now use AdaptiveStringColumn
- Automatic cardinality detection
- Support for explicit encoding hints
- 19 comprehensive tests passing
Add integration tests that exercise full DataFrame pipeline✅ DONEUpdate index.ts to export enhanced string classes✅ DONE
Next priorities:
- CSV reader integration - Update CSV reader to use AdaptiveStringColumn during parsing
- Parquet reader integration - Update Parquet reader to support StringView
- Expression compiler - Update compiler to handle StringView in expressions
- GroupBy/Join optimizations - Leverage shared dictionaries for performance
📖 Key Design Decisions
| Decision | Rationale |
|---|---|
| StringView as default | O(1) insertion, handles any cardinality |
| No auto-switching | Avoids O(N) latency spikes during operations |
| Sampling-based | Non-intrusive cardinality detection |
| Shared dictionaries | Enable fast integer-based joins |
| UTF-8 byte ops | Avoid V8 string materialization tax |
| Separate null bitmap | SIMD-friendly null handling |
The core infrastructure is solid. The main work ahead is integration with existing DataFrame infrastructure to make the enhanced string handling transparent to users while maintaining backward compatibility.