Skip to content

Molniya Roadmap

This document tracks features that are documented but not yet fully implemented in the library.

Enhanced String Handling (NEW - In Progress) ✨

Completed ✅

  • [x] StringView implementation for high-cardinality strings
  • [x] AdaptiveStringColumn with cardinality sampling and automatic encoding selection
  • [x] EnhancedChunk supporting mixed StringView + Dictionary columns
  • [x] ColumnStorageAdapter for unified column access
  • [x] UTF-8 aware operations (prefix/suffix filtering, sorting)
  • [x] Shared resource management (dictionaries and buffers)
  • [x] Comprehensive unit tests (45/45 passing)
  • [x] Integration tests with DataFrame (basic operations working)
  • [x] Chunk integration complete (EnhancedChunk, EnhancedChunkBuilder)
  • [x] Public API exports (all enhanced string classes exported)
  • [x] fromRecords() integration: Uses AdaptiveStringColumn for automatic cardinality detection
    • High-cardinality strings auto-detected (cardinality > 10%)
    • Automatically chooses Dictionary vs StringView based on data
    • Seamlessly converts back to dictionary for pipeline compatibility

Partially Complete 🔄

  • CSV Integration: Dictionary encoding works perfectly for categorical data

    • ✅ Dictionary encoding for low-cardinality strings (existing implementation)
    • ❌ StringView for high-cardinality strings (TODO comment added for future work)
    • Future: Use AdaptiveStringColumn in streaming CSV parser
  • String Operations: Forward-looking code exists but depends on EnhancedColumnBuffer

    • ✅ Architecture designed in src/ops/string-ops.ts
    • ✅ StringView and Dictionary implementations drafted
    • ❌ EnhancedColumnBuffer class doesn't exist yet
    • ❌ Integration with DataFrame string methods incomplete
    • Current: Dictionary-only string operations work for existing code

Not Yet Implemented ❌

  • [ ] EnhancedColumnBuffer class (wrapper to unify ColumnBuffer + AdaptiveStringColumn)
  • [ ] DataFrame execution pipeline StringView support (currently converts to dict)
  • [ ] Expression compiler updates for StringView comparisons
  • [ ] GroupBy/Join optimizations with shared dictionaries
  • [ ] String operations (trim, replace) working with StringView columns

Implementation Notes

What Works Now:

  1. AdaptiveStringColumn correctly detects cardinality and chooses encoding
  2. fromRecords() uses adaptive encoding, converting high-cardinality data to dictionary
  3. All existing DataFrame operations work (filter, select, groupBy, etc.)
  4. Dictionary encoding efficiently handles categorical strings

Architecture Gap: The main gap is that we have two separate column systems:

  • Legacy: ColumnBuffer (used by Chunk, Pipeline, all operators)
  • New: AdaptiveStringColumn (used only in fromRecords, then converted)

To fully integrate StringView:

  1. Create EnhancedColumnBuffer that wraps both systems
  2. Update Pipeline/Operators to handle EnhancedColumnBuffer
  3. Update execution.ts toArray() to read StringView columns
  4. Complete string operation implementations

Core API Gaps

DataFrame Creation

  • [x] fromColumns() - Create DataFrame from column-oriented data with TypedArrays ✅
  • [x] fromArrays() - Create DataFrame from arrays of values ✅
  • [x] range() - Create DataFrame with sequence of numbers ✅
  • [ ] DataFrame.empty() - Static method needs verification

String Operations (ColumnRef methods)

  • [x] col().length() - String length ✅
  • [x] col().substring(start, len) - Extract substring ✅
  • [x] col().upper() / col().lower() - Case conversion ✅
  • [x] col().trim() - Remove whitespace ✅
  • [x] col().replace() - Replace substring ✅
  • [x] col().contains() - Check if contains substring ✅
  • [x] col().startsWith() / col().endsWith() - Prefix/suffix checks ✅

Date/Time Operations (ColumnRef methods)

  • [x] col().year() - Extract year ✅
  • [x] col().month() - Extract month ✅
  • [x] col().day() - Extract day ✅
  • [x] col().dayOfWeek() - Extract day of week ✅
  • [x] col().quarter() - Extract quarter ✅
  • [x] col().hour() / col().minute() / col().second() - Extract time components ✅
  • [x] col().addDays() / col().subDays() - Date arithmetic ✅
  • [x] col().diffDays() - Difference between dates ✅
  • [x] col().truncateDate() - Truncate to period ✅

Math Functions

  • [x] col().round(decimals) - Round to decimal places ✅
  • [x] col().floor() - Floor function ✅
  • [x] col().ceil() - Ceiling function ✅
  • [x] col().abs() - Absolute value ✅
  • [x] col().sqrt() - Square root ✅
  • [x] col().pow(exp) - Power/exponentiation ✅

Aggregation Functions

  • [x] std() - Standard deviation ✅
  • [x] var() - Variance ✅
  • [x] median() - Median value ✅
  • [x] countDistinct() - Count unique values ✅

Sorting Enhancements

  • [ ] asc().nullsLast() - Sort nulls last option
  • [ ] desc().nullsLast() - Sort nulls last option

Join Operations

  • [ ] rightJoin() - Right outer join
  • [ ] fullJoin() - Full outer join
  • [ ] Multi-column join support (array syntax)

DataFrame Methods

  • [x] df.union() - Union with deduplication ✅
  • [x] df.unionAll() - Union without deduplication ✅
  • [x] df.offset() / df.slice() - Pagination support ✅
  • [x] df.tail(n) - Last n rows ✅
  • [x] df.shuffle() - Random shuffle ✅
  • [x] df.sample(fraction) - Random sample ✅
  • [x] df.explode() - Explode array column into rows ✅

Utility Functions

  • [x] when().otherwise() - Complete conditional expression ✅
  • [x] coalesce() - First non-null value ✅
  • [x] between() - Range check ✅
  • [x] isIn() - Check if in array ✅

I/O Features

CSV

  • [x] readCsv() with projection option ✅
  • [x] readCsv() with filter predicate pushdown ✅

Parquet

  • [x] readParquet() with projection option ✅
  • [x] readParquet() with filter predicate pushdown ✅
  • [x] Complete Parquet type mapping (INT96 timestamps, complex types) ✅

Type System

Schema

  • [x] createSchema() export verification ✅
  • [x] Schema validation utilities ✅
  • [x] Schema comparison functions ✅

Type Casting

  • [x] toDate() / toTimestamp() - String to date parsing ✅
  • [x] formatDate() - Date to string formatting ✅
  • [x] parseJson() - Parse JSON strings ✅

Documentation Notes

The following documentation files contain API that may not be fully implemented:

  • guide/strings-dates.md - Many date/string methods documented
  • guide/expressions.md - Some expression builders documented
  • guide/aggregations.md - Statistical aggregations documented
  • api/column-ops.md - Many column operations documented
  • api/sort-limit.md - Some sorting options documented
  • api/groupby.md - Pivot functionality documented
  • api/joins.md - Right and full joins documented

Implementation Priority

High Priority (Core functionality)

  1. String operations on ColumnRef (length, substring, upper, lower, trim)
  2. Date extraction methods (year, month, day)
  3. Math functions (round, floor, ceil, abs)
  4. when().otherwise() completion

Medium Priority (Convenience)

  1. fromColumns() / fromArrays() creation methods
  2. df.slice() / df.offset() for pagination
  3. coalesce() function
  4. between() and isIn() operators

Low Priority (Advanced)

  1. Statistical aggregations (std, var, median)
  2. rightJoin() / fullJoin()
  3. df.shuffle() / df.sample()
  4. df.explode()

What We've Implemented

1. Enhanced String Handling System

  • StringView class for high-cardinality strings (string-view.ts)

    • UTF-8 byte storage with offset/length arrays
    • O(1) insertion, zero V8 heap pressure
    • Shared data buffers across columns
  • AdaptiveStringColumn (adaptive-string.ts)

    • Cardinality sampling (first N strings)
    • Auto-selection between Dictionary and StringView
    • User override via DType hints
    • No automatic O(N) switching during operations
  • Enhanced DType system (dtypes.ts)

    • Added StringDict, StringView, String (adaptive) types
    • Updated type mappings and constructors
  • Resource Management (string-resources.ts)

    • Shared dictionary/buffer pooling
    • Reference counting and lifecycle management
    • Tag-based resource discovery
  • UTF-8 Operations (utf8-ops.ts)

    • Byte-level vs character-level operations
    • Fast prefix/suffix filtering without materialization
    • Sorting, grouping, duplicate detection
  • EnhancedColumnBuffer (enhanced-column-buffer.ts)

    • Unified interface for all column types
    • Transparent adaptive string handling
  • Comprehensive Tests (enhanced-strings.test.ts, enhanced-strings-benchmark.test.ts)

    • Unit tests for all components
    • Integration tests for workflows
    • Performance benchmarks

What's Missing - Integration Work

1. Chunk IntegrationCOMPLETED

The existing Chunk class (chunk.ts) only knows about DTypeKind.String

Status: DONE

  • ✅ EnhancedChunk supports both StringView and Dictionary columns
  • ✅ ColumnStorageAdapter provides unified access
  • ✅ EnhancedChunkBuilder for easy construction
  • ✅ All 8 integration tests passing
  • ✅ Exported in public API

2. String Operations UpdateCOMPLETED

String ops (string-ops-enhanced.ts) now work with both Dictionary and StringView:

typescript
trimColumnEnhanced(column: EnhancedColumnBuffer)
replaceColumnEnhanced(column, pattern, replacement, all)
toLowerColumnEnhanced / toUpperColumnEnhanced
substringColumnEnhanced / padColumnEnhanced

Implementation:

  • ✅ StringView-aware versions of trim, replace, toLowerCase, toUpperCase, etc.
  • ✅ Detection of encoding type and dispatch to appropriate implementation
  • ✅ UTF8Utils integration for byte-level operations
  • ✅ Full backward compatibility with existing Dictionary operations
  • ✅ Comprehensive test suite (29 tests) verifying both encodings

3. DataFrame IntegrationCOMPLETED

What works:

  • fromRecords() now uses AdaptiveStringColumn for optimal string encoding
  • fromColumns() now uses AdaptiveStringColumn for optimal string encoding
  • ✅ Automatic detection of high vs low cardinality strings
  • ✅ Support for explicit encoding hints (DType.stringDict, DType.stringView)
  • ✅ Proper handling of nullable string columns
  • ✅ String materialization in toArray() for all encoding types
  • ✅ Basic DataFrame operations (filter, select) work with enhanced strings
  • 19 comprehensive integration tests passing

Implementation Details:

  • Both fromRecords and fromColumns now use AdaptiveStringColumn to:
    • Sample first N strings to detect cardinality
    • Auto-select Dictionary encoding for low cardinality
    • Auto-select StringView encoding for high cardinality
    • Respect user hints (DType.stringDict / DType.stringView)
    • Handle nullable string columns properly
  • Updated toArray() execution to materialize StringDict and StringView types
  • Added comprehensive test coverage (test/enhanced-dataframe.test.ts)

What's tested:

  • ✅ Dictionary encoding for low-cardinality strings
  • ✅ StringView encoding for high-cardinality strings
  • ✅ Explicit DType hints (stringDict, stringView)
  • ✅ Null value handling in string columns
  • ✅ Mixed-length string efficiency
  • ✅ TypedArray support in fromColumns
  • ✅ Filtering on string columns
  • ✅ Selecting string columns
  • ✅ Large dataset handling (10,000 rows)
  • ✅ Edge cases (empty DataFrames, single rows, empty strings, unicode)

4. CSV Reader UpdatesNOT IMPLEMENTED

src/io/csv/source.ts currently:

  • Always uses Dictionary for strings
  • No cardinality detection during parsing

Missing:

  • Sampling during CSV parsing
  • Creating StringView columns for high-cardinality data
  • User hints for string encoding in CsvSchemaSpec

5. Parquet Reader UpdatesNOT IMPLEMENTED

Similar issues with Parquet reader - needs StringView support

6. Expression CompilerNOT IMPLEMENTED

compiler.ts needs to:

  • Handle StringView columns in expressions
  • Generate code that works with both Dictionary and StringView
  • Use UTF8Utils for string comparisons when available

7. GroupBy/Join OperationsNOT IMPLEMENTED

groupby.ts and src/ops/joins.ts:

  • Leverage shared dictionaries for faster joins
  • Hash grouping on StringView bytes
  • Detect when columns share dictionaries

8. Type CastingNOT IMPLEMENTED

cast.ts:

  • Support casting to/from StringDict and StringView
  • Preserve encoding when possible

9. Buffer PoolNOT IMPLEMENTED

pool.ts:

  • Pool EnhancedColumnBuffer instances
  • Reuse StringView buffers

10. Export/Index UpdatesCOMPLETED

index.ts: Export enhanced string classes

  • ✅ All classes exported in src/buffer/index.ts

🎯 Implementation Goals

Primary Goals

  1. Backward Compatibility - Existing code continues to work
  2. Zero-Cost Abstraction - No performance penalty for non-string columns
  3. Transparent Optimization - Users get benefits without code changes
  4. Opt-in Control - Advanced users can force specific encodings

Performance Goals

  1. Dictionary: < 50MB memory for 10M rows with 100 unique values
  2. StringView: < 200MB memory for 10M rows with 10M unique values
  3. Adaptive: Choose correctly within 1000 samples
  4. UTF-8 ops: 2-10x faster than string materialization

API Design Goals

  1. Simple default: DType.string → automatic (starts with StringView)
  2. Explicit control: DType.stringDict / DType.stringView
  3. Shared resources: { sharedDictionaryId: "cities" }
  4. Conversion API: column.convertToDictionary() for user control

📊 Migration Strategy

Phase 1: Core Infrastructure ✅ DONE

  • [x] Enhanced DType system
  • [x] StringView implementation
  • [x] AdaptiveStringColumn
  • [x] Resource management
  • [x] UTF-8 operations
  • [x] Tests

Phase 2: IntegrationCOMPLETED

  • [x] Update Chunk to support StringView columns ✅
  • [x] Export enhanced string classes ✅
  • [x] Update string-ops.ts to work with both encodings ✅
  • [x] Update DataFrame creation functions (fromRecords, fromColumns) ✅
  • [x] Add comprehensive integration tests ✅
  • [ ] Update CSV/Parquet readers

Phase 3: Optimization ⏳ NOT STARTED

  • [ ] Expression compiler optimizations
  • [ ] GroupBy/Join leveraging shared dictionaries
  • [ ] Buffer pooling for EnhancedColumnBuffer

Phase 4: Polish ⏳ NOT STARTED

  • [ ] Documentation updates
  • [ ] Migration guide
  • [ ] Performance tuning
  • [ ] Real-world benchmarks

🚧 Immediate Next Steps

  1. Update Chunk class to handle StringView columnsDONE
  2. Create adapter layer in string-ops.ts to detect encoding and dispatchDONE
    • Implemented as string-ops-enhanced.ts with full encoding detection and dispatch
    • All 29 tests passing for both Dictionary and StringView encodings
  3. Update fromRecords/fromColumns to use EnhancedColumnBufferDONE
    • Both functions now use AdaptiveStringColumn
    • Automatic cardinality detection
    • Support for explicit encoding hints
    • 19 comprehensive tests passing
  4. Add integration tests that exercise full DataFrame pipelineDONE
  5. Update index.ts to export enhanced string classesDONE

Next priorities:

  1. CSV reader integration - Update CSV reader to use AdaptiveStringColumn during parsing
  2. Parquet reader integration - Update Parquet reader to support StringView
  3. Expression compiler - Update compiler to handle StringView in expressions
  4. GroupBy/Join optimizations - Leverage shared dictionaries for performance

📖 Key Design Decisions

DecisionRationale
StringView as defaultO(1) insertion, handles any cardinality
No auto-switchingAvoids O(N) latency spikes during operations
Sampling-basedNon-intrusive cardinality detection
Shared dictionariesEnable fast integer-based joins
UTF-8 byte opsAvoid V8 string materialization tax
Separate null bitmapSIMD-friendly null handling

The core infrastructure is solid. The main work ahead is integration with existing DataFrame infrastructure to make the enhanced string handling transparent to users while maintaining backward compatibility.