Core Concepts

Understanding these fundamental concepts will help you use Molniya effectively.

Columnar Storage

Molniya stores data in a columnar format, meaning each column is stored as a contiguous array rather than storing rows together:

Row-oriented (traditional):
[ {id: 1, name: "Alice"}, {id: 2, name: "Bob"} ]

Columnar (Molniya):
ids:    [1, 2]
names:  ["Alice", "Bob"]  // Actually dictionary-encoded indices

This provides several advantages:

Better cache locality when processing single columns
Vectorized operations using SIMD instructions
Efficient compression of similar values within a column
Reduced memory bandwidth for analytical queries

Lazy Evaluation

Molniya uses lazy evaluation, which means operations don't execute immediately. Instead, they build a logical execution plan:

typescript

const df = await readCsv("data.csv", schema);

// These just build a plan:
const step1 = df.filter(col("age").gt(18));
const step2 = step1.select("name", "salary");
const step3 = step2.limit(100);

// Execution happens here:
await step3.show();

The plan might be optimized before execution:

Predicate pushdown (filter before reading all columns)
Projection pushdown (only read needed columns)
Operator fusion (combine multiple operations)

Streaming vs Materialization

Streaming Operations

These process data in chunks and work with datasets larger than memory:

filter() - Row filtering
project() / select() - Column selection
limit() - Row limiting

Materializing Operations

These load all data into memory:

sort() - Requires all data to sort
groupBy() - Currently buffers all data
join() - Loads both DataFrames
collect() - Explicitly materializes

When working with large files, structure your pipeline to do filtering and projection before materializing operations:

typescript

// Good: Filter first, then sort a smaller dataset
const result = await df
  .filter(col("year").eq(2024))  // Streaming
  .sort("amount")                 // Materializes filtered data only
  .limit(10)
  .collect();

// Less efficient: Sorts everything first
const result = await df
  .sort("amount")                 // Materializes all data
  .filter(col("year").eq(2024))  // Then filters
  .limit(10)
  .collect();

Dictionary Encoding

String columns use dictionary encoding by default:

Original: ["apple", "banana", "apple", "cherry", "banana"]
Dictionary: ["apple", "banana", "cherry"]
Encoded:   [0, 1, 0, 2, 1]

Benefits:

Memory efficiency: Repeated strings stored once
Cache efficiency: Integer comparisons are faster
Vectorization: Integer arrays work better with SIMD

Type System

Molniya has a strict type system built on DType:

typescript

import { DType } from "molniya";

// Non-nullable types
DType.int8
DType.int16
DType.int32
DType.int64      // bigint in JS
DType.uint8
DType.uint16
DType.uint32
DType.uint64     // bigint in JS
DType.float32
DType.float64
DType.boolean
DType.string
DType.date       // Days since epoch
DType.timestamp  // Milliseconds since epoch

// Nullable variants
DType.nullable.int32
DType.nullable.string

TypeScript can infer the resulting types from operations:

typescript

const df = await readCsv("data.csv", {
  id: DType.int32,
  amount: DType.float64
});

// TypeScript knows 'id' is number, 'amount' is number
type RowType = { id: number; amount: number };

The Expression System

Expressions are the building blocks of transformations:

typescript

import { col, lit, and, sum, avg } from "molniya";

// Column references
col("name")

// Literals
lit(100)
lit("hello")
lit(null)

// Arithmetic
col("price").mul(1.1)
add(col("a"), col("b"))

// Comparisons
col("age").gte(18)

// Logical
and(col("a").gt(0), col("b").lt(100))

// Aggregations (only valid in groupBy context)
sum("amount")
avg("age")
count()

Expressions are compiled to optimized JavaScript functions at execution time.

Memory Management

Molniya uses a chunk-based memory model:

Data is processed in chunks (default ~64K rows)
Chunks are pooled and reused when possible
Streaming operations recycle chunks as they go

For most users, this is transparent. But if you're processing extremely large files:

typescript

// The pipeline automatically manages memory
const result = await readCsv("huge.csv", schema)
  .filter(col("active").eq(true))   // Chunks filtered and recycled
  .select("id", "name")              // Only needed columns kept
  .toArray();                        // Materializes result

Error Handling

Molniya uses a Result type internally but throws errors for user-facing APIs:

typescript

try {
  const df = await readCsv("missing.csv", schema);
  await df.show();
} catch (error) {
  // Error will have a descriptive message
  console.error(error.message);
}

Common errors:

Column not found
Type mismatch in operations
File not found or unreadable
Invalid expression

Next Steps

Now that you understand the core concepts:

Learn about Data Types in detail
See practical patterns in the Cookbook
Browse the API Reference

Core Concepts ​

Columnar Storage ​

Lazy Evaluation ​

Streaming vs Materialization ​

Streaming Operations ​

Materializing Operations ​

Dictionary Encoding ​

Type System ​

The Expression System ​

Memory Management ​

Error Handling ​

Next Steps ​

Core Concepts

Columnar Storage

Lazy Evaluation

Streaming vs Materialization

Streaming Operations

Materializing Operations

Dictionary Encoding

Type System

The Expression System

Memory Management

Error Handling

Next Steps