Core Concepts
Understanding these fundamental concepts will help you use Molniya effectively.
Columnar Storage
Molniya stores data in a columnar format, meaning each column is stored as a contiguous array rather than storing rows together:
Row-oriented (traditional):
[ {id: 1, name: "Alice"}, {id: 2, name: "Bob"} ]
Columnar (Molniya):
ids: [1, 2]
names: ["Alice", "Bob"] // Actually dictionary-encoded indicesThis provides several advantages:
- Better cache locality when processing single columns
- Vectorized operations using SIMD instructions
- Efficient compression of similar values within a column
- Reduced memory bandwidth for analytical queries
Lazy Evaluation
Molniya uses lazy evaluation, which means operations don't execute immediately. Instead, they build a logical execution plan:
const df = await readCsv("data.csv", schema);
// These just build a plan:
const step1 = df.filter(col("age").gt(18));
const step2 = step1.select("name", "salary");
const step3 = step2.limit(100);
// Execution happens here:
await step3.show();The plan might be optimized before execution:
- Predicate pushdown (filter before reading all columns)
- Projection pushdown (only read needed columns)
- Operator fusion (combine multiple operations)
Streaming vs Materialization
Streaming Operations
These process data in chunks and work with datasets larger than memory:
filter()- Row filteringproject()/select()- Column selectionlimit()- Row limiting
Materializing Operations
These load all data into memory:
sort()- Requires all data to sortgroupBy()- Currently buffers all datajoin()- Loads both DataFramescollect()- Explicitly materializes
When working with large files, structure your pipeline to do filtering and projection before materializing operations:
// Good: Filter first, then sort a smaller dataset
const result = await df
.filter(col("year").eq(2024)) // Streaming
.sort("amount") // Materializes filtered data only
.limit(10)
.collect();
// Less efficient: Sorts everything first
const result = await df
.sort("amount") // Materializes all data
.filter(col("year").eq(2024)) // Then filters
.limit(10)
.collect();Dictionary Encoding
String columns use dictionary encoding by default:
Original: ["apple", "banana", "apple", "cherry", "banana"]
Dictionary: ["apple", "banana", "cherry"]
Encoded: [0, 1, 0, 2, 1]Benefits:
- Memory efficiency: Repeated strings stored once
- Cache efficiency: Integer comparisons are faster
- Vectorization: Integer arrays work better with SIMD
Type System
Molniya has a strict type system built on DType:
import { DType } from "molniya";
// Non-nullable types
DType.int8
DType.int16
DType.int32
DType.int64 // bigint in JS
DType.uint8
DType.uint16
DType.uint32
DType.uint64 // bigint in JS
DType.float32
DType.float64
DType.boolean
DType.string
DType.date // Days since epoch
DType.timestamp // Milliseconds since epoch
// Nullable variants
DType.nullable.int32
DType.nullable.stringTypeScript can infer the resulting types from operations:
const df = await readCsv("data.csv", {
id: DType.int32,
amount: DType.float64
});
// TypeScript knows 'id' is number, 'amount' is number
type RowType = { id: number; amount: number };The Expression System
Expressions are the building blocks of transformations:
import { col, lit, and, sum, avg } from "molniya";
// Column references
col("name")
// Literals
lit(100)
lit("hello")
lit(null)
// Arithmetic
col("price").mul(1.1)
add(col("a"), col("b"))
// Comparisons
col("age").gte(18)
// Logical
and(col("a").gt(0), col("b").lt(100))
// Aggregations (only valid in groupBy context)
sum("amount")
avg("age")
count()Expressions are compiled to optimized JavaScript functions at execution time.
Memory Management
Molniya uses a chunk-based memory model:
- Data is processed in chunks (default ~64K rows)
- Chunks are pooled and reused when possible
- Streaming operations recycle chunks as they go
For most users, this is transparent. But if you're processing extremely large files:
// The pipeline automatically manages memory
const result = await readCsv("huge.csv", schema)
.filter(col("active").eq(true)) // Chunks filtered and recycled
.select("id", "name") // Only needed columns kept
.toArray(); // Materializes resultError Handling
Molniya uses a Result type internally but throws errors for user-facing APIs:
try {
const df = await readCsv("missing.csv", schema);
await df.show();
} catch (error) {
// Error will have a descriptive message
console.error(error.message);
}Common errors:
- Column not found
- Type mismatch in operations
- File not found or unreadable
- Invalid expression
Next Steps
Now that you understand the core concepts:
- Learn about Data Types in detail
- See practical patterns in the Cookbook
- Browse the API Reference