Performance Guide
Molniya is built for speed, but how you use it affects performance. Here is how to keep your pipelines flying.
1. Filter Early, Filter Often
The fastest data to process is data you don't process. Reduce the dataset size as early as possible in your chain.
Bad:
typescript
df
.sort('date') // Sorting 1M rows
.apply(complexMath) // Calculating on 1M rows
.filter(r => r.active); // Oh wait, we only needed 10k rowsGood:
typescript
df
.filter(r => r.active) // Drop 99% of rows immediately
.sort('date') // Sorting only 10k rows (FAST)
.apply(complexMath); // Calculating on 10k rows2. Select Columns Early
DataFrames store data in columns. If you have a 100-column CSV but only need 3, drop the rest immediately. This frees up massive amounts of memory.
typescript
// Reading only what you need (if supported by loader)
// or select immediately after load
const df = await readCsv('huge.csv');
const workingSet = df.select('id', 'value', 'date');
// The other 97 columns are now eligible for garbage collection3. Avoid Row Loops
Accessing data row-by-row is slower than column-based operations because of JavaScript object allocation overhead.
Slow (Row-based):
typescript
// Creates 1 million objects only to derive one value
const total = df.rows().reduce((sum, row) => sum + row.price, 0);Fast (Column-based):
typescript
// Uses a tight loop over a typed array
const total = df.col('price').sum();4. Large Datasets? Use scanCsv
If your CSV is bigger than your RAM (e.g., 2GB+ file on a laptop), readCsv will crash. Use scanCsv to create a LazyFrame.
typescript
// Doesn't load anything yet
const lazy = await scanCsv('massive_log.csv');
// Operations are recorded, not executed
const result = lazy
.filter(row => row.status === 500)
.select('url', 'latency')
.head(100) // We only need top 100
.collect(); // NOW it reads just enough of the file to answer the query5. Type Selection
- Strings are expensive: They take more memory and are slower to compare than numbers. If you have a categorical column like "Status" with values "OPEN", "CLOSED", consider mapping them to integers
0and1if you are memory constraint. - Float64 vs Int32: Molniya defaults to
float64for numbers for safety. If you have massive arrays of small integers, forcingint32can save memory (though JS engines are erratic about this).
Benchmarking
Always measure.
typescript
console.time('analysis');
// ... your code ...
console.timeEnd('analysis');