Grouping & Aggregating
Summarize data by categories using groupBy() and aggregation functions.
Basic GroupBy
Simple Count
import { col, count } from "molniya";
// Count rows per category
const result = df
.groupBy("category")
.count();
await result.show();
// Output: category, countMultiple Aggregations
import { col, sum, avg, min, max, count } from "molniya";
const result = df
.groupBy("department", [
{ name: "total_salary", expr: sum("salary") },
{ name: "avg_salary", expr: avg("salary") },
{ name: "min_salary", expr: min("salary") },
{ name: "max_salary", expr: max("salary") },
{ name: "employee_count", expr: count() }
]);
await result.show();Aggregation Functions
sum()
Calculate the total of a numeric column.
import { sum, col } from "molniya";
// Sum of a column
df.groupBy("category", [
{ name: "total", expr: sum("amount") }
])
// Sum of an expression
df.groupBy("category", [
{ name: "total_with_tax", expr: sum(col("amount").mul(1.1)) }
])avg()
Calculate the average (mean) of a numeric column.
import { avg } from "molniya";
df.groupBy("category", [
{ name: "average_price", expr: avg("price") }
])Null Handling
avg() automatically excludes null values from the calculation. If all values are null, the result is null.
min() / max()
Find the minimum or maximum value.
import { min, max } from "molniya";
df.groupBy("category", [
{ name: "lowest_price", expr: min("price") },
{ name: "highest_price", expr: max("price") }
])count()
Count rows or non-null values.
import { count } from "molniya";
// Count all rows
df.groupBy("category", [
{ name: "row_count", expr: count() }
])
// Count non-null values in a column
df.groupBy("category", [
{ name: "valid_emails", expr: count("email") }
])first() / last()
Get the first or last value in each group.
import { first, last } from "molniya";
df.groupBy("category", [
{ name: "first_order", expr: first("order_date") },
{ name: "last_order", expr: last("order_date") }
])Ordering
first() and last() return values based on the physical order of data in the file. Use sort() before grouping if you need specific ordering.
Multi-Column Grouping
Group by multiple columns for more granular summaries:
df.groupBy(["year", "month", "category"], [
{ name: "total_sales", expr: sum("amount") },
{ name: "order_count", expr: count() }
])Shortcut Methods
The RelationalGroupedDataset returned by groupBy() has shortcut methods:
const grouped = df.groupBy("category");
// Instead of: grouped.agg([{ name: "sum_amount", expr: sum("amount") }])
grouped.sum("amount"); // Creates column "sum(amount)"
// Similarly:
grouped.avg("price"); // Column: "mean(price)"
grouped.min("price"); // Column: "min(price)"
grouped.max("price"); // Column: "max(price)"
grouped.count(); // Column: "count"Practical Examples
Sales Report by Category
const report = await df
.groupBy("category", [
{ name: "total_revenue", expr: sum(col("price").mul(col("quantity"))) },
{ name: "total_orders", expr: count() },
{ name: "avg_order_value", expr: avg(col("price").mul(col("quantity"))) },
{ name: "unique_customers", expr: count("customer_id") }
])
.sort(desc("total_revenue"))
.limit(10)
.toArray();Daily Active Users
const dau = await df
.filter(col("event_type").eq("login"))
.groupBy("date", [
{ name: "active_users", expr: count("user_id") }
])
.sort("date")
.toArray();Cohort Analysis
const cohorts = await df
.groupBy(["signup_month", "activity_month"], [
{ name: "retained_users", expr: count("user_id") }
])
.toArray();Pareto Analysis (80/20 Rule)
// Find top 20% of customers generating 80% of revenue
const customerTotals = await df
.groupBy("customer_id", [
{ name: "total_spent", expr: sum("amount") }
])
.sort(desc("total_spent"))
.toArray();
// Calculate cumulative percentage
// (This would need to be done in JavaScript after collection)Combining with Other Operations
Filter Before Grouping
// Only include completed orders
const result = await df
.filter(col("status").eq("completed"))
.groupBy("category", [
{ name: "total", expr: sum("amount") }
])
.collect();Transform After Grouping
const result = await df
.groupBy("category", [
{ name: "total", expr: sum("amount") }
])
.withColumn("percentage",
col("total").div(col("total").sum())
)
.collect();Post-GroupBy Operations
Currently, operations after groupBy() are limited. Complex transformations may need to be done after .collect().
Common Patterns
Top N per Group
Molniya doesn't have a built-in window function for "top N per group" yet. Workaround:
// Get all data, then process in JavaScript
const allData = await df
.groupBy(["category", "product"], [
{ name: "sales", expr: sum("amount") }
])
.toArray();
// Process in JS to get top 3 per category
const topPerCategory = allData.reduce((acc, row) => {
const cat = row.category;
if (!acc[cat]) acc[cat] = [];
acc[cat].push(row);
acc[cat].sort((a, b) => b.sales - a.sales);
acc[cat] = acc[cat].slice(0, 3);
return acc;
}, {});Running Totals
Running totals also require post-processing:
const daily = await df
.groupBy("date", [
{ name: "sales", expr: sum("amount") }
])
.sort("date")
.toArray();
// Calculate running total in JS
let runningTotal = 0;
const withRunningTotal = daily.map(row => ({
...row,
running_total: runningTotal += row.sales
}));Performance Considerations
- GroupBy materializes data: The entire dataset is loaded into memory for grouping
- Filter first: Reduce data volume before grouping
- Use appropriate types: Integer grouping keys are faster than strings
- Limit groups: Extremely high cardinality grouping (millions of unique keys) may cause memory issues
// Good: Filter first
const result = await df
.filter(col("year").eq(2024))
.groupBy("category", [...])
.collect();
// Less efficient: Groups all years
const result = await df
.groupBy(["year", "category"], [...])
.filter(col("year").eq(2024))
.collect();