GroupBy API
API reference for grouping and aggregating data.
groupBy()
Group DataFrame by one or more columns for aggregation.
groupBy(column: string): GroupedDataFrame<T>
groupBy(columns: string[]): GroupedDataFrame<T>
groupBy(column: string, aggregations: AggregationSpec[]): DataFrame<T>
groupBy(columns: string[], aggregations: AggregationSpec[]): DataFrame<T>Parameters:
column/columns- Column(s) to group byaggregations- Array of aggregation specifications (optional, for immediate aggregation)
Returns: GroupedDataFrame for further operations, or DataFrame if aggregations provided
Example:
import { col, sum, count } from "molniya";
// Group by single column
df.groupBy("category")
// Group by multiple columns
df.groupBy(["year", "month"])
// Group with immediate aggregation
df.groupBy("category", [
{ name: "total", expr: sum("amount") },
{ name: "count", expr: count() }
])AggregationSpec
Specification for an aggregation within a groupBy:
interface AggregationSpec {
name: string; // Name of the output column
expr: Expr; // Aggregation expression
}Example:
{ name: "total_sales", expr: sum("amount") }
{ name: "avg_price", expr: avg("price") }
{ name: "record_count", expr: count() }GroupedDataFrame Methods
When you call groupBy() without aggregations, you get a GroupedDataFrame with these methods:
agg()
Apply aggregations to grouped data.
agg(aggregations: AggregationSpec[]): DataFrame<T>Example:
import { col, sum, avg, count } from "molniya";
df.groupBy("category").agg([
{ name: "total", expr: sum("amount") },
{ name: "average", expr: avg("price") },
{ name: "count", expr: count() }
])count()
Count rows per group (convenience method).
count(): DataFrame<T>Example:
df.groupBy("category").count()
// Equivalent to: groupBy("category").agg([{ name: "count", expr: count() }])pivot()
Pivot groups into columns (cross-tabulation).
pivot(pivotColumn: string, values?: unknown[]): GroupedDataFrame<T>Example:
// Pivot sales by year
// Input: year | category | amount
// Output: category | 2023 | 2024
df.groupBy("category")
.pivot("year")
.agg([{ name: "total", expr: sum("amount") }])Aggregation Functions
sum()
Sum of values.
sum(column: string | ColumnRef): AggregationExprExample:
import { sum, col } from "molniya";
sum("amount")
sum(col("price").mul(col("quantity")))avg() / mean()
Average (arithmetic mean).
avg(column: string | ColumnRef): AggregationExpr
mean(column: string | ColumnRef): AggregationExprExample:
import { avg } from "molniya";
avg("score")
avg(col("total"))min()
Minimum value.
min(column: string | ColumnRef): AggregationExprExample:
import { min } from "molniya";
min("price")
min(col("date"))max()
Maximum value.
max(column: string | ColumnRef): AggregationExprExample:
import { max } from "molniya";
max("price")
max(col("date"))count()
Count of rows or non-null values.
count(): AggregationExpr // Count all rows
count(column: string | ColumnRef): AggregationExpr // Count non-null valuesExample:
import { count } from "molniya";
count() // Total rows per group
count("email") // Rows with non-null email per groupfirst()
First value in group.
first(column: string | ColumnRef): AggregationExprExample:
import { first } from "molniya";
first("name")
first(col("created_at"))Ordering
Use sort() before groupBy() to control which value is "first".
last()
Last value in group.
last(column: string | ColumnRef): AggregationExprExample:
import { last } from "molniya";
last("status")
last(col("updated_at"))std()
Standard deviation.
std(column: string | ColumnRef): AggregationExprExample:
import { std } from "molniya";
std("value")var()
Variance.
var(column: string | ColumnRef): AggregationExprExample:
import { var as variance } from "molniya";
variance("value")median()
Median value.
median(column: string | ColumnRef): AggregationExprExample:
import { median } from "molniya";
median("income")Conditional Aggregations
Filter aggregations within groups:
import { sum, count, col } from "molniya";
df.groupBy("region", [
{
name: "high_value_total",
expr: sum("amount").filter(col("amount").gt(1000))
},
{
name: "high_value_count",
expr: count().filter(col("amount").gt(1000))
},
{
name: "total_count",
expr: count()
}
])Multiple GroupBy Columns
Group by computed or multiple columns:
import { col, year, month, sum } from "molniya";
// Group by year and month
df.withColumn("year", year(col("date")))
.withColumn("month", month(col("date")))
.groupBy(["year", "month"], [
{ name: "revenue", expr: sum("amount") }
])
// Group by age bracket
df.withColumn("age_group",
when(col("age").lt(25), "18-24")
.when(col("age").lt(35), "25-34")
.otherwise("35+")
)
.groupBy("age_group", [
{ name: "count", expr: count() },
{ name: "avg_income", expr: avg("income") }
])Common Patterns
Summary Statistics
import { sum, avg, min, max, count, std } from "molniya";
df.groupBy("category", [
{ name: "total", expr: sum("amount") },
{ name: "average", expr: avg("amount") },
{ name: "min", expr: min("amount") },
{ name: "max", expr: max("amount") },
{ name: "count", expr: count() },
{ name: "std_dev", expr: std("amount") }
])Count Distinct
import { countDistinct } from "molniya";
df.groupBy("region", [
{ name: "unique_customers", expr: countDistinct("customer_id") },
{ name: "total_orders", expr: count() }
])First and Last
import { first, last } from "molniya";
df.sort(asc("date"))
.groupBy("customer_id", [
{ name: "first_order", expr: first("order_date") },
{ name: "last_order", expr: last("order_date") },
{ name: "latest_status", expr: last("status") }
])Performance Notes
- GroupBy operations require materializing data
- Pre-filter data before grouping when possible
- Too many unique group keys can cause memory issues
- Aggregations are computed in a single pass when possible