Joining DataFrames
Combine data from multiple sources using join operations.
Overview
Joins combine rows from two DataFrames based on matching values in specified columns. Molniya supports several join types:
| Join Type | Description | Use Case |
|---|---|---|
innerJoin | Only matching rows from both | Find orders with valid customers |
leftJoin | All left rows, matching right | All orders with customer info if available |
semiJoin | Left rows with matches in right | Find customers who have orders |
antiJoin | Left rows without matches | Find customers with no orders |
crossJoin | All combinations | Generate combinations |
Join Performance
Joins currently load both DataFrames into memory. For large datasets, filter before joining when possible.
Inner Join
Returns only rows where the join key exists in both DataFrames.
import { readCsv, DType, col } from "molniya";
const orders = await readCsv("orders.csv", {
orderId: DType.int32,
customerId: DType.int32,
amount: DType.float64
});
const customers = await readCsv("customers.csv", {
id: DType.int32,
name: DType.string,
email: DType.string
});
// Join orders with customer info
const withCustomers = await orders.innerJoin(
customers,
"customerId", // Column in orders
"id" // Column in customers
);
await withCustomers.select("orderId", "name", "amount").show();With Different Column Names
const result = await orders.innerJoin(
customers,
"customerId", // Left column
"customer_id" // Right column (different name)
);With Suffix for Overlapping Columns
const result = await orders.innerJoin(
customers,
"customerId",
"id",
"_right" // Suffix for overlapping column names
);Left Join
Returns all rows from the left DataFrame, with matching rows from the right (null if no match).
// All orders, with customer info where available
const withCustomers = await orders.leftJoin(
customers,
"customerId",
"id"
);
// Orders without customers will have null customer columnsSemi Join
Returns rows from the left DataFrame where the join key exists in the right.
// Find customers who have placed orders
const customersWithOrders = await customers.semiJoin(
orders,
"id" // customers.id matches orders.customerId
);
// Result only has columns from customersSemi Join vs Inner Join
Use semiJoin when you only need columns from the left DataFrame. It's more efficient than innerJoin followed by select().
Anti Join
Returns rows from the left DataFrame where the join key does NOT exist in the right.
// Find customers who haven't placed orders
const inactiveCustomers = await customers.antiJoin(
orders,
"id"
);
// Find orders without valid customers
const orphanedOrders = await orders.antiJoin(
customers,
"customerId"
);Cross Join
Returns the Cartesian product of both DataFrames (every combination).
const products = await readCsv("products.csv", {
productId: DType.int32,
name: DType.string
});
const regions = await readCsv("regions.csv", {
regionId: DType.int32,
regionName: DType.string
});
// All product-region combinations
const combinations = await products.crossJoin(regions);Cross Join Size
A cross join of two DataFrames with 1000 rows each produces 1,000,000 rows. Use with caution!
Multi-Column Joins
Currently, joins only support single-column keys. For multi-column joins, combine keys first:
// Workaround: Create composite key
const ordersWithKey = orders.withColumn(
"join_key",
col("customerId").mul(1000000).add(col("regionId"))
);
const customersWithKey = customers.withColumn(
"join_key",
col("customerId").mul(1000000).add(col("regionId"))
);
const result = await ordersWithKey.innerJoin(
customersWithKey,
"join_key"
);Practical Examples
Enriching Data
// Add category names to products
const productsWithCategory = await products
.innerJoin(categories, "categoryId", "id")
.select("productId", "productName", "categoryName", "price");Finding Missing Data
// Products without categories
const uncategorized = await products.antiJoin(
categories,
"categoryId"
);
// Categories without products
const emptyCategories = await categories.antiJoin(
products,
"id",
"categoryId"
);Many-to-Many Relationships
// Tags and products (through junction table)
const productTags = await products
.innerJoin(productTagJunction, "productId", "productId")
.innerJoin(tags, "tagId", "id");Self Join
// Find employees and their managers
const employeesWithManagers = await employees
.innerJoin(
employees, // Join with itself
"managerId", // Employee's manager
"employeeId", // Manager's ID
"_manager" // Suffix for manager columns
)
.select(
"name", // Employee name
"name_manager" // Manager name
);Performance Tips
- Filter before joining - Reduce data volume on both sides
- Use semi/anti joins - When you only need one side's columns
- Avoid cross joins - Unless you truly need all combinations
- Consider alternatives - Sometimes
withColumn()with a lookup is faster
// Good: Filter first
const recentOrders = orders.filter(col("date").gte("2024-01-01"));
const result = await recentOrders.innerJoin(customers, "customerId");
// Less efficient: Join then filter
const result = await orders
.innerJoin(customers, "customerId")
.filter(col("date").gte("2024-01-01"));Common Patterns
Lookup Table Pattern
// Instead of joining, use withColumn for simple lookups
const statusNames = {
1: "Pending",
2: "Processing",
3: "Shipped",
4: "Delivered"
};
// This requires collecting and processing in JS
const ordersWithStatus = await orders.toArray().then(rows =>
rows.map(row => ({
...row,
statusName: statusNames[row.statusId] || "Unknown"
}))
);Master-Detail Pattern
// Get order headers with their line items
const orderDetails = await orders
.innerJoin(orderLines, "orderId", "orderId")
.groupBy("orderId", [
{ name: "total", expr: sum(col("price").mul(col("quantity"))) },
{ name: "itemCount", expr: count() }
]);