To run this notebook, click on the molab shield above or run the following command at the terminal:
uvx marimo edit --sandbox --mcp --no-token --watch https://github.com/nll-ai/kirin/blob/main/docs/tutorials/commits.py
import marimo as mo
Working with Commits
This tutorial deep dives into Kirin's commit system. You'll learn how commits work, how to navigate commit history, compare commits, and use commits effectively in your data science workflows.
What You'll Learn
- Understanding commit structure and properties
- Navigating linear commit history
- Comparing commits to see what changed
- Working with specific commits
- Best practices for commit messages and workflows
Prerequisites
- Completed Your First Dataset tutorial
- Basic understanding of datasets and files
import tempfile
from datetime import datetime
from pathlib import Path
from kirin import Catalog
# Create a temporary directory for our tutorial
# In production, you might use: Catalog(root_dir="s3://my-bucket/data")
temp_dir = Path(tempfile.mkdtemp(prefix="kirin_commits_tutorial_"))
catalog = Catalog(root_dir=temp_dir)
# Create a new dataset
commit_demo_dataset = catalog.create_dataset(
"commit_demo", description="Demo dataset for commit tutorial"
)
# Create a directory for our data files
data_dir = temp_dir / "sample_data"
data_dir.mkdir(exist_ok=True)
print(f"✅ Created dataset: {commit_demo_dataset.name}")
print(f" Dataset root: {commit_demo_dataset.root_dir}")
Step 1: Understanding Commit Structure
A commit in Kirin is an immutable snapshot of files at a specific point in time. Unlike Git, Kirin uses a linear commit history - each commit has exactly one parent, creating a simple chain:
Initial Commit → Commit 2 → Commit 3 → Commit 4
Let's create some commits and explore what makes up a commit.
# Create first commit
file1 = data_dir / "data.csv"
file1.write_text("name,value\nA,10\nB,20\n")
commit_msg1 = f"Initial data - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
commit_hash1 = commit_demo_dataset.commit(
message=commit_msg1, add_files=[str(file1)]
)
print(f"✅ Created first commit: {commit_hash1[:8]}")
# Create second commit with updated data (same filename - versioning!)
file2 = data_dir / "data.csv"
file2.write_text("name,value\nA,10\nB,20\nC,30\n")
commit_msg2 = f"Add more data - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
commit_hash2 = commit_demo_dataset.commit(
message=commit_msg2, add_files=[str(file2)]
)
print(f"✅ Created second commit: {commit_hash2[:8]}")
print(" Note: Same filename - versioning handled by commits!")
# Get the current commit and explore its properties
current_commit = commit_demo_dataset.current_commit
if current_commit:
print("📊 Commit Properties:")
print(f" Hash: {current_commit.hash}")
print(f" Short hash: {current_commit.short_hash}")
print(f" Message: {current_commit.message}")
print(f" Timestamp: {current_commit.timestamp}")
parent_hash_display = (
current_commit.parent_hash[:8]
if current_commit.parent_hash
else "None (initial)"
)
print(f" Parent hash: {parent_hash_display}")
print(f" Files: {current_commit.list_files()}")
print(f" File count: {current_commit.get_file_count()}")
print(f" Total size: {current_commit.get_total_size()} bytes")
print(f" Is initial: {current_commit.is_initial}")
Key properties:
- Hash: Unique identifier (SHA256) for the commit
- Message: Human-readable description of what changed
- Timestamp: When the commit was created
- Parent hash: Reference to the previous commit (None for initial commit)
- Files: Dictionary of File objects in this commit
Step 2: Viewing Commit History
The commit history is a linear sequence. Let's explore it to see how commits are organized.
# Get all commits (newest to oldest)
history = commit_demo_dataset.history()
# Reverse to show oldest first for tutorial clarity
history_oldest_first = list(reversed(history))
print(f"📊 Total commits: {len(history)}")
print("\nCommit History (oldest → newest):")
print("=" * 50)
for step_num, history_commit in enumerate(history_oldest_first, 1):
print(f"\n{step_num}. {history_commit.short_hash}: {history_commit.message}")
print(f" Date: {history_commit.timestamp.strftime('%Y-%m-%d %H:%M:%S')}")
print(f" Files: {', '.join(history_commit.list_files())}")
parent_display = (
history_commit.parent_hash[:8]
if history_commit.parent_hash
else "None (initial)"
)
print(f" Parent: {parent_display}")
Understanding the history:
- History is returned from newest to oldest (latest commit first)
- Each commit (except the first) has a parent
- The latest commit is
history[0]or usedataset.current_commit - To display oldest first, reverse the list:
list(reversed(history))
Step 3: Limiting History
For large datasets with many commits, you can limit how many commits to retrieve. This is useful for performance and focusing on recent changes.
# Get only the 5 most recent commits
recent_commits = commit_demo_dataset.history(limit=5)
print(f"📊 Recent commits: {len(recent_commits)}")
for recent_commit in recent_commits:
print(f" {recent_commit.short_hash}: {recent_commit.message}")
Step 4: Getting Specific Commits
You can retrieve a specific commit by its hash. This is useful when you know exactly which commit you want to work with.
# Get a specific commit (using the oldest commit for demonstration)
oldest_commit_hash = history[-1].hash # Oldest is last in newest-first list
retrieved_commit = commit_demo_dataset.get_commit(oldest_commit_hash)
if retrieved_commit:
print(f"✅ Retrieved commit: {retrieved_commit.short_hash}")
print(f" Message: {retrieved_commit.message}")
print(f" Files: {retrieved_commit.list_files()}")
else:
print("❌ Commit not found")
Step 5: Checking Out Commits
"Checking out" a commit means switching the dataset to that commit's state. This lets you see what files were available at that point in time.
# Checkout the latest commit (default)
commit_demo_dataset.checkout()
current_commit_hash = (
commit_demo_dataset.current_commit.short_hash
if commit_demo_dataset.current_commit
else "None"
)
print(f"📂 Current commit: {current_commit_hash}")
print(f" Files: {list(commit_demo_dataset.files.keys())}")
# Checkout a specific commit (using the oldest commit)
oldest_commit_for_checkout = history[-1] # Oldest is last in newest-first list
commit_demo_dataset.checkout(oldest_commit_for_checkout.hash)
print("\n📂 After checkout:")
print(f" Current commit: {commit_demo_dataset.current_commit.short_hash}")
print(f" Files: {list(commit_demo_dataset.files.keys())}")
# Checkout latest again
commit_demo_dataset.checkout() # No argument = latest
print("\n📂 Back to latest:")
print(f" Current commit: {commit_demo_dataset.current_commit.short_hash}")
print(f" Files: {list(commit_demo_dataset.files.keys())}")
Important: Checking out a commit doesn't delete anything - it just changes which files are "current" in the dataset. All commits and their files remain accessible.
Step 6: Comparing Commits
One of the most powerful features is comparing commits to see what changed between them. This helps you understand how your dataset evolved over time.
# Get two commits to compare (oldest vs newest)
oldest_commit = history[-1] # Oldest is last in newest-first list
newest_commit = history[0] # Newest is first in newest-first list
# Compare them (oldest first, then newest)
comparison = commit_demo_dataset.compare_commits(
oldest_commit.hash, newest_commit.hash
)
print("📊 Commit Comparison:")
print("=" * 50)
commit1_info = comparison["commit1"]
commit2_info = comparison["commit2"]
print(f"Commit 1: {commit1_info['hash'][:8]} - {commit1_info['message']}")
print(f"Commit 2: {commit2_info['hash'][:8]} - {commit2_info['message']}")
# File changes - compute manually by comparing file lists
print("\n📁 File Changes:")
files1 = set(oldest_commit.list_files())
files2 = set(newest_commit.list_files())
added_files = list(files2 - files1)
removed_files = list(files1 - files2)
unchanged_files = list(files1 & files2)
if added_files:
print(f" Added: {added_files}")
if removed_files:
print(f" Removed: {removed_files}")
if unchanged_files:
print(f" Unchanged: {unchanged_files}")
# Metadata changes (if any)
if comparison.get("metadata_diff"):
metadata_diff = comparison["metadata_diff"]
if metadata_diff.get("added"):
print(f"\n📋 Metadata Added: {metadata_diff['added']}")
if metadata_diff.get("changed"):
print(f"📋 Metadata Changed: {metadata_diff['changed']}")
# Tag changes (if any)
if comparison.get("tags_diff"):
tags_diff = comparison["tags_diff"]
if tags_diff.get("added"):
print(f"\n🏷️ Tags Added: {tags_diff['added']}")
if tags_diff.get("removed"):
print(f"🏷️ Tags Removed: {tags_diff['removed']}")
Step 7: Understanding File Changes
Let's see how files change between commits by creating more commits and comparing them step by step.
# Create commits with file changes
file3 = data_dir / "data_v3.csv"
file3.write_text("name,value\nA,10\nB,20\nC,30\nD,40\n")
commit_msg3 = f"Add more rows - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
commit_demo_dataset.commit(message=commit_msg3, add_files=[str(file3)])
# Remove a file
commit_msg4 = f"Remove old file - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
commit_demo_dataset.commit(message=commit_msg4, remove_files=["data_v1.csv"])
print("✅ Created additional commits with file changes")
# Update history
updated_history = commit_demo_dataset.history()
# Compare adjacent commits (going from newest to oldest)
print("📊 Comparing adjacent commits (newest → oldest):")
print("=" * 50)
for step_index in range(len(updated_history) - 1):
commit_a = updated_history[step_index] # Newer commit
commit_b = updated_history[step_index + 1] # Older commit
# Compute file differences manually
files_a = set(commit_a.list_files())
files_b = set(commit_b.list_files())
added_files_result = list(files_b - files_a)
removed_files_result = list(files_a - files_b)
print(f"\n{commit_a.short_hash} → {commit_b.short_hash}:")
if added_files_result:
print(f" + Added: {added_files_result}")
if removed_files_result:
print(f" - Removed: {removed_files_result}")
Step 8: Commit Metadata and Tags
Commits can have metadata and tags for better organization. This is especially useful for tracking experiments, model versions, or data releases.
# Create a commit with metadata and tags
file4 = data_dir / "model_v1.pkl"
file4.write_text("fake model data")
commit_msg5 = f"Add trained model - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
commit_demo_dataset.commit(
message=commit_msg5,
add_files=[str(file4)],
metadata={
"accuracy": 0.92,
"framework": "sklearn",
"model_type": "RandomForest",
},
tags=["model", "v1.0", "production"],
)
print("✅ Created commit with metadata and tags")
# Access metadata and tags
latest_commit = commit_demo_dataset.current_commit
if latest_commit:
print(f"📊 Commit: {latest_commit.short_hash}")
print(f" Metadata: {latest_commit.metadata}")
print(f" Tags: {latest_commit.tags}")
Step 9: Finding Commits
You can search for commits using various criteria: tags, metadata filters, or message content. This makes it easy to find specific commits in large datasets.
# Find commits by tags
production_commits = commit_demo_dataset.find_commits(tags=["production"])
print(f"🏷️ Production commits: {len(production_commits)}")
for prod_commit in production_commits:
print(f" {prod_commit.short_hash}: {prod_commit.message}")
# Find commits by metadata filter
high_accuracy_commits = commit_demo_dataset.find_commits(
metadata_filter=lambda m: m.get("accuracy", 0) > 0.9
)
print(f"\n📊 High accuracy commits: {len(high_accuracy_commits)}")
for acc_commit in high_accuracy_commits:
accuracy_value = acc_commit.metadata.get("accuracy")
commit_info = f"{acc_commit.short_hash}: {acc_commit.message}"
print(f" {commit_info} (accuracy: {accuracy_value})")
# Find commits by message content
def find_by_message(commit_demo_dataset, search_term):
history = commit_demo_dataset.history()
return [c for c in history if search_term.lower() in c.message.lower()]
model_commits = find_by_message(commit_demo_dataset, "model")
print(f"\n🔍 Commits with 'model' in message: {len(model_commits)}")
for msg_commit in model_commits:
print(f" {msg_commit.short_hash}: {msg_commit.message}")
Step 10: Commit Statistics
Let's analyze commit patterns to understand how your dataset has evolved over time.
def analyze_commits(commit_demo_dataset):
history = commit_demo_dataset.history()
if not history:
print("No commits found")
return
print("📊 Commit Statistics:")
print("=" * 50)
# Basic stats
total_commits = len(history)
total_size = sum(c.get_total_size() for c in history)
avg_size = total_size / total_commits if total_commits > 0 else 0
print(f"Total commits: {total_commits}")
print(f"Total size: {total_size / (1024 * 1024):.2f} MB")
print(f"Average commit size: {avg_size / 1024:.2f} KB")
# File frequency
file_counts = {}
for commit in history:
for filename in commit.list_files():
file_counts[filename] = file_counts.get(filename, 0) + 1
print("\nMost frequently changed files:")
sorted_files = sorted(file_counts.items(), key=lambda x: x[1], reverse=True)[:5]
for filename, count in sorted_files:
print(f" {filename}: appears in {count} commits")
# Time span
if len(history) > 1:
first_date = history[0].timestamp
last_date = history[-1].timestamp
time_span = last_date - first_date
print(f"\nTime span: {time_span.days} days")
print(f" First commit: {first_date.strftime('%Y-%m-%d %H:%M')}")
print(f" Last commit: {last_date.strftime('%Y-%m-%d %H:%M')}")
analyze_commits(commit_demo_dataset)
Step 11: Working with Commit Files
You can access files from specific commits by checking out that commit and then using the standard file access methods.
# Get files from a specific commit (using oldest for demonstration)
target_commit = updated_history[-1] # Oldest is last in newest-first list
commit_demo_dataset.checkout(target_commit.hash)
print(f"📁 Files in commit {target_commit.short_hash}:")
with dataset.local_files() as local_files:
for filename, local_path in local_files.items():
file_obj = dataset.get_file(filename)
print(f" {filename}:")
print(f" Size: {file_obj.size} bytes")
print(f" Hash: {file_obj.hash[:16]}...")
print(f" Local path: {local_path}")
Step 12: Commit Workflows
Here are common commit workflow patterns that you can use in your data science projects.
Pattern 1: Linear Data Processing
Sequential data processing pipeline where each step commits its output.
# Sequential data processing pipeline
commit_demo_dataset.commit(
message="Add raw data", add_files=["raw_data.csv"]
)
# ... process data ...
commit_demo_dataset.commit(
message="Add cleaned data", add_files=["cleaned_data.csv"]
)
# ... analyze data ...
commit_demo_dataset.commit(
message="Add analysis results", add_files=["results.csv"]
)
Pattern 2: Experiment Tracking
Track different experiments with metadata and tags for easy comparison.
# Track different experiments
commit_demo_dataset.commit(
message="Experiment 1: Random Forest",
add_files=["rf_model.pkl", "rf_results.csv"],
metadata={"model": "RandomForest", "accuracy": 0.85},
tags=["experiment", "rf"]
)
commit_demo_dataset.commit(
message="Experiment 2: Gradient Boosting",
add_files=["gb_model.pkl", "gb_results.csv"],
metadata={"model": "GradientBoosting", "accuracy": 0.90},
tags=["experiment", "gb"]
)
Pattern 3: Versioned Releases
Version your data releases with tags for easy reference.
# Version your data releases
commit_demo_dataset.commit(
message="Release v1.0: Initial dataset",
add_files=["dataset_v1.csv"],
tags=["release", "v1.0"]
)
commit_demo_dataset.commit(
message="Release v1.1: Added features",
add_files=["dataset_v1.1.csv"],
tags=["release", "v1.1"]
)
Step 13: Best Practices
Following best practices helps you maintain a clean, understandable commit history that makes it easy to track changes and understand your dataset's evolution.
Write Clear Commit Messages
Good commit messages are descriptive and specific. They explain what changed and why, making it easy to understand the dataset's history.
# ✅ Good: Descriptive and specific
commit_demo_dataset.commit(
message="Add Q1 2024 sales data with customer demographics",
add_files=["sales_q1_2024.csv"]
)
# ✅ Good: Explains the change
commit_demo_dataset.commit(
message="Fix data quality issues: remove duplicates and handle missing values",
add_files=["customers_cleaned.csv"]
)
# ❌ Bad: Vague and unhelpful
commit_demo_dataset.commit(message="Update", add_files=["data.csv"])
commit_demo_dataset.commit(message="Fix", add_files=["file.csv"])
Make Atomic Commits
Each commit should represent a single logical change. This makes it easier to understand what changed and to revert specific changes if needed.
# ✅ Good: Single logical change
commit_demo_dataset.commit(message="Add customer data", add_files=["customers.csv"])
# ✅ Good: Related changes together
commit_demo_dataset.commit(
message="Update customer data and add validation rules",
add_files=["customers_updated.csv", "validation_rules.json"]
)
# ❌ Bad: Unrelated changes
commit_demo_dataset.commit(
message="Add customer data and fix bug",
add_files=["customers.csv", "bug_fix.py"]
)
Commit Regularly
Commit after each logical step in your workflow. This creates a clear history of how your dataset evolved and makes it easier to track changes.
# ✅ Good: Commit after each logical step
commit_demo_dataset.commit(
message="Add raw data", add_files=["raw_data.csv"]
)
# ... process data ...
commit_demo_dataset.commit(
message="Add cleaned data", add_files=["cleaned_data.csv"]
)
# ❌ Bad: Too many changes in one commit
# ... many processing steps ...
commit_demo_dataset.commit(
message="All changes",
add_files=["file1.csv", "file2.csv", "file3.csv", ...],
)
Step 14: Troubleshooting
Sometimes you need to find a specific commit or recover to a previous state. Here are some helpful techniques.
Finding Lost Commits
If you have a partial commit hash, you can search for the full commit.
def find_commit_by_hash(commit_demo_dataset, partial_hash):
history = commit_demo_dataset.history()
for commit in history:
if commit.hash.startswith(partial_hash):
return commit
return None
# Find commit using a partial hash (using oldest for demonstration)
found_commit = None
partial_hash = None
if updated_history:
partial_hash = updated_history[-1].hash[:8] # Oldest is last
found_commit = find_commit_by_hash(commit_demo_dataset, partial_hash)
if found_commit:
print(f"✅ Found: {found_commit.short_hash} - {found_commit.message}")
Recovering from Mistakes
You can recover your dataset to a specific commit state by checking out that commit.
def recover_to_commit(commit_demo_dataset, commit_hash):
commit = commit_demo_dataset.get_commit(commit_hash)
if not commit:
print(f"❌ Commit {commit_hash} not found")
return False
# Checkout the commit
commit_demo_dataset.checkout(commit_hash)
# Verify
if (
commit_demo_dataset.current_commit
and commit_demo_dataset.current_commit.hash == commit_hash
):
print(f"✅ Successfully recovered to commit {commit_hash[:8]}")
print(f" Message: {commit_demo_dataset.current_commit.message}")
return True
else:
print("❌ Failed to recover")
return False
# Recover to a known good commit (using oldest for demonstration)
recover_to_commit(dataset, updated_history[-1].hash)
Summary
Congratulations! You've learned how to work with commits in Kirin:
- ✅ Understanding commit structure - Hash, message, timestamp, parent, files
- ✅ Navigating history - Viewing, limiting, and finding commits
- ✅ Checking out commits - Switching between different commit states
- ✅ Comparing commits - Seeing what changed between commits
- ✅ Using metadata and tags - Organizing commits with additional information
- ✅ Finding commits - Searching by tags, metadata, or message
- ✅ Best practices - Writing good commit messages and workflows
Key Concepts
- Linear History: Each commit has one parent, creating a simple chain
- Immutable Snapshots: Commits are immutable - they never change
- Content-Addressed Files: Files are stored by content hash, not filename
- Checkout: Switching the dataset to a specific commit's state
Next Steps
- Cloud Storage Overview - Learn about using cloud storage backends
- Web UI Basics - Use the web interface to browse commits
- Track Model Training Data - See commits in action with ML workflows