To run this notebook, click on the molab shield above or run the following command at the terminal:
uvx marimo edit --sandbox --mcp --no-token --watch https://github.com/nll-ai/kirin/blob/main/docs/tutorials/first-dataset.py
import marimo as mo
Your First Dataset
This tutorial will guide you through creating and working with your first Kirin dataset. By the end, you'll understand the core concepts of datasets, commits, and how to work with versioned files.
What You'll Learn
- How to create a dataset
- How to add files to a dataset
- How to view commit history
- How to access files from different commits
- How to update your dataset with new files
Prerequisites
- Python 3.13 or higher
- Kirin installed (see Installation Guide)
Step 1: Understanding Datasets
A dataset in Kirin is a collection of versioned files. Think of it like a Git repository, but specifically designed for data files. Each dataset has:
- A name that identifies it
- A linear commit history that tracks changes over time
- Files that are stored using content-addressed storage
import tempfile
from pathlib import Path
from kirin import Catalog
# Create a temporary directory for our tutorial
# In production, you might use: Catalog(root_dir="s3://my-bucket/data")
temp_dir = Path(tempfile.mkdtemp(prefix="kirin_tutorial_"))
catalog = Catalog(root_dir=temp_dir)
# Create a new dataset
my_dataset = catalog.create_dataset(
"my_first_dataset", description="My first Kirin dataset for learning"
)
my_dataset
Tip: When you display a dataset in a notebook cell (like my_dataset
above), Kirin shows an interactive HTML view with a "Copy Code to Access"
button for each file. The copied code uses "dataset" by default, but you
can customize it by setting my_dataset._repr_variable_name = "my_dataset".
Step 2: Preparing Your First Files
Before we can commit files, let's create some sample data files to work with.
# Create a directory for our data files
data_dir = temp_dir / "sample_data"
data_dir.mkdir(exist_ok=True)
# Create a simple CSV file
csv_file = data_dir / "data.csv"
csv_file.write_text("""name,age,city
Alice,28,New York
Bob,35,San Francisco
Carol,42,Chicago
""")
# Create a JSON configuration file
config_file = data_dir / "config.json"
config_file.write_text("""{
"version": "1.0",
"description": "Sample dataset configuration",
"columns": ["name", "age", "city"]
}
""")
print("✅ Created files:")
print(f" - {csv_file.name}")
print(f" - {config_file.name}")
Step 3: Making Your First Commit
Now let's add these files to our dataset. This creates your first commit.
# Commit files to the dataset
my_dataset.commit(
message="Initial commit: Add sample data and configuration",
add_files=[str(csv_file), str(config_file)],
)
# Display the current commit with rich HTML
my_dataset
What just happened?
- Kirin calculated content hashes for each file
- Files were stored in content-addressed storage
- A commit was created that references these files
- The commit was added to the dataset's linear history
Step 4: Viewing Your Commit History
Let's see what we've created. You should see your first commit listed with the files you added.
# Display the dataset which shows files in the current commit
my_dataset
Step 5: Accessing Files from a Commit
Now let's access the files from the current commit. The recommended way to
work with files is using the local_files() context manager. This downloads
files on-demand and cleans them up automatically.
Key points:
- Files are only downloaded when you access them (lazy loading)
- Files are automatically cleaned up when you exit the context manager
- You can use standard Python libraries (pandas, polars, etc.) with the local paths
# Access files as local paths
with my_dataset.local_files() as local_files:
# Files are lazily downloaded when accessed
csv_path = local_files["data.csv"]
config_path = local_files["config.json"]
# Now you can use standard Python file operations
print("📂 Local file paths:")
print(f" CSV: {csv_path}")
print(f" Config: {config_path}")
# Read file content
csv_content = Path(csv_path).read_text()
print("\n📝 CSV content:")
print(csv_content)
# Or use with data science libraries
import pandas as pd
df = pd.read_csv(csv_path)
print("\n📊 DataFrame:")
print(f" Shape: {df.shape}")
print(df)
Step 6: Adding More Files
Let's add another file to see how the commit history grows.
from datetime import datetime
# Create a new file
results_file = data_dir / "results.txt"
results_file.write_text("""Analysis Results
================
Total records: 3
Average age: 35.0
Cities: New York, San Francisco, Chicago
""")
# Commit the new file
commit_msg = (
f"Add analysis results - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
)
my_dataset.commit(
message=commit_msg,
add_files=[str(results_file)],
)
# Display the current commit with rich HTML
my_dataset.current_commit
Step 7: Viewing the Updated History
Let's see how the history has changed. Notice how the commit history is linear - each commit builds on the previous one.
# Display the dataset which shows updated commit history
updated_history = my_dataset.history()
my_dataset
Step 8: Checking Out Different Commits
You can checkout any commit to see what files were available at that point.
# Get the first commit
first_commit = updated_history[-1] # Oldest commit is last in history
# Checkout the first commit and display it
my_dataset.checkout(first_commit.hash)
first_commit
# Checkout the latest commit and display the dataset
my_dataset.checkout() # No argument = latest commit
my_dataset
Step 9: Understanding Content-Addressed Storage
One of Kirin's key features is content-addressed storage. This means:
- Files are stored by their content hash, not by filename
- Identical files are automatically deduplicated
- File integrity is guaranteed by the hash
Let's demonstrate this by creating a duplicate file with the same content. Even though the files have different names, they have the same content hash, so Kirin stores them only once!
# Create a file with the same content as data.csv
duplicate_file = data_dir / "data_copy.csv"
duplicate_file.write_text(csv_file.read_text())
# Commit the duplicate
my_dataset.commit(
message="Add duplicate file", add_files=[str(duplicate_file)]
)
# Check the file objects
original = my_dataset.get_file("data.csv")
duplicate = my_dataset.get_file("data_copy.csv")
print("🔍 Content-Addressed Storage Demo:")
print(f" Original file hash: {original.hash}")
print(f" Duplicate file hash: {duplicate.hash}")
print(f" Same content = Same hash: {original.hash == duplicate.hash}")
Step 10: Removing Files
You can also remove files from a dataset.
# Remove a file
my_dataset.commit(
message="Remove duplicate file", remove_files=["data_copy.csv"]
)
# Display the dataset to see updated state
my_dataset
Step 11: Combining Operations
You can add and remove files in the same commit.
# Create a summary report file
summary_report = data_dir / "monthly_summary.json"
summary_report.write_text("""{
"period": "2024-01",
"total_records": 3,
"average_age": 35.0,
"cities": ["New York", "San Francisco", "Chicago"],
"generated_at": "2024-01-15T10:00:00Z"
}
""")
# Add summary report and remove detailed processing log
# These are different types of files: summary vs detailed logs
my_dataset.commit(
message="Add monthly summary, remove detailed processing logs",
add_files=[str(summary_report)],
remove_files=["results.txt"],
)
# Display the dataset to see updated state
my_dataset
Summary
Congratulations! You've learned the fundamentals of working with Kirin datasets:
- ✅ Created a dataset using a catalog
- ✅ Made commits to track file changes
- ✅ Viewed commit history to see how your dataset evolved
- ✅ Accessed files from different commits
- ✅ Worked with files locally using the context manager
- ✅ Understood content-addressed storage and deduplication
- ✅ Updated datasets by adding and removing files
Key Concepts
- Dataset: A collection of versioned files with linear commit history
- Commit: A snapshot of files at a point in time
- Content-addressed storage: Files stored by content hash for integrity and deduplication
- Linear history: Simple, sequential commits without branching complexity
Next Steps
- Working with Commits - Deep dive into commit operations and history
- Cloud Storage Overview - Learn about using cloud storage backends
- Track Model Training Data - See a real-world example with ML models