Skip to content

Open in molab

To run this notebook, click on the molab shield above or run the following command at the terminal:

uvx marimo edit --sandbox --mcp --no-token --watch https://github.com/nll-ai/kirin/blob/main/docs/tutorials/first-dataset.py
import marimo as mo

Your First Dataset

This tutorial will guide you through creating and working with your first Kirin dataset. By the end, you'll understand the core concepts of datasets, commits, and how to work with versioned files.

What You'll Learn

  • How to create a dataset
  • How to add files to a dataset
  • How to view commit history
  • How to access files from different commits
  • How to update your dataset with new files

Prerequisites

Step 1: Understanding Datasets

A dataset in Kirin is a collection of versioned files. Think of it like a Git repository, but specifically designed for data files. Each dataset has:

  • A name that identifies it
  • A linear commit history that tracks changes over time
  • Files that are stored using content-addressed storage
import tempfile
from pathlib import Path

from kirin import Catalog

# Create a temporary directory for our tutorial
# In production, you might use: Catalog(root_dir="s3://my-bucket/data")
temp_dir = Path(tempfile.mkdtemp(prefix="kirin_tutorial_"))
catalog = Catalog(root_dir=temp_dir)

# Create a new dataset
my_dataset = catalog.create_dataset(
    "my_first_dataset", description="My first Kirin dataset for learning"
)
my_dataset

Tip: When you display a dataset in a notebook cell (like my_dataset above), Kirin shows an interactive HTML view with a "Copy Code to Access" button for each file. The copied code uses "dataset" by default, but you can customize it by setting my_dataset._repr_variable_name = "my_dataset".

Step 2: Preparing Your First Files

Before we can commit files, let's create some sample data files to work with.

# Create a directory for our data files
data_dir = temp_dir / "sample_data"
data_dir.mkdir(exist_ok=True)

# Create a simple CSV file
csv_file = data_dir / "data.csv"
csv_file.write_text("""name,age,city
Alice,28,New York
Bob,35,San Francisco
Carol,42,Chicago
""")

# Create a JSON configuration file
config_file = data_dir / "config.json"
config_file.write_text("""{
"version": "1.0",
"description": "Sample dataset configuration",
"columns": ["name", "age", "city"]
}
""")

print("✅ Created files:")
print(f"   - {csv_file.name}")
print(f"   - {config_file.name}")

Step 3: Making Your First Commit

Now let's add these files to our dataset. This creates your first commit.

# Commit files to the dataset
my_dataset.commit(
    message="Initial commit: Add sample data and configuration",
    add_files=[str(csv_file), str(config_file)],
)

# Display the current commit with rich HTML
my_dataset

What just happened?

  • Kirin calculated content hashes for each file
  • Files were stored in content-addressed storage
  • A commit was created that references these files
  • The commit was added to the dataset's linear history

Step 4: Viewing Your Commit History

Let's see what we've created. You should see your first commit listed with the files you added.

# Display the dataset which shows files in the current commit
my_dataset

Step 5: Accessing Files from a Commit

Now let's access the files from the current commit. The recommended way to work with files is using the local_files() context manager. This downloads files on-demand and cleans them up automatically.

Key points:

  • Files are only downloaded when you access them (lazy loading)
  • Files are automatically cleaned up when you exit the context manager
  • You can use standard Python libraries (pandas, polars, etc.) with the local paths
# Access files as local paths
with my_dataset.local_files() as local_files:
    # Files are lazily downloaded when accessed
    csv_path = local_files["data.csv"]
    config_path = local_files["config.json"]

    # Now you can use standard Python file operations
    print("📂 Local file paths:")
    print(f"   CSV: {csv_path}")
    print(f"   Config: {config_path}")

    # Read file content
    csv_content = Path(csv_path).read_text()
    print("\n📝 CSV content:")
    print(csv_content)

    # Or use with data science libraries
    import pandas as pd

    df = pd.read_csv(csv_path)
    print("\n📊 DataFrame:")
    print(f"   Shape: {df.shape}")
    print(df)

Step 6: Adding More Files

Let's add another file to see how the commit history grows.

from datetime import datetime

# Create a new file
results_file = data_dir / "results.txt"
results_file.write_text("""Analysis Results
================
Total records: 3
Average age: 35.0
Cities: New York, San Francisco, Chicago
""")

# Commit the new file
commit_msg = (
    f"Add analysis results - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
)
my_dataset.commit(
    message=commit_msg,
    add_files=[str(results_file)],
)

# Display the current commit with rich HTML
my_dataset.current_commit

Step 7: Viewing the Updated History

Let's see how the history has changed. Notice how the commit history is linear - each commit builds on the previous one.

# Display the dataset which shows updated commit history
updated_history = my_dataset.history()
my_dataset

Step 8: Checking Out Different Commits

You can checkout any commit to see what files were available at that point.

# Get the first commit
first_commit = updated_history[-1]  # Oldest commit is last in history

# Checkout the first commit and display it
my_dataset.checkout(first_commit.hash)
first_commit

# Checkout the latest commit and display the dataset
my_dataset.checkout()  # No argument = latest commit
my_dataset

Step 9: Understanding Content-Addressed Storage

One of Kirin's key features is content-addressed storage. This means:

  • Files are stored by their content hash, not by filename
  • Identical files are automatically deduplicated
  • File integrity is guaranteed by the hash

Let's demonstrate this by creating a duplicate file with the same content. Even though the files have different names, they have the same content hash, so Kirin stores them only once!

# Create a file with the same content as data.csv
duplicate_file = data_dir / "data_copy.csv"
duplicate_file.write_text(csv_file.read_text())

# Commit the duplicate
my_dataset.commit(
    message="Add duplicate file", add_files=[str(duplicate_file)]
)

# Check the file objects
original = my_dataset.get_file("data.csv")
duplicate = my_dataset.get_file("data_copy.csv")

print("🔍 Content-Addressed Storage Demo:")
print(f"   Original file hash: {original.hash}")
print(f"   Duplicate file hash: {duplicate.hash}")
print(f"   Same content = Same hash: {original.hash == duplicate.hash}")

Step 10: Removing Files

You can also remove files from a dataset.

# Remove a file
my_dataset.commit(
    message="Remove duplicate file", remove_files=["data_copy.csv"]
)

# Display the dataset to see updated state
my_dataset

Step 11: Combining Operations

You can add and remove files in the same commit.

# Create a summary report file
summary_report = data_dir / "monthly_summary.json"
summary_report.write_text("""{
"period": "2024-01",
"total_records": 3,
"average_age": 35.0,
"cities": ["New York", "San Francisco", "Chicago"],
"generated_at": "2024-01-15T10:00:00Z"
}
""")

# Add summary report and remove detailed processing log
# These are different types of files: summary vs detailed logs
my_dataset.commit(
    message="Add monthly summary, remove detailed processing logs",
    add_files=[str(summary_report)],
    remove_files=["results.txt"],
)

# Display the dataset to see updated state
my_dataset

Summary

Congratulations! You've learned the fundamentals of working with Kirin datasets:

  • Created a dataset using a catalog
  • Made commits to track file changes
  • Viewed commit history to see how your dataset evolved
  • Accessed files from different commits
  • Worked with files locally using the context manager
  • Understood content-addressed storage and deduplication
  • Updated datasets by adding and removing files

Key Concepts

  • Dataset: A collection of versioned files with linear commit history
  • Commit: A snapshot of files at a point in time
  • Content-addressed storage: Files stored by content hash for integrity and deduplication
  • Linear history: Simple, sequential commits without branching complexity

Next Steps