Skip to content

Kirin API Reference

Core Classes

Dataset

The main class for working with Kirin datasets.

from kirin import Dataset

# Create or load a dataset
dataset = Dataset(root_dir="/path/to/data", name="my-dataset")

Constructor Parameters

Dataset(
    root_dir: Union[str, Path],           # Root directory for the dataset
    name: str,                           # Name of the dataset
    description: str = "",               # Description of the dataset
    fs: Optional[fsspec.AbstractFileSystem] = None,  # Filesystem to use
    # Cloud authentication parameters
    aws_profile: Optional[str] = None,   # AWS profile for S3 authentication
    gcs_token: Optional[Union[str, Path]] = None,  # GCS service account token
    gcs_project: Optional[str] = None,   # GCS project ID
    azure_account_name: Optional[str] = None,  # Azure account name
    azure_account_key: Optional[str] = None,    # Azure account key
    azure_connection_string: Optional[str] = None,  # Azure connection string
)

Basic Operations

  • commit(message, add_files=None, remove_files=None, metadata=None, tags=None) - Commit changes to the dataset
  • checkout(commit_hash=None) - Switch to a specific commit (latest if None)
  • files - Dictionary of files in the current commit
  • local_files() - Context manager for accessing files as local paths
  • history(limit=None) - Get commit history
  • get_file(filename) - Get a file from the current commit
  • read_file(filename) - Read file content as text
  • download_file(filename, target_path) - Download file to local path

Model Versioning Operations

  • find_commits(tags=None, metadata_filter=None, limit=None) - Find commits matching criteria
  • compare_commits(hash1, hash2) - Compare metadata between two commits

Notebook Integration

Kirin provides rich HTML representations for datasets, commits, and catalogs that display beautifully in Jupyter and Marimo notebooks.

HTML Representation

When you display a Dataset, Commit, or Catalog object in a notebook cell, Kirin automatically generates an interactive HTML view with:

  • File Lists: Click on any file to preview its contents or reveal code snippets
  • File Previews:
  • CSV files: Displayed as interactive tables with proper formatting
  • JSON files: Formatted JSON with syntax highlighting
  • Text files: Displayed in code blocks with proper formatting
  • Files larger than 100KB are not previewed for performance
  • Copy Code to Access Button: Each file has a button that copies Python code to your clipboard with the correct variable name
  • Commit History: Visual display of commit history with metadata
  • File Metadata: File sizes, content types, and icons

Example:

from kirin import Dataset

dataset = Dataset(root_dir="/path/to/data", name="my_dataset")

# Display in notebook - shows interactive HTML
dataset
Variable Names in Code Snippets

By default, code snippets use generic variable names ("dataset", "commit", or "catalog") based on the class type. You can customize the variable name used in code snippets by setting the _repr_variable_name attribute.

Default Behavior:

dataset = Dataset(root_dir="/path/to/data", name="my_dataset")

# Display in notebook - code snippets use "dataset" by default
dataset

When you click "Copy Code to Access" on a file, the copied code will use the default variable name:

# Get path to local clone of file
with dataset.local_files() as files:
    file_path = files["data.csv"]

Custom Variable Names:

If you want code snippets to use a different variable name, set the _repr_variable_name attribute:

my_dataset = Dataset(root_dir="/path/to/data", name="my_dataset")
my_dataset._repr_variable_name = "my_dataset"

# Now code snippets will use "my_dataset" instead of "dataset"
my_dataset  # Display in notebook

When you click "Copy Code to Access", the copied code will use your custom variable name:

# Get path to local clone of file
with my_dataset.local_files() as files:
    file_path = files["data.csv"]

Note: The _repr_variable_name attribute is only used for HTML representation and doesn't affect the actual dataset object.

Known Limitation (as of December 2025): The "Copy Code to Access" button does not work within Marimo notebooks running inside VSCode due to clipboard API restrictions. The button works correctly when viewing notebooks in a web browser.

Method Details

commit(message, add_files=None, remove_files=None, metadata=None, tags=None)

Create a new commit with changes to the dataset.

Enhanced for ML artifacts:

  • Model objects: If add_files contains scikit-learn model objects, they are automatically serialized, and hyperparameters/metrics are extracted and added to metadata.

  • Plot objects: If add_files contains matplotlib or plotly figure objects, they are automatically converted to files (SVG for vector plots, WebP for raster plots) with format auto-detection.

Parameters:

  • message (str): Commit message describing the changes
  • add_files (List[Union[str, Path, Any]], optional): List of files (paths), model objects, or plot objects to add. Can include:
  • File paths (str or Path): Regular files
  • scikit-learn model objects: Automatically serialized with hyperparameters and metrics extracted
  • matplotlib/plotly figure objects: Automatically converted to SVG/WebP with format auto-detection
  • remove_files (List[str], optional): List of filenames to remove
  • metadata (Dict[str, Any], optional): Metadata dictionary (merged with auto-extracted metadata). For model-specific metadata, use metadata["models"][var_name] structure.
  • tags (List[str], optional): List of tags for staging/versioning

Returns:

  • str: Hash of the new commit

Examples:

# Basic commit with file paths
dataset.commit("Add new data", add_files=["data.csv"])

# Commit with scikit-learn model object (automatic serialization)
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

dataset.commit(
    message="Initial model",
    add_files=[model],  # Auto-serialized as "model.pkl"
    metadata={"accuracy": 0.95}  # Data-dependent metrics
)

# Multiple models with model-specific metadata
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)
rf_accuracy = rf_model.score(X_test, y_test)

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_accuracy = lr_model.score(X_test, y_test)

dataset.commit(
    message="Compare models",
    add_files=[rf_model, lr_model],
    metadata={
        "models": {
            "rf_model": {"accuracy": rf_accuracy},  # Model-specific
            "lr_model": {"accuracy": lr_accuracy},  # Model-specific
        },
        "dataset": "iris",  # Shared metadata
    },
)

# Commit with matplotlib plot object (automatic conversion)
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot([1, 2, 3], [1, 4, 9])
ax.set_title("Training Loss")

dataset.commit(
    message="Add training plot",
    add_files=[fig],  # Auto-converted to SVG
)

# Mixed: model objects, plot objects, and file paths
dataset.commit(
    message="Model with plots",
    add_files=[
        model,  # Auto-serialized model
        fig,  # Auto-converted plot
        "config.json",  # Regular file path
    ],
)

# Traditional model versioning (still works)
dataset.commit(
    message="Improved model v2.0",
    add_files=["model.pt", "config.json"],
    metadata={
        "framework": "pytorch",
        "accuracy": 0.92,
        "hyperparameters": {"lr": 0.001, "epochs": 10}
    },
    tags=["production", "v2.0"]
)

Format Auto-Detection for Plots:

When plot objects are committed, Kirin automatically detects the optimal format:

  • SVG (vector): Default for matplotlib and plotly figures. Best for line plots, scatter plots, and other vector-based visualizations. Provides infinite scalability without quality loss.

  • WebP (raster): Used for plots with raster elements (e.g., images, heatmaps). Provides good compression while maintaining quality.

The format is automatically chosen based on the plot type. Matplotlib and plotly figures default to SVG, which is optimal for most scientific visualizations.

Metadata Structure:

When model objects are committed, metadata is automatically structured as:

{
    "models": {
        "model_name": {
            "model_type": "RandomForestClassifier",  # Auto-extracted
            "hyperparameters": {...},  # Auto-extracted via get_params()
            "metrics": {...},  # Auto-extracted (feature_importances_, etc.)
            "sklearn_version": "1.3.0",  # Auto-extracted
            "accuracy": 0.95,  # User-provided, model-specific
            "source_file": "ml-workflow.py",  # Auto-detected
            "source_hash": "..."  # Auto-detected
        }
    },
    "dataset": "iris"  # Top-level metadata (shared)
}

Metadata Merging:

  • Auto-extracted metadata (hyperparameters, metrics, sklearn_version, source info) is added to each model's entry
  • User-provided model-specific metadata (via metadata["models"][var_name]) is merged into each model's entry
  • Top-level metadata (outside models dict) applies to the entire commit
  • User-provided metadata wins on conflicts
find_commits(tags=None, metadata_filter=None, limit=None)

Find commits matching specified criteria.

Parameters:

  • tags (List[str], optional): Filter by tags (commits must have ALL specified tags)
  • metadata_filter (Callable[[Dict], bool], optional): Function that takes metadata dict and returns bool
  • limit (int, optional): Maximum number of commits to return

Returns:

  • List[Commit]: List of matching commits (newest first)

Examples:

# Find production models
production_models = dataset.find_commits(tags=["production"])

# Find high-accuracy models
high_accuracy = dataset.find_commits(
    metadata_filter=lambda m: m.get("accuracy", 0) > 0.9
)

# Find PyTorch production models
pytorch_prod = dataset.find_commits(
    tags=["production"],
    metadata_filter=lambda m: m.get("framework") == "pytorch"
)
compare_commits(hash1, hash2)

Compare metadata between two commits.

Parameters:

  • hash1 (str): First commit hash
  • hash2 (str): Second commit hash

Returns:

  • dict: Dictionary with comparison results including metadata and tag differences

Example:

comparison = dataset.compare_commits("abc123", "def456")
print("Metadata changes:", comparison["metadata_diff"]["changed"])
print("Tag changes:", comparison["tags_diff"])

Examples

# Basic usage
dataset = Dataset(root_dir="/data", name="project")
dataset.commit("Initial commit", add_files=["data.csv"])

# Cloud storage with authentication
dataset = Dataset(
    root_dir="s3://my-bucket/data",
    name="project",
    aws_profile="my-profile"
)

# GCS with service account
dataset = Dataset(
    root_dir="gs://my-bucket/data",
    name="project",
    gcs_token="/path/to/service-account.json",
    gcs_project="my-project"
)

# Azure with connection string
dataset = Dataset(
    root_dir="az://my-container/data",
    name="project",
    azure_connection_string=os.getenv("AZURE_CONNECTION_STRING")
)

Catalog

The main class for managing collections of datasets.

from kirin import Catalog

# Create or load a catalog
catalog = Catalog(root_dir="/path/to/data")

Catalog Constructor Parameters

Catalog(
    root_dir: Union[str, fsspec.AbstractFileSystem],  # Root directory for the catalog
    fs: Optional[fsspec.AbstractFileSystem] = None,  # Filesystem to use
    # Cloud authentication parameters
    aws_profile: Optional[str] = None,   # AWS profile for S3 authentication
    gcs_token: Optional[Union[str, Path]] = None,  # GCS service account token
    gcs_project: Optional[str] = None,   # GCS project ID
    azure_account_name: Optional[str] = None,  # Azure account name
    azure_account_key: Optional[str] = None,    # Azure account key
    azure_connection_string: Optional[str] = None,  # Azure connection string
)

Catalog Basic Operations

  • datasets() - List all datasets in the catalog
  • get_dataset(name) - Get a specific dataset
  • create_dataset(name, description="") - Create a new dataset
  • __len__() - Number of datasets in the catalog

Catalog Examples

# Basic usage
catalog = Catalog(root_dir="/data")
datasets = catalog.datasets()
dataset = catalog.get_dataset("my-dataset")

# Cloud storage with authentication
catalog = Catalog(
    root_dir="s3://my-bucket/data",
    aws_profile="my-profile"
)

# GCS with service account
catalog = Catalog(
    root_dir="gs://my-bucket/data",
    gcs_token="/path/to/service-account.json",
    gcs_project="my-project"
)

Web UI

The web UI provides a graphical interface for Kirin operations.

Routes

  • / - Home page for catalog management
  • /catalogs/add - Add new catalog
  • /catalog/{catalog_id} - View catalog and datasets
  • /catalog/{catalog_id}/{dataset_name} - View specific dataset
  • /catalog/{catalog_id}/{dataset_name}/commit - Commit interface

Catalog Management

The web UI supports cloud authentication through CatalogConfig:

from kirin.web.config import CatalogConfig

# Create catalog config with cloud auth
config = CatalogConfig(
    id="my-catalog",
    name="My Catalog",
    root_dir="s3://my-bucket/data",
    aws_profile="my-profile"
)

# Convert to runtime catalog
catalog = config.to_catalog()

Cloud Authentication in Web UI

The web UI automatically handles cloud authentication when you:

  1. Create a catalog with cloud storage URL (s3://, gs://, az://)
  2. The system will prompt for authentication parameters
  3. Credentials are stored securely in the catalog configuration

Storage Format

Kirin uses a simplified Git-like storage format:

data/
├── data/                 # Content-addressed file storage
│   └── {hash[:2]}/{hash[2:]}
├── datasets/
│   └── my-dataset/
│       └── commits.json  # Linear commit history

Error Handling

Common Exceptions

  • ValueError - Invalid operations (file not found, invalid commit hash, etc.)
  • FileNotFoundError - File not found in dataset
  • HTTPException - Web UI errors (catalog not found, validation errors)

Example Error Handling

try:
    dataset.checkout("nonexistent-commit")
except ValueError as e:
    print(f"Checkout failed: {e}")

try:
    content = dataset.read_file("nonexistent.txt")
except FileNotFoundError as e:
    print(f"File not found: {e}")

Best Practices

Dataset Naming

  • Use descriptive names: user-data, ml-experiments, production-models
  • Avoid generic names: test, data, temp

Workflow Patterns

  • Commit changes regularly with descriptive messages
  • Use linear commit history for simplicity
  • Keep datasets focused on specific use cases
  • Use catalogs to organize related datasets

File Management

  • Use local_files() context manager for library compatibility
  • Commit changes after adding/removing files
  • Use descriptive commit messages

Advanced Features

Context Managers

# Access files as local paths
with dataset.local_files() as local_files:
    df = pd.read_csv(local_files["data.csv"])
    # Files automatically cleaned up

Commit

Represents an immutable snapshot of files at a point in time with optional metadata and tags.

Commit Notebook Integration

Commits also support rich HTML representation in notebooks. When you display a Commit object, you'll see:

  • Commit Metadata: Hash, message, timestamp, and parent commit
  • File List: All files in the commit with interactive access
  • Copy Code to Access: Each file has a button that copies code including a checkout step

Example:

from kirin import Dataset

dataset = Dataset(root_dir="/path/to/data", name="my_dataset")
commit = dataset.get_commit(commit_hash)

# Display in notebook - shows interactive HTML
commit

Commit Code Snippets:

When you click "Copy Code to Access" on a file in a commit, the code includes a checkout step:

# Checkout this commit first
dataset.checkout("commit_hash")
# Get path to local clone of file
with dataset.local_files() as files:
    file_path = files["data.csv"]

Note: Commits are frozen dataclasses, so you cannot set _repr_variable_name on them. Code snippets will use "dataset" as the default variable name.

Properties

  • hash (str): Unique commit identifier
  • message (str): Commit message
  • timestamp (datetime): When the commit was created
  • parent_hash (Optional[str]): Hash of the parent commit (None for initial commit)
  • files (Dict[str, File]): Dictionary of files in this commit
  • metadata (Dict[str, Any]): Metadata dictionary for model versioning
  • tags (List[str]): List of tags for staging/versioning

Methods

  • get_file(name) - Get a file by name
  • list_files() - List all file names
  • has_file(name) - Check if file exists
  • get_file_count() - Get number of files
  • get_total_size() - Get total size of all files
  • to_dict() - Convert to dictionary representation
  • from_dict(data, storage) - Create from dictionary

Commit Examples

# Access commit properties
commit = dataset.current_commit
print(f"Commit: {commit.short_hash}")
print(f"Message: {commit.message}")
print(f"Files: {len(commit.files)}")
print(f"Metadata: {commit.metadata}")
print(f"Tags: {commit.tags}")

# Check if commit has specific metadata
if commit.metadata.get("accuracy", 0) > 0.9:
    print("High accuracy model!")

# Check if commit has specific tags
if "production" in commit.tags:
    print("Production model")

Commit History

# Get commit history
history = dataset.history(limit=10)
for commit in history:
    print(f"{commit.hash}: {commit.message}")

File Operations

# Add files to commit
dataset.commit("Add new data", add_files=["new_data.csv"])

# Remove files from commit
dataset.commit("Remove old data", remove_files=["old_data.csv"])

# Combined operations
dataset.commit("Update dataset",
              add_files=["new_data.csv"],
              remove_files=["old_data.csv"])

Cloud Storage Integration

# AWS S3 with profile
dataset = Dataset(
    root_dir="s3://my-bucket/data",
    name="my-dataset",
    aws_profile="production"
)

# GCS with service account
dataset = Dataset(
    root_dir="gs://my-bucket/data",
    name="my-dataset",
    gcs_token="/path/to/service-account.json",
    gcs_project="my-project"
)

# Azure with connection string
dataset = Dataset(
    root_dir="az://my-container/data",
    name="my-dataset",
    azure_connection_string=os.getenv("AZURE_CONNECTION_STRING")
)

For detailed examples and cloud storage setup, see the Cloud Storage Authentication Guide.