Kirin API Reference
Core Classes
Dataset
The main class for working with Kirin datasets.
from kirin import Dataset
# Create or load a dataset
dataset = Dataset(root_dir="/path/to/data", name="my-dataset")
Constructor Parameters
Dataset(
root_dir: Union[str, Path], # Root directory for the dataset
name: str, # Name of the dataset
description: str = "", # Description of the dataset
fs: Optional[fsspec.AbstractFileSystem] = None, # Filesystem to use
# Cloud authentication parameters
aws_profile: Optional[str] = None, # AWS profile for S3 authentication
gcs_token: Optional[Union[str, Path]] = None, # GCS service account token
gcs_project: Optional[str] = None, # GCS project ID
azure_account_name: Optional[str] = None, # Azure account name
azure_account_key: Optional[str] = None, # Azure account key
azure_connection_string: Optional[str] = None, # Azure connection string
)
Basic Operations
commit(message, add_files=None, remove_files=None, metadata=None, tags=None)- Commit changes to the datasetcheckout(commit_hash=None)- Switch to a specific commit (latest if None)files- Dictionary of files in the current commitlocal_files()- Context manager for accessing files as local pathshistory(limit=None)- Get commit historyget_file(filename)- Get a file from the current commitread_file(filename)- Read file content as textdownload_file(filename, target_path)- Download file to local path
Model Versioning Operations
find_commits(tags=None, metadata_filter=None, limit=None)- Find commits matching criteriacompare_commits(hash1, hash2)- Compare metadata between two commits
Notebook Integration
Kirin provides rich HTML representations for datasets, commits, and catalogs that display beautifully in Jupyter and Marimo notebooks.
HTML Representation
When you display a Dataset, Commit, or Catalog object in a notebook cell,
Kirin automatically generates an interactive HTML view with:
- File Lists: Click on any file to preview its contents or reveal code snippets
- File Previews:
- CSV files: Displayed as interactive tables with proper formatting
- JSON files: Formatted JSON with syntax highlighting
- Text files: Displayed in code blocks with proper formatting
- Files larger than 100KB are not previewed for performance
- Copy Code to Access Button: Each file has a button that copies Python code to your clipboard with the correct variable name
- Commit History: Visual display of commit history with metadata
- File Metadata: File sizes, content types, and icons
Example:
from kirin import Dataset
dataset = Dataset(root_dir="/path/to/data", name="my_dataset")
# Display in notebook - shows interactive HTML
dataset
Variable Names in Code Snippets
By default, code snippets use generic variable names ("dataset", "commit", or
"catalog") based on the class type. You can customize the variable name used
in code snippets by setting the _repr_variable_name attribute.
Default Behavior:
dataset = Dataset(root_dir="/path/to/data", name="my_dataset")
# Display in notebook - code snippets use "dataset" by default
dataset
When you click "Copy Code to Access" on a file, the copied code will use the default variable name:
# Get path to local clone of file
with dataset.local_files() as files:
file_path = files["data.csv"]
Custom Variable Names:
If you want code snippets to use a different variable name, set the
_repr_variable_name attribute:
my_dataset = Dataset(root_dir="/path/to/data", name="my_dataset")
my_dataset._repr_variable_name = "my_dataset"
# Now code snippets will use "my_dataset" instead of "dataset"
my_dataset # Display in notebook
When you click "Copy Code to Access", the copied code will use your custom variable name:
# Get path to local clone of file
with my_dataset.local_files() as files:
file_path = files["data.csv"]
Note: The _repr_variable_name attribute is only used for HTML
representation and doesn't affect the actual dataset object.
Known Limitation (as of December 2025): The "Copy Code to Access" button does not work within Marimo notebooks running inside VSCode due to clipboard API restrictions. The button works correctly when viewing notebooks in a web browser.
Method Details
commit(message, add_files=None, remove_files=None, metadata=None, tags=None)
Create a new commit with changes to the dataset.
Enhanced for ML artifacts:
-
Model objects: If
add_filescontains scikit-learn model objects, they are automatically serialized, and hyperparameters/metrics are extracted and added to metadata. -
Plot objects: If
add_filescontains matplotlib or plotly figure objects, they are automatically converted to files (SVG for vector plots, WebP for raster plots) with format auto-detection.
Parameters:
message(str): Commit message describing the changesadd_files(List[Union[str, Path, Any]], optional): List of files (paths), model objects, or plot objects to add. Can include:- File paths (str or Path): Regular files
- scikit-learn model objects: Automatically serialized with hyperparameters and metrics extracted
- matplotlib/plotly figure objects: Automatically converted to SVG/WebP with format auto-detection
remove_files(List[str], optional): List of filenames to removemetadata(Dict[str, Any], optional): Metadata dictionary (merged with auto-extracted metadata). For model-specific metadata, usemetadata["models"][var_name]structure.tags(List[str], optional): List of tags for staging/versioning
Returns:
str: Hash of the new commit
Examples:
# Basic commit with file paths
dataset.commit("Add new data", add_files=["data.csv"])
# Commit with scikit-learn model object (automatic serialization)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
dataset.commit(
message="Initial model",
add_files=[model], # Auto-serialized as "model.pkl"
metadata={"accuracy": 0.95} # Data-dependent metrics
)
# Multiple models with model-specific metadata
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)
rf_accuracy = rf_model.score(X_test, y_test)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_accuracy = lr_model.score(X_test, y_test)
dataset.commit(
message="Compare models",
add_files=[rf_model, lr_model],
metadata={
"models": {
"rf_model": {"accuracy": rf_accuracy}, # Model-specific
"lr_model": {"accuracy": lr_accuracy}, # Model-specific
},
"dataset": "iris", # Shared metadata
},
)
# Commit with matplotlib plot object (automatic conversion)
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot([1, 2, 3], [1, 4, 9])
ax.set_title("Training Loss")
dataset.commit(
message="Add training plot",
add_files=[fig], # Auto-converted to SVG
)
# Mixed: model objects, plot objects, and file paths
dataset.commit(
message="Model with plots",
add_files=[
model, # Auto-serialized model
fig, # Auto-converted plot
"config.json", # Regular file path
],
)
# Traditional model versioning (still works)
dataset.commit(
message="Improved model v2.0",
add_files=["model.pt", "config.json"],
metadata={
"framework": "pytorch",
"accuracy": 0.92,
"hyperparameters": {"lr": 0.001, "epochs": 10}
},
tags=["production", "v2.0"]
)
Format Auto-Detection for Plots:
When plot objects are committed, Kirin automatically detects the optimal format:
-
SVG (vector): Default for matplotlib and plotly figures. Best for line plots, scatter plots, and other vector-based visualizations. Provides infinite scalability without quality loss.
-
WebP (raster): Used for plots with raster elements (e.g., images, heatmaps). Provides good compression while maintaining quality.
The format is automatically chosen based on the plot type. Matplotlib and plotly figures default to SVG, which is optimal for most scientific visualizations.
Metadata Structure:
When model objects are committed, metadata is automatically structured as:
{
"models": {
"model_name": {
"model_type": "RandomForestClassifier", # Auto-extracted
"hyperparameters": {...}, # Auto-extracted via get_params()
"metrics": {...}, # Auto-extracted (feature_importances_, etc.)
"sklearn_version": "1.3.0", # Auto-extracted
"accuracy": 0.95, # User-provided, model-specific
"source_file": "ml-workflow.py", # Auto-detected
"source_hash": "..." # Auto-detected
}
},
"dataset": "iris" # Top-level metadata (shared)
}
Metadata Merging:
- Auto-extracted metadata (hyperparameters, metrics, sklearn_version, source info) is added to each model's entry
- User-provided model-specific metadata (via
metadata["models"][var_name]) is merged into each model's entry - Top-level metadata (outside
modelsdict) applies to the entire commit - User-provided metadata wins on conflicts
find_commits(tags=None, metadata_filter=None, limit=None)
Find commits matching specified criteria.
Parameters:
tags(List[str], optional): Filter by tags (commits must have ALL specified tags)metadata_filter(Callable[[Dict], bool], optional): Function that takes metadata dict and returns boollimit(int, optional): Maximum number of commits to return
Returns:
List[Commit]: List of matching commits (newest first)
Examples:
# Find production models
production_models = dataset.find_commits(tags=["production"])
# Find high-accuracy models
high_accuracy = dataset.find_commits(
metadata_filter=lambda m: m.get("accuracy", 0) > 0.9
)
# Find PyTorch production models
pytorch_prod = dataset.find_commits(
tags=["production"],
metadata_filter=lambda m: m.get("framework") == "pytorch"
)
compare_commits(hash1, hash2)
Compare metadata between two commits.
Parameters:
hash1(str): First commit hashhash2(str): Second commit hash
Returns:
dict: Dictionary with comparison results including metadata and tag differences
Example:
comparison = dataset.compare_commits("abc123", "def456")
print("Metadata changes:", comparison["metadata_diff"]["changed"])
print("Tag changes:", comparison["tags_diff"])
Examples
# Basic usage
dataset = Dataset(root_dir="/data", name="project")
dataset.commit("Initial commit", add_files=["data.csv"])
# Cloud storage with authentication
dataset = Dataset(
root_dir="s3://my-bucket/data",
name="project",
aws_profile="my-profile"
)
# GCS with service account
dataset = Dataset(
root_dir="gs://my-bucket/data",
name="project",
gcs_token="/path/to/service-account.json",
gcs_project="my-project"
)
# Azure with connection string
dataset = Dataset(
root_dir="az://my-container/data",
name="project",
azure_connection_string=os.getenv("AZURE_CONNECTION_STRING")
)
Catalog
The main class for managing collections of datasets.
from kirin import Catalog
# Create or load a catalog
catalog = Catalog(root_dir="/path/to/data")
Catalog Constructor Parameters
Catalog(
root_dir: Union[str, fsspec.AbstractFileSystem], # Root directory for the catalog
fs: Optional[fsspec.AbstractFileSystem] = None, # Filesystem to use
# Cloud authentication parameters
aws_profile: Optional[str] = None, # AWS profile for S3 authentication
gcs_token: Optional[Union[str, Path]] = None, # GCS service account token
gcs_project: Optional[str] = None, # GCS project ID
azure_account_name: Optional[str] = None, # Azure account name
azure_account_key: Optional[str] = None, # Azure account key
azure_connection_string: Optional[str] = None, # Azure connection string
)
Catalog Basic Operations
datasets()- List all datasets in the catalogget_dataset(name)- Get a specific datasetcreate_dataset(name, description="")- Create a new dataset__len__()- Number of datasets in the catalog
Catalog Examples
# Basic usage
catalog = Catalog(root_dir="/data")
datasets = catalog.datasets()
dataset = catalog.get_dataset("my-dataset")
# Cloud storage with authentication
catalog = Catalog(
root_dir="s3://my-bucket/data",
aws_profile="my-profile"
)
# GCS with service account
catalog = Catalog(
root_dir="gs://my-bucket/data",
gcs_token="/path/to/service-account.json",
gcs_project="my-project"
)
Web UI
The web UI provides a graphical interface for Kirin operations.
Routes
/- Home page for catalog management/catalogs/add- Add new catalog/catalog/{catalog_id}- View catalog and datasets/catalog/{catalog_id}/{dataset_name}- View specific dataset/catalog/{catalog_id}/{dataset_name}/commit- Commit interface
Catalog Management
The web UI supports cloud authentication through CatalogConfig:
from kirin.web.config import CatalogConfig
# Create catalog config with cloud auth
config = CatalogConfig(
id="my-catalog",
name="My Catalog",
root_dir="s3://my-bucket/data",
aws_profile="my-profile"
)
# Convert to runtime catalog
catalog = config.to_catalog()
Cloud Authentication in Web UI
The web UI automatically handles cloud authentication when you:
- Create a catalog with cloud storage URL (s3://, gs://, az://)
- The system will prompt for authentication parameters
- Credentials are stored securely in the catalog configuration
Storage Format
Kirin uses a simplified Git-like storage format:
data/
├── data/ # Content-addressed file storage
│ └── {hash[:2]}/{hash[2:]}
├── datasets/
│ └── my-dataset/
│ └── commits.json # Linear commit history
Error Handling
Common Exceptions
ValueError- Invalid operations (file not found, invalid commit hash, etc.)FileNotFoundError- File not found in datasetHTTPException- Web UI errors (catalog not found, validation errors)
Example Error Handling
try:
dataset.checkout("nonexistent-commit")
except ValueError as e:
print(f"Checkout failed: {e}")
try:
content = dataset.read_file("nonexistent.txt")
except FileNotFoundError as e:
print(f"File not found: {e}")
Best Practices
Dataset Naming
- Use descriptive names:
user-data,ml-experiments,production-models - Avoid generic names:
test,data,temp
Workflow Patterns
- Commit changes regularly with descriptive messages
- Use linear commit history for simplicity
- Keep datasets focused on specific use cases
- Use catalogs to organize related datasets
File Management
- Use
local_files()context manager for library compatibility - Commit changes after adding/removing files
- Use descriptive commit messages
Advanced Features
Context Managers
# Access files as local paths
with dataset.local_files() as local_files:
df = pd.read_csv(local_files["data.csv"])
# Files automatically cleaned up
Commit
Represents an immutable snapshot of files at a point in time with optional metadata and tags.
Commit Notebook Integration
Commits also support rich HTML representation in notebooks. When you display a
Commit object, you'll see:
- Commit Metadata: Hash, message, timestamp, and parent commit
- File List: All files in the commit with interactive access
- Copy Code to Access: Each file has a button that copies code including a checkout step
Example:
from kirin import Dataset
dataset = Dataset(root_dir="/path/to/data", name="my_dataset")
commit = dataset.get_commit(commit_hash)
# Display in notebook - shows interactive HTML
commit
Commit Code Snippets:
When you click "Copy Code to Access" on a file in a commit, the code includes a checkout step:
# Checkout this commit first
dataset.checkout("commit_hash")
# Get path to local clone of file
with dataset.local_files() as files:
file_path = files["data.csv"]
Note: Commits are frozen dataclasses, so you cannot set
_repr_variable_name on them. Code snippets will use "dataset" as the default
variable name.
Properties
hash(str): Unique commit identifiermessage(str): Commit messagetimestamp(datetime): When the commit was createdparent_hash(Optional[str]): Hash of the parent commit (None for initial commit)files(Dict[str, File]): Dictionary of files in this commitmetadata(Dict[str, Any]): Metadata dictionary for model versioningtags(List[str]): List of tags for staging/versioning
Methods
get_file(name)- Get a file by namelist_files()- List all file nameshas_file(name)- Check if file existsget_file_count()- Get number of filesget_total_size()- Get total size of all filesto_dict()- Convert to dictionary representationfrom_dict(data, storage)- Create from dictionary
Commit Examples
# Access commit properties
commit = dataset.current_commit
print(f"Commit: {commit.short_hash}")
print(f"Message: {commit.message}")
print(f"Files: {len(commit.files)}")
print(f"Metadata: {commit.metadata}")
print(f"Tags: {commit.tags}")
# Check if commit has specific metadata
if commit.metadata.get("accuracy", 0) > 0.9:
print("High accuracy model!")
# Check if commit has specific tags
if "production" in commit.tags:
print("Production model")
Commit History
# Get commit history
history = dataset.history(limit=10)
for commit in history:
print(f"{commit.hash}: {commit.message}")
File Operations
# Add files to commit
dataset.commit("Add new data", add_files=["new_data.csv"])
# Remove files from commit
dataset.commit("Remove old data", remove_files=["old_data.csv"])
# Combined operations
dataset.commit("Update dataset",
add_files=["new_data.csv"],
remove_files=["old_data.csv"])
Cloud Storage Integration
# AWS S3 with profile
dataset = Dataset(
root_dir="s3://my-bucket/data",
name="my-dataset",
aws_profile="production"
)
# GCS with service account
dataset = Dataset(
root_dir="gs://my-bucket/data",
name="my-dataset",
gcs_token="/path/to/service-account.json",
gcs_project="my-project"
)
# Azure with connection string
dataset = Dataset(
root_dir="az://my-container/data",
name="my-dataset",
azure_connection_string=os.getenv("AZURE_CONNECTION_STRING")
)
For detailed examples and cloud storage setup, see the Cloud Storage Authentication Guide.