To run this notebook, click on the molab shield above or run the following command at the terminal:
uvx marimo edit --sandbox --mcp --no-token --watch https://github.com/nll-ai/kirin/blob/main/docs/tutorials/cloud-storage.py
import marimo as mo
Cloud Storage Overview
This tutorial introduces you to using Kirin with cloud storage backends. You'll learn how Kirin works seamlessly with S3, GCS, Azure, and other cloud storage providers, and understand the key concepts for working with remote datasets.
What You'll Learn
- Understanding cloud storage backends and how Kirin uses them
- How to create catalogs and datasets with cloud storage
- Authentication methods for different cloud providers
- Working with remote files using the same API as local files
- Key differences and considerations when using cloud storage
- Best practices for cloud storage workflows
Prerequisites
- Completed Your First Dataset tutorial
- Basic understanding of datasets and commits
- Familiarity with cloud storage concepts (S3, GCS, Azure)
import tempfile
from pathlib import Path
from kirin import Catalog
# For this tutorial, we'll use local storage to demonstrate concepts
# In production, you would use: Catalog(root_dir="s3://my-bucket/data")
temp_dir = Path(tempfile.mkdtemp(prefix="kirin_cloud_tutorial_"))
catalog = Catalog(root_dir=temp_dir)
print(f"✅ Created catalog with root: {catalog.root_dir}")
print(" (In production, this would be a cloud URL like s3://bucket/data)")
Understanding Cloud Storage Backends
Kirin supports multiple cloud storage backends through the fsspec library. This means you can use the same API whether you're working with local files or cloud storage.
Supported Backends:
- AWS S3:
s3://bucket/path - Google Cloud Storage:
gs://bucket/path - Azure Blob Storage:
az://container/path - And many more: Dropbox, Google Drive, FTP, etc.
The key insight is that Kirin treats all storage backends the same way - you use the same methods and patterns regardless of where your data is stored.
Step 1: Creating Catalogs with Cloud Storage
A catalog is a collection of datasets. When you create a catalog with a cloud storage URL, all datasets in that catalog will be stored in the cloud.
The API is identical whether you're using local or cloud storage - you
just change the root_dir parameter.
# Local storage (what we're using in this tutorial)
catalog = Catalog(root_dir="/path/to/local/data")
# AWS S3
catalog = Catalog(
root_dir="s3://{{ bucket_name }}/data",
aws_profile="{{ aws_profile }}"
)
# Google Cloud Storage
catalog = Catalog(
root_dir="gs://{{ bucket_name }}/data",
gcs_token="/path/to/service-account.json",
gcs_project="{{ project_id }}"
)
# Azure Blob Storage
catalog = Catalog(
root_dir="az://{{ container_name }}/data",
azure_connection_string="{{ connection_string }}"
)
Key points:
- The URL scheme (
s3://,gs://,az://) tells Kirin which backend to use - Authentication parameters are passed when creating the catalog
- Once created, you use the catalog the same way regardless of backend
# Create a dataset in the catalog (works the same for local or cloud)
dataset = catalog.create_dataset(
"cloud_demo", description="Demo dataset for cloud storage tutorial"
)
print(f"✅ Created dataset: {dataset.name}")
print(f" Dataset root: {dataset.root_dir}")
print(" (This would be in cloud storage if using s3://, gs://, etc.)")
Step 2: The Same API for Local and Cloud
One of Kirin's strengths is that the API is identical whether you're working with local files or cloud storage. This means:
- You can develop locally and deploy to cloud without code changes
- The same patterns work everywhere
- You don't need to learn different APIs for different backends
# Create some sample files
data_dir = temp_dir / "sample_data"
data_dir.mkdir(exist_ok=True)
# Create a CSV file
csv_file = data_dir / "sales_data.csv"
csv_file.write_text("""date,product,revenue
2024-01-01,Widget A,1000
2024-01-02,Widget B,1500
2024-01-03,Widget A,1200
""")
# Create a JSON config file
config_file = data_dir / "config.json"
config_file.write_text("""{
"version": "1.0",
"region": "us-east-1",
"currency": "USD"
}
""")
print("✅ Created sample files:")
print(f" - {csv_file.name}")
print(f" - {config_file.name}")
# Commit files - same API whether local or cloud!
commit_hash = dataset.commit(
message="Initial commit: Add sales data and configuration",
add_files=[str(csv_file), str(config_file)],
)
print(f"✅ Created commit: {commit_hash[:8]}")
print(" (Files are now stored in content-addressed storage)")
print(" (If using cloud, files are uploaded to cloud storage)")
What just happened?
- Files were committed to the dataset
- If using cloud storage, files were uploaded to the cloud backend
- Files are stored using content-addressed storage (by content hash)
- The commit was created and stored in the dataset's history
The process is identical whether you're using local or cloud storage!
Step 3: Working with Remote Files
When you access files from a cloud storage dataset, Kirin handles the
complexity of downloading files on-demand. You use the same local_files()
context manager as with local storage.
# Access files - same API for local and cloud!
with dataset.local_files() as local_files:
print("📂 Files available locally:")
for filename, local_path in local_files.items():
print(f" {filename} -> {local_path}")
# Read file content (works the same for local or cloud)
csv_path = local_files["sales_data.csv"]
csv_content = Path(csv_path).read_text()
print("\n📝 CSV content:")
print(csv_content[:200] + "...")
How it works with cloud storage:
- When you access a file, Kirin downloads it from cloud storage
- Files are cached locally in a temporary directory
- When you exit the context manager, temporary files are cleaned up
- This is called lazy loading - files are only downloaded when needed
Benefits:
- Efficient: Only download what you use
- Automatic cleanup: Temporary files are managed for you
- Same API: Works identically for local and cloud storage
Understanding Authentication
Different cloud providers use different authentication methods. Kirin supports the standard authentication patterns for each provider.
Key concept: Authentication is configured when you create the catalog or dataset, not when you access files. Once authenticated, all operations use those credentials automatically.
AWS S3 Authentication Methods
1. AWS Profile (Recommended for Development)
catalog = Catalog(
root_dir="s3://{{ bucket_name }}/data",
aws_profile="{{ aws_profile }}"
)
2. Environment Variables
export AWS_ACCESS_KEY_ID={{ access_key_id }}
export AWS_SECRET_ACCESS_KEY={{ secret_access_key }}
export AWS_DEFAULT_REGION={{ region }}
Then use without explicit credentials:
catalog = Catalog(root_dir="s3://{{ bucket_name }}/data")
3. IAM Roles (Production)
When running on EC2, ECS, or Lambda, IAM roles are used automatically:
# No credentials needed - uses IAM role
catalog = Catalog(root_dir="s3://{{ bucket_name }}/data")
Google Cloud Storage Authentication Methods
1. Service Account Key File
catalog = Catalog(
root_dir="gs://{{ bucket_name }}/data",
gcs_token="/path/to/service-account.json",
gcs_project="{{ project_id }}"
)
2. Application Default Credentials
# One-time setup
gcloud auth application-default login
Then use without explicit credentials:
catalog = Catalog(root_dir="gs://{{ bucket_name }}/data")
3. Workload Identity (GKE/Kubernetes)
When running on GKE with Workload Identity configured, Application Default Credentials automatically detect credentials from the metadata server (which is how Workload Identity works):
# No credentials needed - uses ADC (which includes Workload Identity on GKE)
catalog = Catalog(root_dir="gs://{{ bucket_name }}/data")
Azure Blob Storage Authentication Methods
1. Connection String
import os
catalog = Catalog(
root_dir="az://{{ container_name }}/data",
azure_connection_string=os.getenv("AZURE_CONNECTION_STRING")
)
2. Account Name and Key
catalog = Catalog(
root_dir="az://{{ container_name }}/data",
azure_account_name="{{ account_name }}",
azure_account_key="{{ account_key }}"
)
3. Azure CLI Authentication
# One-time setup
az login
Then use without explicit credentials:
catalog = Catalog(root_dir="az://{{ container_name }}/data")
Step 4: Key Differences from Local Storage
While the API is the same, there are some important differences to understand when working with cloud storage.
Network Latency
Cloud storage operations involve network requests, which adds latency:
- Local storage: Instant file access
- Cloud storage: Network round-trip time (typically 10-100ms)
Impact: Operations like listing files or checking if a file exists take longer with cloud storage.
Mitigation: Use batch operations when possible, and cache results when appropriate.
File Download Behavior
With cloud storage, files are downloaded on-demand:
- Local storage: Files are already on disk
- Cloud storage: Files are downloaded when accessed via
local_files()
Impact: First access to a file takes longer (download time).
Mitigation: Files are cached during the local_files() context, so
multiple accesses to the same file don't re-download.
Cost Considerations
Cloud storage has usage-based costs:
- Storage costs: Pay for data stored
- Request costs: Pay for API requests (PUT, GET, LIST)
- Data transfer costs: Pay for data downloaded (in some cases)
Best practices:
- Batch operations to reduce request counts
- Use appropriate storage classes (S3 Standard, IA, Glacier)
- Compress files before storing
- Monitor usage through cloud provider dashboards
Authentication Requirements
Cloud storage requires authentication, while local storage does not:
- Local storage: No authentication needed
- Cloud storage: Must provide credentials (profile, keys, etc.)
Best practices:
- Use IAM roles/service accounts in production (not access keys)
- Rotate credentials regularly
- Use least privilege (only grant necessary permissions)
- Never commit credentials to version control
Step 5: Best Practices for Cloud Storage
Following best practices helps you build efficient, secure, and cost-effective cloud storage workflows.
Security Best Practices
1. Use IAM Roles/Service Accounts
In production, use managed identities instead of access keys:
# ✅ Good: Uses IAM role automatically (on EC2/ECS/Lambda)
catalog = Catalog(root_dir="s3://{{ bucket_name }}/data")
# ❌ Avoid: Hardcoded credentials
catalog = Catalog(
root_dir="s3://{{ bucket_name }}/data",
aws_access_key_id="AKIA...",
aws_secret_access_key="..."
)
2. Least Privilege
Grant only the permissions needed:
- Read-only access for datasets that won't be modified
- Write access only where necessary
- Use bucket policies to restrict access
3. Monitor Access
Enable audit logging in your cloud provider to track access patterns.
Performance Best Practices
1. Batch Operations
Group multiple file operations together:
# ✅ Good: Single commit with multiple files
dataset.commit(
message="Add multiple files",
add_files=["file1.csv", "file2.csv", "file3.csv"]
)
# ❌ Avoid: Multiple separate commits
dataset.commit(message="Add file1", add_files=["file1.csv"])
dataset.commit(message="Add file2", add_files=["file2.csv"])
dataset.commit(message="Add file3", add_files=["file3.csv"])
2. Use Appropriate Regions
Store data in the same region as your compute resources to minimize latency.
3. Process Files in Chunks
For large files, use chunked processing:
with dataset.local_files() as local_files:
if "large_data.csv" in local_files:
local_path = local_files["large_data.csv"]
# Process in chunks
for chunk in pd.read_csv(local_path, chunksize=10000):
process_chunk(chunk)
Cost Optimization Best Practices
1. Use Appropriate Storage Classes
Different storage classes have different costs:
- Standard: Fast access, higher cost
- Infrequent Access (IA): Lower cost, slightly slower
- Glacier/Archive: Lowest cost, slow retrieval
2. Enable Lifecycle Policies
Automatically move old data to cheaper storage classes:
# Example: Move files older than 90 days to IA
# (Configured in cloud provider console, not in Kirin)
3. Compress Files
Compress text files before storing to reduce storage costs:
import gzip
import shutil
# Compress before committing
with open("data.csv", "rb") as f_in:
with gzip.open("data.csv.gz", "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
dataset.commit(message="Add compressed data", add_files=["data.csv.gz"])
Step 6: Common Workflows
Here are some common patterns for working with cloud storage in Kirin.
Development to Production Workflow
Pattern: Develop locally, deploy to cloud
# Development (local)
dev_catalog = Catalog(root_dir="./local_data")
dev_dataset = dev_catalog.create_dataset("my_dataset")
# Production (cloud)
prod_catalog = Catalog(
root_dir="s3://{{ bucket_name }}/data",
aws_profile="{{ aws_profile }}"
)
prod_dataset = prod_catalog.create_dataset("my_dataset")
# Same code works for both!
commit_hash = dataset.commit(
message="Add data",
add_files=["data.csv"]
)
The API is identical, so you can develop locally and deploy to cloud without code changes.
Multi-Environment Workflow
Pattern: Use different catalogs for different environments
# Environment-specific catalogs
catalogs = {
"dev": Catalog(root_dir="./local_data"),
"staging": Catalog(root_dir="s3://staging-bucket/data"),
"prod": Catalog(root_dir="s3://prod-bucket/data")
}
# Use appropriate catalog for environment
env = os.getenv("ENVIRONMENT", "dev")
catalog = catalogs[env]
dataset = catalog.get_dataset("my_dataset")
This pattern allows you to use the same code across environments while keeping data isolated.
Hybrid Workflow
Pattern: Mix local and cloud storage
# Local catalog for development
local_catalog = Catalog(root_dir="./local_data")
# Cloud catalog for shared/production data
cloud_catalog = Catalog(
root_dir="s3://{{ bucket_name }}/data",
aws_profile="{{ aws_profile }}"
)
# Use local for development, cloud for production
if os.getenv("ENVIRONMENT") == "production":
catalog = cloud_catalog
else:
catalog = local_catalog
This allows you to develop locally while using cloud storage for production workloads.
Step 7: Troubleshooting Common Issues
Here are solutions to common problems when working with cloud storage.
SSL Certificate Errors
Problem: SSLCertVerificationError when connecting to cloud storage
Solution: Set up SSL certificates for isolated Python environments:
python -m kirin.setup_ssl
This is especially important when using pixi, uv, or other isolated Python environments.
Authentication Failures
Problem: "Access Denied" or "Anonymous caller" errors
Solutions:
-
Check credentials are configured:
# AWS aws configure list # GCP gcloud auth list # Azure az account show -
Verify credentials have correct permissions:
- Check IAM policies for your user/role
-
Ensure bucket/container permissions are correct
-
Test credentials directly:
# AWS import boto3 s3 = boto3.client('s3') s3.list_buckets() # Should work without errors
Performance Issues
Problem: Slow operations with cloud storage
Solutions:
- Use same region: Store data in the same region as compute
- Batch operations: Group multiple operations together
- Cache results: Cache file lists and metadata when appropriate
- Use appropriate storage class: Standard for frequently accessed data
Problem: Large file downloads are slow
Solutions:
- Process in chunks: Don't download entire large files at once
- Use streaming: For very large files, consider streaming approaches
- Compress files: Smaller files download faster
Summary
Congratulations! You've learned the fundamentals of using Kirin with cloud storage:
- ✅ Understanding cloud backends - S3, GCS, Azure, and more
- ✅ Creating cloud catalogs - Same API for local and cloud
- ✅ Authentication methods - Different approaches for each provider
- ✅ Working with remote files - Lazy loading and automatic cleanup
- ✅ Key differences - Network latency, costs, authentication
- ✅ Best practices - Security, performance, cost optimization
- ✅ Common workflows - Development to production patterns
- ✅ Troubleshooting - Solutions to common issues
Key Concepts
- Unified API: Same methods work for local and cloud storage
- Content-addressed storage: Files stored by hash, works the same everywhere
- Lazy loading: Files downloaded on-demand when accessed
- Authentication: Configured at catalog/dataset creation, not per operation
- Backend-agnostic: Switch between backends by changing the URL
Next Steps
- Setup Cloud Storage - Complete guide for setting up AWS S3, Google Cloud Storage, and Azure Blob Storage
- Track Model Training Data - See cloud storage in action with ML workflows