Skip to content

Cloud Storage

Set up and use Kirin with cloud storage backends like S3, GCS, and Azure.

Overview

Kirin supports multiple cloud storage backends through the fsspec library. You can use the same API whether you're working with local files or cloud storage.

Supported Backends:

  • AWS S3: s3://bucket/path
  • Google Cloud Storage: gs://bucket/path
  • Azure Blob Storage: az://container/path
  • And many more: Dropbox, Google Drive, etc.

Authentication Methods

AWS/S3 Authentication

from kirin import Catalog, Dataset

# Using AWS profile
catalog = Catalog(
    root_dir="s3://{{ bucket_name }}/data",
    aws_profile="{{ aws_profile }}"
)

# Using Dataset with AWS profile
dataset = Dataset(
    root_dir="s3://my-bucket/data",
    name="{{ dataset_name }}",
    aws_profile="my-profile"
)

Using Environment Variables

Set environment variables in your shell or system:

# Set AWS credentials
export AWS_ACCESS_KEY_ID={{ access_key_id }}
export AWS_SECRET_ACCESS_KEY={{ secret_access_key }}
export AWS_DEFAULT_REGION={{ region }}

# Set Azure credentials
export AZURE_CONNECTION_STRING={{ azure_connection_string }}

Then use without explicit credentials:

# Environment variables are automatically detected
catalog = Catalog(root_dir="s3://{{ bucket_name }}/data")

Using IAM Roles (EC2/ECS/Lambda)

# No explicit credentials needed - uses IAM role automatically
catalog = Catalog(root_dir="s3://my-bucket/data")

Using AWS SSO

# After running: aws sso login
catalog = Catalog(
    root_dir="s3://{{ bucket_name }}/data",
    aws_profile="{{ sso_profile_name }}"
)

GCP/GCS Authentication

Using Service Account Key File

from kirin import Catalog, Dataset

# Using service account key file
catalog = Catalog(
    root_dir="gs://{{ bucket_name }}/data",
    gcs_token="/path/to/service-account.json",
    gcs_project="{{ project_id }}"
)

# Using Dataset with GCS credentials
dataset = Dataset(
    root_dir="gs://{{ bucket_name }}/data",
    name="{{ dataset_name }}",
    gcs_token="/path/to/service-account.json",
    gcs_project="{{ project_id }}"
)

Using Application Default Credentials

# Set up ADC (one-time setup)
# gcloud auth application-default login

# Use without explicit credentials (automatically detects ADC)
catalog = Catalog(root_dir="gs://my-bucket/data")

Using Environment Variables (GCS)

import os

# Set environment variable
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/service-account.json"

# Use without explicit credentials (automatically detects environment)
catalog = Catalog(root_dir="gs://my-bucket/data")

Using Workload Identity (GKE/Kubernetes)

# No explicit credentials needed - uses workload identity
catalog = Catalog(root_dir="gs://my-bucket/data")

Azure Blob Storage Authentication

Using Connection String

from kirin import Catalog, Dataset

# Using connection string
catalog = Catalog(
    root_dir="az://{{ container_name }}/data",
    azure_connection_string=os.getenv("AZURE_CONNECTION_STRING")
)

# Using Dataset with connection string
dataset = Dataset(
    root_dir="az://{{ container_name }}/data",
    name="{{ dataset_name }}",
    azure_connection_string=os.getenv("AZURE_CONNECTION_STRING")
)

Using Account Name and Key

catalog = Catalog(
    root_dir="az://{{ container_name }}/data",
    azure_account_name="myaccount",
    azure_account_key="mykey"
)

Using Environment Variables (Azure)

import os

# Set environment variables
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "myaccount"
os.environ["AZURE_STORAGE_ACCOUNT_KEY"] = "mykey"

# Use without explicit credentials (automatically detects environment)
catalog = Catalog(root_dir="az://my-container/data")

Using Azure CLI Authentication

# After running: az login
# No explicit credentials needed - uses Azure CLI authentication
catalog = Catalog(root_dir="az://my-container/data")

Using Managed Identity (Azure VMs/App Service)

# No explicit credentials needed - uses managed identity
catalog = Catalog(root_dir="az://my-container/data")

Working with Cloud Storage

Basic Operations

# Create catalog with cloud storage
catalog = Catalog(root_dir="s3://my-bucket/data")

# Create dataset
dataset = catalog.create_dataset("cloud_dataset")

# Add files (same API as local storage)
commit_hash = dataset.commit(
    message="Add data to cloud",
    add_files=["data.csv", "config.json"]
)

# Work with files (same API as local storage)
with dataset.local_files() as local_files:
    for filename, local_path in local_files.items():
        print(f"{filename} -> {local_path}")
        # Process files normally

Performance Considerations

Processing Large Files

# For large files, use chunked processing
with dataset.local_files() as local_files:
    if "large_data.csv" in local_files:
        local_path = local_files["large_data.csv"]
        # Use pandas chunking for large files
        for chunk in pd.read_csv(local_path, chunksize=10000):
            print(f"Processing chunk with {len(chunk)} rows")
            process_chunk(chunk)

Batch Operations

# Batch multiple operations for better performance
files_to_add = ["file1.csv", "file2.csv", "file3.csv"]
dataset.commit(
    message="Add multiple files",
    add_files=files_to_add
)

Error Handling

import boto3
from botocore.exceptions import ClientError

try:
    catalog = Catalog(root_dir="s3://my-bucket/data")
    dataset = catalog.get_dataset("my-dataset")
except ClientError as e:
    if e.response['Error']['Code'] == 'NoSuchBucket':
        print("Bucket does not exist")
    elif e.response['Error']['Code'] == 'AccessDenied':
        print("Access denied - check your credentials")
    else:
        print(f"AWS error: {e}")
except Exception as e:
    print(f"General error: {e}")

Web UI Cloud Integration

Setting Up Cloud Catalogs

The web UI supports cloud storage through a simple interface:

  1. Authenticate with your cloud provider using their CLI tools:
# AWS
aws configure

# GCP
gcloud auth login

# Azure
az login
  1. Create catalog in web UI:
  2. Click "Add Catalog" in the web interface
  3. Enter catalog details (ID, name, root directory)
  4. For S3: Select AWS profile from dropdown
  5. For GCS/Azure: Ensure credentials are configured via environment variables or CLI

  6. Authentication handling:

  7. S3: Web UI provides profile selection
  8. GCS/Azure: Requires pre-configured credentials (environment variables, CLI auth, etc.)

Cloud Authentication in Web UI

  1. Create catalog with cloud URL: Use s3://, gs://, or az:// URLs
  2. AWS Profile Selection: Web UI provides AWS profile dropdown for S3 authentication
  3. Other Cloud Providers: For GCS and Azure, authentication must be configured programmatically or via environment variables
  4. Credentials stored securely: AWS profiles saved in catalog configuration
  5. Automatic authentication: Subsequent uses authenticate automatically

Troubleshooting

Common Issues

SSL Certificate Errors

# Set up SSL certificates for isolated Python environments
python -m kirin.setup_ssl

Authentication Failures

# Check your credentials
import boto3

# Test AWS credentials
session = boto3.Session(profile_name="my-profile")
s3 = session.client('s3')
s3.list_buckets()  # Should work without errors

Permission Issues

# Check bucket permissions
import boto3

s3 = boto3.client('s3')
try:
    s3.head_bucket(Bucket='my-bucket')
    print("Bucket accessible")
except ClientError as e:
    print(f"Bucket not accessible: {e}")

Performance Optimization

Use Appropriate Regions

# Use same region as your compute resources
catalog = Catalog(
    root_dir="s3://{{ bucket_name }}/data",
    aws_profile="{{ aws_profile }}"
)
# Ensure bucket is in same region as your compute

Optimize File Sizes

# For very large files, consider chunking
# Split large files into smaller chunks
dataset.commit(
    message="Add chunked data",
    add_files=["chunk_001.csv", "chunk_002.csv", "chunk_003.csv"]
)

Use Compression

# Compress files before adding to reduce storage costs
import gzip
import shutil

# Compress file
with open("data.csv", "rb") as f_in:
    with gzip.open("data.csv.gz", "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)

# Add compressed file
dataset.commit(
    message="Add compressed data",
    add_files=["data.csv.gz"]
)

Best Practices

Security

  • Use IAM roles when possible instead of access keys
  • Rotate credentials regularly
  • Use least privilege - only grant necessary permissions
  • Monitor access through cloud provider audit logs

Cost Optimization

  • Use appropriate storage classes (S3 Standard, IA, Glacier)
  • Enable lifecycle policies for automatic archival
  • Monitor usage through cloud provider dashboards
  • Use compression for text files

Performance

  • Use same region as your compute resources
  • Batch operations when possible
  • Use chunked processing for large files
  • Consider CDN for frequently accessed data

Next Steps