Core Concepts

Understanding the fundamental concepts behind Kirin's data versioning system.

What is Kirin?

Kirin is a simplified tool for version-controlling data using content-addressed storage. It provides linear commit history for datasets without the complexity of branching and merging.

Key Concepts

Datasets

A dataset is a logical collection of files that you want to version together. Think of it as a folder that tracks changes over time.

from kirin import Dataset

# Create a dataset
dataset = Dataset(root_dir="/path/to/data", name="my_dataset")

Characteristics:

Contains multiple files
Has a linear commit history
Can be shared and collaborated on
Maintains data integrity through content-addressing

Files

A file in Kirin represents a versioned file with content-addressed storage. Files are immutable once created and identified by their content hash.

# Get a file from the current commit
file_obj = dataset.get_file("data.csv")
print(f"File hash: {file_obj.hash}")
print(f"File size: {file_obj.size} bytes")
print(f"Content type: {file_obj.content_type}")
print(f"Short hash: {file_obj.short_hash}")

Key Properties:

Content-addressed: Identified by content hash, not filename
Immutable: Cannot be changed once created
Deduplicated: Identical content stored only once
Backend-agnostic: Works with any storage backend

Commits

A commit represents an immutable snapshot of files at a point in time. Commits form a linear history with single parent relationships.

# Create a commit
commit_hash = dataset.commit(
    message="Add new data",
    add_files=["data.csv", "config.json"]
)

# Get commit information
commit = dataset.get_commit(commit_hash)
print(f"Commit: {commit.hash}")
print(f"Message: {commit.message}")
print(f"Timestamp: {commit.timestamp}")
print(f"Files: {commit.list_files()}")
print(f"Short hash: {commit.short_hash}")

Characteristics:

Linear history: No branching, simple parent-child relationships
Immutable: Cannot be changed once created
Atomic: All files in a commit are added/removed together
Traceable: Full history of changes over time

Content-Addressed Storage

Content-addressed storage means files are stored and identified by their content hash, not their filename or location.

Benefits:

Data integrity: Files cannot be corrupted without detection
Deduplication: Identical content stored only once
Efficient storage: Saves space by avoiding duplicate files
Tamper-proof: Any change to content changes the hash

Storage Layout:

data/
├── ab/                  # First two characters of hash
│   └── cdef1234...      # Rest of hash (no file extensions)
└── ...

Important: Files are stored without extensions in the content store. Original extensions are preserved as metadata in the File entity's name attribute and restored when files are accessed.

Catalogs

A catalog is a collection of datasets that you want to manage together. It's like a workspace for multiple related datasets.

from kirin import Catalog

# Create a catalog
catalog = Catalog(root_dir="/path/to/data")

# List all datasets
datasets = catalog.datasets()
print(f"Available datasets: {datasets}")

# Get a specific dataset
dataset = catalog.get_dataset("my_dataset")

Use Cases:

Project organization: Group related datasets
Team collaboration: Share multiple datasets
Workflow management: Organize data processing pipelines

How It Works

1. File Storage

When you add a file to Kirin:

Hash calculation: File content is hashed (SHA256)
Content storage: File stored at data/{hash[:2]}/{hash[2:]}
Metadata tracking: Original filename stored as metadata
Deduplication: If file already exists, no duplicate storage

2. Commit Process

When you create a commit:

File staging: Files to be added/removed are identified
Hash resolution: Content hashes are calculated/resolved
Commit creation: New commit object created with file references
History update: Commit added to linear history

3. Data Access

When you access files:

Commit resolution: Current commit or specific commit identified
File lookup: File references resolved to content hashes
Content retrieval: Files retrieved from content store
Extension restoration: Original filenames restored

Linear vs. Branching

Kirin uses linear commit history instead of Git's branching model:

Linear History (Kirin):

Commit A → Commit B → Commit C → Commit D

Branching History (Git):

Commit A → Commit B → Commit C
         ↘
           Commit D → Commit E

Benefits of Linear History:

Simpler: No merge conflicts or complex branching
Clearer: Easy to understand data evolution
Safer: No risk of losing data through complex merges
Faster: No need to resolve merge conflicts

Creating "Branches" with New Datasets

If you need branching-like functionality, create a new dataset using existing files:

# Original dataset
original_dataset = catalog.get_dataset("experiment_v1")

# Create a "branch" by starting a new dataset with existing files
branch_dataset = catalog.create_dataset("experiment_v2")

# Copy files from the original dataset to the new one
with original_dataset.local_files() as local_files:
    for filename, local_path in local_files.items():
        # Copy file to new dataset
        import shutil
        shutil.copy2(local_path, filename)

# Commit the copied files to the new dataset
branch_dataset.commit(
    message="Branch from experiment_v1",
    add_files=["data.csv", "config.json"]
)

# Now you can develop independently in each dataset
original_dataset.commit("Continue original work", add_files=["new_data.csv"])
branch_dataset.commit("Try different approach", add_files=["alternative_data.csv"])

Benefits of Dataset-based "Branching":

Clear separation: Each dataset is independent
Easy comparison: Compare datasets side by side
No conflicts: No merge conflicts between datasets
Flexible: Can share files between datasets as needed

Backend-Agnostic Design

Kirin works with any storage backend through the fsspec library:

Supported Backends:

Local filesystem: /path/to/data
AWS S3: s3://bucket/path
Google Cloud Storage: gs://bucket/path
Azure Blob Storage: az://container/path
And many more: Dropbox, Google Drive, etc. (sync/auth handled by backend)

Benefits:

Flexibility: Use any storage backend
Scalability: Scale from local to cloud
Portability: Move between backends easily
Cost optimization: Choose the right storage for your needs

Zero-Copy Operations

Kirin is designed with zero-copy philosophy for efficient large file handling:

Zero-Copy Features:

Memory-mapped files: Avoid loading entire files into memory
Chunked processing: Process data incrementally using libraries like pandas
Direct transfers: Stream between storage backends
Reference-based operations: Use references instead of copying

Benefits:

Memory efficient: Handle files larger than RAM
Fast operations: No unnecessary data copying
Scalable: Work with datasets of any size

Next Steps

Quickstart - Try Kirin with a simple example
Basic Usage Guide - Learn common workflows
Working with Files - File operations and patterns