Core Concepts
Understanding the fundamental concepts behind Kirin's data versioning system.
What is Kirin?
Kirin is a simplified tool for version-controlling data using content-addressed storage. It provides linear commit history for datasets without the complexity of branching and merging.
Key Concepts
Datasets
A dataset is a logical collection of files that you want to version together. Think of it as a folder that tracks changes over time.
from kirin import Dataset
# Create a dataset
dataset = Dataset(root_dir="/path/to/data", name="my_dataset")
Characteristics:
- Contains multiple files
- Has a linear commit history
- Can be shared and collaborated on
- Maintains data integrity through content-addressing
Files
A file in Kirin represents a versioned file with content-addressed storage. Files are immutable once created and identified by their content hash.
# Get a file from the current commit
file_obj = dataset.get_file("data.csv")
print(f"File hash: {file_obj.hash}")
print(f"File size: {file_obj.size} bytes")
print(f"Content type: {file_obj.content_type}")
print(f"Short hash: {file_obj.short_hash}")
Key Properties:
- Content-addressed: Identified by content hash, not filename
- Immutable: Cannot be changed once created
- Deduplicated: Identical content stored only once
- Backend-agnostic: Works with any storage backend
Commits
A commit represents an immutable snapshot of files at a point in time. Commits form a linear history with single parent relationships.
# Create a commit
commit_hash = dataset.commit(
message="Add new data",
add_files=["data.csv", "config.json"]
)
# Get commit information
commit = dataset.get_commit(commit_hash)
print(f"Commit: {commit.hash}")
print(f"Message: {commit.message}")
print(f"Timestamp: {commit.timestamp}")
print(f"Files: {commit.list_files()}")
print(f"Short hash: {commit.short_hash}")
Characteristics:
- Linear history: No branching, simple parent-child relationships
- Immutable: Cannot be changed once created
- Atomic: All files in a commit are added/removed together
- Traceable: Full history of changes over time
Content-Addressed Storage
Content-addressed storage means files are stored and identified by their content hash, not their filename or location.
Benefits:
- Data integrity: Files cannot be corrupted without detection
- Deduplication: Identical content stored only once
- Efficient storage: Saves space by avoiding duplicate files
- Tamper-proof: Any change to content changes the hash
Storage Layout:
data/
├── ab/ # First two characters of hash
│ └── cdef1234... # Rest of hash (no file extensions)
└── ...
Important: Files are stored without extensions in the content store.
Original extensions are preserved as metadata in the File entity's name
attribute and restored when files are accessed.
Catalogs
A catalog is a collection of datasets that you want to manage together. It's like a workspace for multiple related datasets.
from kirin import Catalog
# Create a catalog
catalog = Catalog(root_dir="/path/to/data")
# List all datasets
datasets = catalog.datasets()
print(f"Available datasets: {datasets}")
# Get a specific dataset
dataset = catalog.get_dataset("my_dataset")
Use Cases:
- Project organization: Group related datasets
- Team collaboration: Share multiple datasets
- Workflow management: Organize data processing pipelines
How It Works
1. File Storage
When you add a file to Kirin:
- Hash calculation: File content is hashed (SHA256)
- Content storage: File stored at
data/{hash[:2]}/{hash[2:]}
- Metadata tracking: Original filename stored as metadata
- Deduplication: If file already exists, no duplicate storage
2. Commit Process
When you create a commit:
- File staging: Files to be added/removed are identified
- Hash resolution: Content hashes are calculated/resolved
- Commit creation: New commit object created with file references
- History update: Commit added to linear history
3. Data Access
When you access files:
- Commit resolution: Current commit or specific commit identified
- File lookup: File references resolved to content hashes
- Content retrieval: Files retrieved from content store
- Extension restoration: Original filenames restored
Linear vs. Branching
Kirin uses linear commit history instead of Git's branching model:
Linear History (Kirin):
Commit A → Commit B → Commit C → Commit D
Branching History (Git):
Commit A → Commit B → Commit C
↘
Commit D → Commit E
Benefits of Linear History:
- Simpler: No merge conflicts or complex branching
- Clearer: Easy to understand data evolution
- Safer: No risk of losing data through complex merges
- Faster: No need to resolve merge conflicts
Creating "Branches" with New Datasets
If you need branching-like functionality, create a new dataset using existing files:
# Original dataset
original_dataset = catalog.get_dataset("experiment_v1")
# Create a "branch" by starting a new dataset with existing files
branch_dataset = catalog.create_dataset("experiment_v2")
# Copy files from the original dataset to the new one
with original_dataset.local_files() as local_files:
for filename, local_path in local_files.items():
# Copy file to new dataset
import shutil
shutil.copy2(local_path, filename)
# Commit the copied files to the new dataset
branch_dataset.commit(
message="Branch from experiment_v1",
add_files=["data.csv", "config.json"]
)
# Now you can develop independently in each dataset
original_dataset.commit("Continue original work", add_files=["new_data.csv"])
branch_dataset.commit("Try different approach", add_files=["alternative_data.csv"])
Benefits of Dataset-based "Branching":
- Clear separation: Each dataset is independent
- Easy comparison: Compare datasets side by side
- No conflicts: No merge conflicts between datasets
- Flexible: Can share files between datasets as needed
Backend-Agnostic Design
Kirin works with any storage backend through the fsspec library:
Supported Backends:
- Local filesystem:
/path/to/data
- AWS S3:
s3://bucket/path
- Google Cloud Storage:
gs://bucket/path
- Azure Blob Storage:
az://container/path
- And many more: Dropbox, Google Drive, etc. (sync/auth handled by backend)
Benefits:
- Flexibility: Use any storage backend
- Scalability: Scale from local to cloud
- Portability: Move between backends easily
- Cost optimization: Choose the right storage for your needs
Zero-Copy Operations
Kirin is designed with zero-copy philosophy for efficient large file handling:
Zero-Copy Features:
- Memory-mapped files: Avoid loading entire files into memory
- Chunked processing: Process data incrementally using libraries like pandas
- Direct transfers: Stream between storage backends
- Reference-based operations: Use references instead of copying
Benefits:
- Memory efficient: Handle files larger than RAM
- Fast operations: No unnecessary data copying
- Scalable: Work with datasets of any size
Next Steps
- Quickstart - Try Kirin with a simple example
- Basic Usage Guide - Learn common workflows
- Working with Files - File operations and patterns