Architecture Overview
High-level system architecture and design principles for Kirin.
System Architecture
Kirin implements a simplified content-addressed storage system with the following key components:
graph TB
subgraph "Kirin Core Components"
DA["Dataset API<br/>• File operations<br/>• Commit management<br/>• Versioning"]
CS["Commit Store<br/>• Linear history<br/>• Metadata<br/>• References"]
COS["Content Store<br/>• Content hash<br/>• Deduplication<br/>• Backend agnostic"]
end
subgraph "Storage Layer"
FS["FSSpec Layer<br/>• Local FS<br/>• S3<br/>• GCS<br/>• Azure<br/>• Other backends"]
end
DA --> CS
DA --> COS
CS --> COS
COS --> FS
Core Design Principles
1. Simplified Data Versioning
Kirin is simplified "git" for data - follows git conventions but with linear-only history:
- Linear Commits: Simple, linear commit history without branching complexity
- Content-Addressed Storage: Files stored by content hash for integrity and deduplication
- Ergonomic Python API: Focus on ease of use and developer experience
- Backend-Agnostic: Works with any storage backend via fsspec
- No Branching: Linear-only commit history to avoid complexity
2. Content-Addressed Storage Design
CRITICAL: Files are stored without file extensions in the content-addressed storage system:
- Storage Path:
root_dir/data/{hash[:2]}/{hash[2:]}
(e.g.,data/ab/cdef1234...
) - No Extensions: Original
.csv
,.txt
,.json
extensions are not preserved in storage - Metadata Storage: File extensions are stored as metadata in the
File
entity'sname
attribute - Extension Restoration: When files are downloaded or accessed, they get their original names back
- Content Integrity: Files are identified purely by content hash, ensuring data integrity
- Deduplication: Identical content (regardless of original filename) is stored only once
3. File Access Patterns
Kirin provides simple file access through standard Python file operations:
- Temporary file downloads: Files are downloaded to temporary locations when accessed
- Standard file handles: Files are accessed through normal Python file objects
- Automatic cleanup: Temporary files are automatically cleaned up when file handles are closed
- Streaming support: Large files can be streamed through fsspec backends for efficient transfer
Key Benefits
For Data Scientists
- Linear Data Versioning: Track changes to datasets with simple, linear commits
- Content-Addressed Storage: Ensure data integrity and enable deduplication
- Multi-Backend Support: Work with S3, GCS, Azure, local filesystem, and more
- Serverless Architecture: No dedicated servers required
- Ergonomic Python API: Focus on ease of use and developer experience
- File Versioning: Track changes to individual files over time
For Data Engineers
- Backend-agnostic: Works with any storage backend via fsspec
- Automatic deduplication: Identical files stored once, saving space
- Content integrity: Files stored by content hash for data integrity
- Performance optimized: Chunked processing for large files
- Extensible: Easy to add new backends and features
User Personas and Jobs to be Done
Data Scientist / ML Engineer
Jobs to be Done:
- Track Experiment Data: "I need to keep track of which datasets were used in which experiments so I can reproduce my results."
- Find and Use the Right Data Version: "I need to identify and access specific versions of datasets for training models."
- Collaborate with Team Members: "I need to share datasets with colleagues in a way that ensures we're all using the same exact data."
- Document Data Transformations: "I need to track how raw data is transformed into model-ready data."
Data Engineer
Jobs to be Done:
- Manage Data Pipelines: "I need to ensure data pipelines produce consistent, traceable outputs."
- Optimize Storage Usage: "I need to handle large datasets efficiently without wasting storage."
- Support Multiple Storage Solutions: "I need to work with data across various storage systems our organization uses."
- Ensure Data Governance: "I need to track who accesses what data and how it's used."
Data Team Manager / Lead
Jobs to be Done:
- Ensure Reproducibility: "I need to guarantee that our team's work is reproducible for scientific integrity and audit purposes."
- Manage Technical Debt: "I need to understand data dependencies to prevent cascading failures when data changes."
- Accelerate Onboarding: "I need new team members to quickly understand our data ecosystem."
- Support Regulatory Compliance: "I need to demonstrate data provenance for regulatory compliance."
MLOps Engineer
Jobs to be Done:
- Deploy Models with Data Dependencies: "I need to package models with their exact data dependencies."
- Monitor Data Drift: "I need to compare production data against training data to detect drift."
- Implement Data-Centric CI/CD: "I need automated tests that verify data quality across pipeline stages."
- Roll Back Data When Needed: "I need to quickly revert to previous data versions if issues arise."
Feature-to-Job Mapping
Content-Addressed Storage
- Jobs: Track Experiment Data, Find and Use the Right Data Version, Collaborate with Team Members, Ensure Reproducibility
- Users: Data Scientist, ML Engineer, Team Lead, Laboratory Scientist
Automatic Lineage Tracking
- Jobs: Document Data Transformations, Manage Data Pipelines, Track Sample Lineage, Manage Technical Debt
- Users: Data Scientist, Data Engineer, Laboratory Scientist
Backend-Agnostic Storage
- Jobs: Support Multiple Storage Solutions, Optimize Storage Usage, Manage Collaborative Research
- Users: Data Engineer, MLOps Engineer, Laboratory Scientist
Dataset Versioning
- Jobs: Deploy Models with Data Dependencies, Roll Back Data When Needed, Monitor Data Drift, Ensure Experimental Reproducibility
- Users: MLOps Engineer, Data Engineer, Laboratory Scientist
Usage Tracking
- Jobs: Document Data Usage, Ensure Data Governance, Support Regulatory Compliance, Document Methods and Parameters
- Users: Team Lead, Laboratory Scientist
Streaming File Access
- Jobs: Optimize Storage Usage, Handle Large Datasets
- Users: Data Engineer, MLOps Engineer
Data Catalog
- Jobs: Accelerate Onboarding, Find the Right Data Version, Manage Collaborative Research
- Users: Team Lead, Data Scientist, Laboratory Scientist
Path-Based API
- Jobs: Implement Data-Centric CI/CD, Manage Data Pipelines
- Users: MLOps Engineer, Data Engineer
System Flow
1. Data Ingestion Flow
- User provides files to be tracked
- Files are hashed and stored in content store
- A commit is created with references to file versions
- The commit is recorded in the linear history
2. Data Access Flow
- User requests a specific version of a file or dataset
- System resolves the logical path to a content hash
- Content is retrieved from the storage backend
- Content is provided to the user
3. Data Processing Flow
- User accesses input data files
- Processing is performed on the data
- Output files are stored in Kirin
- New commit is created with updated files
Linear vs. Branching
Kirin uses linear commit history instead of Git's branching model:
Linear History (Kirin):
graph LR
A[Commit A] --> B[Commit B]
B --> C[Commit C]
C --> D[Commit D]
Branching History (Git):
graph TD
A[Commit A] --> B[Commit B]
B --> C[Commit C]
B --> D[Commit D]
D --> E[Commit E]
Benefits of Linear History:
- Simpler: No merge conflicts or complex branching
- Clearer: Easy to understand data evolution
- Safer: No risk of losing data through complex merges
- Faster: No need to resolve merge conflicts
Backend-Agnostic Design
Kirin works with any storage backend through the fsspec library:
Supported Backends:
- Local filesystem:
/path/to/data
- AWS S3:
s3://bucket/path
- Google Cloud Storage:
gs://bucket/path
- Azure Blob Storage:
az://container/path
- S3-compatible services: Minio, Backblaze B2, DigitalOcean Spaces, Wasabi
- And many more: Dropbox, Google Drive, etc. (sync/auth handled by backend)
Benefits:
- Flexibility: Use any storage backend
- Scalability: Scale from local to cloud
- Portability: Move between backends easily
- Cost optimization: Choose the right storage for your needs
System Flow Diagrams
Data Ingestion Flow
sequenceDiagram
participant U as User
participant D as Dataset API
participant CS as Content Store
participant CM as Commit Store
U->>D: Provide files to track
D->>CS: Hash and store files
CS-->>D: Return content hashes
D->>CM: Create commit with file references
CM-->>D: Save commit to linear history
D-->>U: Return commit hash
Data Access Flow
sequenceDiagram
participant U as User
participant D as Dataset API
participant CS as Content Store
participant FS as FSSpec Layer
U->>D: Request specific file version
D->>D: Resolve logical path to content hash
D->>CS: Retrieve content by hash
CS->>FS: Read from storage backend
FS-->>CS: Return file content
CS-->>D: Return content bytes
D-->>U: Provide file to user
Next Steps
- API Reference - Complete API documentation
- Storage Format - Technical storage details