Storage Format
Technical details about Kirin's storage format and data structures.
Overview
Kirin uses a simplified Git-like storage format optimized for data versioning. The storage is organized into two main areas:
- Content Store: Content-addressed file storage
- Dataset Store: Commit history and metadata
Storage Layout
<root>/
├── data/ # Content-addressed storage
│ ├── ab/ # First two characters of hash
│ │ └── cdef1234... # Rest of hash (no file extensions)
│ └── ...
└── datasets/ # Dataset storage
├── dataset1/ # Dataset directory
│ └── commits.json # Linear commit history
└── ...
Content Store
File Storage
Files are stored in the content store using their content hash:
Storage Path: data/{hash[:2]}/{hash[2:]}
Example:
- File hash:
abc123def456...
- Storage path:
data/ab/c123def456...
Critical Design: Extension-less Storage
Files are stored WITHOUT file extensions in the content store:
- Storage Path:
data/ab/cdef1234...
(no.csv
,.txt
, etc.) - Original Extensions: Stored as metadata in the
File
entity'sname
attribute - Extension Restoration: Original filenames restored when files are accessed
- Content Integrity: Files identified purely by content hash
- Deduplication: Identical content stored only once, regardless of original filename
Benefits of Extension-less Storage
- Content Integrity: Files identified by content, not filename
- Deduplication: Identical content stored once regardless of original name
- Tamper-proof: Any change to content changes the hash
- Efficient Storage: No duplicate storage for identical content
Dataset Store
Commit History Format
Each dataset maintains a single JSON file with linear commit history:
File: datasets/{dataset_name}/commits.json
{
"dataset_name": "my_dataset",
"commits": [
{
"hash": "abc123...",
"message": "Initial commit",
"timestamp": "2024-01-01T12:00:00",
"parent_hash": null,
"files": {
"data.csv": {
"hash": "def456...",
"name": "data.csv",
"size": 1024,
"content_type": "text/csv"
}
}
},
{
"hash": "ghi789...",
"message": "Add processed data",
"timestamp": "2024-01-01T13:00:00",
"parent_hash": "abc123...",
"files": {
"data.csv": {
"hash": "def456...",
"name": "data.csv",
"size": 1024,
"content_type": "text/csv"
},
"processed.csv": {
"hash": "jkl012...",
"name": "processed.csv",
"size": 2048,
"content_type": "text/csv"
}
}
}
]
}
Commit Structure
Each commit contains:
- hash: SHA256 hash of the commit
- message: Human-readable commit message
- timestamp: ISO 8601 timestamp
- parent_hash: Hash of parent commit (null for first commit)
- files: Dictionary mapping filename to file metadata
File Metadata
Each file entry contains:
- hash: Content hash of the file
- name: Original filename (including extension)
- size: File size in bytes
- content_type: MIME type of the file
Data Structures
File Entity
@dataclass(frozen=True)
class File:
"""Represents a versioned file with content-addressed storage."""
hash: str # Content hash (SHA256)
name: str # Original filename
size: int # File size in bytes
content_type: Optional[str] = None # MIME type
def read_bytes(self) -> bytes: ...
def open(self, mode: str = "rb") -> Union[BinaryIO, TextIO]: ...
def download_to(self, path: Union[str, Path]) -> str: ...
def exists(self) -> bool: ...
def to_dict(self) -> dict: ...
Commit Entity
@dataclass(frozen=True)
class Commit:
"""Represents an immutable snapshot of files at a point in time."""
hash: str # Commit hash
message: str # Commit message
timestamp: datetime # Creation timestamp
parent_hash: Optional[str] # Parent commit hash
files: Dict[str, File] # filename -> File mapping
def get_file(self, name: str) -> Optional[File]: ...
def list_files(self) -> List[str]: ...
def has_file(self, name: str) -> bool: ...
def get_total_size(self) -> int: ...
Dataset Entity
class Dataset:
"""Represents a logical collection of files with linear history."""
def __init__(self, root_dir: Union[str, Path], name: str,
description: str = "",
fs: Optional[fsspec.AbstractFileSystem] = None,
# AWS/S3 authentication
aws_profile: Optional[str] = None,
# GCP/GCS authentication
gcs_token: Optional[Union[str, Path]] = None,
gcs_project: Optional[str] = None,
# Azure authentication
azure_account_name: Optional[str] = None,
azure_account_key: Optional[str] = None,
azure_connection_string: Optional[str] = None): ...
def commit(self, message: str, add_files: List[Union[str, Path]] = None,
remove_files: List[str] = None) -> str: ...
def checkout(self, commit_hash: Optional[str] = None) -> None: ...
def get_file(self, name: str) -> Optional[File]: ...
def list_files(self) -> List[str]: ...
def has_file(self, name: str) -> bool: ...
def read_file(self, name: str, mode: str = "r") -> Union[str, bytes]: ...
def download_file(self, name: str, target_path: Union[str, Path]) -> str: ...
def open_file(self, name: str, mode: str = "rb") -> Union[BinaryIO,
TextIO]: ...
def local_files(self): ... # Context manager for local file access
def history(self, limit: Optional[int] = None) -> List[Commit]: ...
def get_commit(self, commit_hash: str) -> Optional[Commit]: ...
def get_commits(self) -> List[Commit]: ...
def is_empty(self) -> bool: ...
def cleanup_orphaned_files(self) -> int: ...
def get_info(self) -> dict: ...
def to_dict(self) -> dict: ...
# Properties
@property
def current_commit(self) -> Optional[Commit]: ...
@property
def head(self) -> Optional[Commit]: ... # Alias for current_commit
@property
def files(self) -> Dict[str, File]: ... # Files from current commit
Catalog Entity
@dataclass
class Catalog:
"""Represents a collection of datasets."""
root_dir: Union[str, fsspec.AbstractFileSystem]
fs: Optional[fsspec.AbstractFileSystem] = None
# AWS/S3 authentication
aws_profile: Optional[str] = None
# GCP/GCS authentication
gcs_token: Optional[Union[str, Path]] = None
gcs_project: Optional[str] = None
# Azure authentication
azure_account_name: Optional[str] = None
azure_account_key: Optional[str] = None
azure_connection_string: Optional[str] = None
def datasets(self) -> List[str]: ... # List dataset names
def get_dataset(self, dataset_name: str) -> Dataset: ... # Get existing dataset
def create_dataset(self, dataset_name: str,
description: str = "") -> Dataset: ... # Create new dataset
def __len__(self) -> int: ... # Number of datasets
Content Addressing
Hash Calculation
Files are hashed using SHA256 directly on content bytes:
import hashlib
def calculate_hash(content: bytes) -> str:
"""Calculate SHA256 hash of content bytes."""
return hashlib.sha256(content).hexdigest()
# Example usage in storage
def store_file(file_path: Path) -> str:
with open(file_path, "rb") as f:
content = f.read()
return hashlib.sha256(content).hexdigest()
Deduplication
Identical content is stored only once:
# Two files with identical content
file1_content = b"Hello, World!"
file2_content = b"Hello, World!"
# Both files get the same hash
hash1 = hashlib.sha256(file1_content).hexdigest()
hash2 = hashlib.sha256(file2_content).hexdigest()
assert hash1 == hash2 # Same hash = same storage location
Content Integrity
Any change to file content changes the hash:
# Original content
content1 = b"Hello, World!"
hash1 = hashlib.sha256(content1).hexdigest()
# Modified content
content2 = b"Hello, World!" # Even a single character change
hash2 = hashlib.sha256(content2).hexdigest()
assert hash1 != hash2 # Different hash = different storage location
Commit Hash Generation
Commit hashes are generated using file hashes, message, and timestamp:
def generate_commit_hash(files: Dict[str, File], message: str,
parent_hash: Optional[str],
timestamp: datetime) -> str:
"""Generate commit hash from file hashes, message, and timestamp."""
import hashlib
# Sort file hashes for consistency
file_hashes = sorted(file.hash for file in files.values())
parent_hash = parent_hash or ""
# Combine all components
content = (
"\n".join(file_hashes) + "\n" +
message + "\n" +
parent_hash + "\n" +
str(timestamp)
)
# Generate hash
hasher = hashlib.sha256()
hasher.update(content.encode("utf-8"))
return hasher.hexdigest()
Backend Integration
FSSpec Backends
Kirin supports any fsspec backend:
# Local filesystem
fs = fsspec.filesystem("file")
# S3
fs = fsspec.filesystem("s3", profile="my-profile")
# GCS
fs = fsspec.filesystem("gcs", token="/path/to/key.json")
# Azure
fs = fsspec.filesystem("az", connection_string="...")
Storage Operations
# Store file content
def store_file(fs, file_path: Path) -> str:
"""Store file and return content hash."""
with open(file_path, "rb") as f:
content = f.read()
hash_value = hashlib.sha256(content).hexdigest()
storage_path = f"data/{hash_value[:2]}/{hash_value[2:]}"
fs.write_bytes(storage_path, content)
return hash_value
# Retrieve file content
def retrieve_file(fs, hash_value: str) -> bytes:
"""Retrieve file content by hash."""
storage_path = f"data/{hash_value[:2]}/{hash_value[2:]}"
return fs.read_bytes(storage_path)
Performance Considerations
Zero-Copy Operations
Kirin is designed with zero-copy philosophy:
- Reference-based operations: Use File objects as references instead of copying content
- Lazy loading: File content is only downloaded when accessed
- Deduplication: Identical content is stored only once regardless of filename
Caching
Kirin implements commit-level caching for improved performance:
# Commit objects are cached in memory
class CommitStore:
def __init__(self):
self._commits_cache: Dict[str, Commit] = {}
def get_commit(self, commit_hash: str) -> Commit:
# Try cache first
if commit_hash in self._commits_cache:
return self._commits_cache[commit_hash]
# Load from storage if not cached
# ...
Lazy Loading
Content loaded only when needed:
# Lazy file loading
class LazyFile:
def __init__(self, fs, hash_value: str, name: str):
self.fs = fs
self.hash_value = hash_value
self.name = name
self._content = None
def read_bytes(self) -> bytes:
if self._content is None:
self._content = retrieve_file(self.fs, self.hash_value)
return self._content
Migration and Backup
Backup Strategies
# Backup content store
rsync -av data/ backup/data/
# Backup dataset metadata
rsync -av datasets/ backup/datasets/
Migration Between Backends
# Migrate from local to S3
local_fs = fsspec.filesystem("file")
s3_fs = fsspec.filesystem("s3", profile="my-profile")
# Copy content store
for root, dirs, files in os.walk("data"):
for file in files:
local_path = os.path.join(root, file)
s3_path = f"s3://my-bucket/{local_path}"
s3_fs.put(local_path, s3_path)
Security Considerations
Access Control
- File permissions: Respect filesystem permissions
- Cloud IAM: Use appropriate cloud permissions
- Encryption: Support for encrypted storage backends
Data Integrity
- Hash verification: Verify content hashes on retrieval
- Tamper detection: Detect any content changes
- Audit trails: Track all storage operations
Next Steps
- API Reference - Complete API documentation
- Architecture Overview - System architecture