AcornDiskWorker Storage

Filesystem-backed content-addressable storage with Bloom filter fast rejection, POSIX I/O, SHA-256 integrity verification, and state persistence.

.package(url: "https://github.com/treehauslabs/AcornDiskWorker.git", from: "1.0.0")

DiskCASWorker

actor AcornCASWorker

Generic over FileSystemProvider for testability. Use the convenience initializer for production (defaults to DefaultFileSystem).

public actor DiskCASWorker<F: FileSystemProvider>: AcornCASWorker {
    public init(
        directory: URL,
        capacity: Int? = nil,
        maxBytes: Int? = nil,
        halfLife: Duration = .seconds(300),
        sampleSize: Int = 5,
        timeout: Duration? = nil,
        verifyReads: Bool = true
    ) throws
}

Parameters

directory URL Root directory for the CAS store. 256 shard subdirectories (00–ff) are created automatically.
capacity Int? Maximum number of entries before LFU eviction. Nil for unbounded.
maxBytes Int? Maximum total bytes on disk. Nil for unbounded.
halfLife Duration LFU score decay half-life. Default 300 seconds.
sampleSize Int Eviction candidate sample size. Default 5.
verifyReads Bool SHA-256 verify data on every read. Corrupted files are auto-deleted. Default true.

Directory Layout

Files are sharded by the first two hex characters of the CID into 256 subdirectories:

<directory>/ ├── 00/ │ └── 00a3f7b2...64chars... ├── 01/ │ └── 01e8c4d1...64chars... ├── ... ├── ff/ │ └── ff29b10c...64chars... ├── .bloom ← serialized Bloom filter └── .sizes ← JSON map of CID → byte size

Methods

MethodBehavior
has(cid:) Bloom filter check first (~80ns for definite miss). Falls back to access() syscall on Bloom "maybe".
getLocal(cid:) async Bloom filter → read file → optional SHA-256 verify → return data. Auto-deletes corrupted files.
storeLocal(cid:data:) async Write to temp file → atomic rename (POSIX). Triggers LFU eviction if needed. Updates Bloom filter.
delete(cid:) Remove file from disk. Update cache, size tracking, and metrics.
persistState() Serialize Bloom filter to .bloom and item sizes to .sizes. Enables fast restart.

Properties

PropertyTypeDescription
metricsCASMetricsHits, misses, stores, evictions, deletions, corruption detections
totalBytesIntRunning total of all stored data on disk

Bloom Filter

DiskCASWorker uses a Bloom filter to avoid unnecessary filesystem calls. For CIDs that definitely don't exist on disk, the Bloom filter returns false in ~80 nanoseconds — avoiding a ~100µs disk seek.

Bloom filter false positives
A Bloom filter can say "maybe" when a CID doesn't exist, but never says "no" when it does. False positives cause an unnecessary access() syscall, but never data loss.

Integrity Verification

When verifyReads is true (the default), every getLocal() recomputes the SHA-256 hash of the file contents and compares it to the CID. If they don't match:

  1. The corrupted file is deleted from disk
  2. metrics.corruptionDetections is incremented
  3. The method returns nil (as if the data doesn't exist)

This catches bit rot, incomplete writes, and filesystem corruption automatically.

Atomic Writes

Stores use the POSIX temp-file + rename pattern:

  1. Write data to a temporary file in the shard directory
  2. Call rename() to atomically move it to the final path

This guarantees that getLocal() never reads a partially-written file, even during crashes.

FileSystemProvider Protocol

DiskCASWorker is generic over filesystem implementation for testability:

public protocol FileSystemProvider: Sendable {
    func fileExists(atPath: String) -> Bool
    func createDirectory(atPath: String) throws
    func contentsOfFile(atPath: String) throws -> Data
    func writeFile(_ data: Data, toPath: String) throws
    func removeItem(atPath: String) throws
    func contentsOfDirectory(atPath: String) throws -> [String]
    func fileSize(atPath: String) -> Int?
}

DefaultFileSystem

The production implementation uses raw POSIX syscalls (open, read, write, rename, unlink, stat, access) for maximum performance, bypassing Foundation's file I/O overhead.

State Persistence

persistState() writes two files to the cache directory:

If these files exist on init, the worker loads them in O(1). If missing, it falls back to scanning all shard directories.

CASMetrics

public struct CASMetrics: Sendable, Equatable {
    public var hits: Int
    public var misses: Int
    public var stores: Int
    public var evictions: Int
    public var deletions: Int
    public var corruptionDetections: Int
}

DiskWorker's CASMetrics includes an additional corruptionDetections field not present in MemoryWorker's version.