Blobs
Iroh works with content-addressed blobs of opaque data, which are often the bytes of a file.
A blob is a sequence of bytes. Those bytes can be a JPEG image, a text file, a video, really anything that is a definite set of bytes.
All blobs within iroh are referred to by the BLAKE3 [1] hash of its content. BLAKE3 is a tree hashing algorithm, that splits its input into uniform chunks and arranges them as the leaves of a binary tree, producing intermediate chunk hashes that accumulate up to a root hash. Iroh uses the 32 byte root hash (or just “hash”) as an immutable blob identifier.
Iroh leverages the tree hash structure of BLAKE3 as a basis for incremental verification when data is sent over the network, as described by Section 6.4 “Verified Streaming” in [1] and implemented in [2]. Iroh caches all chunk hashes as external metadata, leaving the unaltered input blob as the canonical source of bytes. Verified streaming also facilitates range requests: fetching a verifiable contiguous subsequence of a blob by streaming only the portions of the BLAKE3 binary tree required to verify the designated subsequence.
Chunk hashes are distinct from root hashes and only used during data transfer. The chunk group size of BAO is a tunable constant that defaults to 1KiB, which results in a 6% overhead on on the size of the input blob. Increasing the chunk size reduces overhead, at the cost of requiring more data transfer before an incremental verification checkpoint is reached. The chunk group size constant can be modified & recalculated without affecting the root hash. This opens the door to experiment with different chunk group size constants, even at runtime. We intend to investigate chunk size optimization in future work.
Root hashes are expressed as a Content identifier (CID) as defined in [3], making iroh an IPFS system capable of interoperating with other systems that use CIDs. In contrast to other IPFS systems, only root hashes are valid content identifiers, which enforces a strict 1-1 relationship between a Content Identifier and blob. This 1-1 relationship brings iroh into alignment with common whole-file checksum systems. A naive implementation of iroh can skip verified streaming entirely and use the the CID as a whole-file checksum.
Collections
A collection is an ordered set of blob hashes.
Iroh uses collections as immutable, ordered lists of blobs. Collections themselves are blobs, serialized as a hash sequence of one hash after another, with no separators or headers. Because all hashes in iroh are 32-byte BLAKE3 hashes, the byte length of a collection will always be a multiple of 32.
Collection link counts can range from 0-billions. Collections are true lists, and should not be nested to form graphs. While it's totally possible to put the hash of a collection within a collection, the internal garbage collector within iroh that keeps track of what blobs can be deleted does not check this, and will prune away data that isn't explicitly known to the garbage collector.
Collections and documents can both be used to group blobs together. The core difference is collections are immutable, while documents are mutable.
Collection Metadata
Formally, the serialized list of hashes stored as a blob is a hash sequence, which has no metadata. Iroh defines a collection as a hash sequence who's first element points to a CollectionV0
metadata blob. The metadata blob is always starts with the CollectionV0
UTF-8 string, followed by a list of strings that are the names of the links in the collection. For example, a collection with the links foo
, bar
, and baz
would look like this:
"CollectionV0"
["foo", "bar", "baz", ...]
The length of the list must match the length of elements in the hash sequence (minus the metadata element). The metadata blob is always the first element of the hash sequence. Iroh can issue sparse requests to determine the byte lengths of each blob in the collection, which combines to give a baseline metadata of "file names" and sizes. This provides a few nice advantages:
- There is nothing left to remove from the definition of a hash sequence. We consider this specification finished.
- The "metadata as first element" is a convention. It's completely acceptable to build a custom collection definition that includes different metadata. Iroh will still understand how to transfer & seek into the hash sequence. This opens the door to building efficient, specialized, & immutable compound data structures on iroh.
- BLAKE3 chunks along 1Kib Boundaries, which means there will always be exactly 32 hashes in a BLAKE3 incremental verification block.
When to use a collection
Collections are the right tool to reach for when you need a "snapshot" of a set of blobs. For example, a collection is a good way to represent a directory of files. Collections are not the right tool if the data you're working with is changing. For mutable, named grouping, use documents.
References
- blake3
https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf - bao
https://github.com/oconnor663/bao - CID (Content IDentifier): Self-describing content-addressed identifiers for distributed systems
https://github.com/multiformats/cid - multihashes
https://multiformats.io/multihash/