Blog Index

A richer tags api

by Rüdiger Klaehn

We are working on a big refactor of the iroh-blobs code base to better support multiprovider downloads, and to provide a long-term stable API for blobs 1.0.

Before that lands, however, we've shipped a richer API for manipulating tags! This seems like a good opportunity to explain again what tags are and why they are useful.

So what is a tag, anyway?

A tag is a name you can give to content to mark it as important.

The tag name can be any byte string. Usually, you will use an UTF-8 string. But there are use cases where it is very useful to store arbitrary bytes.

The tag value is a tuple consisting of the BLAKE3 hash of the data, and a blob format which can currently only be Raw or HashSeq.

There are two main purposes of tags.

First of all, they make it much easier to sort and retrieve content. You don't have to store a mapping from name to hash in some external database, but can assign a friendly name to your data.

Second, tagging data is the only thing that prevents it from being freed by the blob store garbage collection. Any blob that isn't directly or indirectly tagged will sooner or later be deleted!

Tagging a hash expresses that if you had that data, you would like to keep it. It does not matter if you have it yet. E.g. you first tag the hash of a giant wikipedia snapshot, then browse that snapshot. You will keep all the ranges of content you download while browsing.

Hash sequences

Let's say you have a large number of files that belong together — a folder of pictures from a family vacation, for example. You could of course tag them all individually. An alternative is to put all the BLAKE3 hashes into a blob (the blob will have a size that is a multiple of 32 bytes) and tag just that blob with the blob format HashSeq. All the pictures are now indirectly tagged via the hash sequence and therefore protected.

Again, there is a similarity to IPFS. This is similar to recursively tagging a Dag-CBOR CID. But in iroh-blobs we only support one level of indirection, not arbitrary DAGs. If you have to store DAGs, we have a way to do this.

But we have found that in most cases a single level of indirection is enough, and we like to keep things simple. Below is a section about how to deal with metadata in hash sequences.

Tagging API

Tags are being added automatically when you add content locally or download content from a remote node. So the previous API for tags is pretty minimal. You just have the ability to list all tags and delete individual tags. While this is fine for simple use cases where you have a small number of tags to e.g. manage some manually added data, it is insufficient for more complex use cases.

So the current API has been extended to give the full capability of a key-value store for tags. You can get the value of individual tags, list them by range or prefix, and even bulk delete them by range or prefix. In addition, we added the ability to atomically rename a tag.

Why do you need such a complex API? Once you start working with lists of tags, this becomes more obvious. As soon as you have a very large number of tags, list-by-range allows you to page through the tags instead of getting them all. List-by-prefix allows you to organize the tags in something like a directory structure.

But what about bulk deletion?

Use case: expiring tags

Let's say you've got some sort of storage service. Customers can upload data and it will be automatically assigned a tag. You want to delete data for non-paying customers after some time and only preserve data for paying customers. With expiring tags this would be easy. Data gets an expiring tag on upload, and you assign an additional non-expiring tag on payment.

Expiring tags are not natively supported, but with the new API they can be easily implemented. You just encode the expiry date in the tag, either as a human readable form that properly sorts, such as ISO8601 UTC timestamps, or in a more compact form as the number of seconds since the UNIX epoch, as a big endian encoded 64 bit integer.

So an expiring tag would use a prefix to distinguish it from non-expiring tags, and the encoded expiry time, for example

expiring-<ISO8601 date>

If you want to store additional information such as the customer id, you can just add that to the tag as well.

expiring-<ISO8601 date>-<customer id>

Then you have a small task somewhere, possibly triggered by a cron job or some application timer, that just deletes all expiring tags up to the current date. This can be done either by listing expiring tags and deleting them one by one to have more control about what is deleted, or just by doing a bulk delete from expiring- to expiring-<current time>.

Use case: binary data in tags

Let's say you store data for entities that are identified by a 32 byte iroh node id and a 64 bit integer entity id, and you want to easily access data by node id and entity id. Possibly you also want to query by entity id range.

If tags were limited to UTF8 strings, you would have to encode the node id and process id to an UTF8 friendly format like hex. So a tag would look like this:

<hex encoded node id (64 bytes)>-<hex encoded big-endian u64 (16 bytes)>

To a javascript developer this feels very natural, but it is inefficient (tags are bigger than they have to be) and requires additional code. With our tags you don't have to hex encode anything. A tag simply becomes

<node id (32 bytes)><big-endian u64 (8 bytes)>

Tags are half as long, and encoding (node_id, entity_id) tuples to tags becomes simpler and faster. In most use cases this won't matter. But iroh and iroh-blobs are meant to be used as embedded components for demanding applications, so sometimes it will matter. Also, we want to provide a friendly API to developers like myself that have a strong dislike for needlessly stringifying things.

Rust API

Currently, the tags API is available as an in-memory RPC client to the Blobs protocol handler. To get the API, you need to enable the rpc feature flag.

The API is done in the following way: for complex operations such as list or delete, there are functions list_with_opts and delete_with_opts that allow you to pass in an options struct that contains all the various options such as start and end tag.

    pub async fn list_with_opts(
        &self,
        options: ListOptions,
    ) -> Result<impl Stream<Item = Result<TagInfo>>> { ... }

In addition there are functions for common use cases such as get, list, list_range, list_prefix that internally use list_with_opts. These functions do not use fixed types but impl Into or impl AsRef<[u8]> for maximum convenience.

    pub async fn list_range<R, E>(&self, range: R) -> Result<impl Stream<Item = Result<TagInfo>>>
    where
        R: RangeBounds<E>,
        E: AsRef<[u8]>,
    { ... }

The advantage of this convenience fn is that you can now use any type that implements AsRef<[u8]> as range bounds, and you can use any kind of range (open, half-open, bounds included or excluded).

    let res = tags.list_range("a"..="z")?;

There is a small performance downside. E.g. if you already have a Tag object for start and end of a range, using the convenience API will lead to new allocations because fn list_range can not know that the bounds are already tags.

In 99.9% of all cases this is completely fine, but if you have a very demanding use case where you care about allocations, you can always use the list_with_opts fn directly.

   let from: Tag = ...;
   let to: Tag = ...;
   let res = tags.list_with_opts(ListOptions { from, to, raw: true, hash_seq: true })?;

Tags are usually very small, so the allocation won't be very measurable. This distinction between the convenience fns and the options API becomes more relevant when you deal with large in-memory blobs that you already have as bytes::Bytes.

Temp tags

Sometimes you have a situation where you perform a complex operation and want to only assign a permanent tag once the operation is complete. E.g. you are adding a large number of files from a local directory. You do not want to assign permanent tags yet, either for performance reasons or because they would point to incomplete data and you want to assign a permanent tag only once the operation is complete.

For such cases, iroh-blobs has anonymous temporary tags that only live in memory and are therefore extremely lightweight. A temp tag protects the content it points to until it goes out of scope and gets dropped.

Temp tags can be used to perform atomic changes. E.g. you got a tag family-pictures-2024 that points to a HashSeq containing a directory full of pictures. Now you add an updated directory containing additional family pictures. This directory might be large, so adding a tag for every item would be wasteful. In addition, you still want the family-pictures-2024 tag to point to the old pictures until the import is complete, in case there is a crash during the import.

Temp tags allow you to work with the data in process without having to care about tag names or having to touch the tags table. Then once you are done with the operation, you assign a single permanent tag and drop the temp tags.

How to deal with metadata in hash sequences

A hash sequence is just a sequence of BLAKE3 hashes. So what if you want to add some kind of metadata to the entries, like file names?

More complex formats such as DAG-CBOR allow you to freely mix data and hashes/links. But that is not without downsides. Every time you want to traverse a DAG-CBOR blob, you need to parse the DAG-CBOR haystack and look for CID needles. This is not a big problem for small blobs, but for iroh-blobs we treasure the ability to deal with arbitrary large blobs, and we want to extend this to blobs that contain links. For hash sequences, parsing is not needed, and you can easily access a link at an offset even without loading the entire hash sequence into memory.

But still, how do you deal with metadata? We have come up with a convention for this that works for many cases. We store the metadata as the first child of the HashSeq. E.g. to represent a directory in [sendme], the first child of the hash sequence contains a blob with a header followed by a sequence of file paths in a simple format — just the concatenated UTF-8 strings. A hash sequence with this kind of metadata is called a collection.

Depending on your use case, you might follow the convention of having the metadata in the first child, but have a different metadata format. Or you might come up with an entirely different convention. E.g. if you have a lot of metadata and require random access for the metadata, you might have 2 hashseq entries per content blob. Or you might have fixed size metadata in the metadata blob to allow random access for the metadata.

There is no one size fits all solution, but whatever you do, iroh-blobs does not have to know about any of this. It only knows raw blobs and hash sequences.

History of the iroh-blobs tagging concept

Most of the concepts in this blog post date back to my work on ipfs-embed in 2020. I did a talk about this at IPFS camp 2022. Compared to ipfs-embed, iroh-blobs simplifies things by focusing exclusively on BLAKE3 hashed blobs of arbitrary size, and taking away the ability to traverse arbitrary DAGs in favour of the extremely simple concept of hash sequences.

Iroh is a dial-any-device networking library that just works. Compose from an ecosystem of ready-made protocols to get the features you need, or go fully custom on a clean abstraction over dumb pipes. Iroh is open source, and already running in production on hundreds of thousands of devices.
To get started, take a look at our docs, dive directly into the code, or chat with us in our discord channel.