Blob store design challenges
by Rüdiger KlaehnWe've been hard at work with one of our protocols that you can pick up off the shelf with iroh
: iroh-blobs.
You might've used it before if you tried out sendme
, which is just a small CLI wrapper over iroh
with iroh-blobs
.
Put briefly, this protocol helps you transfer arbitrary data ("blobs") from one device to another in a way that allows these transfers to be interrupted, resumed, and verified by content-addressing blobs using the BLAKE3 hashing function.
It also allows you to exchange the backing storage it uses via a blob store abstraction. The actual hashes and blobs might be stored as files on disk, stored fully in-memory or even fully remotely by utilizing some cloud storage.
BLAKE3 hashed data and BLAKE3 verified streaming/bao are fully deterministic and the iroh-blobs
protocol is implemented in this spirit.
Even for complex requests like requests that involve multiple ranges or entire hash sequences, for every request, there is exactly one sequence of bytes that is the correct answer. And the requester will notice if data is incorrect after at most 16 KiB of data.
So why is designing a blob store for BLAKE3 hashed data so complex? What are the challenges?
Terminology
The job of a blob store is to store two pieces of data per BLAKE3 hash
, the actual data itself and an outboard
containing the flattened hash tree that connects each chunk
of data to the root hash. Data and outboard are kept separate so that the data can be used as-is.
While the bao crate stores tree hashes all the way down to ≤ 1024 byte BLAKE3 chunks
, we store the outboard only down to 16 KiB chunk groups
to reduce the outboard size.
In memory store
In the simplest case, you store data in memory. If you want to serve complete blobs, you just need a map from hash
to data
and outboard
stored in a cheaply cloneable container like bytes::Bytes. Even if you want to also handle incomplete blobs, you use a map from hash
and chunk number
to (≤ 1024 byte) data block, and that's it. Such a store is relatively easy to implement using a crate like bao-tree and a simple BTreeMap wrapped in a Mutex.
But the whole point of a blob store - in particular for iroh-blobs
, where we don't have an upper size limit for blobs - is to store a lot of data persistently. So while an in memory store implementation is included in iroh-blobs
, an in memory store won't do for all use cases.
Ok, fine. Let's use a database
We want a database that works on the various platforms we support for iroh, and that does not come with a giant dependency tree or non-rust dependencies. So our current choice is an embedded database called redb. But as we will see the exact choice of database does not matter that much.
A naive implementation of a block store using a relational or key value database would just duplicate the above approach. A map from hash and chunk number to a ≤ 1024 byte data block, except now in a persistent table.
But this approach has multiple problems. While it would work very well for small blobs, it would be very slow for large blobs. First of all, the database will have significant storage overhead for each chunk. Second, operating systems are very efficient in quickly accessing large files. Whenever you do disk IO on a computer, you will always go through the file system. So anything you add on top will only slow things down.
E.g. playing back videos from this database would be orders of magnitude slower than playing back from disk. And any interaction with the data would have to go through the iroh
node, while with a file system we could directly use the file on disk.
So, no database?
This is a perfectly valid approach when dealing exclusively with complete, large files. You just store the data file and the outboard file in the file system, using the hash as file name. But this approach has multiple downsides as well.
First of all, it is very inefficient if you deal with a large number of tiny blobs, like we frequently do when working with iroh-docs or iroh-willow documents. Just the file system metadata for a tiny file will vastly exceed the storage needed for the data itself.
Also, now we are very much dependent on the quirks of whatever file system our target operating system has. Many older file systems like FAT32 or EXT2 are notoriously bad in handling directories with millions of files. And we can't just limit ourselves to e.g. linux servers with modern file systems, since we also want to support mobile platforms and windows PCs.
And last but not least, creating the metadata for a file is very expensive compared to writing a few bytes. We would be limited to a pathetically low download speed when bulk downloading millions of blobs, like for example an iroh
collection containing the linux source code. For very small files embedded databases are frequently faster than the file system.
There are additional problems when dealing with large incomplete files. Enough to fill a book. More on that later.
What about a hybrid approach
It seems that databases are good for small blobs, and the file system works great for large blobs. So what about a hybrid approach where small blobs are stored entirely inline in a database, but large blobs are stored as files in the file system? This is complex, but provides very good performance both for millions of small blobs and for a few giant blobs.
We won't ever have a directory with millions of files, since tiny files never make it to the file system in the first place. And we won't need to read giant files through the bottleneck of a database, so access to large files will be as fast and as concurrent as the hardware allows.
This is exactly the approach that we currently take in the iroh-blobs
file based store. For both data files and outboards, there is a configurable threshold below which the data will be stored inline in the database. The configuration can be useful for debugging and testing, e.g. you can force all data on disk by setting both thresholds to 0. But the defaults of 16 KiB for inline outboard and data size make a lot of sense and will rarely have to be changed in production.
While this is the most complex approach, there is really no alternative if you want good performance both for a large number of tiny blobs and a small number of giant blobs.
The hybrid blob store in detail
What do we have to store?
We want to have the ability to quickly list all blobs or determine if we have data for a hash. Small blobs live completely in the database, so for these this is not a problem. But even for large blobs we want to keep a bit of metadata in the database as well, such as whether they are complete and if not if we at least have a verified size. But we also don't want to keep more than this in the metadata database, so we don't have to update it frequently as new data comes in.
An individual database update is very fast, but syncing the update to disk is extremely slow no matter how small the update is. So the secret to fast download speeds is to not touch the metadata database at all if possible.
Let's take a look at the metadata in detail.
The metadata table
The metadata table is a mapping from a BLAKE3 hash to an entry state. The rough entry state can be either complete
or partial
.
Complete entry state
Data can either be inline
in the metadata db, owned
in a canonical location in the file system (in this case we don't have to store the path since it can be computed from the hash), or external
in a number of user owned paths that have to be explicitly stored. In the owned
or external
case we store the size in the metadata, since getting it from a file is an io operation with non-zero cost.
Owned data files above the inline threshold are by default stored in the data
directory of the blob store, with a file name <BLAKE3 HASH HEX>.data
.
An outboard can either be inline
in the metadata db, owned
in a canonical location in the file system, or not needed
in case the data is less large than a chunk group size. Any data ≤ 16 KiB will have an outboard of size 0 which we don't bother to store at all. Outboards are never in user owned paths, so we never have to store a path for them.
Owned outboard files above the inline threshold are by default stored in the data
directory of the blob store, with a file name <BLAKE3 HASH HEX>.obao4
. Why .obao4
? .obao
is the file extension chosen for the normal bao format, and .obao4
indicates that we are using a chunk group
size of 24.
Partial entry state
For a partial entry, there are just two possible states. Once we have the last chunk we have a verified size, so we store it here. If we do not yet have the last chunk, we do not have a size.
Data and outboard for partial entries are always stored on disk, even when they are below the inline threshold. The reason for this is that updating inline blobs in the database would have very bad performance.
However please keep in mind that entries below the chunk group
size, which should be the majority of tiny blobs, will directly go from unknown to complete without ever having the partial
state in the metadata database.
Inline data and inline outboard table
The inline data table is kept separate from the metadata table since it will tend to be much bigger. Mixing inline data with metadata would make iterating over metadata entries slower. For the same reason, the outboard table is separate. You could argue that the inline outboard and data should be combined in a single table, but I decided against it to keep a simple table structure. Also, for the default settings if the data is inline, no outboard will be needed at all. Likewise when an outboard is needed, the data won't be inline.
For the inline data and outboard table, you will have small keys stored together with up to 16 KiB values. This will lead to suboptimal performance. Here a database with key value separation such as fjall might be an option in the future.
Tags table
The tags table is just a simple CRUD table that is completely independent of the other tables. It is just kept together with the other tables for simplicity.
A note about sizes
A complication with the hybrid approach is that when we have a hash to get data for, we do not know its size yet. So we can not immediately decide whether it should go into a file or the database. Even once we request the data from a remote peer, we don't know the exact size. A remote node could lie about the size, e.g. tell us that the blob has an absurdly large size. As we get validated chunks the uncertainty about the size decreases. After the first validated chunk of data, the uncertainty is only a factor of 2, and as we receive the last chunk we know the exact size.
This is not the end of the world, but it does mean that we need to keep the data for a hash in some weird indeterminate state before we have enough data to decide if it will be stored in memory or on disk.
Blob lifecycle
There are two fundamentally different ways how data can be added to a blob store.
Adding local files by name
If we add data locally, e.g. from a file, we have the data but don't know the hash. We have to compute the complete outboard, the root hash, and then atomically move the file into the store under the root hash. Depending on the size of the file, data and outboard will end up in the file system or in the database. If there was partial data there before, we can just replace it with the new complete data and outboard. Once the data is stored under the hash, neither the data nor the outboard will be changed.
Syncing remote blobs by hash
If we sync data from a remote node, we do know the hash but don't have the data. In case the blob is small, we can request and atomically write the data, so we have a similar situation as above. But as soon as the data is larger than a chunk group size (16 KiB), we will have a situation where the data has to be written incrementally. This is much more complex than adding local files. We now have to keep track of which chunks of the blob we have locally, and which chunks of the blob we can prove to have locally (not the same thing if you have a chunk group size > 1).
Using the data
As soon as data is added, we can serve it over the network or use it locally. We don't have to wait for completion for this. We can start sharing a large blob even before we are finished downloading it.
When a range request comes in that contains ranges that are not yet present locally, we will notice that the local data is incomplete and stop answering the request - outgoing data is always validated.
Alternatively we can rely on a bitfield to make sure that all requested data is present locally.
Blob deletion
On creation, blobs are tagged with a temporary tag that prevents them from being deleted for as long as the process lives. They can then be tagged with a persistent tag that prevents them from being deleted even after a restart. And last but not least, large groups of blobs can be protected from deletion in bulk by putting a sequence of hashes into a blob and tagging that blob as a hash sequence.
We also provide a way to explicitly delete blobs by hash, but that is meant to be used only in case of an emergency. You have some data that you want gone no matter how dire the consequences are.
Deletion is always complete. There is currently no mechanism to protect or delete ranges of a file. This makes a number of things easier. A chunk of a blob can only go from not present (all zeroes) to present (whatever the correct data is for the chunk), not the other way. And this means that all changes due to syncing from a remote source commute, which makes dealing with concurrent downloads from multiple sources much easier.
Deletion runs in the background, and conceptually you should think of it as always being on. Currently there is a small delay between a blob being untagged and it being deleted, but in the future blobs will be immediately deleted as soon as they are no longer tagged.
Directories and files
As mentioned above, files that exceed the inline threshold and are owned by the store are stored in the data
directory of the blob store. Data files have a .data
extension, outboard files an .obao4
extension.
There is a temp
directory where large files are staged and the outboard is computed while they are being added to the store. This directory is configurable, but must be on the same device the data directory is on so that files can be atomically moved into the data directory when ready.
The metadata database itself is called blobs.db
. This is currently a redb database file.
Even more detail
By now we know in pretty high detail where the data is stored, what metadata we keep, and how blobs enter and leave the store. But let's go one level deeper. This is the point where this turns into a bit of a rant.
Metadata write batching.
Relational databases as well as most key value stores usually have strong durability guarantees. Unless you change the default config, after a transaction commit you have a guarantee that the modifications in the transaction will be permanent and will be visible even if the process or the entire computer crashes immediately afterwards.
The downside of this is that data needs to be synced to disk on commit. Even on a computer with SSD disks, a sync will take on the order of milliseconds, no matter how tiny the update is. So creating a write transaction for each metadata update would reduce the number of updates to at most a few 1000 per second even on very fast computers. This is independent of the database used. Two very different embedded databases, redb and sqlite, have very similar performance limitations for small write transactions. When using sqlite as a blob store for small blobs, which I did for a previous project, you would have to disable syncing the write ahead log to get acceptable performance.
Redb does not use a write ahead log and for very good reasons does not allow you to tweak durability settings. So to get around this limitation, we combine multiple metadata updates in a single transaction up to some number and time limit. The downside of this approach is that it reduces durability guarantees - if you have a crash a few milliseconds after adding a blob, after the crash the blob will be gone. In most cases this is acceptable, and if you want guaranteed durability you can explicitly call sync. Since we are still using transactions with full durability guarantees, we can at least be sure that the database will be in a consistent state after a crash.
Incomplete files and the file system
File system APIs are not very rich. They don't provide any transactional semantics or even ordering guarantees. And what little guarantees they provide is different from platform to platform.
This is not much of a problem when dealing with complete files. You have a file for the data and a file for the outboard, and that's it. You never write to them, and reliably reading ranges from files is something even the most ancient file system like FAT32 or EXT2 can reliably handle.
But it gets much more complex when dealing with incomplete files. E.g. you start downloading a large file from a remote node, and add chunks one by one (or in the case of n0 flavoured bao in chunk groups
of 16 KiB).
The challenge now is to write to the file system in such a way that the data remains consistent even when the write operation is rudely interrupted by the process being killed or the entire computer crashing. Losing a bit of data that has been written immediately before the crash is not great, but tolerable. But we don't ever want to have a situation after a crash where we think we have some data, but actually we are missing something about the data or the outboard that is needed to verify the data.
BLAKE3 is very fast, so for small to medium files we can just validate the blob on node startup. For every chunk of data in the file, compute the hash and check it against the data in the outboard. This functionality exists in the bao-tree crate. valid_ranges will give you an iterator of chunk ranges that you have locally. But not even BLAKE3 is fast enough to do this for giant files like ML models or disk images. For such files it would take many minutes at startup to compute the validity bitfield, so we need a way to persist it.
And that comes with its own rabbit hole of problems. As it turns out, files are hard. In particular writing to files.
Files are hard
As discussed above, we want to keep track of data and outboard in the file system for large blobs. Let's say we use three files, <hash>.data
, <hash>.outboard
and <hash>.bitfield
. As we receive chunks of verified data via bao, we write the hashes to the outboard file, the data to the data file, and then update the bitfield file with the newly valid chunks. But we need to do the update in exactly this order, otherwise we might have a situation where the bitfield file has been updated before the data and outboard file, meaning that the bitfield file is inconsistent with the other files. The only way to fix this would be to do a full validation, which is what we wanted to avoid.
We can of course enforce order by calling fsync
on the data and outboard file before updating the bitfield file, after each chunk update. But as we have seen with the metadata database, that is horribly expensive no matter how fast your computer is. Just to put some rough numbers on this: A sync to file system will take on the order of milliseconds, even on SATA SSDs. So if you sync after each chunk, and a sync takes 5ms, your write speed is now limited to 200 kilobytes per second. Even if you sync after each 16 KiB chunk group, your download speed is now limited to 3.2 megabytes per second.
But if we don't do the sync, there is no guarantee whatsoever in what order or if at all the data makes it to disk. So in case of a crash our bitfield could be inconsistent with the data, so we would be better off not persisting the bitfield at all.
Most operating systems have platform dependent ways to finely control syncing for memory mapped files. E.g. on POSIX systems there is the msync
API and on windows the similar FlushViewOfFile
API. But first of all we are not confident that memory mapping will work consistently on all platforms we are targeting. And also, this API is not as useful as it might seem.
While it provides fine grained control over syncing individual ranges of memory mapped files, it does not provide a way to guarantee that no sync happens unless you explicitly call it. So if you map the three files in memory and then update them, even if you don't call msync
at all there is no guarantee in which order the data makes it to disk. The operating system page cache does not know about files or about write order, it will persist pages to disk in whatever order is most convenient for its internal data structures.
So what's the solution? We can write to the data and outboard file in any order and without any sync calls as often as we want. These changes will be invisible after a crash provided that the bitmap file is not updated.
We keep changes to the bitfield file fully in memory, in a separate data structure. In regular intervals we sync the data and outboard file to make sure the data is present on disk, and then write the changed bitfield ranges to disk and sync the bitfield file.
To get started, take a look at our docs, dive directly into the code, or chat with us in our discord channel.