Irmin: an OCaml Library / Christine Rose

Irmin: an OCaml Library

Published on 26th October 2021

WHAT IS IRMIN?
Irmin is an open-source distributed, version-controlled storage system, most similar to a Git-like database. In the simplest terms, Irmin is an OCaml library that can be embedded into application code for the purpose of storing data. With Irmin, it’s possible to store data on a remote server or locally on your own file system. In fact, you can even keep data in-memory—you have complete control! You define where you want the data stored in the OCaml code, so it’s solely your choice on where to store your data depending on your own application needs. The Irmin library then keeps track of your data, and applications built on top of Irmin can be deployed on any platform!

HISTORY OF IRMIN
Thomas Gazagnaire created Irmin in 2013 at the University of Cambridge, establishing part of a collective that would later become Tarides. Irmin is a component of the broader MirageOS project which provides a robust software framework to develop portable networked applications. Tarides develops MirageOS and Irmin specifically to address issues with modern systems, namely cloud-centric system architecture, inadequate developer tools, and layered complexity.

There are fundamental flaws in the current cloud architecture: high latency, security concerns, and data privacy issues. Therefore, Irmin focuses on an offline-first architecture to improve performance, significantly reduce security concerns, and prioritize data privacy.

Cloud architecture often comes with high latency—a delay between requested information and the cloud service’s response. We’ve become accustomed to instantaneous results over the internet. If users must wait more than a few seconds to retrieve information, we risk losing them. High latency increases frustration and decreases both enjoyment and efficiency. Irmin has confronted high latency through its offline-first architecture, which tracks data changes locally before merging it externally. This offline-first approach also benefits places like remote hospitals, so the system can continue to work despite an unstable internet connection or being completely disconnected.

Cybersecurity has become increasingly important as the world turns online for data storage. We’ve addressed security concerns by making the Irmin audit log tamper-proof, which provides transparency, and you can build complex data-flow pipelines using a reactive language to define policies. Irmin itself is also written in OCaml, a modern, type-safe functional programming language, to ensure its own logic is as robust as possible.

We place great importance on protecting data privacy, so Irmin has created a system that is only synchronized when needed. Data is processed locally by default, making it more secure and reducing cost. You can schedule data access via defined data-flow policies—all customizable for the user’s needs.

Irmin provides a portable and modular storage toolkit. You have the flexibility to choose between any platform that OCaml can target, including native Unix-like or Windows systems, unikernels based on Xen or Solo5, or JavaScript targets, depending on your specific needs and platform.

Learn about how Irmin reduces security concerns and prioritizes data privacy, as well as MirageOS overall, on Thomas’s “Packaging Tezos as a Mirage OS Unikernel” presentation.

HOW IRMIN IS USED
Irmin is used mostly as a versioned storage stack, like Git does for source-code but for application data. Some backends are indeed fully compatible with Git, but all still have the Git philosophy of version control. Unlike most storage systems that are limited by the OS kernel capabilities, Irmin is a very portable system and can be deployed as a unikernel, which are specialised applications that target a large range of deployment objects—for instance, deployed as a minimal virtual machine (VM) running on top of a hypervisor or as a pure JavaScript library inside a browser.

Two things make Irmin distinct from most other storage libraries:

It uses a Git-like data model (In this instance, it’s similar to what blockchain now does, but it's a big difference from SQL, for example.)
It's designed to be plug-and-play with different storage backends

The complexity of most contemporary system stacks makes it impossible to perform a full system analysis. Since every distribution system needs persistence for fault tolerance and scalability, scheduling of communication between nodes, and tracing across nodes for the purpose of debugging and profiling, Irmin uses version control and data persistence, which allows a single process to utilize memory on startup. Irmin stores data in a tree-like structure. These forked data structures are similar to the Git workflow making it possible to revert back to a previous branch and merge forked objects together. In fact, Irmin is even more beneficial than Git because of this snapshot feature through the use of these underlying, immutable objects, so it doesn’t have the limitations of other modern storage systems.

With Irmin, it’s possible to clone repositories across nodes and commit local operations within the nodes, but it also doesn’t record all operations permanently. Instead, Irmin designs efficient merge/sync processes. This way you can resume the process at its current state if it crashes—even if you didn’t commit!

----

Tarides and OCaml Labs work in collaboration with the OCaml community to continuously improve the language performance and safety through multicore, modular implicits, and interactive proof of the OCaml code.

Sources: