This is a transcript of my talk on Diving into Merkle Trees that I will giveat Lambda Days and ScaleConf Colombia. Slides and video should be up soon!

Introduced in 1979 by Ralph C. Merkle in his Thesis: Secrecy, Authentications,and Public Key Systems, the Merkle Tree, also known as a binary hash tree, is adata structure used for efficiently summarizing and verifying the integrity oflarge sets of data enabling users to verify the authenticity of their receivedresponses.
“The general idea in the new system is to use aninfinite tree of one-time signatures. […] Each node of the tree performs threefunctions: (1) it authenticates the left sub-node (2) it authenticates the rightsub-node (3) it signs a single message.”
Initially, it was used for the purpose of one-time signatures and authenticatedpublic key distribution, namely providing authenticated responses as to thevalidity ofhttps://ordep.dev/assets/images/digital-signature.pngssets/images/digital-signature.png" alt="digital-signature">
Ralph C. Merkle described a new digital signature that was able to sign anunlimited number of messages and the signature size would increaselogarithmically as a function of the number of messages signed.
At this point we can identify the two main purposes of a Merkle Tree: (1)summarize large sets of data and (2) verify that a specific piece of databelongs to a larger data set.
One-Way Hashing Functions
Before diving into one-time signatures lets first get confortable withone-way functions. Usually, a one-way function is a mathematical algorithmthat take inputs and provide unique outputshttps://ordep.dev/assets/images/one-way-hashing-functions.pngg src="/assets/images/one-way-hashing-functions.png" alt="one-way-hashing-functions">
A one-way function F is a function that iseasy to compute, but difficult to invert. Given x and F, it is easy tocompute y=F(x), but given y and F, it is effectively impossible to computex.
One-way hashing functions are especially useful within Merkle Trees for twoobvious reasons; storage and privacy.
With systems that contain massive amounts of data, the benefits of beingable to store and identify data with a fixed length output can create vaststorage savings and help to increase efficiency.
The person who computes y=F(x) is the only person who knows x. If y ispublicly revealed, only the originator of y knows x, and can choose toreveal or conceal x at his whim!
One-time Signatures
Also in 1979, Leslie Lamport published his concept of One-time Signatures.Most signature schemes rely in part on one-way functions, typically hashfunctions, for their security proofs. The beauty of Lamport scheme was that thissignature was only relying on the security of these one-way functions!
One time signatures are practical between a single pairof users who are willing to exchange the large amount of data necessary but theyare not practical for most applications without further refinements.
If 1000 messages are to be signed before new publicauthentication data is needed, over 20,000,000 bits or 2.5 megabytes must bestored as public information.
If B had to keep 2.5 megabytes of data for 1000 other users, B would have tostore 2.5 gigabytes of data. With further increases in the number of users, orin the number of message each user wants to be able to sign, the system wouldeventually become burdensome.
Improving One-time Signatures
Merkle focused on how to eliminate the huge storage requirements in the Lamportmethod and proposed an improved One-time Signature that reduced the size ofsigned messages by almost a factor of 2.
This improved method was easy to implement and cutted the size of the signedmessage almost in half, although this was still too large for most applications;instead of storing 2.5 gigabytes of data, B only had to store
The method is called tree authentication because it’s computation forms a binarytree of recursive calls. Using this method, requires only log2 ntransmissions. A close look at the algorithm will reveal that half thetransmissions are redundant since we’re able to compute a given parent node Afrom their children A1 and A2, so there’s really no need to send A.
How to compute a Merkle Root?
Given that we have a data file represented by a set of blocks [L1, L2].

We start by applying a one-way hashing function to https://ordep.devhttps://ordep.dev/assets/images/building-blocks-02.pngntext highlighter-rouge">h(📄L1) = 9ec4.

The next step is to apply the same function to h(📄L2) = 7e6a.

To calculate the parent node, we need always to concatenate both child hashesh(📄L1) and h(📄L2) before applying, once again, the h(h(📄L1) || h(📄L2)) = aea9.

At this point we know the building blocks of a Merkle Tree; let’s represent itin Elixir.
Building a Merkle-Tree
In order to build a Merkle Tree, we need to define three new types: Leaf,Node, and the MerkleTree itself. Let’s start by defining Leaf – itshould contain the hash and the value of a given data block.
defmodule MerkleTree.Leaf do defstruct [:hash, :value] @type hash :: String.t @type value :: String.t @type t :: %MerkleTree.Leaf{ hash: hash, value: value }end
The next type is Node – it should contain the left and right childnodes, and the hash value of the concatenation of both child hashes.
defmodule MerkleTree.Node do defstruct [:hash, :left, :right] @type hash :: String.t @type left :: MerkleTree.Node.t | MerkleTree.Leaf.t @type right :: MerkleTree.Node.t | MerkleTree.Leaf.t @type t :: %MerkleTree.Node{ hash: hash, left: left, right: right }end
And, finally, the Merkle Tree itself.
defmodule MerkleTree do defstruct [:root] @type root :: MerkleTree.Node.t @type t :: %MerkleTree{ root: root }end
Hashing the data blocks
The first step to build a Merkle Tree is to hash the data blocks andconverting them to leaves. In order to hash something we need to define a newmodule Crypto with a single function hash that should accept an input and isresponsible for encoding it using the appropriate one-way hashing function.
defmodule MerkleTree.Crypto do def hash(input, type \\ :sha256) do type |> :crypto.hash("#{input}") |> Base.encode16 endend
Having blocks = ["L1", "L2", "L3", "L4"], the expected output would be:
[ %MerkleTree.Leaf{ value: "L1", hash: "DFFE8596427FC50E8F64654A609AF134D45552F18BBECEF90B31135A9E7ACAA0" }, %MerkleTree.Leaf{ value: "L2", hash: "D76354D8457898445BB69E0DC0DC95FB74CC3CF334F8C1859162A16AD0041F8D" }, %MerkleTree.Leaf{ value: "L3", hash: "842983DE8FB1D277A3FAD5C8295C7A14317C458718A10C5A35B23E7F992A5C80" }, %MerkleTree.Leaf{ value: "L4", hash: "4A5A97C6433C4C062457E9335709D57493E75527809D8A9586C141E591AC9F2C" }]
By defining a function new that accepts blocks, we should be able to hashthe data blocks and convert them into Leafs.
defmodule MerkleTree do alias MerkleTree.Leaf alias MerkleTree.Crypto def new(blocks) do blocks |> Enum.map(&%Leaf.build(&1, Crypto.hash(&1))) endend
Where Leaf.build/2 is just:
defmodule MerkleTree.Leaf do def build(value, hash) do %Leaf{value: value, hash: hash} endend
Calling the function above MerkleTree.new ["L1", "L2", "L3", "L4"] shouldyield the expected output. Although, we’re not done yet.
Hashing the nodes
Remember that for creating a Node we need to join both child hashes andapply the one-way hashing function once again?
defmodule MerkleTree.Node do alias MerkleTree.Crypto def new(nodes) do nodes |> Enum.map_join(&(&1.hash)) |> Crypto.hash |> build(nodes) end def build(hash, [left, right]) do %Node{hash: hash, left: left, right: right} endend
That’s basically what new is doing before calling build(nodes). Once wehave the Node hash value, we’re ready to create a new Node with hash,left, and right. As an example, by calling the function above with these twoleaves:
MerkleTree.Node.new([ %MerkleTree.Leaf{ value: "L1", hash: "DFFE8596427FC50E8F64654A609AF134D45552F18BBECEF90B31135A9E7ACAA0" }, %MerkleTree.Leaf{ value: "L2", hash: "D76354D8457898445BB69E0DC0DC95FB74CC3CF334F8C1859162A16AD0041F8D" }])
Would yield the following Node:
%MerkleTree.Node{ hash: "8C569660D98A20D59DE10E134D81A8CE55D48DD71E21B8919F4AD5A9097A98C8", left: %MerkleTree.Leaf{ value: "L1", hash: "DFFE8596427FC50E8F64654A609AF134D45552F18BBECEF90B31135A9E7ACAA0" }, right: %MerkleTree.Leaf{ value: "L2", hash: "D76354D8457898445BB69E0DC0DC95FB74CC3CF334F8C1859162A16AD0041F8D" }}
All the way up
Having a way to build a Node from a pair of nodes, we can now make use ofrecursion to calculate the remaining nodes up to the Merkle root. Let’scomplete the new function:
defmodule MerkleTree do alias MerkleTree.Leaf alias MerkleTree.Crypto def new(blocks) do blocks |> Enum.map(&%Leaf.build(&1, Crypto.hash(&1))) |> build endend
Currently, this new function it’s yielding a list of leaves. Let us define abuild function that accepts that list of leaf nodes with the goal ofgrouping it into several pairs of leaf nodes in order to build the parentNode by concatenating both left and right child hashes.
defp build(nodes) do nodes |> Enum.chunk_every(2) |> Enum.map(&MerkleTree.Node.new(&1)) |> buildend
Note that we’re making use of tail recursion to build our Merkle Tree from theground up to the root. Still, we need to stop that recursive processing.
defp build([root]) do %MerkleTree{root: root.hash, tree: root}enddefp build(nodes) do nodes |> Enum.chunk_every(2) |> Enum.map(&MerkleTree.Node.new(&1)) |> buildend
By pattern matching a single element root in the list of nodes, we’re now ableto stop the processing and return a MerkleTree.
The final new function would look like this:
defmodule MerkleTree do alias MerkleTree.Leaf alias MerkleTree.Crypto def new(blocks) do blocks |> Enum.map(&%Leaf.build(&1, Crypto.hash(&1))) |> build endend
Finally, by calling the function MerkleTree.new ["L1", "L2", "L3", "L4"] wouldyield the following result:
%MerkleTree{ root: "B8EC6582F5B8ED1CDE7712275C02C8E4CC0A2AC569A23F6B7738E6B69BF132E6", tree: %MerkleTree.Node{ hash: "B8EC6582F5B8ED1CDE7712275C02C8E4CC0A2AC569A23F6B7738E6B69BF132E6", left: %MerkleTree.Node{ hash: "8C569660D98A20D59DE10E134D81A8CE55D48DD71E21B8919F4AD5A9097A98C8", left: %MerkleTree.Leaf{ value: "L1", hash: "DFFE8596427FC50E8F64654A609AF134D45552F18BBECEF90B31135A9E7ACAA0" }, right: %MerkleTree.Leaf{ value: "L2", hash: "D76354D8457898445BB69E0DC0DC95FB74CC3CF334F8C1859162A16AD0041F8D" } }, right: %MerkleTree.Node{ hash: "29C5146A0AABBC4444D91087D91D2637D8EB4620A686CF6179CCD7A0BFB9B8EF", left: %MerkleTree.Leaf{ value: "L3", hash: "842983DE8FB1D277A3FAD5C8295C7A14317C458718A10C5A35B23E7F992A5C80" }, right: %MerkleTree.Leaf{ value: "L4", hash: "4A5A97C6433C4C062457E9335709D57493E75527809D8A9586C141E591AC9F2C" } } }}
Audit Proof
As we already know, we can use a Merkle Tree to verify that a specific piece ofdata belongs to a larger data set. The Merkle audit proof is the missing nodesrequired to compute all of the nodes between the data block and the Merkle root.If a Merkle audit proof fails to prodhttps://ordep.devhttps://ordep.dev/assets/images/audit-proof.pngmatches theoriginal Merkle root hash, it means that our data block is not present in thetree.

In this example, we need to provide a proof that the data block L1 exists inthe tree. Since we already know the hash value of L1, we’ll need the hashvalue of L2 in order to compute P1. Now that we are able to compute P1 wefinally need to get P2 to compute R. In this specific case the Merkle auditproof is a list of nodes [H2, P2].
The use of tree authentication is now fairly clear. A given userA transmits R to another user B. A then transmits the authentication path forYi. B knows R, the root of the authentication tree, by prior arrangement. B canthen authenticate Yi, and can accept any Ynth from A as genuine.
How they are useful?
Merkle trees are especially useful in distributed, peer-to-peer systems wherethe same data should exist in multiple places. By using Merkle Trees we candetect inconsistencies between replicas, reduce the amount of transferred dataenabling peer-to-peer file sharing, and maintaining several versions of the sametree, also called persistent data-structures.
Detect inconsistencies
Having a data file represented by a data structure we’re able to detectinconsistencies between replicas of that same tree. Take for example threereplicas of the same Mhttps://ordep.devhttps://ordep.dev/assets/images/replicas-00.pnge root nodes we can makesure that those trees are not the same, or in this case, there areinconsistencies between them.

By using an Anti-entropy mechanismonly the data neededto repair the inconsistent tree.

To compare the state of two nodes, they exchange the corresponding Merkle Treesby levelhttps://ordep.devhttps://ordep.dev/assets/images/replicas-02.pngn the tree if the corresponding hashes aredifferent. If two corresponding leaf nodes have different hashes, then there areobjects which must be repaired.

This is actually used by Dynamo, Riak, and Cassandra to repair bad replicas!
Peer-to-peer file sharing
The principal advantage of Merkle Trees is that each branch of the tree can bechecked independently without requiring nodes to download thhttps://ordep.dev/assets/images/peer-to-peer-01.pngpeer-to-peer file sharing another good use for Merkle Trees, wherewe start by fetching the root of the tree from a trusted source to access agiven file.

Since we can fetch single parts of a tree, reducing the amount of transferreddata, we then fetch chunks of data from untrusted sources.

We start by fetching L3 and deriving its hash, b2d0. To allow us to get tothe root, we must fetch the hash value from the right leaf, 8f14. With thesetwo nodes, we can derive the next hash value, 165f. By fetching the lasthash, 165f, to derive the root hash,which is indeed 9cee.

We were building a partial tree having just the root hash and a given datablock. If the root computed from the audit path matches the true root, then theaudit path is proof that the data block exists in the tree.
L4, thebranch that links to it must calculate new hashes all the way up to theroot, although, the other branches stay intact.

Wrapping up

Merkle trees are just binary trees containing an infinite number ofcryptographic hashes where leaves contain hashes of data blocks and nodescontain hashes of their children. They also produce a Merkle Root that summarizesthe entire data set and that’s publicly distributed across the network and can easilyprove that a given data block exists in the tree.
You can find them in Cassandra, IPFS, Riak, Ethereum, Bitcoin, Open ZFS, andmuch more. Also, you have lots of papers to read as well if you want to diveeven deeper.
Have fun!
References
