Geo sharding a DAG

0
61

For true horizontal scalability.

Not surprisingly is this article about sharding a DAG, specifically a DAG in the distributed setting of DLT’s and blockchains. To add to the disclaimer is that this article focusses on the sharding the data-structure, the DAG itself, and not the financial ledger! With that said, this is not a scientific paper but a general description of some key elements for sharding a DAG geographically. I nevertheless believe that if you can’t shard/scale the data-structure you can’t scale the ledger.

if you can’t scale the data-structure you can’t scale the ledger.

The basis of sharding

In the database world and scaling, in general, consistent hashing is one of the main concepts to deterministically assign any data to a shard, a shard is a ‘location’ where your data ends up. Now say you want 16 shards in your system then you simply do Hash(data) mod 16 (Take a number, subtract 16 and take what is left, aka modulo) to identify where you want to store information. In order to know where to find it, you simply find a node/machine/server that matches hash(IP) mod 16 == Hash(data) mod 16. This way you can in a simple deterministic way know where to find your data. There is, of course, a bit more to it and if you like to get into detail please read this article.

The difficult part with this approach comes when there is an unknown amount of nodes that will want to divide the data in more and more shards. Or in other words, do dynamic sharding. Dynamically determining your shards would cause a so-called rebalance of the network because it needs to send data all around and not something you want to do too often. For this reason, DLT solutions like Radix pick a VERY VERY large number of shards with the idea that shard space never runs out. So now we just mod with a very large number but have the problem that there will be not enough devices to fill all this shard-space 1 on 1 like we potentially could with only 16 shards. This situation is handled by so-called hash-rings and Distributed Hash Tables, in this way we can discover what our neighbors have without knowing the entire network topology.

Now, this works fine with key-value pairs but not all data is just and only key-values. Most real-world data, including distributed ledgers, is linked data in some form or another. It is contextual and only means something within that context, like a balance of an address is the result of all transactions sending something to it. With just a distributed key-value store it will have to pop from location to location in your distributed hash table and consistent hash rings to retrieve all of these grouped and related data. If everyone needs to do this all the time then that would clog up all the network bandwidth and then the entire point of sharding is lost.

What I am trying to scetch here are some basic requirements for a distributed linked/related data structure.

  1. Given the data we are looking for, we must identify it’s ‘location’
  2. Given a location, we must know what network address maps to it.
  3. We want related data to be of relative proximity to its relations.

To shard a DAG but keep its related data close to each other we should let the Vertices contain location information and only reference(edges) vertices that are within its ‘proximity’. The problem with this approach is that it becomes trivial to fill up a single location with a lot of data if that space is imaginary (like shards in consistent hashing). This because there is no incentive to play nice and because a ‘location’ is arbitrary picked and no longer a function of itself (a hash).

To still let vertices pick where they are attached we need some way of limiting who can do this.

Geolocation based sharding and networking

Now that some context is created let us dive straight into the actual ideas.

A Vertex requires a minimum setup for sharding, there are multiple fields that are required for other purposes but are left out, like timestamps.

Vertex structure:

Vertex: {    
id: Hash(fields),
location: "gbsuv7ztr",
vertex_ref_1: "gbsuv7ztq-v1ID",
vertex_ref_2: "gbsuv7zts-v2ID",
nodeID: "gbsuv7ztr-nodeID",
nodeSig: "0xfff88...",
nonce: 999,
payload: {"whatever your application is"}
}

Location is defined as a Geohash, a concise description of a small area in the latitude/longitude GPS space, so it is not a fixed location but a description of a small area. There are always and only two edges, similar to IOTA’s branch and trunk. However, the identifier of these edges includes the location of the vertex it is referencing.

Network nodes

Nodes in this setup need to identify themselves. A NodeID is comprised of at least its public key and Geohash which is freely picked but not arbitrary. This node’s Geohash is important!

A node can start attaching by issuing a vertex declaring its NodeID and, it's intended sharding scope. This sharding scope will contain NodeID’s of potential neighbors and is comprised of a surface function (radius/square/polygon etc) relative to the NodeID’s self-declared geohash.

Triangulation

This vertex only gets added by other nodes that do strong testing on the location of the node by the use of multilateration/ triangulation based on latency, how to do this exactly does require another article but previous research is performed on this subject.

Because of the simple limitation of the speed of information/ light, it is possible to generate a confidence area of a node’s location compared to one’s own node. A NodeID’s geohash must lie within this confidence area for other nodes to accept it. This needs to be a continues process of testing nodes for them to stay honest.

As the throughput of the network increases and more nodes join the network, the precision of this method increases. As a result of this method, it would incentivize strong node distribution and prevents the clustering of nodes in Cloudproviders.

The amount of surface area a node is covering will be dependant on the throughput of the network and the capabilities of its machine. When vertex throughput increases, more nodes will need to join and existing nodes will decrease their surface area.

Location restrictions

A vertex reference two other vertices. However, there are further restrictions on this in order to achieve sharding. First: a node is not allowed to issue a vertex with a location that is outside its own surface area with a maximum value that can be derived from the DAG’s throughput and neighboring nodes.

Transaction reference surface

Besides a node only able to issue a vertex within its self-declared and peer-neighbor tested geolocation the vertices itself must issue a location that is within the surface defined by the vertices it wants to reference. If a vertex’s given location is not available then a node will need to issue out-of-bound vertex with increased proof of work so the network can start including those vertices.

By incorporating NodeID’s with testable relative locations and some extra restrictions on Vertices we create a shard per node that can decide itself what it’s own throughput is based on geolocation and resources.

Deriving the financial ledger

This article, as stated before, is not about explaining further consensus mechanisms to make a financial ledger work. However, it is important to note that if you have a location-aware data-structure that the financial layer on top of it must be location-aware as well. Meaning that money gains a location assigned to it, almost like real cash ;)

Conclusion

All in all, this generally allows for interesting properties that current DLT systems don’t have.

  • A network split is not a problem, only potentially problematic at the actual physical location of the split.
  • It allows for geo-aware consensus mechanisms. Like a token or coin, that by its consensus mechanism is only available in a certain city or area. So the general concept of physical geofencing.
  • This would also allow having true horizontal scalability. More machines = more throughput
  • Add physical location to data, as real objects have.
  • And many more.

For those interested, a few months back I did a small simulation using a 1d circle (lower dimension from a 2d planetary surface, easier visualization) and placed nodes evenly on this circle and let them start issuing transactions according to the above-described rules. The white vertices in the video are the ‘view’ of a single sharded node. You can view the video here:

For questions feel free to contact me on:

Telegram: @ovanwijk


Geo sharding a DAG was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.