There are multiple classes of geospatial and routing algorithms within Graph Theory, each designed for different use-cases from finding the shortest-path between a single start and end location, to generating permutations of optimal routes covering all nodes in the network, to optimizing graphs through graph transformation, and many more. For example, one of the most famous problems in graph theory is the Four-Color-Problem, determining the possibility of coloring a map with just four colors where no region borders another region with colored the same color. In this article, we will focus on the shortest-path class of algorithms with a single start location as it is fundamental to the mapping navigation and ETA system we will build in a future architecture series article.
The shortest-path problem uses a graph to model the network of roads where each node typically represents an intersection and each edge a road between two intersections. The edges have weights representing either the time or distance between two nodes and include additional metadata such as one-way streets, turn restrictions, or speed limits. This data is typically persisted to a no-sql or graph database along with in-memory storage for quick lookups, though we’ll dive into this more in the section below on Geospatial Indexing. There are several well-known shortest-path algorithms to traverse a graph and evaluate the shortest most efficient route between two points. At the most basic one may try a simple breadth-first-search (BFS) but this is far too inefficient at scale. The first algorithm we discuss is similar to BFS but uses the graph weights to determine which node to evaluate next while skipping nodes already visited.
Dijkstra’s Algorithm
One of the most famous shortest-path routing algorithms and one which many subsequent algorithms are based off of is Dijkstra’s algorithm. Dijkstra’s algorithm was invented in the late 1950’s and has since been iterated on substantially. It is not restricted to a single destination node and can be used to find the shortest path to all nodes in a graph, but we will focus on a single destination as that is our use case. Dijkstra’s Algorithm starts with tracking the set of nodes which have not been visited and marking all nodes but the start node as unvisited with an infinite tentative distance. Multiple iterations are then run, where at each iteration the algorithm considers each unvisited node adjacent to the current node and updates each which will have a lower total tentative distance from the current node than its current tentative distance. Once all unvisited adjacent nodes have been covered, the current node is marked as visited and removed from the unvisited set of nodes. Finally, the unvisited node reachable from any of the already visited nodes with the lowest total tentative distance is set as the new current node; this is often implemented as a min-priority queue. Once a node is marked as visited the node is not checked again; this works as we always visit the reachable unvisited node with the smallest total tentative distance first so any visits after must be of a longer distance. At the end of the iteration, if the destination node has been marked visited (or really once the destination node has the shortest tentative distance among all unvisited nodes) the algorithm is complete, otherwise the next iteration begins from the newly marked current node.
A* Algorithm
The A* (“A-Star”) Algorithm can be thought of as an optimized Dijkstra’s Algorithm using heuristics focused on a single start-location and a single end-location while updating the graph edge weights in real-time. A key component of A* is its cost function which can be customized to the problem statement (e.g. travel distance, straight line distance, travel duration, fuel consumption). This cost function determines which node is visited next at each iteration and involves the sum of two inputs - the total distance from the start node and a heuristic calculating the estimated distance to the destination node. Two common heuristics are the Euclidean distance using the pythagorean theorem or the Manhattan distance using the sum of the latitude and longitude deltas. In fact, Dijkstra’s algorithm can be thought of as A* where the heuristic calculating the estimated distance to the destination node is always zero so only the total distance from the start node is considered as part of the cost function. The cost function and the included heuristics address the flaw in Dijkstra’s Algorithm where it wastes significant time exploring in directions unlikely to ever be the shortest path. A*, similar to Dijkstra’s algorithm, starts with a list of all the unvisited nodes in a priority queue and each node marked unvisited. Multiple iterations are then run, where at each iteration the algorithm follows similar logic to Dijkstra’s algorithm but using the cost function to determine the next node to visit “F = G + H”, where G is the total distance from the start node and H is the heuristic calculating the estimated distance to the destination node.
A* is a significant improvement over Dijkstra’s algorithm and can be further improved with optimizations like landmarks and preprocessing of the graph using contraction hierarchies. This is required as these shortest-path routing algorithms are too slow by themselves to be used in real-time at scale with customers, especially with longer routes. Contraction Hierarchies contract nodes together so queries only consider the “important” nodes and skip over the “unimportant” nodes. As distances between non-adjacent nodes are computed they are persisted to speed up subsequent queries. For example, when finding the shortest path between two cities on opposite coasts of the country we likely don’t need to consider taking every exit as a routing option and can contract these nodes into the distance between major city exits. The preprocessing step for generating the contraction hierarchy can be sped up by sharding the routing graph into parallelizable chunks per region to be computed. With A* and contraction hierarchies in-hand, we conclude with the ALT algorithm (A* Search, Landmarks, Triangle inequality) which combines the A* Algorithm with landmarks and triangle inequality in conjunction with contraction hierarchies for quick querying. The shortest distance from each node to key landmarks is precomputed; then during querying, the landmark information can be combined with triangle inequality to get a lower-bound distance between any two nodes in the graph. Together these optimizations result in an efficient shortest-path routing algorithm with the ability to evaluate queries orders of magnitude faster, in only hundreds of milliseconds.
Keep an eye out for the next upcoming system architecture post on designing a Mapping, Navigation, and ETA system, buildng on the foundation in this article.