Insights

Over-squashing, Bottlenecks, and Graph Ricci curvature

and

Friday, 17 June 2022

Over-squashing is a common plight of Graph Neural Networks occurring when message passing fails to propagate information efficiently on the graph. In this post, we discuss how this phenomenon can be understood and remedied through the concept of Ricci curvature, borrowed from the field of differential geometry.

This post was co-authored with Jake Topping and is based on the paper J. Topping, F. Di Giovanni et al., Understanding over-squashing and bottlenecks on graphs via curvature (2022) ICLR (outstanding paper honorable mention), a collaboration between Twitter Cortex and the University of Oxford.

This post is unavailable

This post is unavailable.

The majority of Graph Neural Network (GNN) architectures are based on the message-passing paradigm, in which information is propagated between the nodes of the graph along its edges. Traditionally, the input graph is used both as part of the data along with the node features, as well as the computational structure for information propagation. However, recent works showed that certain graphs in certain situations tend to be “unfriendly” for message passing [1].

More specifically, Message Passing Neural Networks (MPNNs) [2] tend to perform poorly when the learned task requires long-range dependencies (in other words, the output of an MPNN depends on representations of distant nodes interacting with each other) and at the same time, the structure of the graph results in exponentially many long-range neighboring nodes (in other words, the “receptive field” of a node grows exponentially with the radius of the neighborhood). In such situations, messages coming from non-adjacent nodes need to be propagated and compressed into fixed-size vectors, causing a phenomenon of information over-squashing [3]. The structural characteristics of the graph responsible for over-squashing are referred to as “bottlenecks”. While extensively observed empirically, the theoretical understanding of these phenomena is currently rather limited.

Characterizing Over-squashing

In a recent paper [4], we introduced tools to study over-squashing and bottlenecks by a characterization of the local geometric properties of the graph. Suppose we have a multi-layer MPNN. We can calculate the sensitivity of the hidden features computed by the MPNN to the input features at a remote node. Given a node p and another node s that is r-distant from it [5], the Jacobian ∂hₚ⁽ʳ⁺¹⁾/∂xₛ tells us how a change in the input feature xₛ will be propagated by the MPNN and affect the (r+1)st layer output hₚ⁽ʳ⁺¹⁾. Small absolute values of this Jacobian are indicative of poor information propagation, or over-squashing [6].

This post is unavailable

This post is unavailable.

Over-squashing can be defined as the lack of sensitivity of the output of an MPNN at node p to the input features at an r-distant node s, quantified by the Jacobian |∂hₚ⁽ʳ⁺¹⁾/∂xₛ|. A tree is an example of a pathological graph where this phenomenon is particularly acute.

The Jacobian can be bounded by a term expressed through the powers of the normalized adjacency matrix of the graph [7]. However, besides indicating that somehow the graph structure – a “bottleneck” – is to blame for the over-squashing, such an expression is not very illuminating. In order to better understand the nature of bottlenecks, we need to take a closer look at the fine-grain geometric properties of the graph.

Curvature

Intuitively, we expect very different behavior in graphs that look like grids, where the receptive field of a node grows polynomially, as is the case with convolutional neural networks, and graphs that look like trees, with exponentially growing receptive fields. In differential geometry, a natural object that allows us to distinguish different geometries is the Ricci curvature, named after the early differential geometer Gregorio Ricci-Curbastro [8]. Roughly speaking [9], it determines the “geodesic dispersion”, namely, whether geodesics starting at nearby points with the “same” velocity remain parallel (as in Euclidean space in which the curvature is zero), converge (spherical space with positive curvature), or diverge (hyperbolic space with negative curvature).

This post is unavailable

This post is unavailable.

Ricci curvature of a manifold can be characterized by “geodesic dispersion”, namely, whether two parallel geodesics shot from nearby points converge (this happens in positively-curved spaces locally resembling a sphere), remain parallel (flat or Euclidean case), or diverge (negative curvature giving rise to hyperbolic geometry).

The Ricci curvature of a manifold is one of the most fundamental geometric objects studied in differential geometry as well as in geometric theories of gravitation. In general relativity, the Ricci curvature appears in Einstein’s field equation relating the curvature of spacetime to energy and momentum. It has also been extensively explored as a way to deform metric structures through a family of geometric partial differential equations introduced by Richard Hamilton in the 1980s under the name of “Ricci flow” [10]. This method gained notoriety even outside pure mathematics when two decades later Grigori Perelman completed Hamilton's program in his celebrated proof of the Poincaré Conjecture [11].

As an analogy of geodesic dispersion on graphs, consider an edge p~q and two edges starting at p and q, respectively. In a discrete spherical geometry, the edges would meet at another node s to form a triangle (a 3-clique). In a discrete Euclidean geometry, the edges would stay parallel and form a rectangle (4-cycle) based at p~q (orthogonal grid). Finally, in a discrete hyperbolic geometry, the mutual distance of the edge endpoints would grow compared to that of p and q, which is the case of a tree.

We defined a new notion of edge-based curvature named “Balanced Forman curvature” [12] and denoted by Ric(p,q) for an edge p~q, based on counting the aforementioned local structures, triangles, rectangles, and outgoing edges. Our discrete curvature reproduces the geodesic dispersion behavior in the continuous case:

This post is unavailable

This post is unavailable.

Our definition of discrete Ricci curvature on a graph emulates the “geodesic dispersion” in the continuous case by capturing local structures formed by edges, marked in bold black, emanating from nodes p and q, where p~q is an edge. In a discrete analogy of spherical geometry, these edges tend to form triangles, pqk in this example. In Euclidean geometry, they remain “parallel”, staying at the same 1-hop distance and forming 4-cycles. In hyperbolic geometry, the edges diverge. The distance between their endpoints increases, resulting in tree-like structures.

Negative Curvature is Responsible for Over-squashing

The main theoretical result of our paper is the bound on the Jacobian quantifying the sensitivity of the MPNN output to input features through the Balanced Forman curvature [13], thus relating the curvature of the graph to the over-squashing phenomenon. Our conclusion is that negatively curved edges are those causing the graph bottlenecks responsible for over-squashing.

One of the practical implications of our result is a conceptual method for curvature-based graph rewiring dubbed “Stochastic Discrete Ricci Flow.” The continuous Ricci flow evolves the metric of a Riemannian manifold proportionally to the Ricci curvature. Inspired by this approach and in light of our characterization of bottlenecks, we use the discrete Balanced Forman curvature to guide a rewiring on the graph where edges are added locally around the most negatively-curved edges p~q in order to reduce the bottleneck. Done at the pre-processing stage, such a rewiring makes the graph friendlier to message passing and improves the performance of Graph Neural Networks. At the same time, the “surgical” nature of our procedure makes sure that the new graph is not dramatically different from the original one.

This post is unavailable

This post is unavailable.

Top: a dumbbell-shaped Riemannian manifold with a negatively-curved bottleneck, where negative curvature is coded by cold colors, undergoing curvature-based metric evolution becomes “rounder” and less “bottlenecked”. Bottom: an analogous process of curvature-based graph rewiring, reducing the bottlenecks and making the graph friendlier for message passing.

Besides practical implications for understanding a common phenomenon encountered in GNNs, our work highlights the deep links between deep learning on graphs and differential geometry. We believe that methods and tools from this field of mathematics will lead to novel insights and better graph ML architectures.

---

[1] U. Alon and E. Yahav, On the bottleneck of graph neural networks and its practical implications (2020) arXiv:2006.05205.

[2] J. Gilmer et al., Neural message passing for quantum chemistry (2017) ICML.

[3] Over-squashing phenomena were previously observed in traditional (RNN-based) seq2seq models . These models had to “squash” the entire input sequence into a fixed-size vector. However, while in RNNs operating on 1D grids the receptive field of a node grows linearly, in GNNs this growth is typically exponential and thus of much greater concern.

[4] J. Topping, F. Di Giovanni, et al., Understanding over-squashing and bottlenecks on graphs via curvature (2022) ICLR.

[5] In other words, the shortest path, or geodesic, between p and s consists of r hops.

[6] Similar sensitivity analysis was done by Xu et al., Representation learning on graphs with Jumping Knowledge Networks (2018) ICML, but to the best of our knowledge, we are the first to apply it to over-squashing.

[7] Equation 2 in our paper [4].

[8] G. Ricci, Direzioni e invarianti principali in una varietà qualunque (1903–1904) Atti R. Inst. Veneto 63 (2):1233–1239. Signed with a shortened surname “Ricci”, rather than “Ricci-Curbastro,” an ancient armigerous Italian family, it sometimes causes confusion as to whether these are two different people.

[9] Formally, Ricci curvature assigns to the tangent space at each point of a Riemannian manifold a symmetric bilinear form (tensor) that expresses how different the metric is from the flat (Euclidean) one. More generally, it is possible to measure the Ricci curvature of an arbitrary affine connection. Geodesic dispersion can also be related to volume growth, which is polynomial in Euclidean spaces and exponential in hyperbolic ones.

[10] R. S. Hamilton, Three-manifolds with positive Ricci curvature (1982) Journal of Differential Geometry 17(2):255–306 introduced the Ricci flow. Ricci flow is a partial differential equation of the form ∂g/∂t=−2R governing the evolution of the Riemannian metric tensor g of the manifold proportionally to the Ricci curvature tensor R that bears structurally similar to the diffusion equation.

[11] The proof appeared in a series of arXiv papers, G. Perelman, The entropy formula for the Ricci flow and its geometric applications (2002) arXiv:0211159, G. Perelman, Ricci flow with surgery on three-manifolds (2003) arXiv:0303109, and G. Perelman, Finite extinction time for the solutions to the Ricci flow on certain three-manifolds (2003) arXiv:0307245. Perelman used “Ricci flow with surgery” to prove a more general result known as the Thurston Geometrization Conjecture, of which the Poincaré Conjecture is a special case.

[12] The definition of the Balanced Forman curvature is given in equation 3 in our paper [4]. There exist several graph analogies of the Ricci curvature, among which the constructions introduced by R. Forman, Bochner’s method for cell complexes and combinatorial Ricci curvature (2003) Discrete and Computational Geometry 29:323–374 and Y. Ollivier, Ricci curvature of metric spaces (2007) Comptes Rendus Mathématique 345(11):643–646, referred to as the Forman- and Ollivier curvatures, respectively. The Ollivier curvature has a richer theory due to its relation to optimal transport but at the same time is expensive to compute. The advantage of the Forman curvature is in its combinatorial definition and lower computational complexity. Its disadvantage is in its bias towards negative curvature. For example, the curvature of a grid according to Forman’s definition is negative. For this reason, we introduce a new version of graph Ricci curvature similar in spirit to Forman’s construction but accounting for 4-cycle contributions , hence the name “Balanced Forman” . We show that our curvature lower-bounds the Ollivier one (Theorem 2). A consequence of this result, together with known facts about the Ollivier curvature, is that in positively-curved graphs the volume grows polynomially (Corollary 3).

[13] Theorem 4 in our paper [4].

---

We are grateful to Davide Eynard for proofreading and insightful comments.

This post is unavailable

This post is unavailable.