Improving Graph Generation by Restricting Graph Bandwidth Nathaniel Diamant 1 Alex M. Tseng 1 Kangway V. Chuang 1 Tommaso Biancalani 1 Gabriele Scalia 1 Abstract tive modeling has recently been proven capable of learning both global and fine-grained structural properties, along with their complex interdependencies (Guo & Zhao, 2023). These results suggest many promising applications for deep graph generative modeling, which include the biomedical and pharmaceutical domains, where essential objects such as molecules, gene networks, and cell-level tissue organization can be represented as graphs (Li et al., 2022). Despite these promises, several open challenges remain. Deep graph generative modeling has proven capable of learning the distribution of complex, multiscale structures characterizing real-world graphs. However, one of the main limitations of existing methods is their large output space, which limits generation scalability and hinders accurate modeling of the underlying distribution. To overcome these limitations, we propose a novel approach that significantly reduces the output space of existing graph generative models. Specifically, starting from the observation that many real-world graphs have low graph bandwidth, we restrict graph bandwidth during training and generation. Our strategy improves both generation scalability and quality without increasing architectural complexity or reducing expressiveness. Our approach is compatible with existing graph generative methods, and we describe its application to both autoregressive and one-shot models. We extensively validate our strategy on synthetic and real datasets, including molecular graphs. Our experiments show that, in addition to improving generation efficiency, our approach consistently improves generation quality and reconstruction accuracy. The implementation is made available1 . Applications in these domains require modeling graphs with a high number of nodes N that leads to a large output space, where the number of possible edges is in O(N 2 ). At the same time, many real-world graphs are sparse, and they are characterized by a small number of semantically rich connections which need to be accurately modeled (e.g., a small subset of chemical bonds can confer radically different properties in a molecular graph). Recent research has focused on accurately learning complex dependencies leveraging generative models such as variational autoencoders (VAEs) (Grover et al., 2019), recurrent neural networks (RNNs) (You et al., 2018), normalizingflow models (Shi et al., 2020) and score-based models (Niu et al., 2020). A general limitation of these approaches is their high time complexity and output space O(N 2 ). This limits both their scalability and makes accurate prediction of sparse connections challenging, as the ratio of observed to possible edges can be extremely small. 1. Introduction For this reason, more tractable methods have been proposed, which leverage different architectures (Li et al., 2018; Liu et al., 2018; Dai et al., 2020), generate coarse-grained motifs (Jin et al., 2018; Liao et al., 2019), or change the output representation, such as transforming graphs to sequences (Goyal et al., 2020) or using domain-specific encodings such as molecular SMILES (Gómez-Bombarelli et al., 2018). Although these approaches are more scalable, they trade-off efficiency with model complexity, expressiveness, or have limited applicability because of domain-specific choices. Learning the underlying distribution of graphs for generative purposes finds important applications in diverse fields, where objects can be naturally described through their structures (Hamilton et al., 2017). Computational approaches to capture graph statistical properties have long been established (Albert & Barabási, 2002), whereas deep genera1 Department of Artificial Intelligence and Machine Learning, Research and Early Development, Genentech, USA. Correspondence to: Nathaniel Diamant , Gabriele Scalia . To overcome the limitations of existing approaches, we propose a novel strategy: BwR (Bandwidth-Restricted) graph generation. BwR leverages a permutation of the adjacency matrix to restrict the graph bandwidth, reducing both the time complexity and the output space from O(N 2 ) to O(N · φ̂), where φ̂ is the estimated bandwidth. As will Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1 https://github.com/Genentech/ bandwidth-graph-generation 1 Bandwidth-Restricted Graph Generation 2. Related Work be shown, φ̂ is low for many classes of real-world graphs, such as those characterizing the biomedical domain. During training, BwR leverages bandwidth-restricted adjacency matrices; during sampling, it constrains the generation within a bandwidth-restricted space, which reduces both time complexity and output space without losing expressiveness. 2.1. Graph Generative Models Graph generative models seek to learn the underlying distribution of graph datasets. A model is trained on a set of observed graphs G = {G1 , . . . , GS } ∼ p (G), where each graph Gi = (Vi , Ei ) is defined by its set of nodes Vi = (v1 , . . . , vN ) and edges Ei ⊆ Vi × Vi . The model learns the distribution pmodel (G) ≈ p (G) that allows sampling new graphs. Broadly, graph generative models can be categorized as autoregressive (You et al., 2018; Shi et al., 2020; Goyal et al., 2020; Li et al., 2018; Liu et al., 2018) or one-shot (Kipf & Welling, 2016; Ma et al., 2018; Grover et al., 2019; Niu et al., 2020) models. A common way to represent the graph topology is through its adjacency matrix Aπ ∈ N × N . The adjacency matrix depends on a specific node ordering π, defined as a bijective function π : V → [1, N ]. This strategy brings two key advantages. First, reducing time complexity improves generation scalability (i.e., time and memory requirements). Second, reducing the output space simplifies learning the underlying data distribution, while also making the ratio of observed to possible edges less imbalanced, with a positive impact on generation quality. Importantly, BwR can be easily integrated into virtually all existing graph generative methods, as it is orthogonal to the generative model architecture. Therefore, it does not increase model complexity nor add domain-specific constraints. We describe our strategy in the context of three recent models: an autoregressive model based on GraphRNN (You et al., 2018), a one-shot VAE-based model based on Graphite (Grover et al., 2019), and a one-shot score-based model based on EDP-GNN (Niu et al., 2020). Autoregressive models treat graph generation as a sequential decision process, factorizing pmodel (G) into the joint probability of its components (e.g., nodes or motifs). Nodebased autoregressive methods (You et al., 2018; Shi et al., 2020) generate each node and its edges in a predefined order, conditioning each new node on the already-generated graph, with time complexity and output space in O(N 2 ). Solutions have been proposed to separately output actions corresponding to the number of new edges and their identities for each new node, reducing time complexity (Li et al., 2018; Liu et al., 2018). However, these methods increase model complexity without reducing the output space. In contrast, BwR reduces both the time complexity and output space of existing node-based autoregressive models to be in O(N · φ̂). This reduction is achieved without losing expressiveness and without increasing model complexity. We experimentally validate BwR by evaluating the reconstruction accuracy and generation quality on both synthetic and real datasets. Additionally, we analyze memory and time improvements. We include molecular datasets— spanning both small molecules and larger peptides—to evaluate the advantages of BwR for de novo molecular generation. Our results show that BwR consistently achieves superior or competitive generative performance to the standard baselines, at a fraction of the time/space complexity. Contributions. We summarize our main contributions as follows: • We show that many real-world classes of graphs, such as molecules, have low graph bandwidth. One-shot models sample the whole topology of the graph from a latent distribution. Many one-shot models use permutation-invariant functions to output the graph topology. This class of generative models can be split into two main categories: adjacency-matrix-based models (Ma et al., 2018; Niu et al., 2020) directly output Aπ , while node-embeddingbased models (Kipf & Welling, 2016; Grover et al., 2019) sample node embeddings from the latent distribution and compute Aπ based on pairwise relationships between them. In both cases, these methods have time complexity and output space in O(N 2 ), as they need to consider edges between every pair of nodes. In contrast, BwR allows reducing both the time complexity and output space to be in O(N · φ̂), with no loss of expressiveness. • Building on this property, we propose BwR (Bandwidth-Restricted) graph generation, a novel strategy for graph generation that constrains the bandwidth, drastically reducing time complexity and output space. • BwR can be applied to virtually all existing graph generation methods. We describe its application to an autoregressive method, GraphRNN (You et al., 2018); a one-shot VAE-based method, Graphite (Grover et al., 2019); and a one-shot score-based method, EDP-GNN (Niu et al., 2020). • We validate BwR on both synthetic and real-world datasets, with a focus on real-world molecular datasets. Our results show that, in addition to being more efficient in terms of time and memory used, BwR consistently improves reconstruction accuracy and generative quality across datasets and methods. Last, we notice how our approach is orthogonal and compatible with other methods proposed to increase GNN efficiency—such as graph partitioning (Jia et al., 2020)—as it is largely independent of the generative method. For the 2 Bandwidth-Restricted Graph Generation Standard generation 𝑨 ∈ 𝑵×𝑵 Training Training Sampling Canonical ordering (e.g. BFS, DFS) output ∈ 𝑵×𝑵 Time complexity: 𝓞 𝑵𝟐 Output space: 𝓞 𝑵𝟐 𝒇𝜽 Autoregressive or one-shot Bandwidth-restricted generation 𝝋 𝑨 ∈ 𝑵×𝑵 Training Cuthill–McKee algorithm 𝑨𝐨𝐩𝐭 ∈ 𝑵×𝝋 Flatten Sampling 𝒇𝜽 Ordering reducing max 𝑑𝑖𝑠𝑡(𝑣' , 𝑣( ) output ∈ 𝑵×𝝋 Time complexity: 𝓞 𝝋𝑵 Output space: 𝓞 𝝋𝑵 Autoregressive or one-shot ("!,"")∈ℰ Output space No generation Figure 1. Overview of our strategy and comparison with standard generation methods. (Top) In a standard graph generative method, the model is trained on adjacency matrices A derived through a specific canonical ordering on the graph (e.g., BFS or DFS). During  sampling, the model needs to predict edges from a space in O N 2 . (Bottom) Our bandwidth-restricted graph generation leverages the Cuthill-Mckee (C-M) ordering (Cuthill & McKee, 1969) to reduce the bandwidth φ(A) of each adjacency matrix. The C-M order results in an adjacency matrix that is a band matrix, with all-zero entries outside a φ(A)-sized band. A is re-parameterized as Aopt ∈ N × φ(A), which is used for training. During sampling, only edges in an N × φ(A) space (yellow) are considered as candidates, thus drastically reducing the output space to O (N × φ(A)). present work, we focus on node-based generation, but our approach can be extended to coarser motif-based generation (Liao et al., 2019). clear advantages over depth-first-search (DFS), even though the latter is more often used to define a canonical order for molecular structures (Mercado et al., 2021). A more fundamental issue raised by non-unique graph representations is that choosing a specific ordering π does not rigorously correspond to maximum-likelihood estimation (MLE), thus preventing exact likelihood evaluation (Liao et al., 2019; Chen et al., 2021). Additionally, it can make the reconstruction loss ambiguous. As discussed by Liao et al. (2019), training on random orderings in a specific canonical family (e.g., BFS) optimizes a variational lower bound of the true log-likelihood tighter than any single arbitrary ordering. For autoregressive models, a tighter lower bound is derived by Chen et al. (2021) by performing approximate posterior inference over the node ordering. For one-shot models, Winter et al. (2021) addresses reconstruction ambiguity by training a permuter model to reorder generated graphs alongside a standard encoder/decoder architecture. We notice how these works are orthogonal with respect to our contribution. Indeed, bandwidth-optimized graphs define a canonical family of node orderings, and existing methods could be used to improve likelihood estimation. This integration will be investigated in future work. 2.2. Graph Ordering A unique challenge of graph generative models is that the set of all possible orderings leads to up to N ! different adjacency matrices for the same graph (Liao et al., 2019). Given that our method imposes a specific node ordering, it is related to other works that have investigated ordering in graph generation. Ordering is crucial in autoregressive models. For any particular ordering π, if the index distance |π (vi ) − π (vj )| between two connected nodes (vi , vj ) ∈ E is high, the model is required to handle long-term dependencies. This issue can be addressed by choosing a specific canonical ordering. For example, GraphRNN (You et al., 2018) is trained using random breadth-first-search (BFS) node orderings for each graph, such that new nodes can only be connected to existing nodes at the frontier of the BFS. GraphRNN and its comparison with our bandwidth-restricted version are further discussed in Section 4.2. It has been shown that the choice of node ordering impacts graph generation for specific applications. For example, in the context of autoregressive molecular generation, BFS had 3 Bandwidth-Restricted Graph Generation for very restricted classes of graphs, and approximate superpolynomial solutions have been proposed for general graphs (Feige, 2000; Cygan & Pilipczuk, 2010). However, in practice, efficient heuristic approaches work well for general graphs and are routinely leveraged in applications. Molecule BFS ordering C-M ordering Edge One such heuristic is the Cuthill-McKee (C-M) algorithm (1969), which is based on a variation of BFS search, has linear time complexity O(|E|) (Chan & George, 1980), and has been extensively studied from a theoretical perspective (Turner, 1986). Extensions of this algorithm and other heuristics have been proposed that improve efficiency and/or theoretical guarantees, though results are dataset dependent (Gonzaga de Oliveira et al., 2018). For the present work, we leverage the C-M algorithm initialized to start at a pseudo peripheral node (Gibbs et al., 1976). However, our proposed strategy is independent of the choice of the bandwidth-minimization algorithm, and other approaches— potentially even learned—will be explored in future work. Interestingly, as discussed in Section 2, BFS has been shown to be superior to DFS for autoregressive molecular generation. Given that C-M can be seen as a special case of BFS, using C-M allows us to retain the benefits of BFS, while also further reducing max(vi ,vj )∈E distπ (vi , vj ). We define the bandwidth of the adjacency matrix derived with the C-M algorithm as φ̂. 𝝋(𝑨) = 𝟑 𝝋(𝑨) = 𝟖 𝝋-constrained space Figure 2. Bandwidth of two adjacency matrices of the same molecular graph. The left adjacency matrix is given by a BFS ordering, the right adjacency matrix is given by a Cuthill-McKee ordering. 3. Graph Bandwidth Background We start providing a definition of bandwidth through the graph bandwidth problem (Unger, 1998). Intuitively, the graph bandwidth problem can be seen as placing the nodes of a graph on a line such that the “length” of the longest edge in the graph is minimized. The bandwidth of the graph is then simply the length of the longest edge. A key observation that motivates the present work is that many real-world graphs, such as those characterizing the biomedical domain, have low φ̂. Table 1 shows the empirical bandwidth computed with the C-M algorithm for a diverse set of chemical and biological datasets (see Appendix C for more details on the datasets). For each dataset, a savings factor summarizes the space reduction. The savings factor is calculated as the ratio of the number of edges in the bandwidth-restricted graph to the number of edges in the complete graph (i.e. the size of a non-bandwidth-restricted adjacency matrix). As shown, a bandwidth reparameterization leads to a savings factor > 3 for most small-molecule datasets (e.g., 3.8 ± 1.0 for ZINC250k). The savings factor is even higher for datasets including larger molecules (e.g., 13.5±6.5 for the Peptides-func dataset). Significant savings are also confirmed on non-molecular datasets, such as brain networks (KKI, OHSU). The high savings factor of molecular datasets is due to the existence of an intrinsic upper bound on the bandwidth of molecule-like graphs, which we derive and further discuss in Appendix E. Given a graph G = (V, E) on N vertices, each ordering π : V → [1, N ] defines a graph linearization. We define the distance between nodes vi and vj in the ordering π as distπ (vi , vj ) = |π (vi ) − π (vj )|. The bandwidth of the ordering π is defined as the maximum stretch of any edge on the linearization, i.e. φ (π) = max(vi ,vj )∈E distπ (vi , vj ). The bandwidth of a graph G is the minimum bandwidth across all possible orderings, i.e.: φ (G) = min π:V→[1,N ] φ (π) . (1) Importantly, an ordering that minimizes φ results in an adjacency matrix where all non-zero entries lie in a narrow band along the diagonal (hence the term “bandwidth”). This enables a compact matrix representation of N × φ instead of N 2 (Figure 1) and drastically reduces the space required to represent the graph when φ ≪ N . The maximum size of the off-diagonal band of the adjacency matrix is known as the matrix’s bandwidth, which we denote as φ(A). Figure 2 shows the bandwidth of two adjacency matrices of a molecular graph. We note that for an ordering π, φ(π) = φ(Aπ ). The C-M algorithm consistently reduces φ(π) compared to the orderings routinely used in graph generation (BFS, DFS, etc.). Figure 2 compares the adjacency matrices of a molecular graph given by BFS and C-M, and their respective bandwidths. The bandwidth decreases from 8 to 3, which translates into a two-fold reduction of the output space. Additional examples of adjacency matrices for molecular The graph bandwidth problem has been shown to be NP-hard for general graphs (Papadimitriou, 1976), and also for simpler classes of graphs such as trees and even caterpillar trees (Monien, 1986). Exact polynomial solutions exist 4 Bandwidth-Restricted Graph Generation 4.1. Restricting Graph Bandwidth Table 1. Number of nodes, C-M bandwidth, and savings factor of a set of chemical and biological datasets. The first section of the table consists of small molecules, the second section consists of large molecules, and the third section consists of brain networks. See Appendix C for more details on the datasets. Average and standard deviation calculated across graphs for each dataset. Dataset N φ̂ Savings Factor ZINC250k AIDS Alchemy MCF-7 MOLT-4 Mutagenicity NCI1 NCI-H23 OVCAR-8 PC-3 QM9 SF-295 SN12C Tox21 UACC257 Yeast 23.2 ± 4.5 14.0 ± 10.4 10.1 ± 0.7 26.1 ± 10.7 25.8 ± 10.3 28.5 ± 14.1 29.3 ± 13.4 25.8 ± 10.3 25.8 ± 10.3 26.1 ± 10.6 18.0 ± 2.9 25.8 ± 10.3 25.8 ± 10.3 16.8 ± 10.1 25.8 ± 10.3 21.1 ± 8.8 3.3 ± 0.8 3.1 ± 1.1 2.8 ± 0.7 4.1 ± 1.3 4.0 ± 1.3 5.5 ± 2.2 4.3 ± 1.5 4.0 ± 1.3 4.0 ± 1.3 4.1 ± 1.4 5.3 ± 1.5 4.0 ± 1.3 4.0 ± 1.3 3.0 ± 1.2 4.0 ± 1.3 3.6 ± 1.2 3.9 ± 1.0 2.4 ± 1.0 2.1 ± 0.4 3.5 ± 1.2 3.5 ± 1.2 2.9 ± 1.1 3.7 ± 1.3 3.5 ± 1.2 3.5 ± 1.2 3.5 ± 1.2 2.0 ± 0.4 3.5 ± 1.2 3.5 ± 1.2 3.0 ± 1.3 3.5 ± 1.2 3.3 ± 1.1 Peptides-func DD ENZYMES 150.9 ± 84.5 277.7 ± 217.3 31.7 ± 13.3 5.7 ± 2.6 36.0 ± 20.7 5.4 ± 2.2 13.5 ± 6.5 4.1 ± 1.4 3.3 ± 1.2 KKI OHSU 27.0 ± 19.5 82.0 ± 43.7 7.2 ± 5.1 20.0 ± 13.2 2.2 ± 0.6 2.4 ± 0.7 Starting from the observation that many real-world graphs have low bandwidth (Section 3), we propose to reduce the output space of a graph generative model from N × N to N × φ̂2 . As the goal of graph generation is to learn a distribution of the data p (G), we assume that, given φ̂data the maximum empirical bandwidth on the training set Gtrain , we can reduce the output space of the generative model to N × φ̂data without losing expressiveness (i.e., without losing the ability to generate in-distribution graphs). In general, we achieve this reduction through two complementary mechanisms: (1) imposing a bandwidth-reducing ordering and (2) restricting the output space to a datasetspecific bandwidth φ̂data or a graph-specific bandwidth φ̂ (Figure 1, bottom). During training, for each graph we use the precomputed ad∗ jacency matrix Aπ with the ordering π ∗ computed through the previously introduced C-M algorithm. We remark that, given the linear time complexity of the C-M algorithm, our preprocessing does not introduce any additional overhead compared to using standard orderings such as BFS/DFS. Notably, just restricting training examples to a specific canonical ordering does not guarantee that such an ordering will be respected during generation (Chen et al., 2021), and does not provide any advantage in time complexity or output space reduction, since the complete adjacency matrix needs to be generated. Therefore, we also re-parameterize the ∗ adjacency matrix as Aπopt ∈ N × φ̂data (or N × φ̂ for each graph), dropping the zeros outside the bandwidth. During sampling, only edges belonging to the reduced matrix are considered as candidates, thus constraining the output space and reducing the time complexity, without losing expressiveness. graphs given by BFS, DFS, RDKit (Landrum, 2006), and C-M orderings are presented in Figure 4 (Appendix). As shown, the C-M ordering consistently leads to the lowest φ. Overall, the C-M algorithm allows reducing the bandwidths of all the considered datasets. For example, 95% of the molecules in ZINC250k have φ̂ ≤ 4 (Figure 5, Appendix). Below, we discuss the details of our strategy applied to different models and highlight model-specific choices and advantages. To the best of our knowledge, the concept of graph bandwidth has been used in the context of GNNs only by Balog et al. (2019), with the explicit purpose of improving dense implementations on custom hardware. Their work does not leverage the bandwidth in the model itself and does not target graph generation. 4.2. Autoregressive Graph Generation Autoregressive graph generation approaches recursively generate the edges of a single node (You et al., 2018) or a group of nodes, (Liao et al., 2019) conditioned on the previously generated subgraph. This can be viewed as generating the adjacency matrix row-by-row or block-by-block. We will focus on GraphRNN (You et al., 2018), although a similar approach could be applied to virtually any autoregressive graph generative model. 4. Bandwidth-Restricted Graph Generation In this section we describe BwR, our novel approach for improving graph generation. As BwR can be combined with different existing generative methods, we first describe its general strategy and principles in Section 4.1. Then, we detail the strategies for bandwidth-restricted graph generation applied to an autoregressive model based on GraphRNN (You et al., 2018) in Section 4.2, and two distinct one-shot models based on Graphite (Grover et al., 2019) and EDPGNN (Niu et al., 2020) in Section 4.3. In GraphRNN, the probability of node vi being connected to node vj , with π(vj ) < π(vi ), is parameterized by an 2 Technically, as the adjacency matrix is symmetric, only a triangular matrix is generated both in the standard formulation and in our bandwidth-restricted re-parameterization. 5 Bandwidth-Restricted Graph Generation output function fθ applied to the hidden state of an RNN over the previous rows of the adjacency matrix: (2) Adjacency-matrix-based models directly output the edgeprobability graph. We focus on both categories, considering a model based on Graphite (Grover et al., 2019) and a model based on EDP-GNN (Niu et al., 2020). p[(vi , vj ) ∈ E] = fθ (RNNϕ (Aπ1:i−1 ))j , where θ and ϕ are parameters learned to maximize the likelihood of the data. In particular, we focus on the GraphRNN-S variant (additional model details are provided in Appendix B.3). Node-embedding-based one-shot generation. Graphite (Grover et al., 2019) is a VAE-based approach with one latent vector zi ∈ Rd per node with standard Gaussian prior. The encoder network uses graph convolution layers (Kipf & Welling, 2017) on the input adjacency matrix A ∈ RN ×N and node features X ∈ RN ×k to derive the mean and standard deviation of the variational marginals: We note that a d-unit fθ can generate graphs with at most bandwidth d. Potentially, we could set d to be the maximum number of nodes N of any graph we want to generate, which would ensure maximum expressiveness. Instead, we set d equal to the maximum bandwidth of any Aπ we would want to generate, greatly increasing efficiency and reducing training signal sparsity for low bandwidth graphs. We find the order π for each graph G by using the C-M algorithm and set d = φ̂data as the maximum bandwidth across all the Aπ in the training data. Compared to generating N rows of length d = N , we generate O(N/φ̂data )-times fewer edges (corresponding to the savings factor, Table 1). µ, σ = GNNϕ (A, X), (2) where µ, σ ∈ RN ×d . We will focus on the case without additional node features, so X can be a positional encoding dependent on the node ordering. The decoder network outputs edge probabilities: p[(vi , vj ) ∈ E] = GNNθ (Â, X, Z)i,j , (3) where  is fully-connected and Z are samples from N (µ, σ). Further architectural details are described in Appendix B.4. In our bandwidth-restricted version of Graphite, we constrain the bandwidth of Â. For a graph G, we get the bandwidth φ(Aπ ) for π found using the C-M algorithm and build a new edge set: In the original GraphRNN model, You et al. (2018) used a random BFS ordering during training, and set d = Mdata < N , with Mdata defined as the maximum number of nodes in the BFS queue at any time in the training data. Mdata is estimated empirically by sampling 100,000 BFS orderings per dataset and setting Mdata to be roughly the 99.9 percentile of the empirical distribution of maximum queue sizes. Critically, we observe that Mdata derived in You et al. (2018) approximates the maximum bandwidth across all possible BFS orderings. In contrast, φ̂data is derived by explicitly reducing the bandwidth. Notably, our approach allows significantly shrinking d, thus directly reducing the output space and the time complexity. For example, on the DD dataset (Dobson & Doig, 2003) of 918 protein graphs, the authors set d = Mdata = 230, whereas we derive d = φ̂data = 122, a nearly two-fold reduction. Ê = {(i, j) | 1 ≤ |i − j| ≤ φ(Aπ )}, (4) where 1 ≤ i, j ≤ N . With the new edge set, the decoder message passing steps are reduced from complexity O(N 2 ) to O(N · φ̂). During generation, the standard Graphite model samples the number of nodes from the empirical distribution of the training data. In our bandwidth-restricted version, both the number of nodes and the bandwidth φdata are sampled from the empirical distribution of the training data, thus further reducing the output space. Adjacency-matrix-based one-shot generation. EDP-GNN (Niu et al., 2020) is a score-based model in which a GNN is trained to match the score function of the data distribution. Intuitively, EDP-GNN learns to denoise the upper-right triangle of the adjacency matrix. Our modification denoises the bandwidth-restricted upper-right triangle, thus reducing the modeled output space (Figure 6, Appendix). In EDPGNN, a multi-layer perceptron (MLP) predicts the final edge features from intermediate edge features  built by message passing layers and edge-update layers: 4.3. One-shot Graph Generation One-shot graph generative models sample the entire graph topology simultaneously. Typically the topology is represented by putting edge probabilities on each pair of nodes, resulting in a complete graph with a probability placed on each edge by the generative model (Simonovsky & Komodakis, 2018; Grover et al., 2019; Kipf & Welling, 2016). We will call this graph the edge-probability graph. Our key insight is that the complete edge probability graph can be replaced with a bandwidth-restricted edge-probability graph (Figure 1, bottom). sθ (A)i,j = MLP(Âij ). (5) To constrain the bandwidth, we restrict (i, j) ∈ Ê (Eq. 4) in the final MLP and in all of the message passing and edge-update layers using the same approach described for Graphite. This drastically reduces the time complexity and output space. Further details are included in Appendix B.5. One-shot models can be divided into two main categories: (1) Node-embedding-based models sample node embeddings from the latent distribution and compute the edgeprobability graph based on pairwise relationships, and 6 Bandwidth-Restricted Graph Generation 5. Experiments p-value ≥ 0.05). We experimentally validate our method on both synthetic and real graphs, comparing bandwidth-constrained architectures and non-constrained baselines. 5.3. Generic Graph Generation Datasets. We evaluated our models on six standard graph generation datasets, including both synthetic and real-world graphs: (1) Community2, 1500 two-community graphs generated using an Erdös–Renyi model with 60–160 nodes; (2) Planar, 1500 random planar graphs with 64 nodes made using Delaunay triangulation; (3) Grid2d, 66 distinct twodimensional grids with side lengths between 10 and 20; (4) DD, 918 protein graphs with amino acids as nodes (Dobson & Doig, 2003); (5) Enzymes, 556 protein graphs of enzyme tertiary structures from the BRENDA database (Schomburg et al., 2004); and (6) Proteins, 904 protein graphs from the Protein Data Bank (Dobson & Doig, 2003). Additional details on the datasets are provided in Appendix C.2. 5.1. Metrics We measure two goals of the models: closeness of the sample distribution to the true distribution of graphs, and reconstruction quality. To measure the quality of the sample distribution, we use two metrics. First, consistently with the recent literature, we use the Maximum Mean Discrepancy (MMD) between sampled and test graph statistics (You et al., 2018). The graph statistics we compare are degree, clustering coefficient, node orbit counts following You et al. (2018), and spectrum following Liao et al. (2019). To track overall sample quality, we compute the mean MMD2 across all four MMD metrics3 . Additionally, as recommended by recent work on evaluation metrics for graph generative modeling (Thompson et al., 2022), we use a precision–recall metric (Kynkäänniemi et al., 2019) which accounts for both sample quality and variety. We report the harmonic mean of precision and recall as F1-PR. Results. Table 2 summarizes the results for generic graph generation. As shown, our approach consistently achieves superior or competitive performance across datasets and methods as measured by mean MMD, F1-PR and/or AUPRC. GraphRNN+BwR and Graphite+BwR outperform their standard counterparts in five out of six datasets, with comparable performance (i.e., not statistically significant or mixed) on the sixth. BwR improves EDP-GNN generation quality in four out of six datasets, with comparable performance on the others. Notably, the dataset with the largest graphs, DD (mean 277.7 nodes per graph), could not be trained using standard EDP-GNN due to out-of-memory issues. In contrast, we are able to train our EDP-GNN+BwR, thus highlighting its lower memory complexity. We observe a statistically significant improvement in AUPRC given by BwR in almost all the experiments, thus showing how our approach improves the accuracy of the reconstructed graphs even when bulk statistical distributions are comparable. Still, BwR results in an improvement in the quality of the sample distribution (MMD and/or F1-PR) in the majority of the experiments, with comparable performance in the others. We observe a significant improvement in log-likelihood compared to standard models across all the experiments. Reconstruction quality is measured by comparing the reconstructed graph and the original test graph using the Area Under the Precision–Recall Curve (AUPRC). For GraphRNN, we compare the reconstructed row and the original row; for Graphite, we compare the reconstructed graph and the original graph4 . We chose to use AUPRC because it does not depend on true negatives and because it is well suited to class imbalance (Davis & Goadrich, 2006). Additionally, we report the estimated log-likelihood for test graphs. Although log-likelihood has known limitations for evaluating (graph) generative models (Thompson et al., 2022), it is well suited (in combination with the other metrics) to demonstrating the benefits of BwR ’s reduced output space. 5.2. Experimental Setup We show examples of reconstructed adjacency matrices in Figure 3. In the PROTEINS example (left), the standard model must predict edges in a much larger and imbalanced output space. In the Grid2d example (right), the standard model predicts edges far outside the bandwidth, which is inherently impossible with BwR. Overall, results show that BwR consistently improves or matches generation quality at a fraction of the time/memory cost. We compare our bandwidth-restricted versions (+BwR) of the models based on GraphRNN (You et al., 2018), Graphite (Grover et al., 2019), and EDP-GNN (Niu et al., 2020) to their standard baselines (i.e. without BwR). Our models are described in Section 4 and additional details are provided in Appendix B. Each model architecture is individually hyperoptimized (details in Appendix B.2). All the experiments are repeated five times and significance is determined by Welch’s t-test (models are considered comparable when 5.4. Molecular Generation We note that previous works usually report MMD2 but they indicate MMD in the results. 4 Reconstruction quality is less easily definable for score-based models. Therefore, we omit AUPRC in EDP-GNN evaluation. 3 Datasets. We evaluate our models on real-world molecular datasets to show the benefits of BwR for de novo molecular generation. We consider two datasets: 7 Bandwidth-Restricted Graph Generation Table 2. Graph generation results on generic datasets. Bold indicates best results compared to the other model of the same type and dataset. Significance was determined by Welch’s t-test with five replicates per model. Models are considered comparable when p ≥ 0.05. MMD2 denotes average MMD2 across four metrics (degree, cluster, orbit, spectra). Due to space limitations, we provide all the individual metrics in Table 5 (Appendix). OOM denotes out-of-memory issues. Hyphen (–) denotes not applicable metric/model. COMMUNITY 2 GRID 2 D PLANAR ↓ MMD2 ↑ F1-PR ↑ ℓℓ ↑ AUPRC ↓ MMD2 ↑ F1-PR ↑ ℓℓ ↑ AUPRC ↓ MMD2 ↑ F1-PR ↑ ℓℓ ↑ AUPRC GraphRNN Standard BwR [ours] 0.028 0.024 0.648 0.729 -1990 -1940 0.376 0.421 0.265 0.233 0.029 0.139 -400.0 -309.0 0.545 0.652 0.323 0.240 0.250 0.909 -540.0 -36.4 0.642 0.999 Graphite Standard BwR [ours] 0.055 0.047 0.507 0.423 -3560 -3450 0.706 0.747 0.361 0.468 0.010 0.023 -595.0 -554.0 0.975 0.990 0.649 0.528 0.051 0.666 -2320 -358.0 0.453 0.915 EDP-GNN Standard BwR [ours] 0.030 0.040 0.621 0.581 -211000 -168000 - 0.459 0.474 0.172 0.450 -57800 -36400 - 0.645 0.590 0.548 0.528 -622000 -97900 - ↓ MMD2 ↑ F1-PR ↑ ℓℓ ↑ AUPRC ↓ MMD2 ↑ F1-PR ↑ ℓℓ ↑ AUPRC ↓ MMD2 ↑ F1-PR ↑ ℓℓ ↑ AUPRC GraphRNN Standard BwR [ours] 0.174 0.234 0.426 0.579 -1460 -1400 0.299 0.318 0.023 0.016 0.948 0.963 -199.0 -177.0 0.445 0.602 0.017 0.020 0.971 0.964 -173.0 -117.0 0.533 0.716 Graphite Standard BwR [ours] 0.368 0.273 0.003 0.008 -4360 -3430 0.804 0.840 0.107 0.038 0.459 0.916 -204.0 -164.0 0.925 0.950 0.153 0.037 0.523 0.889 -220.0 -166.0 0.950 0.933 EDP-GNN Standard BwR [ours] OOM 0.299 OOM 0.106 OOM -269000 OOM - 0.092 0.027 0.726 0.914 -18000 -7320 - 0.077 0.024 0.782 0.944 -23400 -7590 - DD ENZYMES PROTEINS Table 3. Graph generation results on molecular datasets. Bold indicates best results compared to the other model of the same type and dataset. Significance was determined by Welch’s t-test with five replicates per model. Models are considered comparable when p ≥ 0.05. MMD2 denotes average MMD2 across four metrics (degree, cluster, orbit, spectra). Due to space limitations, we provide all the individual metrics in Table 5 (Appendix). Hyphen (–) denotes not applicable metric/model. ZINC 250 K 2 PEPTIDES - FUNC 2 ↓ MMD ↑ F1-PR ↑ ℓℓ ↑ AUPRC ↓ MMD ↑ F1-PR ↑ ℓℓ ↑ AUPRC GraphRNN Standard BwR [ours] 0.038 0.029 0.803 0.900 -38.9 -31.0 0.668 0.783 0.031 0.033 0.477 0.526 -168.0 -112.0 0.809 0.903 Graphite Standard BwR [ours] 0.153 0.084 0.178 0.511 -57.0 -39.4 0.999 0.995 0.182 0.120 0.031 0.186 -742.0 -362.0 0.987 0.993 EDP-GNN Standard BwR [ours] 0.106 0.143 0.172 0.750 -7510 -3180 - 0.115 0.143 0.041 0.109 -634000 -32200 - (1) ZINC250k (Irwin et al., 2012) includes 249,455 drug-like small molecules extracted from the ZINC database, averaging 23.14 heavy atoms (nodes) each. (2) Peptides-func (Dwivedi et al., 2022) is a recently published dataset of peptide structures that includes 15,535 molecules, averaging 150.94 heavy atoms (nodes) each. Compared to common small-molecule benchmarks, this dataset includes significantly larger molecular graphs and functional motifs (amino acids). Therefore, it allows us to test the advantages of a reduced time complexity and output space given by our bandwidth-constrained generation. Additional datasets detailes are provided in Appendix C.3. line. Graphite+BwR leads to a significantly better generation quality with comparable or better reconstruction accuracy with respect to the standard baseline. Finally, EDP-GNN+BwR, achieves significantly improved generation quality on ZINC250k, and comparable results on Peptides-func. We observe a significant improvement in log-likelihood compared to standard models across all the experiments. We remark that, even when generation quality is comparable, our approach still significantly reduces time/memory complexity. Results. Table 3 shows the results on molecular graph generation. BwR achieves superior or competitive performance on both datasets across all methods. GraphRNN+BwR significantly improves reconstruction accuracy with a comparable generation quality with respect to the standard base- We evaluate whether the lower time complexity and output space reduction actually translate into reduced time and memory footprint. For this analysis, we consider all the datasets and models previously introduced, for a total of 24 experiments. We measure the following metrics: (1) Mem- 5.5. Memory and Time Efficiency 8 Bandwidth-Restricted Graph Generation Ground truth Graphite Ground truth Predicted (AUPRC = 0.82) Ground truth Predicted (AUPRC = 0.74) Graphite+BwR Predicted (AUPRC = 0.17) Ground truth Predicted (AUPRC = 0.89) Graphite+BwR Graphite PROTEINS GRID2D Figure 3. Comparison of Graphite reconstructions with and without BwR on samples from PROTEINS (left) and Grid2D (right). Table 4. Computational cost results. Bold indicates best results compared to the other model of the same type and dataset. Significance was determined by Welch’s t-test with five replicates per model. Models are considered comparable when p ≥ 0.05. OOM denotes out-of-memory issues. COMMUNITY 2 PEPTIDES - FUNC GRID 2 D DD ↓ sample (s) ↓ mem. (GB) ↓ sample (s) ↓ mem. (GB) ↓ sample (s) ↓ mem. (GB) ↓ sample (s) ↓ mem. (GB) GraphRNN Standard BwR [ours] 0.673 0.538 0.059 0.061 7.690 7.680 0.082 0.080 0.490 0.520 0.115 0.112 1.470 1.500 0.142 0.149 Graphite Standard BwR [ours] 1.460 1.050 0.805 0.553 3.520 0.268 1.740 0.170 6.400 0.842 3.270 0.447 10.90 2.380 5.870 1.510 EDP-GNN Standard BwR [ours] 12.50 8.410 1.600 1.080 69.80 4.880 6.280 0.547 50.80 6.550 6.730 0.837 OOM 19.90 OOM 2.620 6. Conclusion ory usage, as the maximum GPU utilization during a training batch; (2) Training time, as the average wall time to train on one batch; (3) Sample time, as the average wall time to sample 256 graphs from the generative model. We presented BwR, a novel approach to reduce the time/space complexity and output space of graph generative models. Leveraging the observation that many real-world graphs have low graph bandwidth, our method restricts the bandwidth of the adjacency matrices during training and generation. Our method is compatible with virtually all existing graph generative models, and we described its application to autoregressive, VAE-based, and score-based models. Our extensive results on both synthetic, biological, and chemical datasets showed that our strategy consistently achieves superior or comparable generation quality compared to the standard methods, while significantly reducing time/space complexity. Currently, our strategy leverages random Cuthill-McKee orderings to reduce the bandwidth. Future work will explore other—potentially even learned— bandwidth-minimizing orderings, while further theoretically studying the distribution of orderings induced by our approach. Future work will also extend our strategy to additional settings, such as conditional generation, larger graphs, and new state-of-the-art models. Table 4 shows memory usage and sample time for four datasets. The remaining results are included in Table 6 (Appendix). As shown, BwR significantly reduces memory usage across all datasets and all methods besides GraphRNN5 , up to a factor of 11x for larger graphs. Additionally, it improves sample time (up to a factor of 13x-14x) in 14 out of 24 experiments (with comparable results in the others) and training time in 8 out of 24 experiments (with comparable results in the others). Overall, BwR achieves a consistent reduction in computational costs. 5.6. Impact of Output Space Reduction to Performance We analyze the relationship between the savings factor, which summarizes the space reduction given by the C-M bandwidth reparameterization (Table 1), and the performance/computational improvement. Results are included in Appendix D.3. Results suggest that we can estimate the empirical improvement given by BwR for a specific dataset in advance, without actually training a model, by computing simple statistical properties of the dataset. Acknowledgements We thank the anonymous reviewers for their suggestions which have further strengthened the conclusions of the paper. We thank members of the AI/ML department in Genentech for helpful feedback and discussions. All authors are employees of Genentech, Inc. and shareholders of Roche. 5 We believe that GraphRNN has close to the same time/memory usage with and without BwR because most of the computation happens in the hidden layers of the GRU as opposed to the single linear readout layer which predicts the next row. 9 Bandwidth-Restricted Graph Generation References Eppstein, D. Isometric diamond subgraphs. In Graph Drawing: 16th International Symposium, GD 2008, Heraklion, Crete, Greece, September 21-24, 2008. Revised Papers 16, pp. 384–389. Springer, 2009. Albert, R. and Barabási, A.-L. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002. Feige, U. Coping with the NP-hardness of the graph bandwidth problem. In Scandinavian Workshop on Algorithm Theory, pp. 10–19. Springer, 2000. Balog, M., van Merriënboer, B., Moitra, S., Li, Y., and Tarlow, D. Fast training of sparse graph neural networks on dense hardware. arXiv preprint arXiv:1906.11786, 2019. Fey, M. and Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. Biewald, L. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com. Gibbs, N. E., Poole, W. G., and Stockmeyer, P. K. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis, 13 (2):236–250, 1976. Böttcher, J., Pruessmann, K. P., Taraz, A., and Würfl, A. Bandwidth, expansion, treewidth, separators and universality for bounded-degree graphs. European Journal of Combinatorics, 31(5):1217–1227, 2010. Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018. Chan, W.-M. and George, A. A linear time implementation of the reverse Cuthill-McKee algorithm. BIT Numerical Mathematics, 20(1):8–14, 1980. Chen, X., Han, X., Hu, J., Ruiz, F., and Liu, L. Order matters: probabilistic modeling of node sequence for graph generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 1630–1639. PMLR, 2021. Gonzaga de Oliveira, S. L., Bernardes, J. A., and Chagas, G. O. An evaluation of low-cost heuristics for matrix bandwidth and profile reductions. Computational and Applied Mathematics, 37(2):1412–1471, 2018. Cuthill, E. and McKee, J. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th national conference, pp. 157–172, 1969. Goyal, N., Jain, H. V., and Ranu, S. GraphGen: a scalable approach to domain-agnostic labeled graph generation. In Proceedings of The Web Conference 2020, pp. 1253– 1263, 2020. Cygan, M. and Pilipczuk, M. Exact and approximate bandwidth. Theoretical Computer Science, 411(40-42):3701– 3713, 2010. Grover, A., Zweig, A., and Ermon, S. Graphite: iterative generative modeling of graphs. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2434–2444. PMLR, 2019. Dai, H., Nazi, A., Li, Y., Dai, B., and Schuurmans, D. Scalable deep generative modeling for sparse graphs. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2302–2312. PMLR, 2020. Guo, X. and Zhao, L. A systematic survey on deep generative models for graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):5370– 5390, 2023. Davis, J. and Goadrich, M. The relationship between Precision-Recall and ROC curves. In International Conference on Machine Learning, pp. 233–240. Association for Computing Machinery, 2006. Hamilton, W. L., Ying, R., and Leskovec, J. Representation learning on graphs: Methods and applications. IEEE Data Eng. Bull., 40(3):52–74, 2017. Dobson, P. D. and Doig, A. J. Distinguishing enzyme structures from non-enzymes without alignments. Journal of Molecular Biology, 330(4):771–783, 2003. Hendrycks, D. and Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016. Dwivedi, V. P., Rampášek, L., Galkin, M., Parviz, A., Wolf, G., Luu, A. T., and Beaini, D. Long range graph benchmark. In Advances in Neural Information Processing Systems, volume 35, pp. 22326–22340. Curran Associates, Inc., 2022. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851. Curran Associates, Inc., 2020. 10 Bandwidth-Restricted Graph Generation Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V., and Leskovec, J. Strategies for pre-training graph neural networks. In International Conference on Learning Representations, 2020. Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. ZINC: a free tool to discover chemistry for biology. Journal of Chemical Information and Modeling, 52(7):1757–1768, 2012. Ma, T., Chen, J., and Xiao, C. Constrained generation of semantically valid graphs via regularizing variational autoencoders. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Jia, Z., Lin, S., Gao, M., Zaharia, M., and Aiken, A. Improving the accuracy, scalability, and performance of graph neural networks with ROC. Proceedings of Machine Learning and Systems, 2:187–198, 2020. Mercado, R., Bjerrum, E. J., and Engkvist, O. Exploring graph traversal algorithms in graph-based molecular generation. Journal of Chemical Information and Modeling, 62(9):2093–2100, 2021. Jin, W., Barzilay, R., and Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2323–2332. PMLR, 2018. Monien, B. The bandwidth minimization problem for caterpillars with hair length 3 is NP-complete. SIAM Journal on Algebraic Discrete Methods, 7(4):505–512, 1986. Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel, P., and Neumann, M. TUDataset: a collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop Graph Representation Learning and Beyond, 2020. Kipf, T. N. and Welling, M. Variational graph auto-encoders. In Neural Information Processing Systems Workshop on Bayesian Deep Learning, 2016. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8162–8171. PMLR, 2021. Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon, S. Permutation invariant graph generation via score-based generative modeling. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 4474–4484. PMLR, 2020. Landrum, G. RDKit: Open-source cheminformatics, 2006. URL https://rdkit.org. Li, M. M., Huang, K., and Zitnik, M. Graph representation learning in biomedicine and healthcare. Nature biomedical engineering, 6(12):1353–1369, 2022. Otachi, Y. and Suda, R. Bandwidth and pathwidth of threedimensional grids. Discrete Mathematics, 311(10-11): 881–887, 2011. Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018. Papadimitriou, C. H. The NP-completeness of the bandwidth minimization problem. Computing, 16(3):263–270, 1976. Liao, R., Li, Y., Song, Y., Wang, S., Hamilton, W., Duvenaud, D. K., Urtasun, R., and Zemel, R. Efficient graph generation with graph recurrent attention networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: an imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. Constrained graph variational autoencoders for molecule design. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. 11 Bandwidth-Restricted Graph Generation Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., and Schomburg, D. BRENDA, the enzyme database: updates and major new developments. Nucleic acids research, 32(suppl 1):D431–D433, 2004. Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. GraphAF: a flow-based autoregressive model for molecular graph generation. In International Conference on Learning Representations, 2020. Simonovsky, M. and Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders. In Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., and Maglogiannis, I. (eds.), Artificial Neural Networks and Machine Learning – ICANN 2018, pp. 412– 422. Springer International Publishing, 2018. Thompson, R., Knyazev, B., Ghalebi, E., Kim, J., and Taylor, G. W. On evaluation metrics for graph generative models. In International Conference on Learning Representations, 2022. Turner, J. S. On the probable performance of heuristics for bandwidth minimization. SIAM journal on computing, 15(2):561–580, 1986. Unger, W. The complexity of the approximation of the bandwidth problem. In Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No. 98CB36280), pp. 82–91. IEEE, 1998. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Wester, M. J., Pollock, S. N., Coutsias, E. A., Allu, T. K., Muresan, S., and Oprea, T. I. Scaffold topologies. 2. analysis of chemical databases. Journal of Chemical Information and Modeling, 48(7):1311–1324, 2008. Winter, R., Noe, F., and Clevert, D.-A. Permutationinvariant variational autoencoder for graph-level representation learning. In Advances in Neural Information Processing Systems, volume 34, pp. 9559–9573. Curran Associates, Inc., 2021. You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec, J. GraphRNN: generating realistic graphs with deep autoregressive models. In International Conference on Machine Learning, pp. 5708–5717. PMLR, 2018. 12 Bandwidth-Restricted Graph Generation A. Bandwidth Visualization In Figure 4, we show the adjacency matrix A and the bandwidth φ(A) for a random set of 10 molecules from ZINC250k. The RDKit ordering is computed using the canonical atom ranking provided by the RDKit library. For the BFS, DFS and C-M, we randomly sample 100 orderings and plot the one with the highest φ (this approximates the output space needed to correctly model the graph for each ordering). In Figure 5, we show the distribution of bandwidth of ZINC250k adjacency matrices using the C-M algorithm and the canonical SMILES order. B. Model Details Each model was re-implemented and somewhat modified to facilitate the pairwise comparison with and without the BwR modification. B.1. Optimization All models were trained for 100 epochs of 30 training batches and nine validation batches. The batch size was fixed at 32. The AdamW optimizer (Loshchilov & Hutter, 2019) was used with a cosine annealed learning rate (Loshchilov & Hutter, 2017). The initial learning rate was hyperoptimized (Appendix B.2), and the weight decay parameter was either set to zero or hyperoptimized. GraphRNN and Graphite were trained using binary cross entropy to measure reconstruction accuracy. EDP-GNN was trained using mean squared error loss. B.2. Hyperoptimization Hyperparameters were separately optimized for each combination of node order, model, and dataset. The hyperparameters for MMD and AUPRC results were chosen to minimize mean validation MMD2 in the case of GraphRNN and EDP-GNN, and MMD2 − AUPRC in the case of Graphite. The hyperparameters for log-likelihood and F1-PR results were chosen to maximize F1-PR. All hyperoptimizations used 20 outer-loop steps of the Weights and Biases (Biewald, 2020) Bayesian hyperoptimizer. For the specific hyperparameter ranges, see the model details below. B.3. GraphRNN GraphRNN (You et al., 2018) is an autoregressive model for generating adjacency matrices. We used the GraphRNN-S variant which uses an MLP to predict a whole row at once of the adjacency matrix from the RNN’s hidden state. GraphRNN data pre-processing. GraphRNN was trained using teacher forcing to autogressively predict the next row of the re-parameterized adjacency matrices Aopt ∈ N × φ̂data (Figure 1, bottom). In order to prepare the data, a row of zeros was prepended and appended to each Aopt to serve as the initial inputs and final outputs for the model. Next, a column with an indicator for whether the row was the first or last was prepended. This resulted in training data of the form:  1 0   .. .  0 1 0 Aopt 0,0 .. . Aopt N,0 0 ··· ··· .. . ··· ··· 0   Aopt 0,φ̂data  .. . Aopt N,φ̂data  .   (6) 0 In order to train the model, the data were placed into PackedSequence objects in PyTorch (Paszke et al., 2019), enabling batched training with variable sequence lengths. GraphRNN architecture. The architecture is a two layer MLP followed by a four layer GRU followed by two layer MLP: GraphRNN layers 1. Linear(φ̂data + 1 7→ 128) 2. BatchNorm1D 13 Bandwidth-Restricted Graph Generation 3. ReLU 4. Linear(128 7→ 128) 5. GRU(4 layers, 128 7→ 128) 6. Linear(128 7→ 128) 7. BatchNorm1D 8. ReLU 9. Linear(128 7→ φ̂data + 1) GraphRNN hyperoptimization. The GraphRNN hyperparameter ranges were: • Learning rate ∼ Log Uniform [10−4 , 10−2 ] • Weight decay ∼ Log Uniform [10−5 , 10−1 ]. GraphRNN sampling. Rows of the data matrix constructed in Eq. 6 were sampled according to the logits ℓ output at the last layer of the model adjusted by a temperature parameter, τ . That yielded row i edge probabilities:  p[(vi , vj ) ∈ E] ∼ Bernoulli  1 ℓi,j+1 . τ (7) τ was selected for each model to minimize mean MMD on the validation data. The sampling process was halted when an indicator was sampled. B.4. Graphite Graphite (Grover et al., 2019) is a VAE adapted for graph data with one set of latent variables per node. Graphite predicts edge probabilities using a pairwise kernel between node representations at the end of the decoder. We kept the general design while making a few architectural changes for simplicity and performance. Graphite data pre-processing. For every graph G = (V, E) the node order was selected using either BFS (standard variation) or C-M (BwR variation). In the C-M case, we also found the bandwidth resulting from the order φ̂. We constructed node features X ∈ RN ×16 using transformer-style positional encodings (Vaswani et al., 2017). In the original work, Grover et al. (2019) used one-hot positional encodings when there were no node features. For each graph, an edge set was constructed for the decoder model. In the BFS case, the edge set corresponding to a fully connected graph was used. In the C-M case, the edge set for a graph with bandwidth φ̂ was defined as in Eq. 4. Graphite architecture. The architecture was implemented using PyTorch Geometric (Fey & Lenssen, 2019). With respect to the original implementation, we used GINE layers (Hu et al., 2020) rather than graph convolutions, and GELU activation (Hendrycks & Gimpel, 2016) rather than ReLU. Each GINE layer contained a two-layer MLP with the following architecture: GINE MLP layers with hidden dimension h 1. Linear(h 7→ 2h) 2. BatchNorm1D 3. GELU 4. Linear(2h 7→ h) 5. BatchNorm1D 6. GELU. 14 Bandwidth-Restricted Graph Generation We made use of a stack of GINE layers, which we refer to as GINEStack (Algorithm 1). The Graphite encoder was a GINEStack with h = 32. The variational marginals µ, σ (Eq. 2) were 32-dimensional and computed using a single linear layer each from the output of the encoder. The decoder also employed a GINEStack with h = 32 with a few modifications (Algorithm 2). The most consequential change in our experiments was replacing the final edge probability layer. Rather than a dot product as in the original implementation, we observed better performance and lower variance with a two-layer MLP that takes as input concatenated pairs of node embeddings (Algorithm 2, final three lines). Algorithm 1 GINEStack Inputs: node features X ∈ RN ×d , edge list E, edge features E ∈ R|E|×k hidden dimension h Outputs: updated node features X̂ X = Linear(d 7→ h)(X) X = BatchNorm1D(X) X = GELU(X) X0 = GINE(X, E, E) X1 = GINE(X0 , E, E) X2 = GINE(X1 , E, E) X̂ = [X0 |X1 |X2 ] Algorithm 2 Graphite decoder. ◦ denotes function composition. Inputs: node features X ∈ RN ×16 , embeddings Z ∈ RN ×32 , edge list E, edge features E ∈ R|E|×k Outputs: edge probability logits ℓ ∈ R|E| P = Linear(16 7→ 32)(X) P = GELU(BatchNorm1D)(P )) X0 = Z + P X1 = GINEStack(X0 , E, E) X2 = [P |X1 ] X3 = GELU ◦ BatchNorm1D ◦ Linear(112 7→ 32)(X2 ) K = [X|X3 ] ℓ1i,j = Linear(32 7→ 1) ◦ GELU ◦ Linear(96 7→ 32)[Ki |Kj ] ℓ2i,j = Linear(32 7→ 1) ◦ GELU ◦ Linear(96 7→ 32)[Kj |Ki ] # j, i order to preserve symmetry ℓi,j = ℓ1i,j + ℓ2i,j Graphite loss function. The loss was the standard VAE loss function with a hyperoptimized weight β on the KL divergence term:  N  β X 1 1 2 2 µ + σ − log[σ] − L(E, ℓ, µ, σ) = + BCE(ℓ, E), (8) |E| i=1 2 i |E| where ℓ denotes the model predicted edge logits and BCE is the binary cross entropy. Graphite hyperoptimization. The Graphite hyperparameter ranges were: • Learning rate ∼ Log Uniform [10−4 , 10−2 ] • KL-divergence weight (β) ∼ Log Uniform [1, 10−5 ]. Graphite sampling. The latent variables Z were sampled independently from the standard normal distribution. Edge probabilities were then sampled from the decoder’s edge probabilities (Eq. 3, Algorithm 2). The number of nodes and bandwidths used to construct the decoder’s message passing graph were sampled from the empirical distribution of the training data. 15 Bandwidth-Restricted Graph Generation B.5. EDP-GNN EDP-GNN (Niu et al., 2020) is a permutation invariant score-based generative model for graphs. Niu et al. (2020) used annealed Langevin dynamics to sample from their model and used a variance schedule with six time steps. We switched to the DDPM (Ho et al., 2020) framework, which we found led to more reliable results and faster sampling in preliminary experiments. EDP-GNN data pre-processing. For every graph G = (V, E) a node order was selected using either BFS (standard variation) or C-M (BwR variation). In the C-M case, we also found the bandwidth φ̂. We then constructed an edge set E ′ of edges not in the original graph, which the model was trained to distinguish from the real edges. In the BFS case, these were all of the edges not in E, i.e., E ′ = {(i, j)∀i ̸= j} − E. In the C-M case, these were the edges included in a graph ′ with φ(G) = φ̂ (Eq. 4), and not in E, that is, E ′ = Ê − E. Edge features E ∈ R|E|+|E | were constructed to encode whether ′ ′ each edge is in E or E with 1 to indicate ∈ E and -1 to indicate ∈ E . Node features X ∈ RN ×16 were constructed using transformer-style positional encodings (Vaswani et al., 2017). Time embedding features T used for time conditioning were constructed using 128-dimensional positional encodings. EDP-GNN diffusion hyperparameters. We used a cosine variance schedule (Nichol & Dhariwal, 2021) with 200 steps. The effect of this schedule on a restricted adjacency matrix of a planar graph is shown in Figure 6. We used the noise predicting parameterization ϵθ introduced by Ho et al. (2020). EDP-GNN architecture. Our implementation of EDP-GNN was built using the previously introduced GINEStack (Algorithm 1) followed by a two layer MLP operating on edge features pairs of node representations to predict the sampled noise ϵ. The resultant architecture for the molecular datasets is shown in Algorithm 3. We used a node embedding size of 64 for the generic datasets instead of 128. Algorithm 3 Modified EDP-GNN architecture. ◦ denotes function composition. Inputs: node features X ∈ RN ×16 , edge list E∪ = E ∪ E ′ , edge features E ∈ R|E∪ | , time embeddings T ∈ R128 ′ Outputs: noise predictions ϵθ ∈ R|E∪E | P = GELU ◦ Linear(16 7→ 128)(X) T0 = GELU ◦ Linear(128 7→ 128)(T ) X0 = T0 + P X1 = GINEStack(X0 , E∪ , E) X2 = [P |T0 |X1 ] X3 = GELU ◦ BatchNorm1D ◦ Linear(768 7→ 128)(X2 ) K = [X|X3 ] E 0 = GELU ◦ Linear(1 7→ 128)(E) 0 ϵ1i,j = Linear(128 7→ 1) ◦ GELU ◦ Linear(384 7→ 128)[Ki |Kj |Ei,j ] 2 0 ϵi,j = Linear(128 7→ 1) ◦ GELU ◦ Linear(384 7→ 128)[Kj |Ki |Ej,i ] # j, i order to preserve symmetry ϵθ,i,j = ϵ1i,j + ϵ2i,j EDP-GNN hyperoptimization. EDP-GNN sampling. The learning rate was hyperoptimized with a distribution ∼ Log Uniform [10−4 , 10−2 ]. We used the DDPM sampling algorithm (Ho et al., 2020). B.6. Likelihood Evaluation To compute the likelihood of test set graphs as an evaluation metric, we use the following strategies. For GraphRNN, we use the auto-regressive log-likelihood. For Graphite, we use the variational ELBO as a lower bound. For EDP-GNN, we use the ELBO for diffusion models, as shown in (Ho et al., 2020). C. Datasets All datasets were filtered so that there was one connected component per example. 16 Bandwidth-Restricted Graph Generation C.1. Example Datasets for Table 1 All datasets, except Peptides-func, are available through the TUDataset collection (Morris et al., 2020). Peptides-func is available in the Long Range Graph Benchmark (Dwivedi et al., 2022). C.2. Generic Datasets Community2. For each graph the number of nodes N was sampled uniformly between 60 and 160. Each community was then generated using an Erdos-Renyi model with edge probability 0.3. Then edges between the two communities were sampled with probability 0.05. Finally, the largest connected component of the resultant graph was selected. Planar. For each graph 64 2D node coordinates were sampled uniformly between zero and one. A Delaunay triangulation was performed on these coordinates. Two nodes were considered adjacency if they shared a vertex in the triangulation. Grid2d. All unique pairs of side lengths between 10 and 20 were enumerated. For each side length pair, a 2D grid graph was generated. Since this yielded only 66 graphs, each graph in the training and validation sets were included five times with different random BFS and C-M orders each time. DD. The DD graphs were filtered so that each had between 100 and 500 nodes as in (You et al., 2018), going from 1178 to 918 graphs. Enzymes. The enzymes graphs were filtered so that each had 10 ≤ N ≤ 125 going from 600 to 556 graphs. Proteins. The proteins graphs were filtered so that each had 10 ≤ N ≤ 125 going from 1113 to 904 graphs. C.3. Molecular Datasets ZINC250k. No filtering was required. Peptides-func. Removing graphs with more than one connected component filtered 15535 graphs down to 15375. D. Additional Experimental Results D.1. MMD Metrics We include the individual MMDs results (summarized by the mean MMD in the main text) in Table 5. D.2. Computational Metrics We include computational metrics for all datasets in Table 6. D.3. Impact of Output Space Reduction to Performance Improvements We study whether the theoretical improvement in space/time complexity given by BwR translates well into an empirical improvement in generation quality and computational complexity. To do this, we analyze the relationship between the savings factor, which summarizes the space reduction given by the C-M bandwidth reparameterization (Table 1), and the performance improvement, measured as the ratio between standard models’ metrics and their +BwR extensions. Interestingly, we observe a high correlation (Spearman-r of 0.90 for Graphite, 0.89 for EDP-GNN, and 0.62 for GraphRNN) between the savings ratio and the log-likelihood improvement across the 8 datasets. Additionally, we observe a high correlation (Spearman-r of 0.97 for Graphite and 0.96 for EDP-GNN) between the savings ratio and the improvement in GPU memory usage (Figure 7) across the 8 datasets. This analysis suggests that (1) the theoretical improvement correlates with the empirical advantage, and (2) we can get an estimate of the expected empirical improvement for a specific dataset in advance, without actually training a model. 17 Bandwidth-Restricted Graph Generation E. Why Molecular Graphs Have Low Bandwidth? In this section, we discuss the bandwidth of molecular graphs, i.e. graphs whose nodes and edges describe atoms and bonds, respectively. As shown in Table 1, all the considered molecular datasets have low average empirical bandwidth (in particular, the first section of Table 1 and Peptides-func correspond to molecular graphs). Additionally, the average bandwidth increases slowly as the average number of nodes N in the datasets increases (for the considered datasets, the average bandwidth φ̂ varies between 2.8 ± 0.7 and 5.7 ± 2.6, while the average number of nodes N varies between 10.1 ± 0.7 and 150.9 ± 84.5). Furthermore, given that the empirical bandwidth is estimated through a heuristic algorithm, part of the increase in φ̂ for higher N can be explained by a slightly reduced algorithm efficacy for larger graphs. All these empirical observations motivate more theoretical research on why all molecular graphs seem to have restricted bandwidth. To investigate this question, we have derived several upper bounds on the bandwidth of molecular graphs. These bounds show that, indeed, the inherent properties of molecular graphs confer low bandwidth. These results further strengthen the universal validity of BwR, beyond the datasets considered in this paper. Several bounds to the graph bandwidth for molecular graphs are discussed in the following: Molecules have planar graphs and bounded max degree. As highlighted in the cheminformatics literature, molecules with non-planar graphs are extremely rare (Wester et al., 2008). In practice, all molecules included in drug-like libraries (e.g., ZINC250k dataset) have planar graphs. Additionally, because of chemical bonding rules, all molecular graphs have a bounded maximum degree (4-6, depending on the atom types in the molecule). of these  The combination  N two properties (bounded degree and planarity) guarantees sub-linear bandwidth, in O log (N ) , with ∆ being the ∆ maximum degree, as proved by Böttcher et al. (2010). Molecules as combination of motifs. A less rigorous and more intuitive explanation comes from the fact that synthesizable molecules tend to consist of small components connected in a few (typically, one to three) places. Under the (simplified) assumption that a molecular graph is a string of connected components, the graph’s bandwidth is upper-bounded by the size of the largest component. This can be seen by considering the resulting block-diagonal adjacency matrix: each component corresponds to a block in the adjacency matrix, and the bandwidth of the graph corresponds to the bandwidth of the largest block/component. For example, in a linear molecule (like a long alkane chain), the bandwidth is one (independent of the length of the chain) because each carbon (individual node) is only directly connected to its neighbors. Crystal structure as molecular upper bound. Another chemistry-inspired upper bound can be derived by considering regular chemical graphs with degree four. If we allow all 4-regular graphs, thisis insufficient to constrain the bandwidth  N since random 4-regular graphs are expander graphs and have bandwidth in O log(N ) . Instead, we consider the most densely packed form of carbon, diamond, which is formed by a 3D lattice. The graph of diamond’s structure is a subgraph of the 3D grid graph (Eppstein, 2009). Interestingly, the 3D grid graph has bandwidth that scales with the √  square root of the number of nodes (Otachi & Suda, 2011), providing a tighter upper bound in O N . Further research on theoretical upper bounds on the bandwidth of graphs occurring in common domains, such as molecules and other biological objects, will be the subject of future work. 18 Bandwidth-Restricted Graph Generation Table 5. Graph generation results with individual MMD values. Bold indicates best results compared to the other model of the same type and dataset. Significance was determined by Welch’s t-test with five replicates per model. Models are considered comparable when p ≥ 0.05. Graph statistics (degree, cluster, orbit, spectra) are reported as MMD2 . Mean computed across individual statistics for each model/dataset. OOM denotes out-of-memory issues. Hyphen (–) denotes not applicable metric/model. COMMUNITY 2 GRID 2 D PLANAR ↓ Deg. ↓ Cluster ↓ Orbit ↓ Spectra ↓ Deg. ↓ Cluster ↓ Orbit ↓ Spectra ↓ Deg. ↓ Cluster ↓ Orbit ↓ Spectra GraphRNN Standard BwR [ours] 0.016 0.034 0.046 0.041 0.024 0.017 0.024 0.006 0.056 0.060 0.309 0.311 0.617 0.481 0.080 0.079 0.403 0.037 0.000 0.797 0.714 0.066 0.174 0.061 Graphite Standard BwR [ours] 0.146 0.114 0.047 0.047 0.015 0.015 0.011 0.013 0.289 0.311 0.304 0.323 0.749 1.110 0.104 0.128 0.500 0.069 1.300 1.970 0.601 0.038 0.198 0.035 EDP-GNN Standard BwR [ours] 0.037 0.066 0.056 0.048 0.020 0.032 0.008 0.014 0.229 0.179 0.400 0.377 1.120 1.250 0.086 0.092 0.428 0.415 1.380 1.450 0.661 0.392 0.113 0.101 ↓ Deg. ↓ Cluster ↓ Orbit ↓ Spectra ↓ Deg. ↓ Cluster ↓ Orbit ↓ Spectra ↓ Deg. ↓ Cluster ↓ Orbit ↓ Spectra GraphRNN Standard BwR [ours] 0.066 0.092 0.155 0.229 0.410 0.489 0.065 0.125 0.011 0.003 0.045 0.039 0.021 0.010 0.018 0.014 0.004 0.012 0.040 0.045 0.015 0.011 0.010 0.013 Graphite Standard BwR [ours] 0.316 0.239 0.316 0.245 0.656 0.492 0.186 0.118 0.204 0.042 0.046 0.039 0.099 0.052 0.078 0.018 0.293 0.037 0.096 0.043 0.111 0.052 0.114 0.016 EDP-GNN Standard BwR [ours] OOM 0.184 OOM 0.208 OOM 0.738 OOM 0.065 0.098 0.027 0.069 0.033 0.159 0.036 0.042 0.013 0.119 0.027 0.064 0.038 0.082 0.021 0.045 0.012 ↓ Deg. ↓ Cluster ↓ Orbit ↓ Spectra ↓ Deg. ↓ Cluster ↓ Orbit ↓ Spectra GraphRNN Standard BwR [ours] 0.025 0.011 0.045 0.044 0.012 0.005 0.071 0.057 0.009 0.008 0.004 0.001 0.000 0.001 0.108 0.123 Graphite Standard BwR [ours] 0.049 0.009 0.516 0.307 0.005 0.002 0.044 0.019 0.169 0.056 0.235 0.216 0.030 0.011 0.293 0.198 EDP-GNN Standard BwR [ours] 0.174 0.015 0.055 0.528 0.024 0.004 0.170 0.023 0.159 0.050 0.041 0.371 0.047 0.007 0.213 0.144 DD ENZYMES ZINC 250 K PROTEINS PEPTIDES - FUNC Table 6. Computational cost results. Bold indicates best results compared to the other model of the same type and dataset. Significance was determined by Welch’s t-test with five replicates per model. Models are considered comparable when p ≥ 0.05. COMMUNITY 2 GRID 2 D PLANAR DD ↓ sample (s) ↓ mem. (GB) ↓ batch (s) ↓ sample (s) ↓ mem. (GB) ↓ batch (s) ↓ sample (s) ↓ mem. (GB) ↓ batch (s) ↓ sample (s) ↓ mem. (GB) ↓ batch (s) GraphRNN Standard BwR [ours] 0.673 0.538 0.059 0.061 0.008 0.009 0.095 0.094 0.036 0.036 0.007 0.007 0.490 0.520 0.115 0.112 0.016 0.018 1.470 1.500 0.142 0.149 0.015 0.016 Graphite Standard BwR [ours] 1.460 1.050 0.805 0.553 0.012 0.010 0.513 0.242 0.266 0.135 0.010 0.009 6.400 0.842 3.270 0.447 0.022 0.012 10.90 2.380 5.870 1.510 0.028 0.013 EDP-GNN Standard BwR [ours] 12.50 8.410 1.600 1.080 0.006 0.005 4.300 2.260 0.526 0.255 0.004 0.003 50.80 6.550 6.730 0.837 0.016 0.004 OOM 19.90 OOM 2.620 OOM 0.007 ENZYMES PROTEINS ZINC 250 K PEPTIDES - FUNC ↓ sample (s) ↓ mem. (GB) ↓ batch (s) ↓ sample (s) ↓ mem. (GB) ↓ batch (s) ↓ sample (s) ↓ mem. (GB) ↓ batch (s) ↓ sample (s) ↓ mem. (GB) ↓ batch (s) GraphRNN Standard BwR [ours] 0.129 0.239 0.019 0.019 0.011 0.014 0.333 0.533 0.023 0.021 0.007 0.009 0.080 0.080 0.014 0.014 0.004 0.004 7.690 7.680 0.082 0.080 0.014 0.013 Graphite Standard BwR [ours] 0.450 0.050 0.085 0.035 0.014 0.010 0.227 0.056 0.105 0.031 0.008 0.010 0.071 0.083 0.040 0.016 0.008 0.010 3.520 0.268 1.740 0.170 0.015 0.008 EDP-GNN Standard BwR [ours] 1.590 0.981 0.178 0.066 0.004 0.003 2.040 1.040 0.211 0.056 0.003 0.003 1.620 0.809 0.145 0.050 0.003 0.004 69.80 4.880 6.280 0.547 0.024 0.004 19 Bandwidth-Restricted Graph Generation Molecule Graph RDKit canonical ordering BFS ordering DFS ordering C-M ordering = 21 =8 = 23 =3 = 19 =6 = 20 =3 = 25 =7 = 28 =4 = 23 =7 = 29 =4 = 21 =8 = 27 =5 = 13 =7 = 16 =4 = 15 =5 = 19 =3 = 20 =6 = 21 =3 = 25 =7 = 25 =4 = 14 =6 = 16 =3 Figure 4. Adjacency matrix and bandwidth φ for different orderings (canonical RDKit, BFS, DFS and Cuthill-McKee) for a random set of 10 molecules from ZINC250k. 20 Count Bandwidth-Restricted Graph Generation 125000 100000 75000 50000 25000 0 SMILES canonical order C-M order 0 5 10 15 20 25 30 35 Figure 5. Distribution of bandwidth of ZINC250k adjacency matrices using the C-M algorithm and the canonical SMILES order. t=0 t = 20 t = 40 t = 60 t = 80 t = 100 t = 120 t = 140 t = 160 t = 180 Figure 6. Visualization of cosine variance schedule forward diffusion with 200 steps on a planar graph with restricted bandwidth. Savings factor vs max_mem improvement model Graphite EDP-GNN max_mem improvement 10 8 6 4 2 2 4 6 8 Savings factor 10 12 14 Figure 7. Savings factor versus improvement in memory usage with and without BwR for different datasets, for Graphite and EDP-GNN models. 21