A D VA N C E D A L G O R I T H M S notes for cmu 15-850 (fall 2020) lecturer: anupam gupta About this document This document contains the course notes for 15-850: Advanced Algorithms, a graduate-level course taught by Anupam Gupta at Carnegie Mellon University in Fall 2020. Parts of these notes were written by the students of previous versions of the class (based on the lectures) and then edited by the professor. The names of the student scribes will appear soon, as will missing details and bug fixes, more chapters, exercises, and (better) figures. The notes have not been thoroughly checked for accuracy, especially attributions of results. They are intended to serve as study resources and not as a substitute for professionally prepared publications. We apologize for any inadvertent inaccuracies or misrepresentations. More information about the course, including problem sets and references, can be found on the course website: https://www.cs.cmu.edu/~15850/ The style files (as well as the text on this page!) are mildly adapted from the ones developed by Yufei Zhao (MIT), for his notes on Graph Theory and Additive Combinatorics. As some of you may guess, the LATEX template used for these notes is called tufte-book. Contents I Classical Algorithms 7 1 Minimum Spanning Trees 1.1 Minimum Spanning Trees: History . . . . . . . . . . . . . 1.2 The Classical Algorithms . . . . . . . . . . . . . . . . . . 1.3 Fredman and Tarjan’s O(m log∗ n)-time Algorithm . . . 1.4 A Linear-Time Randomized Algorithm . . . . . . . . . . 1.5 Optional: MST Verification . . . . . . . . . . . . . . . . . 1.6 The Ackermann Function . . . . . . . . . . . . . . . . . . 1.7 Matroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 12 14 17 20 23 24 2 Arborescences: Directed Spanning Trees 2.1 Arborescences . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Chu-Liu/Edmonds/Bock Algorithm . . . . . . . . . 2.3 Linear Programming Methods . . . . . . . . . . . . . . . 2.4 Matroid Intersection . . . . . . . . . . . . . . . . . . . . . 25 25 26 28 33 3 Shortest Paths in Graphs 3.1 Single-Source Shortest Path Algorithms . . . . . . . . . . 3.2 The All-Pairs Shortest Paths Problem (APSP) . . . . . . . 3.3 Min-Sum Products and APSPs . . . . . . . . . . . . . . . 3.4 Undirected APSP Using Fast Matrix Multiplication . . . 3.5 Optional: Fredman’s Decision-Tree Complexity Bound . 35 35 38 42 44 47 4 Low-Stretch Spanning Trees 4.1 Towards a Definition . . . . . . . . . . . . . . . . . . . . . 4.2 Low-Stretch Spanning Tree Construction . . . . . . . . . 4.3 Bartal’s Construction . . . . . . . . . . . . . . . . . . . . . 4.4 Metric Embeddings: a.k.a. Simplifying Metrics . . . . . . 49 49 52 53 58 5 A Near-Linear Time Algorithm for SSSP 61 6 Blank 63 4 7 Graph Matchings I: Combinatorial Algorithms 7.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . 7.2 Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . 7.3 General Graphs: The Tutte-Berge Theorem . . . . . . . . 7.4 The Blossom Algorithm . . . . . . . . . . . . . . . . . . . 7.5 Subsequent Work . . . . . . . . . . . . . . . . . . . . . . . 65 65 67 70 71 76 8 Graph Matchings II: Algebraic Algorithms 79 8.1 Preliminaries: roots of low degree polynomials . . . . . . 79 8.2 Detecting Perfect Matchings by Computing a Determinant 81 8.3 From Detecting to Finding Perfect Matchings . . . . . . . 84 8.4 Red-Blue Perfect Matchings . . . . . . . . . . . . . . . . . 87 8.5 Matchings in Parallel, and the Isolation Lemma . . . . . 88 8.6 The Permanent Connection . . . . . . . . . . . . . . . . . 90 8.7 A Matrix Scaling Approach . . . . . . . . . . . . . . . . . 90 9 Graph Matchings III: Weighted Matchings 9.1 Linear Programming . . . . . . . . . . . . . . . . . . . . . 9.2 Weighted Matchings in Bipartite Graphs . . . . . . . . . 9.3 Another Perspective: Buyers and sellers . . . . . . . . . . 9.4 A Third Algorithm: Shortest Augmenting Paths . . . . . 9.5 Perfect Matchings in General Graphs . . . . . . . . . . . 9.6 Integrality of Polyhedra . . . . . . . . . . . . . . . . . . . 93 93 96 100 105 107 110 II Interlude: Dimension Reduction 115 10 Concentration of Measure 117 10.1 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . 118 10.2 Non-Asymptotic Convergence Bounds . . . . . . . . . . . 119 10.3 Chernoff bounds, and Hoeffding’s inequality . . . . . . . 122 10.4 Other concentration bounds . . . . . . . . . . . . . . . . . 127 10.5 Application #1: Oblivious Routing on the Hypercube . . 130 10.6 Application #2: Graph Sparsification . . . . . . . . . . . . 134 10.7 Application #3: The Power of Two Choices . . . . . . . . 134 11 Dimension Reduction and the JL Lemma 135 11.1 The Johnson Lindenstrauss lemma . . . . . . . . . . . . . 135 11.2 The Construction . . . . . . . . . . . . . . . . . . . . . . . 136 11.3 Intuition for the Distributional JL Lemma . . . . . . . . . 137 11.4 A Direct Proof of Lemma 11.2 . . . . . . . . . . . . . . . . 138 11.5 Introducing Subgaussian Random Variables . . . . . . . 140 11.6 A Proof of Lemma 11.2 using Subgaussian r.v.s . . . . . . 141 11.7 Optional: Compressive Sensing . . . . . . . . . . . . . . . 143 11.8 Some Facts about Balls in High-Dimensional Spaces . . . 147 5 12 Streaming Algorithms 12.1 Streams as Vectors, and Additions/Deletions . . . . . . . 12.2 Computing Moments . . . . . . . . . . . . . . . . . . . . . 12.3 A Matrix View of our Estimator . . . . . . . . . . . . . . . 12.4 Application: Approximate Matrix Multiplication . . . . . 12.5 Optional: Computing the Number of Distinct Elements . 149 150 151 154 155 156 13 Dimension Reduction: Singular Value Decompositions 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Best fit subspaces of dimension k and the SVD . . . . . . 13.3 Useful facts, and rank-k-approximation . . . . . . . . . . 13.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . 161 161 162 165 166 168 III 169 “Modern” Algorithms 14 Online Learning: Experts and Bandits 14.1 The Mistake-Bound Model . . . . . . . . . . . . . . . . . . 14.2 The Weighted Majority Algorithm . . . . . . . . . . . . . 14.3 Randomized Weighted Majority . . . . . . . . . . . . . . 14.4 The Hedge Algorithm, and a Change in Perspective . . . 14.5 Optional: The Bandit Setting . . . . . . . . . . . . . . . . 171 171 172 174 176 178 15 Solving Linear Programs using Experts 181 15.1 (Two-Player) Zero-Sum Games . . . . . . . . . . . . . . . 181 15.2 Solving LPs Approximately . . . . . . . . . . . . . . . . . 184 16 Approximate Max-Flows using Experts 16.1 The Maximum Flow Problem . . . . . . . . . . . . . . . . 16.2 A First Algorithm using the MW Framework . . . . . . . 16.3 Finding Max-Flows using Electrical Flows . . . . . . . . e (m3/2 )-time Algorithm . . . . . . . . . . . . . . . . . 16.4 An O e (m4/3 )-time Algorithm . . . . . . . . . . . 16.5 Optional: An O 17 The Gradient Descent Framework 17.1 Convex Sets and Functions . . . . . . . . . . . . . . . . . 17.2 Unconstrained Convex Minimization . . . . . . . . . . . 17.3 Constrained Convex Minimization . . . . . . . . . . . . . 17.4 Online Gradient Descent, and Relationship with MW . . 17.5 Stronger Assumptions . . . . . . . . . . . . . . . . . . . . 17.6 Extensions and Loose Ends . . . . . . . . . . . . . . . . . 189 189 190 192 197 198 205 205 207 211 213 214 217 18 Mirror Descent 219 18.1 Mirror Descent: the Proximal Point View . . . . . . . . . 219 18.2 Mirror Descent: The Mirror Map View . . . . . . . . . . . 222 6 18.3 The Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 225 18.4 Alternative Views of Mirror Descent . . . . . . . . . . . . 227 19 The Centroid and Ellipsoid Algorithms 19.1 The Centroid Algorithm . . . . . . . . . . . . . . . . . . . 19.2 Multi-Dimensional Binary Search . . . . . . . . . . . . . . 19.3 Ellipsoid for Convex Optimization . . . . . . . . . . . . . 19.4 The Ellipsoid Algorithm to Solve LPs . . . . . . . . . . . 19.5 Getting the New Ellipsoid . . . . . . . . . . . . . . . . . . 19.6 Algorithms for Solving LPs . . . . . . . . . . . . . . . . . 229 229 232 234 235 237 239 20 Interior-Point Methods 20.1 Barrier Functions . . . . . . . . . . . . . . . . . . . . . . . 20.2 The Update Step . . . . . . . . . . . . . . . . . . . . . . . 20.3 The Newton-Raphson Method . . . . . . . . . . . . . . . 20.4 Self-Concordance . . . . . . . . . . . . . . . . . . . . . . . 241 242 244 248 250 24 Approximation Algorithms 24.1 A Rough Classification into Hardness Classes . . . . . . 24.2 The Surrogate . . . . . . . . . . . . . . . . . . . . . . . . . 24.3 The Set Cover Problem . . . . . . . . . . . . . . . . . . . . 24.4 A Relax-and-Round Algorithm for Set Cover . . . . . . . 24.5 The Bin Packing Problem . . . . . . . . . . . . . . . . . . 24.6 The Linear Grouping Algorithm for Bin Packing . . . . . 24.7 Subsequent Results and Open Problems . . . . . . . . . . 251 251 253 253 255 256 257 260 25 Approximation Algorithms via SDPs 25.1 Positive Semidefinite Matrices . . . . . . . . . . . . . . . . 25.2 Semidefinite Programs . . . . . . . . . . . . . . . . . . . . 25.3 SDPs in Approximation Algorithms . . . . . . . . . . . . 25.4 The MaxCut Problem and Hyperplane Rounding . . . 25.5 Coloring 3-Colorable Graphs . . . . . . . . . . . . . . . . 261 261 262 264 264 267 26 Online Algorithms 273 26.1 The Competitive Analysis Framework . . . . . . . . . . . 273 26.2 The Ski Rental Problem: Rent or Buy? . . . . . . . . . . . 275 26.3 The Paging Problem . . . . . . . . . . . . . . . . . . . . . 279 26.4 Generalizing Paging: The k-Server Problem . . . . . . . . 281 Part I Classical Algorithms 1 Minimum Spanning Trees We start our exploration of algorithms this semester with a classic problem in algorithmic graph theory: given a graph with edge weights, finding a spanning tree of minimum total weight. Why this problem? 1. It is a rich problem with lots of structure, which is easy to discover, and which we can use to get good algorithms. 2. These algorithms allows us to showcase the interplay between data structures and algorithms—while we will have too much time to spend exploring data structures in this course, they can be crucial in getting improved running times. 3. And finally, some of it is for nostalgia: algorithms for this problem have been known for almost a hundred years now! 1.1 Minimum Spanning Trees: History In minimum spanning tree problem, the input is an undirected connected graph G = (V, E) with n nodes and m edges, where the edges have weights w(e) ∈ R. The goal is to find a spanning tree of the graph with the minimum total edge-weight. If the graph G is disconnected, we get a spanning forest As a classic (and important) problem, it’s been tackled many times. Here’s a brief, not-quitecomprehensive history of its optimization, all without making any assumptions on the edge weights other that they can be compared in constant time: • Otakar Borůvka gave the first known MST algorithm in 1926; it was subsequently rediscovered by Gustave Choquet (1938), Georges Sollin (1965), and several others. Vojtečh Jarník gave his algorithm in 1930, and it was independently discovered by Robert Prim (’57) and Edsger Dijkstra (’59), among others. Joseph Kruskal A spanning tree/forest is defined to be an acyclic subgraph T that is inclusionwise maximal, i.e., adding any edge in G \ T would create a cycle. Otakar Borůvka (1926) Vojtečh Jarník (1930) J.B. Kruskal, Jr. (1956) 10 minimum spanning trees: history gave his algorithm in ’56; this was rediscovered by Loberman and Weinberger in 1957. All these can easily be implemented in O(m log n) time; we will discuss these in this lecture. • In 1975, Andy Yao achieved a runtime of O(m log log n). His algorithm builds on Borůvka’s algorithm (which he attributes to Sollin), and uses as a subroutine the linear-time algorithm for median-finding, which had only recently been invented in 1974. We will work through Yao’s algorithm in HW#1. • In 1984, Michael Fredman and Bob Tarjan gave an O(m log∗ n) time algorithm, based on their Fibonacci heaps data structure. Here log∗ is the iterated logarithm function, and denotes the number of times we must take logarithms before the argument becomes smaller than 1. The actual runtime is a bit more nuanced, which we will not bother with today. This result was soon improved by Gabow, Galil, Spencer, and Tarjan (’86) to get an O(m log log∗ n) runtime—note the logarithm applied to the iterated logarithm. Both Prim and Kruskal refer to Borůvka’s paper, but say it is “unnecesarily elaborate”. However, while Borůvka’s paper is written in a complicated fashion, but his essential ideas are very clean. Andrew Chi-Chih Yao (1975) Michael L. Fredman and Robert E. Tarjan (1987) Gabow, Galil, Spencer, and Tarjan (1986) • In 1995, David Karger, Phil Klein and Bob Tarjan finally got the holy grail of O(m) time! . . . but it was a randomized algorithm, so the search for a deterministic linear-time algorithm continued. Karger, Klein, and Tarjan (1995) • In 1997, Bernard Chazelle gave an O(mα(n))-time deterministic algorithm. Here α(n) is the inverse Ackermann function (defined in §1.6). This function grows extremely slowly, even slower than the iterated logarithm function. However, it still goes to infinity as n → ∞, so we still don’t have a deterministic linear-time MST algorithm. Chazelle (1997) • In 1998, Seth Pettie and Vijaya Ramachandran gave an optimal algorithm for computing minimum spanning trees—however, we don’t know its runtime! More formally, they show that if there exists an algorithm which uses MST ∗ (m, n) comparisons to find MSTs on all graphs with m edges and n nodes, the PettieRamachandran algorithm will run in time O( MST ∗ (m, n)).) Pettie and Ramachandran (1998) In this chapter, we’ll go through the three classics: Jarnik/Prim’s, Kruskal’s, and Borůvka’s algorithms. Then we will discuss Fredman and Tarjan’s algorithm, and finally present Karger, Klein, and Tarjan’s randomized algorithm. This will lead us to discuss another intriguing question: how do we verify whether a given tree is an MST? This was part of Seth’s Ph.D. thesis, and Vijaya was his advisor. minimum spanning trees 11 1.1.1 Two Assumptions For the rest of this chapter, assume that the edge weights are distinct. This does not change things in any essential way, but it simplifies some of the statements, because distinct edge weights imply that the MST is unique. (Exercise: prove this!) Also assume the graph is simple, and hence m ≤ (n2 ); you can delete all self-loops and remove all-but-the-lightest from any collection of parallel edges, all by preprocessing the graph in linear time. 1.1.2 The Cut and Cycle Rules Most of these algorithms rely on two rules: the cut rule (known in Bob Tarjan’s monograph as the blue rule) and the cycle rule (or the red rule). Recall that a cut in the graph is a partition of the vertices into two non-empty sets (S, S̄ = V \ S), and an edge crosses this cut if its two endpoints lie in different sets. Tarjan (1983) Theorem 1.1 (Cut Rule). For any cut of the graph, the minimum-weight edge that crosses the cut must be in the MST. This rule helps us determine what to add to our MST. Proof. Let S ⊊ V be any nonempty proper subset of vertices, let e = {u, v} be the minimum-weight edge that crosses the cut defined by (S, S̄) (W.l.o.g., u ∈ S, v ∈ / S), and let T be a spanning tree not containing e. Then T ∪ {e} contains a unique cycle C. Since C crosses the cut (S, S̄) once (namely at e), it must also cross at another edge e′ . But w(e′ ) > w(e), so T ′ = ( T − {e′ }) ∪ {e} is a lower-weight tree than T, so T is not the MST. Since T was an arbitrary spanning tree not containing e, the MST must contain e. Theorem 1.2 (Cycle Rule). For any cycle in G, the heaviest edge on that cycle cannot be in the MST. This helps us determine what we can remove in constructing the MST. Proof. Let C be any cycle, let e be the heaviest edge in C. For a contradiction, let T be an MST that contains e. Dropping e from T gives two components. Now there must be some edge e′ in C \ {e} that crosses between these two components, and hence T ′ := ( T − {e′ }) ∪ {e} is a spanning tree. (Make sure you see why.) By the choice of e we have w(e′ ) < w(e), so T ′ is a lower-weight spanning tree than T, a contradiction. To find a minimum spanning tree, we repeated apply whichever of these rules we like. E.g., we choose some cut, use the cut rule to designate the lightest edge in it as belonging to the MST by coloring it blue (hence the name). Or we choose a cycle which contains no This edge e cannot have previously been colored red—this follows from the above lemmas. Or more directly, any cycle crosses any cut an even number of times, so a cycle containing e also contains another edge f in the cut, which is heavier. 12 the classical algorithms red edge, use the cycle rule to mark the heaviest edge as not being in the MST, and color it red. (Again, this edge cannot already be blue for similar reasons.) And if either of the rules is not applicable, we are done. Indeed, if we cannot apply the blue rule, the blue edges cross every cut, and hence form a spanning tree, which must be the MST. Similarly, once the non-red edges do not contain a cycle, they form a spanning tree, which must be the MST. All known algorithms differ only in their choice of cut/cycle, and how they find these fast. Indeed, all the deterministic algorithms we discuss today will just use the cut rule, whereas the randomized algorithm will use the cycle rule as well. 1.2 The Classical Algorithms 1.2.1 Kruskal’s Algorithm For Kruskal’s Algorithm, first sort all the edges such that w(e1 ) < w(e2 ) < · · · < w(em ). This takes O(m log m) = O(m log n) time. Start with all edges being uncolored, and iterate through the edges in the sorted order, coloring an edge blue if and only if it connects two vertices which are not currently in the same blue component. Figure 1.1 gives an example of how edges are added. To keep track of which vertex is in which component, use a disjoint set union-find data structure. This data structure has three operations: • makeset(elem), which takes an element elem and creates a new singleton set for it, • find(elem), which finds the canonical representative for the set containing the element elem, and • union(elem1 , elem2 ), which merges the two sets that elem1 and elem2 are in. There is an implementation of this data structure which allows us to do m operations in O(m α(m)) amortized time, where α(·) is the inverse Ackermann function mentioned above. Note that the naïve implementation of Kruskal’s algorithm spends O(m log m) = O(m log n) time to sort the edges, and then performs n makesets, m finds, and n − 1 union operations, the total runtime is O(m log n + m α(m)), which is dominated by the O(m log n) term. 1.2.2 The Jarnik/Prim Algorithm For the Jarnik/Prim algorithm, first take an arbitrary root vertex r to start our MST T. At each iteration, take the cheapest edge connecting 3 1 10 5 4 2 Figure 1.1: Dashed lines are not yet in the MST. Note that 5 will be analyzed next, but will not be added. 10 will be added. Colors designate connected components. minimum spanning trees of our current tree T of blue edges to some vertex not yet in T, and color it blue—thereby adding this edge to T and increasing its size by one. Figure 1.2 below shows an example of how we edges are added. We’ll use a priority queue data structure which keeps track of the lightest edge connecting T to each vertex not yet in T. A priority queue data structure is equipped with (at least) three operations: • insert(elem, key) inserts the given (element, key) pair into the queue, 13 3 1 10 5 4 2 Figure 1.2: Dashed lines are not yet in the MST. We started at the red node, and the blue nodes are also part of T right now. • decreasekey(elem, newkey) changes the key of the element elem from its current key to min(originalkey, newkey), and • extractmin() removes the element with the minimum key from the priority queue, and returns the (elem, key) pair. Note that by using the standard binary heap data structure we can get O(log n) worst-case time for each priority queue operation above. To implement the Jarnik/Prim algorithm, we initially insert each vertex in V \ {r } into the priority queue with key ∞, and the root r with key 0. The key of an node v denotes the weight of the least-weight edge from a node in T to v; it is zero if v ∈ T, and ∞ if there are no edges yet from nodes in T to v. At each step, use extractmin to find the vertex u with smallest key, and add u to the tree using this edge. Then for each neighbor of u, say v, do decreasekey(v, w({u, v})). Overall we do m decreasekey operations, n inserts, and n extractmins, with the decreasekeys supplying the dominating O(m log n) term. We can optimize slightly by inserting a vertex into the priority queue only when it has an edge to the current tree T. This does not seem particularly useful right now, but will be crucial in the Fredman-Tarjan proof. 1.2.3 Borůvka’s Algorithm Unlike Kruskal’s and Jarnik/Prim’s algorithms, Borůvka’s algorithm adds many edges in parallel, and can be implemented without any non-trivial data structures. In a “round”, simply take the lightest edge out of each vertex and color it blue; these edges are guaranteed to form a forest if edge-weights are distinct. (Exercise: why?) Now contract the blue edges and recurse on the resulting graph. At the end, when the resulting graph is a single vertex, uncontract all the edges to get the MST. Each round can be implemented in O(m) work: we will work out the details of this in HW #1. Moreover, we’re guaranteed to shrink away at least half of the nodes (as each node at least pairs up with one other node), and maybe many more if we are lucky. So we have at most ⌈log2 n⌉ rounds of computation, leaving us with O(m log n) total work. 3 1 10 10 5 4 2 Figure 1.3: The red edges will be chosen and contracted in a single step, yielding the graph on the right, which we recurse on. Colors designate components. fredman and tarjan’s O ( m log ∗ n ) -time algorithm 14 1.2.4 A Slight Improvement on Jarnik/Prim We can actually easily improve the performance of Jarnik/Prim’s algorithm by using a more sophisticated data structure, namely by using Fibonacci heaps instead of binary heaps to implement the priority queue. Fibonacci heaps (invented by Fredman and Tarjan) implement the insert and decreasekey operations in constant amortized time, and extractmin in amortized O(log H ) time, where H is the maximum number of elements in the heap during the execution. Since we do n extractmins, and O(m + n) of the other two operations, and the maximum size of the heap is at most n, this gives us a total cost of O(m + n log n). Note that this is linear time on graphs with m = Ω(n log n) edges; however, we’d like to get linear-time on all graphs. So the remaining cases are the graphs with m = o (n log n) edges. 1.3 Fredman and Tarjan’s O(m log∗ n)-time Algorithm Fredman and Tarjan’s algorithm builds on Jarnik/Prim’s algorithm: the crucial observation uses the following crucial facts. The amortized cost of extractmin operations in Fibonacci heaps is O(log H ), where H is the maximum size of the heap. Moreover, in Jarnik/Prim’s algorithm, the size of the heap is just the number of nodes that are adjacent to the current tree T. So if the current tree always has a “small boundary”, the extractmin cost will be low. How can we maintain the boundary to be smaller than some threshold K? Simple: Once the boundary exceeds K, stop growing the Prim tree, and begin Jarnik/Prim’s algorithm anew from a different vertex. Do this until we have a forest where all vertices lie in some tree; then contract these trees (much like Borůvka), and recurse on the smaller graph. Before we formally define the algorithm, here’s an example. Formally, in each round of the algorithm, all vertices start as unmarked. 1. Pick an arbitrary unmarked vertex and start Jarnik/Prim’s algorithm from it, creating a tree T. Keep track of the lightest edge from T to each vertex in the neighborhood N ( T ) of T, where N ( T ) := {v ∈ V − T | ∃u ∈ T s.t. {u, v} ∈ E}. Note that N ( T ) may contain vertices that are marked. 2. If at any time | N ( T )| ≥ K, or if T has just added an edge to some vertex that was previously marked, stop and mark all vertices in the current T, and go to step 1. minimum spanning trees A 15 H 13 61 1 R 6 55 2 7 56 14 16 8 3 62 57 9 15 4 58 63 11 10 17 52 53 12 51 5 60 G 59 18 54 D Figure 1.4: We begin at vertices A, H, R, and D (in that order) with K = 6. Although D begins as its own component, it stops when it joins with tree A. Dashed edges are not chosen in this step (though they may be chosen in the next recursive call), and colors denote trees. 16 fredman and tarjan’s O ( m log ∗ n ) -time algorithm 3. Terminate when each node belongs to some tree. Let’s first note that the runtime of one round of the algorithm is O(m + n log K ). Each edge is considered at most twice, once from each endpoint, giving us the O(m) term. Each time we grow the current tree in step 1, the number of connected components decreases by 1, so there are at most n such steps. Each step calls findmin on a heap of size at most K, which takes O(log K ) times. Hence, at the end of this round, we’ve successfully identified a forest, each edge of which is part of the final MST, in O(m + n log K ) time. Let d v be the degree of the vertex v in the graph we consider in this round. We claim that every marked vertex u belongs to a component C such that ∑ v ∈ C d v ≥ K. Indeed, if u became marked because the neighborhood of its component had size at least K, then this is true. Otherwise, u became marked because it entered a component C of marked vertices. Since the vertices of C were marked, ∑ v ∈ C d v ≥ K before u joined, and this sum only increased when u (and other vertices) joined. Thus, if C 1 , . . . , C l are the components at the end of this routine, we have l l 2m = ∑ d v = ∑ ∑ d v ≥ ∑ K ≥ Kl v i = 1 v ∈ Ci i =1 2m Thus l ≤ 2m K , i.e. this routine produced at most K trees. The choice of K will change over the course of the algorithm. How should we set the thresholds K i ? Say we start round i with n i nodes and m i ≤ m edges. One clean way is to set 2m K i : = 2 ni which ensures that O ( m i + n i log K i ) = O  2m mi + ni · ni  = O(m). In turn, this means the number of trees, and hence the number of 2m i nodes n i + 1 in the next round, is at most 2m K i ≤ K i . The number of edges is m i + 1 ≤ m i ≤ m. Rewriting, this gives Ki ≤ 2m = lg K i + 1 =⇒ K i + 1 ≥ 2 K i . ni +1 Hence the threshold value exponentiates in each step. Hence after log ∗ n rounds, the value of K would be at least n, and we would just run Jarnik/Prim’s algorithm to completion, ending with a single tree. This means we have at most log ∗ n rounds, and a total of O ( m log ∗ n ) work. In retrospect, I don’t know whether to consider the FredmanTarjan algorithm as being trivial (once we have Fibonacci heaps) or The threshold increases “tetrationally”. minimum spanning trees 17 being devilishly clever. I think it is the latter (and that is the beauty of the best algorithms). Indeed, there’s a lovely idea—of keeping the neighborhoods small at the beginning when there’s a lot of work to do, but allow them to grow quickly, as the graph collapses. It is quite non-obvious at the start, and obvious in hindsight. And once you see it, you cannot un-see it! 1.4 A Linear-Time Randomized Algorithm Another algorithm that is extremely clever but almost obvious in hindsight is the the Karger-Klein-Tarjan randomized MST algorithm, which runs in O ( m + n ) expected time. The new idea here is to compute a “rough approximation” to the MST, use that to throw away many edges using the cycle rule, and then recurse on the rest of the graph. Karger, Klein, and Tarjan (1995) A version of this algorithm was proposed by Karger in 1992, but he only obtained an O(m + n log n) runtime. The enhancement to linear time was given by Klein and Tarjan at the STOC 1994 conference; the combined paper is cited above. 1.4.1 Heavy & light edges The crucial definition is that of edges being heavy and light with respect to some forest F. Definition 1.3. Let F be a forest that is a subgraph of G. An edge e ∈ E( G ) is F-heavy if e creates a cycle when added to F, and moreover it is the heaviest edge in this cycle. Otherwise, we say edge e is F-light. The next facts follow from the definition: Fact 1.4. Edge e is F-light ⇐⇒ e ∈ MST( F ∪ {e}). Fact 1.5 (Completeness). If T is an MST of G then edge e ∈ E( G ) is T-light if and only if e ∈ T. Fact 1.6 (Soundness). For any forest F, the F-light edges contain the MST of the underlying graph G. In other words, any F-heavy edge is also heavy with respect to the MST of the entire graph. This suggests a clear strategy: pick a forest F from the current edges, and discard all the F-heavy edges. Hopefully the number of edges remaining is small. By Fact 1.6 these edges contain the MST of G, so repeat the process on them. To make this idea work, we want a forest F with many F-heavy edges. The catch is that a forest has many heavy edges if it has small weight, if there are many off-forest edges forming cycles where they are the heaviest edges. Indeed, one such forest in the MST T ∗ of G: Fact 1.5 shows there are m − (n − 1) many T ∗ -heavy edges, the maximum possible. How do we find some similarly good tree/forest, but in linear time? A second issue is to classify edges as light/heavy, given a forest F. It is easy to classify a single edge e in linear time, but the following remarkable theorem is also true: Figure 1.5: Fix this figure, make it interesting. Every edge in F is F-light, as are the edges on the left, and also those going between the components. The edge on the right is F-heavy. 18 a linear-time randomized algorithm Theorem 1.7 (MST Verification). Given a forest F ⊆ G, we can output the set of all F-light edges in G in time O(m + n). This MST verification algorithm itself uses several interesting ideas; we discuss some of them in Section 1.5. But for now, let us use it to give the randomized linear-time MST algorithm. 1.4.2 The Randomized MST Algorithm The idea is simple and elegant: randomly choose half of the edges and find the minimum-weight spanning forest F on this “half-of-agraph”. This forest F should have many F-heavy edges; we discard these and recursively find the MST on the remaining graph. Since both the recursive calls are on smaller graphs, hopefully the runtime will be linear. The actual algorithm below has just one extra step: we first run a few rounds of Borůvka’s algorithm to force a reduction in the number of vertices, and then do the steps above. Algorithm 1: KKT(G) Run 3 rounds of Borůvka’s Algorithm on G, contracting the chosen edges to get a graph G ′ = (V ′ , E′ ) with n′ ≤ n/8 vertices and m′ ≤ m edges. ′ 1.2 If G has a single vertex, return any chosen edges. ′ 1.3 E1 ← random sample of E , each edge picked indep. w.p. 1/2. ′ 1.4 F1 ← KKT(G1 = (V , E1 )). ′ 1.5 E2 ← all the F1 -light edges in E . ′ 1.6 F2 ← KKT(G2 = (V , E2 )). 1.7 return F2 (combined with Borůvka edges chosen in Step 1). 1.1 Theorem 1.8. The KKT algorithm returns MST(G). Proof. This follows from Fact 1.6, that discarding heavy edges of any forest F in a graph does not change the MST. Indeed, the MST on G2 is the same as the MST on G ′ , since the discarded F1 -heavy edges cannot be in MST ( G ′ ) because of Fact 1.6. Adding back the edges picked by Borůvka’s algorithm in Step 1 gives the MST on G, by the cut rule. Now we need to bound the running time. The following two claims formalize the intuition that we recurse on “smaller” subgraphs: Claim 1.9. E[#E1 ] = 12 m′ . Claim 1.10. E[#E2 ] ≤ 2n′ . The first claim is easy to prove, using linearity of expectations, and that each edge is picked with probability 1/2. The proof of Claim 1.10 The random subgraph may not be connected, so the maximum spanning forest is obtained by finding the MST for each of its connected components. minimum spanning trees 19 is also short, but before we prove it, let us complete the proof of the linear running time. Theorem 1.11. The KKT algorithm, run on a graph with m edges and n vertices, terminates in expected time O(m + n). Proof. Let TG be the expected running time on graph G, and Tm,n := max G =(V,E),|V |=n,| E|=m { TG }. In the KKT algorithm, Step 1, 2, 3, 5, and 7 can each be done in linear time: indeed, the only non-trivial part is Step 4, for which we use Theorem 1.7. Let the total time for these steps be at most c(m + n). Steps 4 and 6 require time TG1 and TG2 respectively. Then we have TG ≤ c(m + n) + E[TG1 + TG2 ] ≤ cm + E[Tm1 ,n′ + Tm2 ,n′ ], where m1 = #E1 and m2 = #E2 are both random variables. Inductively assume that Tm,n ≤ 2c(m + n), then TG ≤ c(m + n) + E[2c(m1 + n′ )] + E[2c(m2 + n′ )] ≤ c(m + n) + c(m′ + 2n′ ) + 2c(2n′ + n′ ) = c(m + m′ + n + 8n′ ) ≤ 2c(m + n). The second inequality holds because E[m1 ] ≤ 12 m′ and E[m2 ] ≤ 2n′ . The last inequality holds because n′ ≤ n/8 and m′ ≤ m. Indeed, we shrunk the graph using Borůvka’s algorithm in the first step just to ensure n′ ≤ 8n and hence give us some breathing room. Now we prove Claim 1.10. Recall that we randomly subsample the edges of G ′ to get G1 , compute its maximum spanning forest F1 , and now we want to bound the expected number of edges in G ′ that are F1 -light. The key to the proof is to do all these steps together, deferring the random decisions to when we really need them. This makes it apparent which edges are light, making them easy to count. This idea to defer looking at the random choices of the algorithm is often called the principle of deferred decisions. Proof of Claim 1.10. For the sake of the proof, we can use any correct algorithm to compute F1 , so let us use Kruskal’s algorithm. Moreover, let’s run a lazy version as follows: first sort all the edges in E′ , and not just those in E1 ⊆ E′ , and consider then in increasing order of weights. Now if the currently considered edge ei connects two different trees in the current blue forest, call ei useful and flip an independent unbiased coin: if the coin comes up “heads”, color ei blue and add it to F1 , else color ei red. The crucial observation is that this process produces a forest from the same distribution as first choosing G1 and then computing F1 by running Kruskal’s algorithm on it. Figure 1.6: Illustration of another order of coin tossing 20 optional: mst verification Now, let us consider the lazy process again: which edges are F1 light? We claim that these are precisely the useful edges. Indeed, any non-useful edge e j forms a cycle with the previously chosen blue edges in F1 , and it is the heaviest edge on that cycle. Hence e j does not belong to MST ( F1 ∪ {e j }), so it is F1 -heavy by Fact 1.4. And a useful edge ei would belong to MST ( F1 ∪ {ei }), since running Kruskal’s algorithm on F1 ∪ {ei } would see that ei connects two different blue components and hence would pick it. Finally, how many useful edges are there, in expectation? Let’s abstract away the details: we’re running a process that periodically asks us to flip an independent unbiased coin. Since each time we see a heads, we add an edge to the forest, so we definitely stop when we see n′ − 1 heads. (We may stop earlier, in case the process runs out of edges, but then we can pad the random sequence to flip some more coins.) Since the coins are independent and unbiased, the expected number of flips until we see n′ − 1 heads is exactly 2(n′ − 1). This proves Claim 1.10. That’s it. The algorithm and proof are both short and slick and beautiful: this result is a real gem. I think it’s an algorithm from The Book. The one slight annoyance with the algorithm is the relative complexity of the MST verification algorithm, which we use to find the F1 -light edges in linear time. Nonetheless, these verification algorithms also contain many nice ideas, which we now discuss. 1.5 Paul Erdős claimed that God has “The Book” which contains the most elegant proof of each mathematical theorem. The current verification algorithms are deterministic; can we use randomness to simplify these as well? Optional: MST Verification We now come back to the implementation of the MST verification procedure. Here we only consider only trees (not forests), since we can run this algorithm separately on each tree in the forest and incur only a linear extra cost. Let us refine Theorem 1.7 as follows. Theorem 1.12 (MST Verification). Given a tree T = (V, E) where |V | = n, and m pairs of vertices (yi , zi ) in T, we can find the heaviest edge on the unique yi -to-zi path in T for all i, in O(m + n) time. Since the edge {yi , zi } is T-heavy precisely if it is heavier than the heaviest edge on the corresponding tree path, this also proves Theorem 1.7. Observe that the query pairs are given up-front: there is an inverse-Ackermann-type lower bound for the problem where the queries arrive online. How do we get such a linear-time algorithm? A priori, it is not easy to even show a query-complexity upper bound: that there exists a procedure that performs a linear number of edge-weight comparisons to solve the MST verification problem. This problem was solved Pettie (2006) minimum spanning trees by János Komlós. His result was subsequently made algorithmic (“how do you find (in linear time) which linear number of queries to make?”) by Brendan Dixon, Monika Rauch (now Monika Henzinger) and Bob Tarjan. This algorithm was futher simplified by Valerie King 1 , and by Thomas Hagerup 2 . We will just discuss Komlós’s query-complexity bound. 21 Komlos (1985) 1 2 1.5.1 A Simpler Case To start developing the algortihm, it helps to consider special cases: e.g., what if the tree is a complete binary tree? Let’s assume something slightly less restrictive than a complete binary tree: suppose tree T is rooted at some node r, all internal nodes have at least 2 children, and all its leaves are at the same level. Moreover, all queries {yi , zi } are for pairs where yi is a leaf and zi its ancestor. Now for an edge (u, v) of the tree, where v is the parent and u the child, consider all queries starting within subtree Tu and ending at vertex v or higher. Say these queries go from some leaves inside Tv up to w1 , w2 , . . . , wk , where w1 is closest to the root. Define the “query string” Q e : = ( w1 , w2 , . . . , w k ) . A node v is an ancestor of u if v lies on the unique path from u to the root; then u is a descendent of v. We want to calculate the “answer string” A e : = ( a1 , a2 , · · · , a k ), where ai is the largest weight among the edges between wi and u. Now given the answer string A(b,a) , we can get the answer string for a child edge. In the example, say the query string for edge (c, b) is Q(c,b) = (w1 , w4 , b). We have lost some queries that were in Q(b,a) , (e.g., for w3 ) but we now have a query ending at b. To get A(b,a) we can drop the lost queries, add in the entry for b, and also take the component-wise maximum with the weight of (c, b) itself. E.g., if (c, b) has weight t, then A(c,b) = (max{ a1 , t}, max{ a4 , t}, t) = (max{6, 5}, max{4, 5}, 5). Naïvely this would require us to compare the weight w(c,b) with all the entries in the answer string, incurring | Ae′ | comparisons. The crucial observation is this: since the nodes in the query string are sorted from top to bottom, the answers must be non-increasing: i.e., a1 ≥ a2 ≥ · · · ≥ ak . Therefore we can do binary search to reduce the number of comparisons between edge-weights. Indeed, given the answer string for some edge e, we can compute answers Ae′ for a child edge e′ using at most ⌈log(| Ae′ | + 1)⌉ comparisons. This will be enough to prove the result. Figure 1.7: Query string Q(b,a) = (w1 , w3 , w4 ) means there are three queries starting from vertices in Tb and ending at w1 , w3 , w4 . The answer string is A(b,a) = ( a1 , a3 , a4 ) = (6, 4, 4). 22 optional: mst verification Claim 1.13. The total number of comparisons for all queries is at most ∑ log (|Qe | + 1) ≤ O(n + n log e m+n ) = O ( m + n ). n Proof. Let the number of edges at height i be ni , where height 1 corresponds to edges incident to the leaves. ∑ e∈height i log2 (1 + | Qe |) = ni avge∈height i (log2 (1 + | Qe |))   ≤ ni log2 1 + avge∈height i (| Qe |)   m ≤ ni log2 1 + ni   4n m+n + log2 = ni log2 . 4n ni The first inequality uses concavity of the function log2 (1 + x ), and Jensen’s inequality. The second holds because each of the m queries can only appear on at most one edge, so the average “load” is at most +n = m/ni . Summing the first term over all heights gives n log2 m4n O ( m ). To bound the second term (summed over all heights), recall that each node has at least two children, so the number of edges at least doubles each time the height decreases. Hence, ni ≤ n/2i−1 , and 4n n 4n O (i ) ∑ ni log2 ni ≤ ∑ 2i−1 log2 n/2i−1 = n · ∑ 2i = O(n). i ≥1 i ≥1 i ≥1 The inequality above uses that x log(4n/x ) is increasing for x ≤ n. Jensen’s inequality says that for any convex function f and any random variable X, E[ f ( X )] ≥ f (E[ X ]). Concavity requires flipping the sign, of course. i i 2 i ≥0 S=∑ 2S = ∑ i ≥0 2 i i −1 =∑ i ≥0 i+1 2i ( i + 1) − i 1 = ∑ i = 2. i 2 2 i ≥0 i ≥0 =⇒ 2S − S = ∑ Converting this into an algorithm that runs in O(m + n) time requires quite a bit more work. The essential idea is to store each query string Q(u,v) as a bit vector of length log2 n, indicating which nodes on the path from v to the root belong to it Q(u,v) . Now the answers A(u,v) can be stored by encoding the locations of the successive maxima. And answers for a child edge can be computed from that of the parent edge using some tricky bit operations (e.g., by precomputing solutions on bit-strings of length, say (log2 n)/3, of which there are only n1/3 × n1/3 = n2/3 ). If you are interested, check out these lecture slides by Uri Zwick. 1.5.2 Solving the General Case Finally, we reduce a general instance of MST verification to the special instances considered in §1.5.2. First we reduce to a “branching” tree with the special properties we asked for, then we alter the queries to become leaf-ancestor queries. Figure 1.8: Illustration of balancing a tree. We have maxwtT (v1 , v7 ) is 7 which is the weight of edge (v4 , v6 ). We can minimum spanning trees 23 To achieve this reduction, run Borůvka’s algorithm on the tree T. After the ith round of edge selection and contraction, let Vi be the remaining vertices, so that V0 = V is the original set of nodes. Define a new tree T ′ whose vertex set V ′ is the disjoint union V0 ⊎ V1 ⊎ · · · . A node u ∈ Vi has an edge in T ′ to v ∈ Vi+1 if the component containing u was contracted into the new vertex v; the weight of this edge in T ′ is the weight of the minimum-weight edge chosen by u in this round. Moreover, if r is the single vertex corresponding to the entire tree T at the end of the run of Borůvka’s algorithm, then root tree T ′ at r. Exercise 1.14. Show that each node in T ′ has at least two children, and all leaves belong to the same level. There are n leaves (corresponding to the nodes in T), and at most 2n − 1 nodes in T ′ . Also show how to construct T ′ in linear time. Exercise 1.15. For nodes u, v in a tree T, let maxwtT (u, v) be the maximum weight of an edge on the (unique) path between u, v in the tree T. Show that all u, v ∈ V, maxwtT (u, v) = maxwtT ′ (u, v). This exercise means arbitrary queries (yi , zi ) in the original tree T can be reduced to leaf-leaf queries in T ′ . To make these leaf-ancestor queries, we simply find the least-common ancestor ℓi := lca(yi , zi ) for each pair, and replace the original query by the maximum of two queries (yi , ℓi ), (zi , ℓi ). To show that we can find the least-common ancestors in linear time, we defer to a theorem of David Harel and Bob Tarjan: Harel and Tarjan (1984) Theorem 1.16. Given a tree T, we can preprocess it in O(n) time, so that all subsequent least-common ancestor queries for T can be answered in O(1) time. Interestingly, this algorithm also proceeds by solving the leastcommon ancestor problem for complete balanced binary trees, and then extending the solution to general trees. For a survey of algorithms for this problem, see the paper of Alstrup et al. This completes Komlós’ proof that the MST verification problem can be solved using O(m + n) comparisons. An outstanding open problem is to get a really simple linear-time algorithm for this problem. (An algorithm that runs in time O(mα(n)) can be given using the disjoint set union-find data structure.) 1.6 Alstrup et al. (2004) The Ackermann Function Wilhelm Ackermann defined a fast-growing function that is totally computable but not primitive recursive. Today, we use the term Ackermann (1928) A similar function was defined by Gabriel Sudan, a Romanian mathematician, in 1927. 24 matroids Ackermann function A(m, n) to refer to one of many variants that are rapidly-growing and have similar propeties. It seems to arise often in algorithm analysis, so let’s briefly discuss it here. For illustrative purposes, it is cleanest to define A(m, n) : N × N → N recursively as   2n : m=1  A(m, n) = 2 : m ≥ 1, xn = 1   A(m − 1, A(m, n − 1)) : m ≥ 2, n ≥ 2 Here are the values of A(m, n) for m, n ≤ 4: 1 2 1 2 2 2 4 4 3 6 8 4 8 16 3 4 2 2 4 4 22 65536 2 22 !!! 22 ... ... ... n 2n 2n ... ... 22 huge! .2 .. We can define the inverse Ackermann function α(·) to be a functional inverse of the diagonal A(n, n); by construction, α(·) grows extremely 2 .. . slowly. For example, α(m) ≤ 4 for all m ≤ 22 height 65536. 1.7 Matroids To come here. See HW1 as well. where the tower has 2 Arborescences: Directed Spanning Trees Greedy algorithms worked vey well for minimum weight spanning tree problem, as we saw in Chapter 1. In this chapter, we define arborescences which are a notion of spanning trees for rooted directed graphs. We will see that a naïve greedy approach no longer works, but it requires just a slightly more sophisticated algorithm to efficiently find them. We give two proofs of correctness for this algorithm. The first is a direct inductive proof, but the second makes use of linear programming duality, and highlights its use in analyzing the performance of algorithms. This will be a theme that return to multiple times in this course. 2.1 Arborescences Consider a graph G = (V, A, w): here V is a set of vertices, and A a set of directed edges, also known as arcs. The function w : A → R gives a weight to every arc. Let |V | = n and | A| = m. Once we root G at a node r ∈ V, we can define a “directed spanning tree” with r being the sink/root. Definition 2.1. An r-arborescence is a subgraph T = (V, A′ ) with A′ ⊆ A such that 1. Every vertex has a directed path in T to the root r, and 2. Each vertex except r has one outgoing arc; r has none. Remark 2.2. Observe that T forms a spanning tree in the undirected sense. This property (along with either property 1 or property 2) can alternatively be used to define an arborescence. Remark 2.3. It’s easy to check if an r-arborescence exists. We can reverse the arcs and run a depth-first search from the root. If all vertices are reached, we have produced an r-arborescence. The focus of this chapter is to find the minimum-weight r-arborescence. We can simplify things slightly by assuming that all of the weights We will use “arcs” instead of “edges” to emphasize the directedness of the graph. A branching is the directed analog of a forest; it drops the first reachability requirement, and asks only for all non-root vertices to have an outgoing edge. 26 the chu-liu/edmonds/bock algorithm are non-negative. Because no outgoing arcs from r will be part of any arborescence, we can assume no such arcs exist in G either. For brevity, we fix r and simply say arborescence when we mean r-arborescence. If there are negative arc weights, add a large positive constant M to every weight. This increases the total weight of each arborescence by M (n − 1), and hence the identity of the minimumweight one remains unchanged. 2.1.1 The Limitations of Greedy Algorithms It’s natural to ask if greedy algorithms like those in Chapter 1 for the directed case. E.g., we can try picking the lightest incoming arc into the component containing r, as in Prim’s algorithm, but this fails, for example in Figure 2.1. Or we could emulate Kruskal’s algorithm and consider arcs in increasing order of weight, adding them if they don’t close a directed cycle. (Exercise: give an example where it fails.) The problem is that greedy algorithms (that consider the arcs in some linear order and irrevocably add them in) don’t see to work. However, the algorithm we eventually get will feel like Borůvka’s algorithm, but one where we are allowed to revoke some of our past decisions. 2.2 The Chu-Liu/Edmonds/Bock Algorithm The algorithm we present was discovered independently by YoengJin Chu and Tseng-Hong Liu, Jack Edmonds, and F. Bock. We will follow Karp’s presentation of Edmonds’ algorithm. Definition 2.4. For a vertex v ∈ V or subset of vertices S ⊆ V, let ∂+ v and ∂+ S denote the set of arcs leaving the node v and the set S, respectively. Definition 2.5. For a vertex v ∈ V in graph G, define MG (v) := mina∈∂+ v w( a) be the minimum weight among arcs leaving v in G. The first step is to create a new graph G ′ by subtracting some weight from each outgoing arc from a vertex, such that there is at least one arc of weight 0. That is, set w( a′ ) ← w( a) − MG (v) for all a ∈ ∂+ v and each v ∈ V. Claim 2.6. T is a min-weight arborescence in G ⇐⇒ T is a minweight arborescence in G ′ . Proof. Each arborescence has exactly one arc leaving each vertex. Decreasing the weight of every arc exiting v by MG (v) decreases the weight of every possible arborescence by MG (v) as well. Thus, the set of min-weight arborescences remains unchanged. Now each vertex has at least one 0-weight arc leaving it. Now, for each vertex, pick an arbitrary 0-weight arc out of it. If this choice is Figure 2.1: A Prim-like algorithm will select the arc with weight 2 and 3, whereas the optimal choices are the arcs with weights 3 and 1. Y.-J. Chu and T.-H. Liu (1965) J. Edmonds (1967) F. Bock (1971) R.M. Karp (1971) arborescences: directed spanning trees an arborescence, this must be the minimum-weight arborescence, since all arc weights are still nonnegative. Otherwise, the graph consist of some connected components, each of which has one directed cycle along with some acyclic incoming components, as shown in the figure. For the second step of the algorithm, consider one such 0-weight cycle C, and construct a new graph G ′′ := G ′ /C by contracting the cycle C down to a single new node vC , removing arcs within C, and replacing parallel arcs by the cheapest of these arcs. Let OPT( G ) denote the weight of the min-weight arborescence on G. Claim 2.7. OPT( G ′ ) = OPT( G ′′ ). Proof. To show OPT(G ′ ) ≤ OPT(G ′′ ), we exhibit an arborescence in G ′ with weight at most OPT(G ′′ ). Indeed, let T ′′ be a min-weight arborescence in G ′′ . Consider arborescence T ′ in G ′ obtained by expanding vC back to the cycle C, and removing one arc in the cycle. Since the cycle has weight 0 on all its arcs, T ′ has the same weight as T ′′ . (See Figure 2.3.) Now to show OPT(G ′′ ) ≤ OPT(G ′ ), take a min-weight arborescence T ′ of G ′ , and identify the nodes in C down to get a vertex vC . The resulting graph is clearly connected, with each vertex having a directed path to the root. Now remove some arcs to get an arborescence of G ′′ , e.g., as in Figure 2.4. Since arc weights are non-negative, we can only lower the weight by removing arcs. Therefore OPT(G ′′ ) ≤ OPT(G ′ ). The proof also gives an algorithm for finding the min-weight arborescence on G ′ by contracting the cycle C (in linear time), recursing on G ′′ , and the “lifting” the solution T ′′ back to a solution T ′ . Since we recurse on a graph which has at least one fewer nodes, there are at most n recursive calls. Moreover, the weight-reduction, contraction, and lifting steps in each recursive call take O(m) time, so the runtime of the algorithm is O(mn). Remark 2.8. This is not the best known run-time bound: there are many optimizations possible. Tarjan presents an implementation of the above algorithm using priority queues in O(min(m log n, n2 )) time, and Gabow, Galil, Spencer and Tarjan give an algorithm to solve the min-weight arborescence problem in O(n log n + m) time. The best runtime currently known is O(m log log n) due to Mendelson et al.. Open problem 2.9. Is there a linear-time (randomized or deterministic) algorithm to find a min-weight arborescence in a digraph G? 27 Figure 2.2: An example of a possible component after running the first step of the algorithm Figure 2.3: The white node is expanded into a 4-cycle, and the dashed arrow is the arc that is removed after expanding. Figure 2.4: Contracting the two white nodes down to a cycle, and removing arc b. R.E. Tarjan (1971) H.N. Gabow, Z. Galil, T. Spencer and R.E. Tarjan (1986) R. Mendelson, R.E. Tarjan, M. Thorup, and U. Zwick (2006) 28 linear programming methods 2.3 Linear Programming Methods Let us now see an alternate proof of correctness of the algorithm above, this time using linear programming duality. This is how Edmonds originally proved his algorithm to be optimal. If you have access to the Chu-Liu or Bock papers, I would love to see them. 2.3.1 Linear Programming Review Before we actually represent the arborescence problem as a linear program, we first review some standard definitions and results from linear programming. Definition 2.10. For some number of variables (a.k.a. dimension) n ∈ N, number of constraints m ∈ N, objective vector c ∈ Rn , constraint matrix A ∈ Rn×m , and right-hand side b ∈ Rm , a (minimization) linear program (LP) is minimize c⊺ x This form of the LP is called the standard form. More here. subject to Ax ≥ b and x ≥ 0 Note that c⊺ x is the inner product ∑in=1 ci xi . The constraints of a linear program form a polyhedron, which is the convex body formed by the intersection of a finite number of half spaces. Here we have m + n half spaces. There are m of them corresponding to the constraints { a⊺i x ≥ bi }im=1 , where ai ∈ Rn is the vector corresponding to the ith row of the matrix A. Moreover, we have n non-negativity constraints { x j ≥ 0}nj=1 . If the polyhedron is bounded, we call it a polytope. Definition 2.11. A vector x ∈ Rn is called feasible if it satisfies the constraints: i.e., Ax ≥ b and x ≥ 0. Definition 2.12. Given a linear program min{c⊺ x | Ax ≤ b, x ≥ 0}, the dual linear program is maximize b⊺ y subject to A⊺ y ≤ c and y ≥ 0 The dual linear program has a single variable yi for each constraint in the original (primal) linear program. This variable can be thought of as giving an importance weight to the constraint, so that taking a linear combination of constraints with these weights shows that the primal cannot possibly surpass a certain value for c⊺ x. This purpose is exemplified by the following theorem. Theorem 2.13 (Weak Duality). If x and y are feasible solutions to the linear program min{c⊺ x | Ax ≤ b, x ≥ 0} and its dual, respectively, then c⊺ x ≥ b⊺ y. Proof. c⊺ x ≥ ( A⊺ y)⊺ x = y⊺ Ax ≥ y⊺ b = b⊺ y. Whenever we write a vector, we imagine it to be a column vector. arborescences: directed spanning trees This principle of weak duality tells us that if we have feasible solutions x, y where c⊺ x = b⊺ y, then we know that both x and y are optimal solutions. Our approach will be to give a linear program that models min-weight arborescences, use the algorithm above to write a feasible solution to the primal, and then to exhibit a feasible solution to the dual such that the primal and dual values are the same—hence both must be optimal! 2.3.2 Arborescences via Linear Programs To analyze the algorithm, we first need to come up with a linear program that “captures” the min-weight arborescence problem. Since we want to find a set of arcs forming an arborescence T, we have one variable x a for each arc a ∈ A. Ideally, each variable will be an indicator for the arc being in the arborescence: i.e., it will binary values: x a ∈ {0, 1}, with x a = 1 if and only if a ∈ T. This choice of variables allows us to express our objective to minimize the total weight: w⊺ x := ∑ a∈ A w( a) x a . Next, we need to come up with a way to express the constraint that T is a valid arborescence. Let S ⊆ V − {r } be a set of vertices not containing the root, and some vertex v ∈ S. Every vertex must be able to reach the root by a directed path. If ∂+ S ∩ T = ∅, there is no arc in T leaving the set S, and hence we have no path from v to r. We conclude that, at a minimum, ∂+ S ∩ T ̸= ∅. We represent this constraint by ensuring that the number of arcs out of S is non-zero, i.e., ∑ xa ≥ 1. a∈∂+ S We write an integer linear programming (ILP) formulation for minweight arborescences as follows: minimize ∑ w( a) x a a∈ A subject to ∑ xa ≥ 1 ∀ S ⊆ V − {r } ∑ ∀v ̸= r a∈∂+ S xa = 1 a∈∂+ v x a ∈ {0, 1} (2.1) ∀ a ∈ A. The following lemma is easy to verify: Lemma 2.14. T is an arborescence of G with x a = 1 a∈T if and only if x is feasible for the integer LP (2.2). Hence the optimal solution to the ILP (2.1) is exactly the min-weight arborescence. 29 See the strong duality theorem in add reference for a converse to this theorem. For now, weak duality will suffice. 30 linear programming methods Relaxing the Boolean integrality constraints gives us the linear programming relaxation: minimize ∑ w( a) x a a∈ A subject to ∑ xa ≥ 1 ∀ S ⊆ V − {r } ∑ xa = 1 ∀v ̸= r a∈∂+ S a∈∂+ v xa ≥ 0 (2.2) ∀ a ∈ A. Since we have relaxed the constraints, the optimal solution to the (fractional) LP (2.2) can only have less value than the ILP (2.1), and hence the optimal value of the LP is at most OPT( G ). In the following, we show that it is in fact equal to OPT( G )! Exercise 2.15. Suppose all the arc weights are non-negative. Show that the optimal solution to the linear program remains unchanged even if drop the constraints ∑ a∈∂+ v x a = 1. 2.3.3 Showing Optimality The output T of the Chu-Liu/Edmonds/Bock algorithm is an arborescence, and hence the associated solution x (as defined in Lemma 2.14) is feasible for ILP (2.1) and hence for LP (2.2). To show that x is optimal, we now exhibit a vector y feasible for the dual linear program with objective equal to w⊺ x. Now weak duality implies that both x and y must be optimal primal and dual solutions. The dual linear program for (2.2) is maximize ∑ yS ∑ yS ≤ w( a) S⊆V −{r } subject to S:a∈∂+ S yS ≥ 0 ∀a ∈ A (2.3) ∀S ⊆ V − {r }, |S| > 1. Observe that yS is unconstrained when |S| = 1, i.e., S corresponds to a singleton non-root vertex. We think of yS as payments raised by vertices inside set S so that we can buy an arc leaving S. In order to buy an arc a, we need to raise w( a) dollars. We’re trying to raise as much money as possible, while not overpaying for any single arc a. Lemma 2.16. If arc weights are non-negative, there exists a solution for the dual LP (2.3) such that w⊺ x = 1⊺ y, where all ye values are non-negative. Proof. The proof is by induction over the execution of the algorithm. arborescences: directed spanning trees 31 • The base case is when the chosen zero-weight arcs out of each node form an arborescence. In this case we can set yS = 0 for all S; since all arc weights are non-negative, this is a feasible dual solution. Moreover, both the primal and dual values are zero. • Suppose we subtract M := MG (v) from all arcs leaving vertex v in graph G so that v has at least one zero-weight arc leaving it. Let G ′ be the graph with the new weights, and let T ′ be the optimal solution on G ′ . By induction on G ′ , let y′ be a non-negative solution such that ∑ a∈T ′ we′ = ∑S y′S . Define yv := y′v + M and yS = y′S for all other subsets; this is the desired feasible dual solution for the same tree T = T ′ on the original graph G. Indeed, for one of the arcs a = (v, u) out of the node v, we have ∑ ∑ yS = S:a∈∂+ S ∑ yS + S:a∈∂+ S,|S|=1 = (y′{u} + M) + yS S:a∈∂+ S,|S|≥2 ∑ S:a∈∂+ S,|S|≥2 r 1 y′S Moreover, the value of the dual increases by M, the same as the increase in the weight of the arborescence. 1 2 3 2 3 ∑ S:a∈∂+ S yS = y′{v } + C ∑ S′ :a′ ∈∂+ S′ ,S′ ̸={v C} y′S ≤ w( a′ ) ≤ w( a). This completes the inductive proof Notice that the sets with non-zero weights correspond to singleton nodes, or to the various cycles contracted during the algorithm. Hence these sets form a laminar family; i.e., any two sets S, S′ with non-zero value in y are either disjoint, or one is contained within the other. By Lemma 2.16 and weak duality, we conclude that the solution x and the associated arborescence T is optimal. It is easy to extend the argument to potentially negative arc weights. 2 2 6 • Else, suppose the chosen zero-weight arcs contain a cycle C, which we contract down to a node vC . Using induction for this new graph G ′ , let y′ be the feasible dual solution. For any subset S′ of nodes in G ′ that contains the new node vC , let S = (S′ \ {vC }) ∪ C, and define yS = y′S′ . For all other subsets S in G ′ not containing vC , define yS = y′S . Moreover, for all nodes v ∈ C, define y{v} = 0. The dual value remains unchanged, as does the weight of the solution T obtained by lifting T ′ . The dual constraint changes only arcs of the form a = (v, u), where v ∈ C and u ̸∈ C. But such an arc is replaced by an arc a′ = (vC , u), whose weight is at most w( a). Hence 5 2 3 ≤ M + w ′ ( a ) = M + ( w ( a ) − M ) = w ( a ). 7 7 Figure 2.5: An optimal dual solution: vertex sets are labeled with dual values, and arcs with costs. 32 linear programming methods Corollary 2.17. There exists a solution for the dual LP (2.3) such that w⊺ x = 1⊺ y. Hence the algorithm produces an optimal arborescence even for negative arc weights. Proof. If some arc weights are negative, add M to all arc weights to get the new graph G ′ where all arc weights are positive. Let y′ be the optimal dual for G ′ from Lemma 2.16; define yS = y′S for all sets of size at least two, and y{v} = y′{v} − M for singletons. Note that the weight of the optimal solution on G is precisely M(n − 1) smaller than on G ′ ; the same is true for the total dual value. Moreover, for arc e = (u, v), we have ∑ S:a∈∂+ S yS = ∑ S:a∈∂+ S,|S|≥2 y′S + (y′{u} − M) ≤ (we + M) − M = we . The inequality above uses that y′ is a feasible LP solution for the graph G ′ with inflated arc weights. Finally, since the only nonnegative values in the dual solution are for singleton sets, all constraints in (2.2) are satisfied for the dual solution y, this completes the proof. 2.3.4 Integrality of the Polytope The result of Corollary 2.17 is quite exciting: it says that no matter what the objective function of the linear program (i.e., the arc weights w( a)), there is an optimal integral solution to the linear program, which our combinatorial algorithm finds. In other words, the optimal solutions to the LP (2.2) and the ILP (2.1) are the same. We will formally discuss this later in the course, but let us start playing with these kinds of ideas. A good start is to visualize this geometrically: let A ⊆ R| A| be the set of all solutions to the ILP (which correspond to the characteristic vectors of all valid r-arborescences). This is a finite set of points, and let Karb be the convex hull of these points. (It can be shown that Karb is a polytope, though we don’t do it here.) If we optimize a linear function given by some weight vector w over this polytope, we get the optimal arborescence for this weight. This is the solution to ILP (2.1). Moreover, let K ⊆ R| A| be the polytope defined by the constraints in the LP relaxation (2.2). Note that each point in A is contained within K, therefore so is their convex hull K. I.e., Karb ⊆ K. In general, the two polytopes are not equal. But in this case, Corollary 2.17 implies that for this particular setting, the two are indeed equal. Indeed, a geometric hand-wavy argument is easy to make — if K were strictly bigger than Karb , there would be some direction arborescences: directed spanning trees in which K extends beyond Karb . But each direction corresponds to a weight-vector, and hence for that weight vector the optimal solution within K (which is the solution to the LP) would differ from the optimal solution within Karb (which is the solution to the ILP). This contradicts Corollary 2.17. 2.4 Matroid Intersection More to come here, maybe just a forward pointer to a later lecture. 33 3 Shortest Paths in Graphs In this chapter, we look at another basic algorithmic construct: given a graph where edges have weights, find the shortest path between two specified vertices in it. Here the weight of a path is the sum of the weights of the edges in it, and a shortest path is the path with least weight. Or given a source vertex, find shortest paths to all other vertices. Or find shortest paths between all pairs of vertices in the graph. Of course, each harder problem can be solved by multiple calls of the easier ones, but can we do better? Let us give some notation. The input is a graph G = (V, E), with each edge e = uv having a weight/length wuv ∈ R. For most of this chapter, the graphs will be directed: in this case we use the terms edges and arcs interchangeably, and an edge uv is imagined as being directed from u to v (i.e., from left to right). 1. Given a source vertex s, the single-source shortest paths (SSSP) asks for the distances (and the corresponding shortest paths) from s to all vertices in V. 2. The all-pairs shortest paths (APSP) problem asks for the distances between each pair of vertices in V. We will consider both these variants, and give multiple algorithms for both. There is another potential source of complexity: whether the edgeweights are all non-negative, or if they are allowed to take on negative values. In the latter case, we disallow cycles of negative weight, else the shortest-path may not be well-defined. This is because a negative cycle allows for ever-smaller shortest paths: we can just run around the cycle to reduce the total weight arbitrarily. 3.1 Single-Source Shortest Path Algorithms The single-source shortest path problem (SSSP) is to find a shortest path from a single source vertex s to every other vertex in the graph. The Given the graph G and edge-weights w, the minimum weight of any path from u to v is often called the distance dw (u, v). We do not consider the s-t-shortestpath problem, since algorithms for that problem also tend to solve the SSSP (on worst-case instances). We could ask for a shortest simple path. However, this problem is NP-hard in general, via a reduction from Hamilton path. 36 single-source shortest path algorithms output of this algorithm can either be the n − 1 numbers giving the weights of the n − 1 shortest paths, or (some compact representation of) these paths. We first consider Dijkstra’s algorithm for the case of non-negative edge-weights, and give the Bellman-Ford algorithm that handles negative weights as well. 3.1.1 Dijkstra’s Algorithm for Non-negative Weights Dijkstra’s algorithm keeps an estimate dist of the distance from s to every other vertex. Initially the estimate of the distance from s to itself is set to 0 (which is correct), and is set to ∞ for all other vertices (which is typically an over-estimate). All vertices are unmarked. Then repeatedly, the algorithm finds an umarked vertex u with the smallest current estimate, marks this vertex (thereby indicating that this estimate is correct), and then updates the estimates for all vertices v reachable by arcs uv thus: dist(v) ← min{dist(v), dist(u) + wuv } We keep all the vertices that are not marked and their estimated distances in a priority queue, and extract the minimum in each iteration. Algorithm 2: Dijkstra’s Algorithm Input: Digraph G = (V, E) with edge-weights we ≥ 0 and source vertex s ∈ G Output: The shortest-path distances from s to each vertex 2.1 add s to heap with key 0 2.2 for v ∈ V \ { s } do 2.3 add v to heap with key ∞ 2.4 while heap not empty do 2.5 u ← deletemin 2.6 for v a neighbor of u do 2.7 key(v) ← min{ key(v), key(u) + wuv } // relax uv To prove the correctness of the algorithm, it suffices to show that each time we extract a vertex u with the minimum estimated distance from the priority queue, the estimate for that vertex u is indeed the distance from s to u. This can be proved by induction on the number of marked vertices, and left as an exercise. Also left as an exercise are the modifications to return the shortest-path tree from node s. The time complexity of the algorithm depends on the priority queue data structure. E.g., if we use binary heap, which incurs O(log n) for decrease-key as well as extract-min operations, we incur a running time of O(m log n). But just like for spanning trees, we can do better with Fibonacci heaps, which implement the decrease-key operation in constant amortized time, and extract-min in O(log n) This update step is often said to relax the edges out of u, which has a nice physical interpretation. Indeed, any edge uv for which the dist(v) is strictly bigger than dist(u) + wuv can be imagined to be over-stretched, which this update fixes. shortest paths in graphs time. Since Dijkstra’s algorithm uses n inserts, n delete-mins, and m decrease-keys, this improves the running time to O(m + n log n). There have been many other improvements since Dijkstra’s original work. If the edge-weights are integers in {0, . . . , C }, a clever priority queue data structure of Peter van Emde Boas can be used instead; this implements all operations in time O(log log C ). Carefully p using it can give us runtimes of O(m log log C ) and O(m + n log C ) (see Ahuja et al.). Later, showed a faster implementation for the case that the weights are integer, which has the running time of O(m + n log log(n)) time. Currently, latest results to come here. 37 Dijkstra (1959) Dijkstra’s paper also gives his version of the Járnik/Prim MST algorithm. The two algorithms are not that different, since the MST algorithm merely changes the update rule to dist(v) ← min{dist(v), wuv }. P. van Emde Boas (1975) Ahuja et al. (1990) M. Thorup (2004) 3.1.2 The Bellman-Ford Algorithm Dijkstra’s algorithm does not work on instances with negative edge weights; see the example on the right. For such instances, we want that a correct SSSP algorithm to either return the distances from s to all other vertices, or else find a negative-weight cycle in the graph. The most well-known algorithm for this case is the ShimbelBellman-Ford algorithm. Just like Dijkstra’s algorithm, this algorithm also starts with an overestimate of the shortest path to each vertex. However, instead of relaxing the out-arcs from each vertex once (in a careful order), this algorithm relaxes the out-arcs of all the vertices n − 1 times, in round-robin fashion. Formally, the algorithm is the following. (A visualization can be found at visualgo.net.) Algorithm 3: The Bellman-Ford Algorithm Input: A digraph G = (V, E) with edge weights we ∈ R, and source vertex s ∈ V Output: The shortest-path distances from s to each vertex, or report that a negative-weight cycle exists 3.1 dist ( s ) = 0 // the source has distance 0 3.2 for v ∈ V do 3.3 dist(v) ← ∞ 3.4 for |V | iterations do 3.5 for edge e = (u, v) ∈ E do 3.6 dist(v) ← min{ dist(v), dist(u) + weight(e)} 3.7 If any distances changed in the last (nth ) iteration, output “G has a negative weight cycle”. The proof relies on the following lemma, which is easily proved by induction on i. Lemma 3.1. After i iterations of the algorithm, dist(v) equals the weight of the shortest-path from s to v containing at most i edges. (This is defined to be ∞ if there are no such paths.) If there is no negative-weight cycle, then the shortest-paths are a 3 s 1 5 −3 t b Figure 3.1: Example with negative edge-weights: Dijkstra’s algorithm gives a label of 4 for t, whereas the correct answer is 3. This algorithm also has a complicated history. The algorithm was first stated by Shimbel in 1954, then Moore in ’57, Woodbury and Dantzig in ’57, and finally by Bellman in ’58. Since it used Ford’s idea of relaxing edges, the algorithm “naturally” came to be known as Bellman-Ford. 38 the all-pairs shortest paths problem (apsp) well-defined and simple, so a shortest-path contains at most n − 1 edges. Now the algorithm is guaranteed to be correct after n − 1 iterations by Lemma 3.4; moreover, none of the distances will change in the nth iteration. However, suppose the graph contains a negative cycle that is reachable from the source. Then the labels dist(u) for vertices on this cycle continue to decrease in each subsequent iteration, because we may reach to any point on this cycle and by moving in that cycle we can accumulate negative distance; therefore, the distance will get smaller and smaller in each iteration. Specifically, they will decrease in the nth iteration, and this decrease signals the existence of a negative-weight cycle reachable from s. (Note that if none of the negative-weight cycles C are reachable from s, the algorithm outputs a correct solution despite C’s existence, and it will produce the distance of ∞ for all the vertices in that cycle.) The runtime is O(mn), since each iteration of Bellman-Ford looks at each edge once, and there are n iterations. This is still the fastest algorithm known for SSSP with general edge-weights, even though faster algorithms are known for some special cases (e.g., when the graph is planar, or has some special structure, or when the edge weights are “well-behaved”). E.g., for the case where all edge weights are integers in the range [−C, ∞), we can compute SSSP in time √ O(m n log C ), using an idea we may discuss in Homework #1. And very recently, ideas using low-diameter decompositions, which we will see in the very next lecture, have been used to give near-linear time algorithms; their runtime is O(m log C poly log n). 3.2 The All-Pairs Shortest Paths Problem (APSP) The obvious way to do this is to run an algorithm for SSSP n times, each time with a different vertex being the source. This gives an O(mn + n2 log n) runtime for non-negative edge weights (using n runs of Dijkstra), and O(mn2 ) for general edge weights (using n runs of Bellman-Ford). Fortunately, there is a clever trick to bypass this extra loss, and still get a runtime of O(mn + n2 log n) with general edge weights. This is known as Johnson’s algorithm, which we discuss next. d 4 3.2.1 Johnson’s Algorithm and Feasible Potentials The idea behind this algorithm is to (a) re-weight the edges so that they are nonnegative yet preserve shortest paths, and then (b) run n instances of Dijkstra’s algorithm to get all the shortest-path distances. A simple-minded hope (based on our idea for MSTs) would be to add s 1 1 −1 1 a b Figure 3.2: A graph with negative edges in which adding positive constant to all the edges will change the shortest paths c shortest paths in graphs a positive number to all the weights to make them positive. Although this preserves MSTs, it doesn’t preserve shortest paths. For instance, the example on the right has a single negative-weight edge. Adding 1 to all edge weights makes them all have non-negative weights, but the shortest path from s to d is changed. Don Johnson gave a algorithm that does the edge re-weighting in a slightly cleverer way, using the idea of feasible potentials. Loosely, it runs the Bellman-Ford algorithm once, then uses the information gathered to do the re-weighting. At first glance, the concept of a feasible potential does not seem very useful. It is just an assignment of weights ϕv to each vertex v of the graph, with some conditions: 39 D.B. Johnson (1977) Lex Schrijver attributes the idea of using potentials to T. Gallai (1958). Definition 3.2. For a weighted digraph G = (V, A), a function ϕ : V → R is a feasible potential if for all edges e = uv ∈ A, ϕ(u) + wuv − ϕ(v) ≥ 0. Given a feasible potential, we can transform the edge-weights of the graph from wuv to buv := wuv + ϕ(u) − ϕ(v). w Observe the following facts: b are all positive. This comes from the definition 1. The new weights w of the feasible potential. 2. Let Pab be a path from a to b. Let ℓ( Pab ) be the length of Pab when we use the weights w, and ℓ̂( Pab ) be its length when we use the b Then weights w. bℓ( Pab ) = ℓ( Pab ) + ϕa − ϕb . The change in the path length is ϕa − ϕb , which is independent of b preserve the shortest a-to-b paths, the path. So the new weights w only changing the length by ϕa − ϕb . This means that if we find a feasible potential, we can compute the b and then run Dijkstra’s algorithm on the remaining new weights w, graph. But how can we find feasible potentials? Here’s the short answer: Bellman-Ford. Indeed, suppose there some source vertex s ∈ V such that every vertex in V is reachable from s. Then, set ϕ(v) = dist(s, v). Lemma 3.3. Given a digraph G = (V, A) with vertex s such that all vertices are reachable from s, ϕ(v) = dist(s, v) is a feasible potential for G. Proof. Since every vertex is reachable from s, dist(s, v) and therefore ϕ(v) is well-defined. For an edge e = uv ∈ A, taking the shortest It is cleaner (and algorithmically simpler) to just add a new vertex s and add zero-weight edges from it to all the original vertices. This does not change any of the original distances, or create any new cycles. 40 the all-pairs shortest paths problem (apsp) path from s to u, and adding on the arc uv gives a path from s to v, whose length is ϕ(u) + wuv . This length is at least ϕ(v), the length of the shortest path from s to v, and the lemma follows. In summary, the algorithm is the following: Algorithm 4: Johnson’s Algorithm Input: A weighted digraph G = (V, A) Output: A list of the all-pairs shortest paths for G ′ 4.1 V ← V ∪ { s } // add a new source vertex ′ 4.2 A ← E ∪ {( s, v, 0) | v ∈ V } ′ ′ 4.3 dist ← BellmanFord((V , A )) // set feasible potentials for e = (u, v) ∈ A do 4.5 weight(e)+ = dist(u) − dist(v) 4.6 L = [] 4.7 for v ∈ V do 4.8 L+ = Dijkstra(G, v) 4.9 return L 4.4 // the result We now bound the running time. Running Bellman-Ford once b requires takes O(mn) time, computing the “reduced” weights w O(m) time, and the n Dijkstra calls take O(n(m + n log n)), if we use Fibonacci heaps. Therefore, the overall running time is O(mn + n2 log n)—almost the same as one SSSP computation, except on very sparse graphs with m = o (n log n). 3.2.2 More on Feasible Potentials How did we decide to use the shortest-path distances from s as our feasible potentials? Here’s some more observations, which give us a better sense of these potentials, and which lead us to the solution. 1. If all edge-weights are non-negative, then ϕ(v) = 0 is a feasible potential. 2. Adding a constant to a feasible potential gives another feasible potential. 3. If there is a negative cycle in the graph, there can be no feasible potential. Indeed, the sum of the new weights along the cycle is the same as the sum of the original weights, due to the telescoping sum. But since the new weights are non-negative, so the old weight of the cycle must be, too. 4. If we set ϕ(s) = 0 for some vertex s, then ϕ(v) for any other vertex v is an underestimate of the s-to-v distance. This is because for all shortest paths in graphs 41 the paths from s to v we have 0 ≤ bℓ( Psv ) = ℓ( Psv ) − ϕv + ϕs = ℓ( Psv ) − ϕv , giving ℓ( Psv ) ≥ ϕv . Now if we try to set ϕ(s) to zero and try to maximize summation of ϕ(v) for other vertices subject to the feasible potential constraints we will get an LP that is the dual of the shortest path LP. Maximize ∑ ϕx x ∈V Subject to ϕs = 0 wvu + ϕv − ϕu ≥ 0 3.2.3 ∀(v, u) ∈ E The Floyd-Warshall Algorithm The Floyd-Warshall algorithm is perhaps best introduced via its strikingly simple pseudocode. It first puts down estimates dist(u, v) for the distances thus:    wij , i, j ∈ E distij = ∞ i, j ∈ / E, i ̸= j .    0, i=j The naming of this algorithm does not disappoint: it was discovered by Bernard Roy, Stephen Warshall, Bob Floyd, and others. The name tells only a small part of the story. Then it runs the following series of updates. Algorithm 5: The Floyd-Warshall Algorithm Input: A weighted digraph D = (V, A) Output: A list of the all-pairs shortest paths for D 5.1 set d ( x, y ) ← w xy if ( x, y ) ∈ E, else d ( x, y ) ← ∞ 5.2 for z ∈ V do 5.3 for x, y ∈ V do 5.4 d( x, y) ← min{d( x, y), d( x, z) + d(z, y)} Importantly, we run over the “inner” index z in the outermost loop. The proof of correctness is similar to, yet not that same as that of Algorithm 3, and is again left as a simple exercise in induction. Lemma 3.4. After we have considered vertices Vk = {z1 , . . . , zk } in the outer loop, dist(u, v) equals the weight of the shortest x-y path that uses only the vertices from Vk as internal vertices. (This is ∞ if there are no such paths.) The running time of Floyd-Warshall is clearly O(n3 )—no better than Johnson’s algorithm. But it does have a few advantages: it is simple, and it is quick to implement with minimal errors. (The most common error is nesting the for-loops in reverse.) Another advantage is that Floyd-Warshall is also parellelizable, and very cache efficient. Actually, this paper of Hide, Kumabe, and Maehara (2019) shows that even if you get the loops wrong, but you run the algorithm a few more times, it all works out in the end. But that proof requires a bit more work. 42 min-sum products and apsps 3.3 Min-Sum Products and APSPs A conceptually different way to get shortest-path algorithms is via matrix products. These may not seem relevant, a priori, but they lead to deep insights about the APSP problem. Recall the classic definition of matrix multiplication, for two realvalued matrices A, B ∈ Rn×n n ( AB)ij = ∑ ( Aik ∗ Bkj ). k =0 Hence, each entry of the product AB is a sum of products, both being the familar operations over the field (R, +, ∗). But now, what if we change the constituent operations, to replace the sum with the min operation, and the product with a sum? We get the Min-Sum Product(MSP): given matrices A, B ∈ Rn×n , the new product is ( A ⊚ B)ij = min{ Aik + Bkj }. k This is the usual matrix multiplication, but over the semiring (R, min, +). It turns out that computing Min-Sum Products is precisely the operation needed for the APSP problem. Indeed, initialize a matrix D exactly as in the Floyd-Warshall algorithm:    wij , i, j ∈ E Dij =    ∞ 0, i, j ∈ / E, i ̸= j . i=j Now ( D ⊚ D )ij represents the cheapest i-j path using at most 2 hops! (It’s as though we made the outer-most loop of Floyd-Warshall into the inner-most loop.) Similarly, we can compute D ⊚k : = D ⊚ D ⊚ D · · · ⊚ D , | {z } k −1 MSPs whose entries give the shortest i-j paths using at most k hops (or at most k − 1 intermediate nodes). Since the shortest paths would have at most n − 1 hops, we can compute D⊚n−1 . How much time would this take? The very definition of MSP shows how to implement it in O(n3 ) time. But performing it n − 1 times would be O(n) worse than all other approaches! But here’s a classical trick, which probably goes back to the Babylonians: for any integer k, D⊚2k = D⊚k ⊚ D⊚k . (Here we use that the underlying operations are associative.) Now it is a simple exercise to compute D⊚n−1 using at most 2 log2 n MSPs. A semiring has a notion of addition and one of multiplication. However, neither the addition nor the multiplication operations are required to have inverses. shortest paths in graphs This a runtime of O( MSP(n) log n), where MSP(n) is the time it takes to compute the min-sum product of two n × n matrices. Now using the naive implementation of MSP gives a total runtime of O(n3 log n), which is almost in the right ballpark! The natural question is: can we implement MSPs faster? 43 In fact, with some more work, we can implement APSP in time O( MSP(n)); you will probably see this in a homework. 3.3.1 Faster Algorithms for Matrix Multiplication Can we get algorithms for MSP that run in time O(n3−ε ) for some constant ε > 0? To answer this question, we can first consider the more common case, that of matrix multiplication over the reals (or over some field)? Here, the answer is yes, and this has been known for now over 50 years. In 1969, Volker Strassen showed that one could multiply n × n matrices over any field F, using O(nlog2 7 ) = O(n2.81 ) additions and multiplications. (One can allow divisions as well, but Strassen showed that divisions do not help asymptotically.) If we define the exponent of matrix multiplication ω > 0 to be smallest real such that two n × n matrices over any field F can be multiplied in time O(nω ), then Strassen’s result can be phrased as saying: ω ≤ log2 7. This value, and Strassen’s idea, has been refined over the years, to its current value of 2.3728 due to François Le Gall (2014). (See this survey by Virginia for a discussion of algorithmic progress until 2013.) There has been a flurry of work on lower bounds as well, e.g., by Josh Alman and Virginia Vassilevska Williams showing limitations for all known approaches. But how about MSP(n)? Sadly, progress on this has been less impressive. Despite much effort, we don’t even know if it can be done in O(n3−ϵ ) time. In fact, most of the recent work has been on giving evidence that getting sub-cubic algorithms for MSP and APSP may not be possible. There is an interesting theory of hardness within P developed around this problem, and related ones. For instance, it is now known that several problems are equivalent to APSP, and truly sub-cubic algorithms for one will lead to sub-cubic algorithms for all of them. Yet there is some interesting progress on the positive side, albeit qualitatively small. As far back as 1976, Fredman had shown an log log n algorithm to compute MSP in O(n3 log n ) time. He used the fact that the decision-tree complexity of APSP is sub-cubic (a result we will discuss in §3.5) in order to speed up computations over nearlyxlogarithmic-sized sub-instances; this gives the improvement above. More recently, another CMU alumnus Ryan Williams improved on V. Strassen. Gaussian elimination is not optimal. Numer. Math. 13 (1969) Mike Paterson has a beautiful but still mysterious geometric interpretation of the sub-problems Strassen comes up with, and how they relate to Karatsuba’s algorithm to multiply numbers. Technically it’s an infimum, since put details and references. The big improvements in this line of work were due to Arnold Schönhage (1981), Don Coppersmith and Shmuel Winograd (1990), with recent refinements by Andrew Stothers, CMU alumna Virginia Vassilevska Williams, and François Le Gall (2014). M.L. Fredman (1976) 44 undirected apsp using fast matrix multiplication this idea quite substantially to O  2 3 √n log n  , using very interesting R.R. Williams (2018) ideas from circuit complexity. We will discuss this result in a later section, if we get a chance. 3.4 Undirected APSP Using Fast Matrix Multiplication One case where we know truly sub-cubic APSP algorithms is that of graphs with small integer edge-weights. Our focus here will be on the case of unit-weighted undirected graphs: we present an algorithm of Raimund Seidel that runs in time O(nω log n), assuming that ω > 2. This elegant algorithm showcases the smart use of matrix multiplication in graph problems. 3.4.1 The Square of G As always, the adjacency matrix A for the simple graph G is the symmetric matrix  1 ij ∈ E Aij = . 0 ij ∈ /E Now consider the graph G2 , the square of G, which has the same vertex set as G but where an edge in G2 corresponds to being at most two hops away in G—that is, uv ∈ E( G2 ) ⇐⇒ dG (u, v) ≤ 2. To construct the adjacency matrix for G2 from that of A, we can use the following idea: 1. Consider B := AG × AG ; this matrix product takes nω time. 2. Since Bij = ∑k Aik Akj counts the number of two-hop paths in A, we can define ( AG2 )ij := ( Bij > 0) ∨ ( Aij > 0). This transformation takes an additional O(n2 ) time. 3.4.2 Relating Shortest Paths in G and G2 Suppose we recursively compute APSP on G2 : how can we translate this result back to G? The next lemma shows that the shortest-path distances in G2 are nicely related to those in G. Lemma 3.5. If d xy and Dxy are the shortest-path distances between x, y in G and G2 respectively, then Dxy =   d xy . 2 R. Seidel (1995) shortest paths in graphs Proof. Any u-v path in G can be written as u, a1 , b1 , a2 , b2 , . . . , ak , bk , v if the path has odd length; an even-length path can be written as u, a1 , b1 , a2 , b2 , . . . , ak , bk , ak+1 , v. In either case, G2 has edges ub1 , b1 b2 , . . . , bk−1 bk , bk v, and thus a u-v d d path of length ⌈ 2xy ⌉ in G2 . Therefore Dxy ≤ ⌈ 2xy ⌉. d To show equality, suppose there is a u-v path of length ℓ < ⌈ 2xy ⌉ in G2 . Each of these ℓ edges corresponds to either an edge or a 2-edge path in G, so we can find a u-v path of length at most 2ℓ < d xy in G, a contradiction. Lemma 3.5 implies that duv ∈ {2Duv , 2Duv − 1}. But which one? The following lemmas give us simple rule to decide. Let NG (v) denote the set of neighbors of v in G. Lemma 3.6. If duv = 2Duv , then for all w ∈ NG (v) we have Duw ≥ Duv . Proof. Assume not, and let w ∈ NG (v) be such that Duw < Duv . Since both of them are integers, we have 2Duw < 2Duv − 1. Then the shortest u-w path in G along with the edge wv forms a u-v-path in G of length at most 2Duw + 1 < 2Duv = duv , which is in contradiction with the assumption that duv is the shortest path in G. Lemma 3.7. If duv = 2Duv − 1, then Duw ≤ Duv for all w ∈ NG (v); moreover, there exists z ∈ NG (v) such that Duz < Duv . Proof. For any w ∈ NG (v), considering the shortest u-v path in G along with the edge vw implies that duw ≤ duv + 1 = (2Duv − 1) + 1, so Lemma 3.5 gives that Duw = ⌈duw /2⌉ = Duv . For the second claim, consider a vertex z ∈ NG (v) on a shortest path from u to v. Then duz = duv − 1, and Lemma 3.5 gives Duz < Duv . These lemmas can be summarized thus: Corollary 3.8. If deg( j) = | NG ( j)| is the degree of j, then duv = 2Duv ⇐⇒ ∑w∈ N (v) Duw ≥ Duv , deg(v) (3.1) Where did we use that G was undirected? In Lemma 3.6 we used that w ∈ NG (v) =⇒ wv ∈ E. And in Lemma 3.7 we used that w ∈ NG (v) =⇒ vw ∈ E. 45 46 undirected apsp using fast matrix multiplication 3.4.3 Using Matrix Multiplication One More Time Given D, the criterion on the right can be checked for each uv in time deg(v) by just computing the average, but that could be too slow— how can we do better? Define the normalized adjacency matrix of G to b with be A 1 bwv = 1wv∈E · A . deg(v) Now if D is the distance matrix of G2 , then b)uv = ∑ Duw A bwv = (D A w ∈V ∑w∈ NG (v) Duw , deg(v) which is conveniently the expression in (3.1). Let 1( D Ab< D) be a matrix b)uv < Duv , and zero otherwise. Then with the uv-entry being 1 if ( D A the distance matrix for G is 2D − 1( D Ab< D) . This completes the algorithm, which we now summarize: Algorithm 6: Seidel’s Algorithm Input: Unweighted undirected graph G = (V, E) with adjacency matrix A Output: The distance matrix for G 6.1 if A = J then 6.2 return A // If A is all-ones matrix, done! 6.3 else 6.4 A′ ← A ∗ A + A // Boolean operations 6.5 D ← Seidel(A′ ) 6.6 return 2D − 1( D Ab< D) Each call to the procedure above performs one Boolean matrix multiplication in step (6.4), one matrix multiplication with rational entries in step (6.6), plus O(n2 ) extra work. The diameter of the graph halves in each recursive call (by Lemma 3.5), and the algorithm hits the base case when the diameter is 1. Hence, the overall running time is O(nω log n). Ideas similar to these can be used to find shortest paths graphs with small integer weights on the edges: if the weights are integers in the interval [0, W ], Avi Shoshan and Uri Zwick give an Õ(Wnω )-time algorithm. In fact, Zwick also extends the graphs,  ideas to directed  1 1 and gives an algorithm with runtime Õ W 4−ω n2+ 4−ω . 3.4.4 Finding the Shortest Paths How do we find the shortest paths themselves, and not just their lengths? For the previous algorithms, modifying the algorithms to Shoshan and U. Zwick (1999) U. Zwick (2000) shortest paths in graphs output the paths is fairly simple. But for Seidel’s algorithm, things get tricky. Indeed, since the runtime of Seidel’s algorithm is strictly sub-cubic, how can we write down the shortest paths in nω time, since the total length of all these paths may be Ω(n3 )? We don’t: we just write down the successor pointers. Indeed, for each pair u, v, define Sv (u) to be the second node on a shortest u-v path (the first node being u, and the last being v). Then to get the entire u-v shortest path, we just follow these pointers: u, Sv (u), Sv (Sv (u)), . . . , v. So there is a representation of all shortest paths that uses at most O(n2 log n) bits. The main idea for computing the successor matrix for Seidel’s algorithm is to solve the Boolean Product Matrix Witness problem: given n × n Boolean matrices A, B, compute an n × n matrix W such that Wuv = k if Aik = Bkj = 1, and Wij = 0 if no such k exists. We will hopefully see (and solve) this problem in a homework. 3.5 Optional: Fredman’s Decision-Tree Complexity Bound Given the algorithmic advances, one may wonder about lower bounds for the APSP problem. There is the obvious Ω(n2 ) lower bound from the time required to write down the answer. Maybe even the decision-tree complexity of the problem is Ω(n3 )? Then no algorithm can do any faster, and we’d have shown the Floyd-Warshall and the Matrix-Multiplication methods are optimal. However, thanks to a result of Michael Fredman, we know this is not the case. If we just care about the decision-tree complexity, we can get much better. Specifically, Fredman shows Theorem 3.9. The Min-Sum Product of two n × n matrices A, B can be deduced in O(n2.5 ) additions and comparisons. Proof. The proof idea is to split A and B into rectangular sub-matrices, and compute the MSP on the sub-matrices. Since these sub-matrices are rectangular, we can substantially reduce the number of comparisons needed for each one. Once we have these sub-MSPs, we can simply compute an element-wise minimum for find the final MSP. Fix a parameter W which we determine later. Then divide A into n/W n × W matrices A1 , . . . , An/W , and divide B into n/W W × n submatrices B1 , . . . , Bn/W . We will compute each Ai ⊚ Bi . Now T) consider ( A ⊚ B)ij = mink∈[W ] ( Aik + Bkj ) = mink∈[W ] ( Aik + Bjk ∗ and let k be the minimizer of this expression. Then we have the M.L. Fredman (1976) 47 48 optional: fredman’s decision-tree complexity bound following: T T Aik∗ − Bjk ∗ ≤ Aik − B jk ∀ k T T Aik∗ − Aik ≤ −( Bjk ∗ − B jk ) ∀ k (3.2) (3.3) Now for every pair of columns, p, q from Ai , BiT , and sort the following 2n numbers A1p − Aiq , A2p − A2q , . . . , Anp − Anq , −( B1p − B1q ), . . . , −( Bnp − Bnq ) We claim that by sorting W 2 lists of numbers we can compute Ai ⊚ Bi . To see this, consider a particular entry ( A ⊚ B)ij and find a k∗ T − BT ) such that for every k ∈ [W ], Aik∗ − Aik precedes every −( Bjk ∗ jk in their sorted list. By (3.3), such a k∗ is a minimizer. Then we can set ( A ⊚ B)ij = Aik∗ + Bk∗ j . This computes the MSP for Ai , Bi , but it is possible that another A j ⊚ Bj produces the actual minimum. So, we must take the element-wise minimum across all the ( Ai ⊚ Bi ). This produces the MSP of A, B. Now for the number of comparisons. We have n/W smaller products to compute. Each sub-product has W 2 arrays to sort, each of which can be sorted in 2n log n comparisons. Finding the minimizer requires W 2 n comparisons.So, computing the sub-products requires n/W ∗ 2W 2 n log n = 2n2 W log n comparisons. Then, reconstructing the final MSP requires n2 element-wise minimums between n/W − 1 elements, which requires n3 /W comparisons. Summing these bounds gives us n3 /W + 2n2 W log n comparisons. Optimizing over W gives p us O(n2 n log n) comparisons. This result does not give us a fast algorithm, since it just counts the number of comparisons, and not the actual time to figure out which comparisons to make. Regardless, many of the algorithms that achieve n3 / poly log n time for APSP use Fredman’s result on tiny instances (say of size O(poly log n), so that we can find the best decision-tree using brute-force) to achieve their results. 4 Low-Stretch Spanning Trees Given that shortest paths from a single source node s can be represented by a single shortest-path tree, can we get an analog for allpairs shortest paths? Given a graph can we find a tree T that gives us the shortest-path distances between every pair of nodes? Does such a tree even exist? Sadly, the answer is negative—and it remains negative even if we allow this tree to stretch distances by a small factor, as we will soon see. However, we show that allowing randomization will allow us to circumvent the problems, and get low-stretch spanning trees in general graphs. In this chapter, we consider undirected graphs G = (V, E), where each edge e has a non-negative weight/length we . For all u, v in V, let dG (u, v) be the distance between u, v, i.e., the length of a shortest path in G from u to v. Observe that the set V along with the distance function dG forms a metric space. 4.1 Towards a Definition The study of low-stretch spanning trees is guided by two high level hopes: 1. Graphs have spanning trees that preserve their distances. That is, given G there exists a subtree T = (V, ET ) with ET ⊆ E such that dG (u, v) ≈ d T (u, v) for all u, v ∈ V. 2. Many NP-hard problems are much easier to solve on trees. Supposing these are true, we have a natural recipe for designing algorithms to solve problems that depend only on distances in G: (1) find a spanning tree T preserving distances in G, (2) solve the problem on T, and then (3) return the solution (or some close cousin) with the hope that it is a good solution for the original graph. A metric space is a set V with a distance function d satisfying symmetry (i.e., d( x, y) = d(y, x ) for all x, y ∈ V) and the triangle inequality (d( x, y) ≤ d( x, z) + d(z, y) for all x, y, z ∈ V). Typically, the definition also asks for x = y ⇐⇒ d( x, y) = 0, but we will merely assume d( x, x ) = 0 for all x. We assume that the weights of edges in ET are the same as those in G. 50 towards a definition 4.1.1 An All-Pairs Shortest Path Tree? The boldest hope would be to find an all-pairs shortest path tree T, i.e., one that ensures d T (u, v) = dG (u, v) for all u, v in V. However, such a tree may not exist: consider Kn , the clique of n nodes, with unit edge lengths. The distance dG satisfies dG ( x, y) = 1 for all x ̸= y, and zero otherwise. But any subtree T contains only n − 1 edges, so most pairs of vertices x, y ∈ V lack an edge between them in T. Any such pair has a shortest-path distance d T ( x, y) ≥ 2, whereas dG ( x, y) = 1. 4.1.2 A First Relaxation: Low-Stretch Spanning Trees To remedy the snag above, let us not require distances in T be equal to those in G, but instead be within a small multiplicative factor α ≥ 1 of those in G. Definition 4.1. Let T be a spanning tree of G, and let α ≥ 1. We call T a (deterministic) α-stretch spanning tree of G if dG (u, v) ≤ d T (u, v) ≤ α dG (u, v). holds for all u, v ∈ V. Supposing we had such a low-stretch spanning tree, we could try our meta-algorithm out on the traveling salesperson problem (TSP): given a graph, find a closed tour that visits all the vertices, and has the smallest total length. This problem is NP-hard in general, but let us see how an α-stretch spanning tree of G gives us an an α-approximate TSP solution for G. The algorithm is simple: Algorithm 7: TSP via Low-Stretch Spanning Trees Find an α-stretch spanning tree T of G. Solve TSP on T to get an ordering π T on the vertices. 7.3 return the ordering π T . 7.1 7.2 Solving the TSP problem on a tree T is trivial: just take an Euler tour of T, and let π T be the order in which the vertices are visited. Let us bound the quality of this solution. Claim 4.2. π T is an α-approximate solution to the TSP problem on G. Proof. Suppose that the permutation πG minimizes the length of the TSP tour for G. The length of the resulting tour is OPTG := ∑ dG (πG (i ), πG (i + 1)). i ∈[n] Exercise: show that if T is any subtree of G with the same edge weights, then dG ( x, y) ≤ d T ( x, y). low-stretch spanning trees 51 Since distances in the tree T are stretched by only a factor of α, ∑ dT (πG (i), πG (i + 1)) ≤ α · ∑ dG (πG (i), πG (i + 1)). i ∈[n] (4.1) i ∈[n] Now, since π T is the optimal ordering for the tree T, and πG is some other ordering, ∑ dT (πT (i), πT (i + 1)) ≤ ∑ dT (πG (i), πG (i + 1)). i ∈[n] | {z OPTT } (4.2) i ∈[n] Finally, since distances were only stretched in going from G to T, ∑ dG (πT (i), πT (i + 1)) ≤ ∑ dT (πT (i), πT (i + 1)). i ∈[n] (4.3) i ∈[n] Putting it all together, the length of the tour given by π T is ∑ dG (πT (i), πT (i + 1)) ≤ α · ∑ dG (πG (i), πG (i + 1)), i ∈[n] i ∈[n] which is α · OPTG . Hence, if we had low-stretch spanning trees T with α ≤ 1.49, we would get the best approximation algorithm for the TSP problem. (Assuming we can find T, but we defer this for now.) However, you may have already noticed that the Kn example above shows that α < 2 is impossible. But can we achieve α = 2? Indeed, is there any “small” value for α such that for any graph G we can find an α-stretch spanning tree of G? Sadly, things are terrible: take the cycle Cn , again with unit edge weights. Now any subtree T is missing one edge from Cn , say uv. The endpoints of this edge are at distance 1 in Cn , but d T (u, v) = n − 1, since we have to go all the way around the cycle. Hence, getting α < (n − 1) is impossible in general. Exercise: show how to find, for any graph G, a spanning tree T with stretch α ≤ n − 1. 4.1.3 A Second Relaxation: Randomization to the Rescue Since we cannot get trees with small stretch deterministically, let us try to get trees with small stretch “on average”. We amend our definition as follows: Definition 4.3. A (randomized) low-stretch spanning tree of stretch α for a graph G = (V, E) is a probability distribution D over spanning trees of G such that for all u, v ∈ V, we have dG (u, v) ≤ d T (u, v) ET ∼D [d T (u, v)] ≤ α dG (u, v) for all T in the support of D , and (4.4) Henceforth, all references to low-stretch trees will only refer to this randomized version, unless otherwise specified. 52 low-stretch spanning tree construction Observe that the first property must hold with probability 1 (i.e., it holds for all trees in the support of the distribution), whereas the second property holds only on average. Is this definition any good for our TSP example above? If we change the algorithm to sample a tree T from the distribution and then return the optimal tour for T, we get a randomized algorithm that is good in expectation. Indeed, (4.1) becomes ∑ E[dT (πG (i), πG (i + 1))] ≤ α · ∑ dG (πG (i), πG (i + 1)), i ∈[n] (4.5) i ∈[n] because the stretch guarantees hold in expectation (and linearity of expectation). The rest of the inequalities hold unchanged, including (4.3)—which requires the probability 1 guarantee of Definition 4.6 (Do you see why?). Hence, we get ∑ E[dG (πT (i), πT (i + 1))] ≤ α · ∑ dG (πG (i), πG (i + 1)) . i ∈[n] | {z expected algorithm’s tour length } i ∈[n] | {z OPTG (4.6) } Even a randomized better-than-1.49 approximation for TSP would still be amazing! And the algorithmic template here works not just for TSP: any NP-hard problem whose objective is a linear function of distances (e.g., many other vehicle routing problems, or the kmedian clustering problem) can be solved in this way. Indeed, the first approximation algorithms for many such problems came via low-stretch spanning trees. Moreover, (randomized) low-stretch spanning trees arise in many different contexts, some of which are not obvious at all. E.g., they can be used to more efficiently solve “Laplacian” linear systems of the form A⃗x = ⃗b, where A is the Laplacian matrix of some graph G. To do this, we let P be the Laplacian matrix of a low-stretch spanning tree of G, and then we solve the system P−1 A⃗x = P−1⃗x instead. This is called preconditioning with P. It turns out that this preconditioning allows certain algorithms for solving linear systems to converge faster to a solution. Time permitting, we will discuss this application later in the course. 4.2 Low-Stretch Spanning Tree Construction But first, given a graph G, how can we find a randomized low-stretch spanning tree for G with a small value of α (and efficiently)? As a sanity check, let us check what we can do on the two examples from before: 1. For the complete graph Kn , choose a star graph centered at a uniformly random vertex of G. For any pair of vertices u, v, they are A natural first attempt (at least for unweighted graphs) would be to try a uniformly random spanning tree. This does not work very well (which I think is not that surprising), even for the complete graph Kn (which I think is somewhat surprising). A result of Moon and Moser shows that for any pair of vertices u, v ∈ V (Kn ), if we choose T to be one of the nn−2 spanning trees uniformly at random, √ the expected distance is d T (u, v) = Θ( n). low-stretch spanning trees at distance 1 in this star if either u or v is the center, else they are 2 2 at distance 2. Hence the expected distance is n2 · 1 + n− n · 2 = 2 − n. 2. For the cycle Cn , choose a tree by dropping a single edge uniformly at random. For any edge uv in the cycle, there is only a 1 in n chance of deleting the edge from u to v. But when it is deleted, u and v are at distance n − 1 in the tree. So E[d T (u, v)] = n−1 1 2 · 1 + · ( n − 1) = 2 − . n n n And what about an arbitrary pair of nodes u, v in Cn ? We can use the exercise on the right to show that the stretch on other pairs is no worse! While we will not manage to get α < 1.49 for general graphs (or even for the above examples, for which the bounds of 2 − n2 are the best possible), we show that α ≈ O(log n) can indeed be achieved. The following theorem is the current best result, due to Ittai Abraham and Ofer Neiman: Theorem 4.4. For any graph G, there exists a distribution D over spanning trees of G with stretch α = O(log n log log n). Moreover, the construction is efficient: we can sample trees from this distribution D in O(m log n log log n) time. Moreover, the stretch bound of this theorem is almost optimal, up to the O(log log n) factor, as the following lower bound due to Alon, Peleg, Karp, and West shows. Theorem 4.5. For infinitely many n, there exist graphs G on n vertices such that any α-stretch spanning tree distribution D on G must have α = Ω(log n). In fact, G can be taken to be the n-vertex square grid, the nvertex hypercube, or any n-vertex constant-degree expander. 4.3 Bartal’s Construction The algorithm underlying Theorem 4.4 is quite involved, but we can give the entire construction of low-stretch trees for finite metric spaces. Definition 4.6. A (randomized) low-stretch tree with stretch α for a metric space M = (V, d) is a probability distribution D over trees over the vertex set V such that for all u, v ∈ V, we have d(u, v) ≤ d T (u, v) ET ∼D [d T (u, v)] ≤ α d(u, v). for all T in the support of D , and (4.7) Exercise: Given a graph G, suppose the stretch on all edges is at most α. Show that the stretch on all pairs of nodes is at most α. (Hint: linearity of expectation.) 53 54 bartal’s construction The difference of this definition from Definition 4.6 is slight: we now have a metric space instead of a graph, and we are allowed to output any tree on the vertex set V (since the concept of subtrees doesn’t make sense now). Note that given a graph G, we can compute its shortest-path metric (V, dG ) and then find a distribution over (non-spanning) trees that approximate the distance in G. So if we don’t really need the spanning aspect in our low-stretch trees—e.g., as in the TSP example—we can use results for this definition. We need one more piece of notation: for a metric space M = (V, d), define its aspect ratio ∆ to be ∆ M := maxu̸=v∈V d(u, v) . minu̸=v∈V d(u, v) We will show the following theorem, due to Yair Bartal: Theorem 4.7. For any metric space M = (V, d), there exists an efficiently sampleable α B -stretch spanning tree distribution D B , where α B = O(log n log ∆ M ). The proof works in two parts: we first show a good low-diameter decomposition. This will be a procedure that takes a metric space and a diameter bound D, and randomly partitions the metric space into clusters of diameter ≤ D, in such a way that close-by points are unlikely to be separated. Then we show how such a low-diameter decomposition can be used recursively to constuct a low-stretch tree. 4.3.1 Low-Diameter Decompositions The notion of a low-diameter decomposition has become ubiquitous in algorithm design, popping up in approximation and online algorithms, and also in distributed and parallel algorithms. It’s something worth understanding well. Definition 4.8 (Low-Diameter Decomposition). A low-diameter decomposition scheme (or LDD scheme) with parameter β for a metric M = (V, d) is a randomized algorithm that, given a bound D > 0, partitions the point set V into “clusters” C1 , . . . , Ct such that (i) for all i ∈ {1, . . . , t}, the diameter of Ci is at most D, and (ii) for all x, y ∈ V such that x ̸= y, we have Pr[ x, y in different clusters] ≤ β · d( x, y) . D Let’s see a few examples, to get a better sense for the definition: 1. Consider a set of points on the real line. One way to partition the line into pieces of diameter D is simple: imagine making notches The diameter of a set S is maxu,v∈S d(u, v), i.e., the maximum distance between any two points in it. low-stretch spanning trees on the line at distance D from each other, and then randomly shifting them. Formally, pick a random value R ∈ [0, D ] uniformly at random, and partition the line into intervals of the form [ Di + R, D (i + 1) + R), for i ∈ Z. A little thought shows that points x, y d( x,y) are separated with probability exactly D . 2. The infinite 2-dimensional square grid with unit edge-lengths. One way to divide this up is to draw horizontal and vertical lines which are D/2 apart, and randomly shift as above. A pair x, y is d( x,y) separated with probability exactly D/2 in this case. Indeed, this approach works for k-dimensional hypergrids (and k-dimensional x,y) ℓ1 -space) with probability k · d(D — in this case the β parameter is at most the dimension of the space. 3. What about lower bounds? One can show that for the k-dimensional hypergrid, we cannot get β = o (k). Or for a constant-degree nvertex expander, we cannot get β = o (log n). Details to come soon. Since the aspect ratio of the metric space is invariant to scaling all the edge lengths by the same factor, it will be convenient to assume that the smallest non-zero distance in d is 1, so the largest distance is ∆. The basic algorithm is then quite simple: Algorithm 8: LDD( M = (V, d), D ) 4 log n p ← min(1, D ). 8.2 while there exist unmarked point do 8.3 v ← any unmarked point. 8.4 sample Rv ∼ Geometric( p). 8.5 cluster Cv ← {unmarked u | d(v, u) < Rv }. 8.6 mark points in Cv . 8.1 8.7 return the resulting set of clusters. Lemma 4.9. The algorithm above ensures that 1. the diameter of every cluster is at most D with probability at least 1 − 1/n, and 2. any pair x, y ∈ V is separated with probability at most 2p d( x, y). Proof. To show the diameter bound, it suffices to show that Rv ≤ D/2 for each cluster Cv , because then the triangle inequality shows that for any x, y ∈ Cv , d( x, y) ≤ d( x, v) + d(v, y) < D/2 + D/2 = D. Now the probability that Rv > D/2 for one particular cluster is We use that 1 − z ≤ ez for all z ∈ R. 55 56 bartal’s construction 1 . n2 By a union bound, there exists a cluster with diameter > D with probability Pr[ Rv > D/2] = (1 − p) D/2 ≤ e− pD/2 ≤ e−2 log n = 1 n = 1− . n n2 To bound the probability of some pair u, v being separated, we use the fact that sampling from the geometric distribution with parameter p means repeatedly flipping a coin with bias p and counting the number of flips until we see the first heads. Recall this process is memoryless, meaning that even if we have already performed k flips without having seen a heads, the time until the first heads is still geometrically distributed. Hence, the steps of drawing Rv and then forming the cluster can be viewed as starting from v, where the cluster is a unit-radius ball around v. Each time we flip a coin of bias p: it is comes up heads we set the radius Rv to the current value, form the cluster Cv (and mark its vertices) and then pick a new unmarked point v; on seeing tails, we just increment the radius of v’s cluster by one and flip again. The process ends when all vertices lie in some cluster. For x, y, consider the first time when one of these vertices lies inside the current ball centered at some point, say, v. (This must happen at some point, since all vertices are eventually marked.) Without loss of generality, let the point inside the current ball be x. At this point, we have performed d(v, x ) flips without having seen a heads. Now we will separate x, y if we see a heads within the next ⌈d(v, y) − d(v, x )⌉ ≤ ⌈d( x, y)⌉ flips—beyond that, both x, y will have been contained in v’s cluster and hence cannot be separated. But the probability of getting a heads among these flips is at most (by a union bound) 1 − Pr[∃v ∈ V, Rv > D/2] ≥ 1 − d( x, y) . D (Here we used that the minimum distance is 1, so rounding up distances at most doubles things.) This proves the claimed probability of separation. ⌈d( x, y)⌉ p ≤ 2d( x, y) p ≤ 8 log n Recall that we wanted the diameter bound with probability 1, whereas Lemma 4.9 only ensures it with high probability. Here’s a quick fix to this problem: repeat the above process until the returned partition has clusters of diameter at most D. The probability of any pair u, v being separated by this last run of Algorithm 8 is at most the probability of u, v being separated by any of the runs, which is at most p d(u, v) times the expected number of runs, p d(u, v) · (1/(1 − 1/n)) ≤ 2p d(u, v) = O(log n) d(u, v) . D Cv v Rv ≤ D2 d(v, x ) d(v, y) x y d( x, y) Figure 4.1: A cluster forming around v in the LDD process, separating x and y. To reduce clutter, only some of the distances are shown. low-stretch spanning trees Lemma 4.10. The low-diameter decomposition scheme above achieves parameter β = O(log n) for any metric M on n points. 4.3.2 Low-Stretch Trees Using LDDs Now we can use the low-diameter decomposition scheme to get a low-stretch tree (LST). Here’s the high-level idea: given a metric with diameter ∆, use an LDD to decompose it into clusters with diameter D ≤ ∆/2. Build a tree recursively for each of these clusters, and then combine these trees into one tree for the entire metric. Recall we assumed that the metric had minimum distance 1 and maximum distance ∆. Formally, we invoke the procedure LST below with the parameters LST(metric M, ⌈log2 ∆⌉). Algorithm 9: LST(metric M = (V,d), D = 2δ ) Input: Invariant: diameter( M) ≤ 2δ 9.1 if |V | = 1 then 9.2 return tree containing the single point in V. δ −1 ). 9.3 C1 , . . . , Ct ← LDD( M, D = 2 9.4 for j in {1, . . . , t } do 9.5 M j ← metric M restricted to the points in Cj . 9.6 Tj ← LST( M j , δ − 1). Add edges of length 2δ from root r1 for tree T1 to the roots of T2 , . . . , Tt . 9.8 return resulting tree rooted at r1 . 9.7 We are ready to prove Theorem 4.7; we will show that the tree has expected stretch O( β log ∆), and that it does not shrink any distances. In fact, we show a slightly stronger guarantee. Lemma 4.11. If the random tree T returned by some call LDD( M′ , δ) has root r, then (a) every vertex x in T has distance d( x, r ) ≤ 2δ+1 , and (b) the expected distance between any x, y ∈ T has E[d T ( x, y)] ≤ 8δβ d( x, y). Proof. The proof is by induction on δ. For the base case, the tree has a single vertex, so the claims are trivial. Else, let x lie in cluster Cj , so inductively the distance to the root of the tree Ti is d( x, ri ) ≤ 2(δ−1)+1 . Now the distance to the new root r is at most 2δ more, which gives 2δ + 2δ = 2δ+1 as claimed. Moreover, any pair x, y is separated by the LDD with probability d( x,y) β 2δ−1 , in which case their distance is at most d( x, r ) + d(r, y) ≤ 2δ+1 + 2δ+1 = 4 · 2δ . Else they lie in the same cluster, and inductively have expected dis- 57 58 metric embeddings: a.k.a. simplifying metrics tance at most 8(δ − 1) β d( x, y). Hence the expected distance is E[d( x, y)] ≤ Pr[ x, y separated] · 4 · 2δ + Pr[ x, y not separated] · 8(δ − 1) β d( x, y) d( x, y) · 4 2δ + 8(δ − 1) β d( x, y) 2δ −1 = 8 δ β d( x, y). ≤β This proves Theorem 4.7 because β = O(log n), and the iniitial call on the entire metric defines δ = O(log ∆). In fact, if we have a better LDD (with smaller β), we immediately get a better low-stretch tree. For example, shortest-path metrics of planar graphs admit an LDD with parameter β = O(1); this shows that planar metrics admit (randomized) low-stretch trees with stretch O(log ∆). It turns out this factor of O(log n log ∆) can be improved to O(log n)— this was done by Fakcharoenphol, Rao, and Talwar. Moreover, the bound of O(log n) is tight: the lower bounds of Theorem 4.5 continue to hold even for low-stretch non-spanning trees. 4.4 Metric Embeddings: a.k.a. Simplifying Metrics We just how to approximate a finite metric space with a simpler metric space, defined over a tree. (Loosely, “every metric space is within O(log n) of some tree metric”.) And since trees are simpler metrics, both conceptually and algorithmically, such an embedding can help design algorithms for problems on metric spaces. This idea of approximating metric spaces by simpler ones has been extensively studied in various forms. For example, another famous result of Jean Bourgain (with an extension by Jirka Matoušek) shows that any finite metric space on n points can be embedded into ℓ p -space with O((log n)/p) distortion 1 . Moreover, the JohnsonLindenstrauss Lemma, which we will see in a future chapter, shows that any n point-submetric of Euclidean space can be embedded into a (low-dimensional) Euclidean space of dimension at most O(log n/ϵ2 ), such that distances between points are distorted by a factor of at most 1 ± ϵ 2 . Since geometric spaces, and particularly, low-dimensional Euclidean spaces, are easier to work with and reason about, these can be used for algorithm design as well. 1 2 4.4.1 Historical Notes To be cleaned up. Elkin et al. 3 gave the first polylog-stretch spanning trees, which took eight years following Bartal’s√ construction. (The first low-stretch spanning trees had stretch 2O( log n log log n) by Alon et al. 4 , which is smaller than nϵ for any ϵ > 0 but larger than 3 4 low-stretch spanning trees polylogarithmic, i.e., (log n)C for any C > 0.) 59 5 A Near-Linear Time Algorithm for SSSP TO be added in 6 Blank TO be added in 7 Graph Matchings I: Combinatorial Algorithms Another fundamental graph problem is to find matchings: these are subsets of edges that do not share endpoints. Matchings arise in various contexts: matching tasks to workers, or advertisements to slots, or roommates to each other. Moreover, matchings have a rich combinatorial structure. The classical results can be found in Matching Theory by Laci Lovász and Michael Plummer, though Lex Schrijver’s three-volume opus Combinatorial Optimization: Polyhedra and Efficiency might be easier to find, and contains more recent developments as well. Several different and interesting algorithmic techniques can be used to find large matchings in graphs; we will discuss them over the next few chapters. This chapter discusses the simplest combinatorial algorithms, explaining the underlying concepts without optimizing the runtimes. 7.1 Notation and Definitions Consider an undirected (simple and connected) graph G = (V, E) with |V | = n and | E| = m as usual. The graph is unweighted; we will consider weighted versions of matching problems in later chapters. When considering bipartite graphs, where the vertex set has parts V = L ⊎ R (the “left” and “right”, and the edges E ⊆ L × R, we may denote the graph as G = ( L, R, E). Definition 7.1 (Matching). A matching in graph G is a subset of the edges M ⊆ E which have no endpoints in common. Equivalently, the edges in M are disjoint, and hence every vertex in (V, M ) has maximum degree 1. Given a matching M in G, a vertex v is open or exposed or free if no edge in the matching is incident to v, else the vertex is closed or covered or matched. Observe: the empty set of edges is a matching. Moreover, any matching can have at most |V |/2 edges, since each L. Lovász and M.D. Plummer A. Schrijver (2003) 66 notation and definitions edge covers two vertices, and each vertex can be covered by at most one edge. Definition 7.2 (Perfect Matching). A perfect matching M is a matching such that | M| = |V |/2. Equivalently, every vertex is matched in the matching M. Definition 7.3 (Maximum Matching). A maximum cardinality matching (or simply maximum matching) in G is a matching with largest possible cardinality. The size of the maximum matching in graph G is denoted MM ( G ). Definition 7.4 (Maximal Matching). A maximal matching on a graph is a matching that is inclusion-wise maximal; that is, no additional edges can be added to M while maintaining the matching property. Hence, M ∪ {e} is not a matching for all edges e ̸∈ M. The last definition is given to mention something we will not be focusing on; our interest is in perfect and maximum matchings. That being said, it is a useful exercise to show that any maximal matching in G has at least MM( G )/2 edges. 7.1.1 Augmenting Paths for Matchings Since we want to find a maximum matching, a question we may ask is: given a matching M, can we (efficiently) decide if it is a maximum matching? One answer to this was suggested by Berge, who gave a characterization of maximum matchings in terms of “augmenting” paths. Definition 7.5 (Alternating Path). For matching M, an M-alternating path is a path in which edges in M alternate with those not in M. Definition 7.6 (Augmenting Path). For matching M, an M-augmenting path is an M-alternating path with both endpoints open. Given sets S, T, their symmetric difference is denoted Figure 7.1: An alternating path P (dashed edges are not in P, solid edges are in P) Figure 7.2: An augmenting path S △ T : = ( S \ T ) ∪ ( T \ S ). The following theorem explains the name for augmenting paths. Theorem 7.7 (Berge’s Optimality Criterion). A matching M is a maximum matching in graph G if and only if there are no M-augmenting paths in G. Proof. If there is an M-augmenting path P, then M′ := M△ P is a larger matching than M. (Think of getting M′ by toggling the dashed Berge (1957) graph matchings i: combinatorial algorithms 67 edges in the path to solid, and vice versa). Hence if M is maximum matching, there cannot exist an M-augmenting path. Conversely, suppose M is not a maximum matching, and matching M′ has | M′ | > | M|. Consider their symmetric difference S := M△ M′ . Every vertex is incident to at most 2 edges in S (at most one each from M and M′ ), so S consists of only paths and cycles, all of them having edges in M alternating with edges in M′ . Any cycle with this alternating structure must be of even length, and any path has at most one more edge from one matching than form the other. Since | M′ | > | M|, there must exists a path in S with one more edge from M′ than from M. But this is an M-augmenting path. If we could efficiently find an M-augmenting path (if one exists), we could repeatedly augment the current matching until we have a maximum matching. However, Berge’s theorem does not immediately give an efficient algorithm: finding an M-augmenting path could naively take exponential time. We now give algorithms to efficiently find augmenting paths, first in bipartite graphs, and then in general graphs. 7.2 Bipartite Graphs Finding an M-augmenting path (if one exists) in bipartite graphs is an easier task, though it still requires cleverness. A first step is to consider a “dual” object, which is called a vertex cover. Definition 7.8 (Vertex Cover). A vertex cover in G is a set of vertices C such that every edge in the graph has at least one endpoint in C. Note that the entire set V is trivially a vertex cover, and the challenge is to find small vertex covers. We denote the size of the smallest cardinality vertex cover of graph G as VC ( G ). Our motivation for calling it a “dual” object comes from the following fundamental theorem from the early 20th century: Theorem 7.9 (König’s Minimax Theorem). In a bipartite graph, the size of the largest possible matching equals the cardinality of the smallest vertex cover: Dénes König (1916) MM( G ) = VC( G ). This theorem is a special case of the max-flow min-cut theorem, which you may have seen before. It is first of many min-max relationships, many of which lead to efficient algorithms. Indeed, the algorithm for finding augmenting paths will come out of the proof of this theorem. Exercise: Use König’s theorem to prove P. Hall’s theorem: A bipartite graph has a matching that matches all vertices of L if and only for every subset S ⊆ L of vertices, | N (S)| ≥ |S|. Here N (S) denotes the “neighborhood” of S, i.e., those vertices with a neighbor inside S. 68 bipartite graphs Proof. In many such proofs, there is one easy direction. Here, it is proving that MM ( G ) ≤ VC ( G ). Indeed, the edges of any matching share no endpoints, so covering a matching of size MM ( G ) requires at least as many vertices. The minimum vertex cover size is therefore at least MM( G ). Next, we prove that MM( G ) ≥ VC ( G ). To do this, we give a linear-time algorithm that takes as input an arbitrary matching M, and either returns an M-augmenting path (if such a path exists), or else returns a vertex cover of size | M|. Since a maximum matching M admits no M-augmenting path by Berge’s theorem, we would get back a vertex cover of size MM( G ), thereby showing VC ( G ) ≤ MM( G ). The proof is an “alternating” breadth-first search: it starts with all open nodes among the left vertex set L, and places them at level 0. Then it finds all the (new) neighbors of these nodes reachable using non-matching edges, and then all (new) neighbors of those nodes using matching edges, and so on. Formally, the algorithm is as follows, where we use X≤ j to denote X0 ∪ . . . ∪ X j . X0 ← all open vertices in L for i = 0, 1, 2, . . . do 9.3 X2i+1 ← {v | exists u ∈ X2i s.t. uv ̸∈ M, and v ̸∈ X≤2i } 9.4 X2i+2 ← {v | exists u ∈ X2i+1 s.t. uv ∈ M, and v ̸∈ X≤2i+1 } 9.1 9.2 Let us make a few observations about the procedure. First, since the graph is bipartite, Xi is a subset of L for even levels i, and of R for odd levels i. Next, all vertices in X2 ∪ X4 ∪ . . . are matched vertices, since they are reached from the previous level using an edge in the matching. Moreover, if some odd level X2i+1 contains an open node v, we have found an M-alternating path from an open node in X0 to v, and hence we can stop and return this augmenting path. Hence, suppose we do not find an open node in an even level, and stop when some X j is empty. Let X = ∪ j X j be all nodes added to any of the sets X j ; we call these marked nodes. Define the set C to be the vertices on the left which are not marked, plus the vertices on the right which are marked. That is, D X A Y B Z C Layer 0 1 2 3 Matched edge 4 Unmatched edge Open vertex Figure 7.3: Illustration of the process to find augmenting paths in a bipartite graph. Mistakes here, to be fixed! C := ( L \ X ) ∪ ( R ∩ X ) We claim that C is a vertex cover of size | M |. Claim 7.10. C is a vertex cover. Proof. G is a bipartite graph, and C hits all edges that touch R ∩ X and L \ X. Hence we must show there are no edges between L ∩ X and R \ X, i.e., between the top-left and bottom-right of the figure. Figure 7.4: X = set of marked vertices, O = marked open vertices, C = claimed vertex cover of G. To be changed. graph matchings i: combinatorial algorithms 1. There can be no unmatched edge from the open vertices in L ∩ X to R \ X, else that vertex would be reachable from X0 and so belong to X1 . Moreover, an open vertex has no unmatched edges, by definition. Hence, any “offending edges” out of L ∩ X must come from a covered vertex. 2. There can be no non-matching edge from a covered vertex in L ∩ X to some node u in R \ X, else this node u would have been added to some level X2i+1 . 3. Finally, there can be no matching edge between a covered vertex in L ∩ X and some vertex in R \ X. Indeed, every covered node in L ∩ X (i.e., those in X2 , X4 , . . . ) was reached via a matching edge from some node in R ∩ X. There cannot be another matching edge from some node in R \ X incident to it. This shows that C is a vertex cover. Claim 7.11. |C | ≤ | M|. We use a simple counting argument: • Every vertex in R ∩ X has a matching edge incident to it; else it would be open, giving an augmenting path. • Every vertex in L \ X has an incident edge in the matching, since no vertices in L \ X ⊆ L \ X0 are open. • There are no matching edges between L \ X and R ∩ X, else they would have been explored and added to X. Hence, every vertex in C = ( L \ X ) ∪ ( R ∩ X ) corresponds to a unique edge in the matching, and |C | ≤ | M |. Observe that the proof of König’s theorem is algorithmic, and can be implemented to run in O(m) time. Now, starting from some trivial matching, we can use this linear-time algorithm to repeatedly augment until we have a maximum matching. This means that maximum matching on bipartite graphs has an O(mn)-time algorithm. Observe: this algorithm also gives a “proof of optimality” of the maximum matching M, in the form of a vertex cover of size | M |. By the easy direction of König’s theorem, this is a vertex cover of minimum cardinality. Therefore, while finding the smallest vertex cover is NP-hard for general graphs, we have just solved the minimum vertex cover problem on bipartite graphs. One other connection: if you have seen the Ford-Fulkerson algorithm for computing maximum flows, the above algorithm may seem familiar. Indeed, modeling the maximum matching problem in bipartite graphs as that of finding a maximum integer s-t flow, and Figure 7.5: Use Ford-Fulkerson algorithm to find a matching 69 70 general graphs: the tutte-berge theorem running the Ford-Fulkerson “augmenting paths” algorithm results in the same result. Moreover, the minimum s-t cut corresponds to a vertex cover, and the max-flow min-cut theorem proves König’s theorem. The figure to the right illustrates this on an example. Figure needs fixing. 7.2.1 Other algorithms There are faster algorithms to find maximum matchings in bipartite graphs. For a long time, the fastest one was an algorithm by John √ Hopcroft and Dick Karp, which ran in time O(m n). It finds many augmenting paths at once, and then combines them in a clever way. There is also a related algorithm of Shimon Even and Bob Tarjan, √ which runs in time O(min(m m, mn2/3 )); in fact, they compute maximum flows on unit-capacity graphs in this running time. There was remarkably little progress on the maximum matching problem until 2016, when Aleksander Madry gave an algorithm that runs in time Õ(m10/7 ) time—in fact the algorithm also solves the unit-capacity maximum-flow problem in that time. It takes an interior-point algorithm for solving general linear programs, and specializes it to the case of maximum matchings. We may discuss this max-flow algorithm in a later chapter. Then, following an intermediate improvement to m4/3+o(1) time, a remarkable paper presented at the FOCS 2022 conference gave an algorithm for both the maximum flow problem, and the min-cost flow problem in m1+o(1) time. (The result assumes polynomially bounded capacities and costs, and integer demands.) 7.3 J. Hopcroft and R.M. Karp (1973) S. Even and R.E. Tarjan (1975) A. Madry (2016) L. Chen, R. Kyng, Y.P. Liu, R. Peng, M. Probst Gutenberg, and S. Sachdeva (2022). General Graphs: The Tutte-Berge Theorem The matching problem on general (non-bipartite) graphs gets more involved, since the structure of matchings is richer. For example, the flow-based approaches do not work any more. And while Berge’s theorem (Theorem 7.7) still holds in this case, König’s theorem (Theorem 7.9) is no longer true. Indeed, the 3-cycle C3 has a maximum matching of size 1, but the smallest vertex cover is of size 2. However, we can still give a min-max relationship, via the Tutte-Berge theorem. To state it, let us give a definition: for a subset U ⊆ V, suppose deleting the nodes of U and their incident edges from G gives connected components {K1 , K2 , . . . , Kt }. The quantity odd( G \ U ) is the number of such pieces with an odd number of vertices. Theorem 7.12 (The Tutte-Berge Max-Min Theorem). Given a graph G, Tutte (1947), Berge (1958) Tutte showed that the graph has a perfect matching precisely if for every U ⊆ V, odd( G \ U ) ≤ |U |. Berge gave the generalization to maximum matchings. graph matchings i: combinatorial algorithms the size of the maximum matching is described by the following equation. MM ( G ) = min U ⊆V n + |U | − odd( G \ U ) . 2 The expression on the right can seem a bit confusing, so let’s consider some cases. • If U = ∅, we get that if |V | is even then MM ( G ) ≤ n/2, and if |V | is odd, the maximum matching cannot be bigger than (n − 1)/2. (Or if G is disconnected with k odd-sized components, this gives n/2 − k/2.) • Another special case is when U is any vertex cover with size c. Then the Ki ’s must be isolated vertices, so odd( G \ U ) = n − c. c+n−(n−c) This gives us MM ≤ = c, i.e., the size of the maximum 2 matching is at most the size of any vertex cover. • Give example where G is even, connected, but MM < VC. Trying special cases is a good way to understand the Proof of the ≤ direction of Theorem 7.12. The easy direction is to show that MM ( G ) is at most the quantity on the right. Indeed, consider a maximum matching M. At most |U | of the edges in M can be hit by nodes in U; the other edges must lie completely within some connected component of G \ U. The maximum size of a matching within Ki is ⌊Ki /2⌋, and it are these losses from the odd components that gives the expression on the right. Indeed, we get  t  | Ki | | M | ≤ |U | + ∑ 2 i =1 n − |U | odd( G \ U ) − 2 2 |U | + n − odd( G \ U ) = . 2 = |U | + We can prove the “hard” direction using induction (see the webpage for several such proofs). However, we defer it for now, and derive it later from the proof of the Blossom algorithm. 7.4 The Blossom Algorithm The Blossom algorithm for finding the maximum matching in a general graph is by Jack Edmonds. Recall: the algorithm for minimumweight arborescences in §?? was also due to him, and you may see some similarities in these two algorithms. Theorem 7.13. Given a graph G, the Blossom algorithm finds a maximum matching M in time O(mn2 ). J. Edmonds (1965) 71 72 the blossom algorithm The rest of this section defines the algorithm, and proves this theorem. The essential idea of the algorithm is simple, and similar to the one for the bipartite case: if we have a matching M, Berge’s characterization from Theorem 7.7 says that if M is not optimal, there exists an M-augmenting path. So the natural idea would be to find such an augmenting path. However, it is not clear how to do this directly. The clever idea in the Blossom algorithm is to either find an M-augmenting path, or else find a structure called a “blossom”. The good thing about blossoms is that we can use them to contract the graph in a certain way, and make progress. Let us now give some definitions, and details. A flower is a subgraph of G that looks like the the object to the right: it has a open vertex at the base, then a stem with an even number of edges (alternating between matched and unmatched edges ), and then a cycle with an odd number of edges (again alternating, though naturally having two unmatched edges adjacent to the stem). The cycle itself is called the blossom. 7.4.1 The Main Procedure The algorithm depends on a subroutine called FindAugPath, which has the following guarantee. Lemma 7.14. Given graph G and matching M, the subroutine FindAugPath, runs in O(m) time. If G has an M-augmenting path, then it returns either (a) a flower F, or (b) an M-augmenting path. Note that we have not said what happens if there is no M-augmenting path. Indeed, we cannot find an augmenting path, but we show that the FindAugPath returns either a flower, or says “no M-augmenting path, and returns a Tutte-Berge set U achieving equality in Theorem 7.12 with respect to M. We can now use this FindAugPath subroutine within our algorithm as follows. 1. Says “no M-augmenting path” and a set U of nodes. In this case, M is the maximum matching. 2. Finds augmenting path P. We can now augment along P, by setting M ← M△ P. 3. Finds a flower F. In this case, we don’t yet know if M is a maximum matching or not. But we can shrink the blossom down to get a smaller graph G ′ (and a matching M′ in it), and recurse. Either we will find a proof of maximality of M′ in G ′ , or an M′ augmenting path. This we can extend to the matching M in G. That’s the whole algorithm! Stem Blossom (a) A (b) Matched edge Unmatched edge Open vertex Figure 7.6: An example of blossom and the toggling of the stem. graph matchings i: combinatorial algorithms Let’s give some more details for the last step. Suppose we find a flower F, with stem S and blossom B. First, toggle the stem (by setting M ← M△S): this moves the open node to the blossom, without changing the size of the matching M. (It makes the following arguments easier, with one less case to consider.) (Change figure.) Next, contract the blossom down into a single vertex v B , which is now open. Denote the new graph G ′ ← G/B, and M′ ← M/B. Since all the nodes in blossom B, apart from perhaps the base, were matched by edges within the blossom, M′ is also a matching in G ′ . Next, we recurse on this smaller graph G ′ with matching M′ . Finally, if we get back an M′ -augmenting path, we “lift” it to get an M-augmenting path (as we see soon). Else if we find that M′ is a maximum matching in G ′ , we declare that M is maximum in G. To show correctness, it suffices to prove the following theorem. 73 Figure 7.7: The shrinking of a blossom. Image found at http://en.wikipedia. org/wiki/Blossom_algorithm. Given a graph and a subset C ⊆ V, recall that G/C denotes the contraction of C in G. Lemma 7.15. Given graph G and matching M, suppose we shrink a blossom to get G ′ and M′ . Then there exists an M-augmenting path in G if and only if there exists an M′ -augmenting path in G ′ . Moreover, given an M′ -augmenting path in G ′ , we can lift it back to an M-augmenting path P in G in O(m) time. Proof. Since we toggled the stem, the vertex v at the base of the blossom B is open, and so is the vertex v B created in G ′ by contracting B. Moreover, all other nodes in the blossom are matched by edges within itself, so all edges leaving B are non-matching edges. The picture essentially gives the proof, and can be used to follow along. (⇒) Consider an M-augmenting path in G, denoted by P. If P does not go through the blossom B, the path still exists in G ′ . Else if P goes through the blossom, we can assume that one of its endpoints is the base of the blossom (which is the only open node on the blossom)—indeed, any other M-augmenting path P can be rerouted to the base. (Figure!) So suppose this path P starts at the base and ends at some v′ not in B. Because v B is open in G ′ , the path from v B to v′ is an M′ -augmenting path in G ′ . (⇐) Again, an M′ -augmenting path P′ in G ′ that does not go through v B still exists in G. Else, the M′ -augmenting path P′ passes through v B , and because v B is open in G ′ , the path starts at v B and ends at some node t. Let the first edge on P′ be e′ = v B y for some node y, and let it correspond to edge e = xy in G, where x ∈ B. Now, if v is the open vertex at the base of the blossom, following one of the two paths (either clockwise or counter-clockwise) along the blossom from v to x, using the edge xy and then following the rest of the path P′ from y to t gives an M-augmenting path in G. (This Figure 7.8: The translation of augmenting paths from G \ B to G and back. 74 the blossom algorithm is where we use the fact that the cycle is odd, and is alternating except for the two edges incident to v.) The process to get from P′ in G ′ to the M-augmenting path in G be done algorithmically in O(m) time, completing the proof. We can now analyze the runtime, and prove Theorem 7.13: Proof of Theorem 7.13. We first call FindAugPath, which takes O(m) time. We are either done (because M is a maximum matching, or else we have an augmenting path), or else we contract down in another O(m) time to get a graph G ′ with at most n − 3 vertices and at most m edges. Inductively, the time taken in the recursive call on G ′ is O(m(n − 3)). Now lifting an augmenting path takes O(m) time more. So the total runtime to find an augmenting path in G (if one exists) is O(mn). Finally, we start with an empty matching, so its size can be augmented at most n/2 times, giving us a total runtime of O(mn2 ). 7.4.2 The FindAugPath Subroutine The subroutine FindAugPath is very similar to the analogous procedure in the bipartite case, but since there is no notion of left and right vertices, we start with level X0 containing all vertices that are unmatched in M0 , and try to grow M-alternating paths from them, in the hope of finding an M-augmenting path. X0 ← all open vertices in V 9.2 for i = 0, 1, 2, . . . do 9.3 X2i+1 ← {v | exists u ∈ X2i s.t. uv ̸∈ M, and v ̸∈ X≤2i } 9.4 X2i+2 ← {v | exists u ∈ X2i+1 s.t. uv ∈ M, and v ̸∈ X≤2i+1 } 9.1 if exists a “cross” edge between nodes of same level then return augmenting path or flower 9.7 else 9.8 say “no M-augmenting path” 9.5 9.6 To argue correctness, let us look at the steps above in more detail. In line 9.2, for each vertex u ∈ X2i , we consider the possible cases for each non-matching edge uv incident to it: 1. If v is not in X≤2i+1 already (i.e., not marked already) then we add it to X2i+1 . Note that v ∈ X2i+1 now has an M-alternating path to some node in X0 , that hits each layer exactly once. 2. If v ∈ X2i , then uv is an unmatched edge linking two vertices in the same level. This gives an augmenting path or a blossom! Indeed, by construction, there are M-alternating paths P and As before, let X≤ j denote X0 ∪ . . . ∪ X j , and let nodes added to some level X j be called marked. graph matchings i: combinatorial algorithms Q from u and v to open vertices in X0 . If P and Q do not intersect, then concatenating path P, edge uv, and path Q gives an M-augmenting path. If P and Q intersect, they must first intersect some vertex w ∈ X2j for some j ≤ i, and the cycle containing u, v, w gives us the blossom, with the stem being a path from w back to an open vertex in X0 . 3. If v ∈ X2j for j < i, then u would have been added to the odd level X2j+1 , which is impossible. 4. Finally, v may belong to some previous odd level, which is fine. Observe that this “backward” non-matching edge uv is also an even-to-odd edge, like the “forward” edge in the first case. Now for the edges out of the odd layers considered in line 9.3. Given u ∈ X2i+1 and matching edge uv ∈ M, the cases are: 1. If v is not in X≤2i+1 then add it to X2i+2 . Notice that v cannot be in X2i+2 already, since nodes in even layers are matched to nodes in the preceding odd layer, and there cannot be two matching edges incident to v. Again, observe inductively that v has a path to some vertex in X0 that hits each intermediate layer once. 2. If v is in X2i+1 , there is an matching edge linking two vertices in the same odd level. This gives an augmenting path or a blossom, as in case 2 above. (Success!) 3. The node v cannot be in a previous level, because all those vertices are either open, or are matched using other edges. Observe that if the algorithm does not succeed, all the matching edges we explored are odd-to-even, whereas all the non-matching edges are even-to-odd. Now we can prove Lemma 7.14. Proof of Lemma 7.14. Let P be an M-augmenting path in G. For a contradiction, suppose we do not succeed in finding an augmenting path or blossom. Starting from one of the endpoints of P (which is in X0 , an even level), trace the path in the leveled graph created above. The next vertex should be in an odd level, the next in an even level, and so forth. Since the path P is alternating, FindAugPath ensures that all its edges will be explored. (Make sure you see this!) Now P has an odd number of edges (i.e., even number of vertices), so the last vertex has an opposite parity from the starting vertex. But the last vertex is open, and hence in X0 , an even level. This is a contradiction. 75 76 subsequent work 7.4.3 Finding a Tutte-Berge Set⋆ If FindAugPath did not succeed, all the edges we explored form a bipartite graph. This does not mean that the entire graph is bipartite, of course—there can be non-matching edges incident to nodes in odd levels that lead to nodes that remain unmarked. But these components have no open vertices (which are all in X0 and marked). Now define U = Xodd := X1 ∪ X3 ∪ . . . be the vertices in odd levels. Since there are no cross edges, each of these nodes has a distinct matching edge leading to the next level. Now G \ U has two kinds of components: (a) the marked vertices in the even levels, Xeven which are all singletons since there are no cross edges, and (b) the unmarked components, which have no open vertices, and hence have even size. Hence n + | Xodd | − | Xeven | n + |U | − odd( G \ U ) = 2 2 2| Xodd | + (n − | X |) = 2 (n − | X |) = | Xodd | + = | M |. 2 The last equality uses that all nodes in V \ X are perfectly matched among themselves, and all nodes in Xodd are matched using unique edges. The last piece is to show that a Tutte-Berge set U ′ for a contracted graph G ′ = G/B with respect to M′ = M/B can be lifted to one for G with respect to M. We leave it as an exercise to show that adding the entire blossom B to U ′ gives such an U. 7.5 Subsequent Work The best runtime of combinatorial algorithms for maximum matching √ in general graphs is O(m n) by an algorithm of Silvio Micali and Vijay Vazirani. The algorithm is based on finding augmenting paths much faster than the naïve approach above. It is quite involved; I recommend an algorithm due to Hal Gabow and Bob Tarjan that has the same running time, and also extends to the min-cost version of the problem. In a later chapter, we will see a very different “algebraic” algorithm based on fast matrix multiplication. This algorithm due to Marcin Mucha and Piotr Sankowski gives a runtime of O(nω ), where ω ≈ 2.376. Coming up next, however, is a discussion of weighted versions of matching, where edges have weights and the goal is to S. Micali and V.V. Vazirani (1984) H.N. Gabow and R.E. Tarjan M. Mucha and P. Sankowski (2006) graph matchings i: combinatorial algorithms find the matching of maximum weight, or perfect matchings with minimum weight. 77 8 Graph Matchings II: Algebraic Algorithms We now introduce some algebraic methods to find perfect matchings in general graphs. We use so-called the “polynomial method”, based on the elementary fact that low-degree polynomials have few zeroes. This is a powerful and versatile idea, using a combination of basic algebra and randomness, that can be used to solve many related problems as well. For instance, we will use it to get parallel (randomized) algorithms for perfect matchings, and also to find red-Blue perfect matchings, an algorithm for which we know no deterministic algorithms. But before we digress to these problems, let us discuss some of the algebraic results for perfect matchings. We focus on perfect matchings here; it is an exercise to reduce finding maximum matchings to perfect matchings. • The first result along these lines is that of Laci Lovász, who introduced the general idea, and gave a randomized algorithm to detect the presence of perfect matchings in time O(nω ), and to find it in time O(mnω ). We will present all the details of this elegant idea soon. Lovász (1979) • Dick Karp, Eli Upfal, and Avi Wigderson, and then Ketan Mulmuley, Umesh Vazirani, and Vijay Vazirani showed how to find such a matching in parallel. The question of getting a deterministic parallel algorithm remains an outstanding open problem, despite recent progress (which discuss at the end of the chapter). Karp, Upfal, and Wigderson (1986) • Michael Rabin and Vijay Vazirani sped up the sequential algorithm to run in O(n · nω ). This was substantially improved by the work of Marcin Mucha and Piotr Sankowski to get a runtime of O(nω ). Rabin and Vazirani (1989) 8.1 Mulmuley, Vazirani, and Vazirani (1987) Mucha and Sankowski (2006) Preliminaries: roots of low degree polynomials For the rest of this lecture, we fix a field F, and consider (univariate and multivariate) polynomials over this field. We assume that we can perform basic arithmetic operations in constant time, though sometimes it will be important to look more closely at this assumption. For finite fields Fq (where q is a prime power), we can perform arithmetic operations (addition, multiplication, division) in time poly log q. 80 preliminaries: roots of low degree polynomials Given p( x ), a root/zero of this polynomial is some value z such that p(z) evaluates to zero. The critical idea for today’s lecture is simple: low-degree polynomials have “few” roots. In this section, we will see this for both univariate and multivariate polynomials, for the right notion of “few”. The following theorem well-knwon is for univariate polynomials. (The proof is essentially by induction on the degree; will add a reference.) Theorem 8.1 (Univariate Few-Roots Theorem). A univariate polynomial p( x ) of degree at most d over any field F has at most d roots, unless p( x ) is zero polynomial. Now, for multivariate polynomials, the trivial extension of this theorem is not true. For example, p( x, y) := xy has degree two, and the solutions to p( x, y) = 0 over the reals are exactly the points in {( x, y) ∈ R2 : x = 0 or y = 0}, which is infinite. However, the roots are still “few”, in the sense that the set of roots is very sparse in R2 . To formalize this observation, let us write a trivial corollary of Theorem 8.1: Corollary 8.2. Given a non-zero univariate polynomial p( x ) over a field F, such that p has degree at most d. Suppose we choose R uniformly at random from a subset S ⊆ F. Then Pr [ p( R) = 0] ≤ d . |S| This statement holds for multivariate polynomials as well, as we see next. The result is called the Schwartz-Zippel lemma, and it appears in papers by Richard DeMillo and Richard Lipton, by Richard Zippel, and by Jacob Schwartz. Theorem 8.3. Let p( x1 , . . . , xn ) be a non-zero polynomial over a field F, such that p has degree at most d. Suppose we choose values R1 , . . . , Rn independently and uniformly at random from a subset S ⊆ F. Then Pr [ p( R1 , . . . , Rn ) = 0] ≤ d . |S| Hence, the number of roots of p inside Sn is at most d|S|n−1 . Proof. We argue by induction on n. The base case of n = 1 considers univariate polynomials, so the claim follows from Theorem 8.1. Now for the inductive step for n variables. Let k be the highest power of xn that appears in p, and let q be the quotient and r be the remainder when dividing p by xnk . That is, let q( x1 , . . . , xn−1 ) and r ( x1 , . . . , xn ) be the (unique) polynomials such that p( x1 , . . . , xn ) = xnk q( x1 , . . . , xn−1 ) + r ( x1 , . . . , xn ), R.A. DeMillo and R.J. Lipton (1978) Zippel (1979) Schwartz (1980) Like many powerful ideas, the provenance of this result gets complicated. A version of this for finite fields was apparently already proved in 1922 by Øystein Ore. Anyone have a copy of that paper? A monomial is a product a collection of variables. The degree of a monomial is the sum of degrees of the variables in it. The degree of a polynomial is the maximum degree of any monomial in it. graph matchings ii: algebraic algorithms where the highest power of xn in r is less than k. Now letting E be the event that q( R1 , . . . , Rn−1 ) is zero, we find Pr [ p( R1 , . . . , Rn ) = 0] = Pr [ p( R1 , . . . , Rn ) = 0 | E] Pr [E ]     + Pr p( R1 , . . . , Rn ) = 0 | E Pr E   ≤ Pr [E ] + Pr p( R1 , . . . , Rn ) = 0 | E By the inductive assumption, and noting that q has degree at most d − k, we know Pr [E ] = Pr [q( R1 , . . . , Rn−1 ) = 0] ≤ (d − k)/|S|. Similarly, fixing the values of R1 , . . . , Rn−1 and viewing p as a polynomial only in variable xn (with degree k), we know Thus we get   Pr p( R1 , . . . , Rn ) = 0 | E ≤ k/|S|. Pr [ p( R1 , . . . , Rn ) = 0] ≤ d−k k d + = . |S| |S| |S| Remark 8.4. Finding the set S ⊆ F such that |S| ≥ dn2 , guarantees that if p is a non-zero polynomial, Pr [ p( R1 , . . . , Rn ) = 0] ≤ 1 . n2 Naturally, if p is zero polynomial, then the probability equals 1. 8.2 Detecting Perfect Matchings by Computing a Determinant Let us solve the easier problem of detecting a perfect matching in a graph, first for bipartite graphs, and then for general graphs. We define the Edmonds matrix of a bipartite graph G. Definition 8.5. For a bipartite graph G = ( L, R, E) with | L| = | R| = n, its Edmonds matrix E( G ) is the following n × n matrix of indeterminates/variables.  0 if (i, j) ̸∈ E and i ∈ L, j ∈ R Ei,j =  xi,j if (i, j) ∈ E and i ∈ L, j ∈ R. Example 8.6. The Edmonds matrix of the graph to the right is " # x11 x12 E= , 0 x22 which has determinant x11 x22 . 1 1 2 2 Figure 8.1: Bipartite graph 81 82 detecting perfect matchings by computing a determinant Recall the Leibniz formula to compute the determinant, and apply it to the Edmonds matrix: n det (E( G )) = ∑ (−1)sign(σ) ∏ Ei,σ(i) σ ∈Sn i =1 There is a natural correspondence between potential perfect matchings in G and permutations σ ∈ Sn , where we match each vertex i ∈ L to vertex σ (i ) ∈ R. Moreover, the term in the above expansion corresponding to a permutation σ gives a non-zero monomial (a product of xij variables) if and only if all the edges coresponding to that permutation exist in G. Moreover, all the monomials are distinct, by construction. This proves the following simple claim. Proposition 8.7. Let E( G ) denote the Edmonds matrix of a bipartite graph G. Then det (E( G )) is a non-zero polynomial (over any field F) if and only if G contains a perfect matching. However, writing out this determinant of indeterminates could take exponential time—it would correspond to a brute-forece check of all possible perfect matchings. Lovász’s idea was to use the randomized algorithm implicit in Theorem 8.3 to test whether G contains a perfect matching. Algorithm 10: PM-tester(bipartite graph G, S ⊆ F) E ← Edmonds matrix for graph G For each non-zero entry Eij , sample Rij ∈ S independently and uniformly at random e ← E({ Rij }i,j ) be matrix with sampled values substituted 10.3 E e ) = 0 then 10.4 if det( E 10.5 return G does not have a perfect matching (No) 10.6 else 10.7 return G contains a perfect matching (Yes) 10.1 10.2 Lemma 8.8. For |S| ≥ n3 , Algorithm 10 always returns No if G has no perfect matching, while it says Yes with probability at least 1 − n12 otherwise. Moreover, the algorithm can be implemented in time O(nω ), where ω is the exponent of matrix multiplication. Proof. The success probability follows from Remark 8.4, and the fact that the determinant is a polynomial of degree n. Assuming that arithmetic operations can be done in constant time, we can compute e in time O(n3 ), using Gaussian elimination. (Try the determinant of E it yourself, or see Wikipedia.) Hence Algorithm 10 easuly runs in time O(n3 ). In fact, Bunch and Hopcroft proved that both computing matrix inverses and determinants can be done in asymptotically the same time as matrix multiplication. Thus, we can make Algorithm 10 run in time O(nω ). If we work over finite fields, the size of the numbers is not an issue. However, Gaussian Elimination over the rationals could cause some of the numbers to get unreasonably large. Ensuring that the numbers remain polynomially bounded requires care; see Edmonds’ paper, or a book on numerical methods. J.R. Bunch and J.E. Hopcroft (1974) graph matchings ii: algebraic algorithms 8.2.1 Non-bipartite matching The extension to the non-bipartite case requires a very small change: instead of using the Edmonds matrix, which is designed for bipartite graphs, we use the analogous object for general graphs. This object was introduced by Bill Tutte in his 1947 paper with the Tutte-Berge theorem. Definition 8.9. For a general graph G = (V, E) with |V | = n, the Tutte matrix T( G ) of G is the n × n skew-symmetric matrix given by   if (i, j) ̸∈ E or i = j  0 Ti,j = Tutte (1947) A matrix A is skew-symmetric if A⊺ = − A. if (i, j) ∈ E and i < j xi,j    − x j,i if (i, j) ∈ E and i > j. Observe that now each variable is of the form xi,j for i < j; it occurs twice in the matrix, with the variables below the diagonal being the negations of those above. Example 8.10. For the graph to the right, the Tutte matrix is  0  −x  1,2   0 − x1,4 x1,2 0 − x2,3 − x2,4 And its determinant is 0 x2,3 0 − x3,4  x1,4 x2,4    x3,4  0 2 2 2 2 + 2x1,2 x1,4 x2,3 x3,4 + x1,2 x3,4 . x1,4 x2,3 We claim the same property for this matrix as we did for the Edmonds matrix: Theorem 8.11. For any graph G, the determinant of the Tutte matrix T( G ) is a non-zero polynomial over any field F if and only if there exists a perfect matching in G. Proof. As before, the determinant is n det(T( G )) = ∑(−1)sign(σ) ∏ Ti,σ(i) . σ i =1 One direction of the theorem is easy: if G has a perfect matching M, consider the permutation σ mapping each vertex to the other endpoint of the matching edge containing it. The corresponding monomial above is ± ∏e∈ M xe2 , which cannot be cancelled by any other permutation, and makes the determinant non-zero over any field. 1 2 4 3 Figure 8.2: Non-bipartite graph 83 84 from detecting to finding perfect matchings To prove the converse, suppose the determinant is non-zero. In the monomial corresponding to permutation σ, either each (i, σ (i )) in the product corresponds to an edge of G, or else the monomial is zero. This means each non-zero monomial of det(T( G )) chooses an edge incident to i, for each vertex i ∈ [n], giving us a cycle cover of G. If the cycle cover for σ has an odd-length cycle, take the permutation σ′ obtained by reversing the order of, say, the first odd cycle in it. This does not change the sign of the permutation, which depends on how many odd and even cycles there are, but the skew symmetry and the odd cycle length means the product of matrix entries flips the sign. Hence the monomials for σ and σ′ cancel each other. Formally, we get a bijection between permutations with odd-length cycles that cancel out. The remaining monomials corresponding to cycle covers with even cycles. Choosing either the odd edges or even edges on each such even cycle gives a perfect matching. Now given Theorem 8.11, the Tutte matrix can simply be substituted instead of the Edmonds matrix to extend the results to general graphs. 8.3 From Detecting to Finding Perfect Matchings We can convert the above perfect matching tester (which solves the decision version of the perfect matching problem) into an algorithm for the search version: one that outputs a perfect matching in a graph (if one exists), using the simple but brilliant idea of self-reducibility. Suppose that graph G has a perfect matching. Then we can pick any edge e = uv and check if G [ E − e], the subgraph of G obtained by dropping just the edge e, contains a perfect matching. If not, then edge e must be part of every perfect matching in G, and hence we can find a perfect matching on the induced subgraph G [V \ {u, v}]. The following algorithm is based on this observation. Algorithm 11: Find-PM(bipartite graph G, S ⊆ F) Assume: G has a perfect matching let e = uv be an edge in G if PM-tester(G [ E − e], S) == Yes then 11.2 return Find-PM(G [ E − e], S) 11.3 else 11.4 M′ ← Find-PM(G [V − {u, v}], S) 11.5 return M′ ∪ {e} 11.1 Theorem 8.12. Let |S| ≥ n3 . Given a bipartite graph G that contains some perfect matching, Algorithm 11 finds a perfect matching with probability at least 12 , and runs in time O(m · nω ). We are reducing the problem to smaller instances of itself, hence the term self-reducibility. graph matchings ii: algebraic algorithms Proof. At each step, we call the tester once, and then recurse after either deleting an edge or two vertices. Thus, the number of total recursive steps inside Algorithm 11 is at most max{m, n/2} = m, if the graph is connected. This gives a runtime of O(m · nω ). Moreover, at each step, the probability that the tester returns a wrong answer is at most n12 , so the PM-tester makes a mistake with probability at most m ≤ 1/2, by a union bound. n2 Observe that the algorithm assumes that G contains a perfect matching. We could simply run the PM-tester once on G at the beginning to check for a perfect matching, and then proceed as above. Or indeed, we could just run the algorithm regardless; if G has no perfect matching, there is no danger of this algorithm erronenously returning one. Moreover, there are at least two ways reducing the error probability to 1/nc : we could increase the size of S to ns+3 , or we could repeat the above algorithm c log2 n times. For the latter approach, the probability of not getting a perfect matching in all iterations is at most (1/2)c log2 n = n1c . Hence we get correctness with high probability. Corollary 8.13. Given a bipartite graph G containing a perfect matching, there is an O(m · nω log n)-time algorithm which finds a perfect matching with high probability. Exercise 8.14. Reduce the time-complexity of the basic version of Algorithm 11 from O(m · nω ) to O(n log n · nω ). 8.3.1 The Algorithm of Rabin and Vazirani Rewrite this section. How can we speed up the algorithm further? The improvement we give here is small, it only removes a logarithmic term from the algorithm you get from Exercise 8.14, but it has a nice idea that we want to emphasize. Again, we only focus on the bipartite case, and leave the general case for the reader. Also, we identify the nodes of both L and R with [n], so that we can index the rows and columns of the Edmonds matrix using vertices in L and R respectively. We can view Algorithm 11 as searching for a permutation π such that M := {iπ (i ) | i ∈ [n]} is a perfect matching in G. Hence it picks a vertex, say 1 ∈ L, and searches for a vertex j ∈ R such that there is an edge 1j covering vertex 1, and also the remaining graph has a perfect matching. Interestingly, the remaining graph has an Edmonds matrix which is simple: it is simply the matrix E−1,− j , which is our notation from dropping row 1 and column j from E. 85 86 from detecting to finding perfect matchings Therefore, our task is simple: find a value j such that 1j is an edge in G, and also det(E−1,− j ) is a non-zero polynomial. Doing this naively would require n determinant computations, one for each j, and we’d be back to square one. But the smart observation is to recall Cramer’s rule for the inverse for any matrix A: A −1  (−1)i+ j det( A− j,−i ) . = i,j det( A) Some jargon that may be useful: A −1 = (8.1) e obtained by substituting random values into the Take the matrix E Edmonds matrix E, and assume the set S and field F are of size at least n10 , say, so that all error probabilities are tiny. Compute its e −1,− j ): inverse in time O(nω ), and use (8.1) to get all the values det(E adjugate( A) , det( A) where adjugate( A) is the transpose of the cofactor matrix of A, given by cofactor( A) p,q := (−1) p+q det( A− p,−q ), where det( A− p,−q ) is also called a minor of A.  e −1,− j ) = E e −1 e) det(E × (−1) j+1 × det(E j,1 using just n scalar multiplications. In fact, it suffices to find a non e −1 , which indicates that det(E e −1,− j ) is non-zero, and hence zero E j,1 the corresponding det(E−1,− j ) (without the tilde) is a non-zero polynomial, so that G [V \ {1, j}] has a perfect matching. In summary, by computing one matrix inverse and n scalar multiplications, we can figure out one edge of the matching. Hence the runtime can be made O(n · nω ). Extending this to general graphs requires a bit more work; we refer to the Rabin and Vazirani paper for details. Also, Marcin Mucha’s thesis has a very well-written introduction which discusses these details, and also gives the details of his improvement (with Sankowski) to O(nω ) time. 8.3.2 The Polynomial Identity Testing (PIT) Problem In Polynomial Identity Testing (PIT) we are given a polynomial P( x1 , x2 , . . . , xn ) over some field F, and we want to test if it is identically zero or not. If P were written out explicitly as a list of monomials and their coefficients, this would not be a problem, since we could just check that all the coefficients are zero. But if P is represented implicitly, say as a determinant, then things get more tricky. A big question is whether polynomial identity testing can be derandomized. We don’t know deterministic algorithms that given P can decide whether P is identically zero or not, in poly-time. How is P given, for this question? Even if P is given as an arithmetic circuit (a circuit whose gates are addition, subtraction and multiplication, and inputs are the variables and constants), it turns out that derandomizing PIT will result in surprising circuit lower bounds—for example, via a result of Kabanets and Impagliazzo. Derandomizing special cases V. Kabanets and R. Impagliazzo (2004) of PIT can be done. For example, the PIT instances that come from graph matchings ii: algebraic algorithms matchings can be derandomized, as was shown in work by Jim Geelen and Nick Harvey, among others; however, but the runtime seems to get worse. 8.4 Red-Blue Perfect Matchings To illustrate the power of the algebraic approach, let us now consider the red-blue matching problem. We solve this problem using the algebraic approach and randomization. Interestingly, no determistic polynomial-time algorithm is currently known for this problem! Again, we consider bipartite graphs for simplicity. Given a graph G where each edge is colored either red or blue, and an integer k, a k-red matching is a perfect matching in G which has exactly k red edges. The goal of the red-blue matching problem is to decide whether G has a k-red matching. (We focus on the decision version of the problem; the self-reducibility ideas used above can solve the search version. To begin, let’s solve the case when G has a unique red-blue matching with k red edges. Define the following n × n matrix: Mi,j =    0    1 y if (i, j) ̸∈ E, if (i, j) ∈ E and colored blue, if (i, j) ∈ E and colored red. Claim 8.15. Let G have at most one perfect matching with k red edges. The determinant det(M) has a term of the form ck yk if and only if G has a k-red matching. Proof. Consider p(y) := det(M) as a univariate polynomial in the variable y. Again, using the Leibniz formula, the only way to get a non-zero term of the form ck yk is if the graph has a k-red matching. And since we assumed that G has at most one such perfect matching, such a term cannot be cancelled out by other such matchings. The polynomial p(y) has degree at most n, and hence we can recover it by Lagrangian interpolation. Indeed, we can choose n + 1 distinct numbers a0 , . . . , an , and evaluate p( a0 ), . . . , p( an ) by computing the determinant det(M) at y = ai , for each i. These n + 1 values are enough to determine the polynomial as follows: n +1 p ( y ) = ∑ p ( ai ) ∏ i =1 j ̸ =i   x − aj . ai − a j (E.g., see 451 lecture notes or Ryan’s lecture notes.) Note this is a completely deterministic algorithm, so far. J.F. Geelen (2000) N.J.A. Harvey (2009) 87 88 matchings in parallel, and the isolation lemma 8.4.1 Getting Rid of the Uniqueness Assumption To extend to the case where G could have many k-red matchings, we can redefine the matrix as the following: Mi,j =    0 xij    yxij if (i, j) ̸∈ E, if (i, j) ∈ E and colored blue, if (i, j) ∈ E and colored red. The determinant det(M) is now a polynomial in m + 1 variables and degree at most 2n. Writing n P(x, y) = ∑ yi Qi (x), i =0 where Qi is a multilinear degree-n polynomial that corresponds to all the i-red matchings. If we set the x variables randomly (say, to values xij = aij ) from a large enough set S, we get a polynomial R(y) = P(a, y) whose only variable is y. The coefficient of yk in this polynomial is Qk (a), which is non-zero with high probability, by the Schwartz-Zippel lemma. Now we can again use interpolation to find out this coefficient, and decide the red-blue matching problem based on whether it is non-zero. 8.5 Multilinear just means that the degree of each variable in each monomial is at most one. Matchings in Parallel, and the Isolation Lemma One of the “killer applications” of the algebraic method for finding a perfect matching is that the approach extends to getting a (randomized) parallel algorithm as well. The basic idea is simple when there is a unique perfect matching. Indeed, computing the determinant can be done in parallel with poly-logarithmic depth and polynomial work. Hence, for each edge e we can run the PM-tester algorithm on G, and also on G [ E − e] to see if e belongs to this unique perfect matching; we output e if it does. However, this approach fails when the graph has multiple perfect matchings. A fix for this problem was given by Mulmuley, Vazirani, and Vazirani by adding further randomness! The approach is first extend the approach from §8.2 to find a minimum-weight perfect matching using the Tutte matrix and Schwartz-Zippel lemma, as long as the weights are polynomially bounded. (Exercise: solve this!) The trickier part is to show that assigning random polynomially-bounded weights to the edges of G causes it to have a unique minimum-weight perfect matching with high probability. Then this unique matching can also be found in parallel, as we outlined in the previous paragraph. K. Mulmuley, U.V. Vazirani and V.V. Vazirani (1987) graph matchings ii: algebraic algorithms The proof showing that random weights result in a unique minimumweight perfect matching is via a beautiful result called the Isolation Lemma. Let us give its simple elegant proof. Theorem 8.16. Consider a collection F = { M1 , M2 , . . . , } of sets over a universe E of size m. Assign a random weight to each elements of E, where the weights are drawn independently and uniformly from {1, . . . , 2m}. Then there exists a unique minimum-weight set with probability at least 12 . Proof. Call an element e ∈ E “confused” if the weight of a minimumweight set containing e is the same as the weight of a minimumweight set not containing e. We claim that any specific element e is confused with probability at most 1/2m. Observe is that there exists a confused element if and only if there are two minimum-weight sets, so using the claim and taking a union bound over all elements proves the theorem. To prove the claim, make the random choices for all elements except e. Now the identity (and weight) of the minimum-weight set not containing e is determined; let its weight be W − . Also, the identity (but not the weight) of the minimum-weight set containing e is determined. Its weight is not determined because the random choice for e’s weight has not been made, so denote its weight by W + + we , where we is the random variable denoting the weight of e. Now e will be confused precisely if W − = W + + we , i.e., if we = W − − W + . But since we is chosen uniformly at random from a set of size 2m, this probability is at most 1/2m, as claimed. We are using the principle of deferred decisions again. It is remarkable the result does not depend on number of sets in F , but only on the size of the universe. We also emphasize that the weights being drawn from a polynomially-sized set is what gives the claim its power: it is trivial to obtain a unique minimum-weight set if the weights are allowed to be in {1, . . . , 2m }. (Exercise: how?) Finally, the proof strategy for the Isolation Lemma is versatile and worth remembering. 8.5.1 Towards Deterministic Algorithms The question of finding perfect matchings deterministically in polylogarithmic depth and polynomial work still remains open. Some recent work of Fenner, Gurjar, and Thierauf, and of Svensson and Tarnawski has shown how to obtain poly-logarithmic depth and quasi-polynomial work; we may see some of these ideas in an upcoming HW. S. Fenner, R. Gurjar, and T. Thierauf (2016) O. Svensson and J. Tarnawski (2017) 89 90 the permanent connection 8.6 The Permanent Connection The randomized approach above defines determinants of matrices full of variables to decide whether a graph has a perfect matching or not. Why didn’t we just use the determinant of the adjacency matrix, instead of replacing each 1 by a suitable variable? The reason is simple: the matrix Jn×n (which is the bipartite adjacency matrix for the complete bipartite graph Kn,n ) has determinant zero, even though it has n! perfect matchings. Of course, this is because the determinant is defined as det( A) = ∑ (−1)sign σ ∏ Ai,σi . σ ∈Sn And in this case, the signs of the permutations cancel each other out. What if we defined a new quantity which does not have these pesky negative signs. This function is called the permanent, defined as: perm( A) = ∑ ∏ Ai,σi . σ ∈Sn Given this definition, we immediately get the following fact: Fact 8.17. Given the (bipartite) adjacency matrix A for a bipartite graph G, perm( A) is the number of perfect matchings in G. Hence The term comes from Cauchy’s use of “fonctions symétriques permanentes” for a related class of functions. The term determinant comes from Gauss, but apparently Cauchy was the one to use determinant to mean precisely the same object as we do. perm( A) > 0 ⇐⇒ G has a perfect matching. This sounds like great news, since we no longer have to rely on the above randomization ideas. However, we seem to have gone from a minor annoyance to a major one: how do we compute the permanent efficiently? This was a source of theoretical and practical annoyance for some time, and attempts to transform permanent computations into determinant computations had been fruitless. Finally, in 1979, Les Valiant proved the following surprising theorem: Theorem 8.18. It is NP-hard to compute the permanent of square {0, 1}matrices. In fact, computing the number of perfect matchings of a bipartite graph is as hard as counting the number of satisfying assignments to a 3SAT formula. This is truly a remarkable theorem. Finding a satisfying assignment to a 3SAT formula is NP-hard, whereas finding a perfect matching is in polynomial time. But counting the number of these two objects has the same complexity! 8.7 A Matrix Scaling Approach While our attempt to use the permanent instead of the determinant to compute matchings turned to be a dead end, the permanent does L.R. Valiant. The complexity of computing the permanent. (1979) The class of problems reducible to counting the number of satisfying assignments to a 3SAT formula is called #P; this contains all problems in NP, naturally, but also seems to contain much more. Valiant’s theorem says: counting the number of perfect matchings is as hard as all the problems in #P, which blows my mind. graph matchings ii: algebraic algorithms 91 arise in the analysis of a very different matrix-based approach to finding (fractional) perfect matchings in bipartite graphs. Let us now explore this elegant approach, called matrix scaling. Given a non-negative matrix A, and two non-negative diagonal matrices R and C, consider the scaled matrix B := RAC. In other words, taking the matrix A, and scaling each row i by Rii and each column j by Cjj , gives the matrix B. Matrix scaling gives us yet another characterization of bipartite graphs that have perfect matchings: Theorem 8.19. A bipartite graph G admits a perfect matching if and only if for each ε > 0 there exist non-negative matrices R, C such that RAG C is ε-approximate doubly-stochastic. Proof. To come soon Given the adjacency matrix A ∈ {0, 1}n for the bipartite graph G, we now try to find scaling matrices R and C. Since we want the rowand column-sums to be close to 1, one “greedy” idea is to start with A and repeatedly do the following two steps: 1. Scale each row to make the row sums equal to 1; this may put the column sums out of whack. 2. Scale each column to make the row sums equal to 1; this may now mess up the row sums. We show that if we ever reach a matrix where both row and column sums are very close to 1, then Theorem 8.19 tells us that the graph has a perfect matching. And if we don’t manage to get close to 1 in a “reasonable” time (which depends on n and ε), interestingly we can conclude it has no perfect matching! To make this precise, let’s define two diagonal matrices R( A) := diag( A1) and C ( A) = diag( A⊺ 1). Then the algorithm becomes: Algorithm 12: Sinkhorn Scaling( A) for i = 1, 2, . . . , T do A ← R ( A ) −1 A √ 12.3 if ∥ I − C ( A)∥2 ≤ 1/ n then return true 12.4 A ← A C ( A ) −1 √ 12.5 if ∥ R( A) − I ∥2 ≤ 1/ n then return true 12.1 12.2 12.6 return false A matrix A is doubly-stochastic if it has unit row- and column-sums. In other words, A1 = A⊺ 1 = 1. The ε-doublystochasticity requires that A1, A⊺ 1 both have entries in (1 − ε, 1 + ε). 92 a matrix scaling approach Theorem 8.20. If T ≥ poly(n) then Algorithm 12 outputs true if and only if the bipartite graph G has a perfect matching. Proof. We will use the permanent of A as a potential function: n perm( A) := ∑ ∏ Ai,σ(i) . σ ∈Sn i =1 Let A(t) be the matrix obtained after t rescalings. Here are the three crucial facts: 1. perm( A(t) ) ≤ 1. To show this, observe that for any non-negative matrix M, n perm( M ) ≤ ∏( Mi1 + . . . + Min ), i =1 because every term of the sum in the permanent also appears in the expression on the right, and the additional terms are all nonnegative. We now use this inequality for some A(t) that has unit row sums (the case of unit column sums is identical): then the RHS above equals 1, which proves the claim. 2. perm( A(1) ) ≥ n−n . This follows from the fact that each matrix A(t) has normalized rows or columns. Suppose the rows are normalized While this looks very similar to the Leibniz formula for the determinant—it just lacks the (−1)sign(σ) term inside the summation—the small difference completely changes the complexity of the two problems. While we can compute determinants in polynomial time, the computation of permanents is #P-complete. 9 Graph Matchings III: Weighted Matchings In this chapter, we study the matching problem from the perspective of linear programs, and also learn results about linear programming using the matching problem as our running example. In fact, we see how linear programs capture the structure of many problems we have been studying: MSTs, min-weight arborescences, and graph matchings. 9.1 Linear Programming We start with some basic definitions and results in Linear Programming. We will use these results while designing our linear program solutions for min-cost perfect matchings, min-weight arborescences and MSTs. This will be a sufficient jumping-off point for the contents of this lecture; a more thorough introduction to the subject can be found in the introductory text by Matoušek and Gärtner. 9.1.1 Polytopes and Polyhedra Definition 9.2 (Polyhedron). A polyhedron in Rn is the intersection of a finite number of half spaces. A polyhedron is a convex region which is defined by finitely many linear constraints. A polyhedron in n dimensions with m constraints is often written compactly as K = { Ax ≤ b}, 4 x2 Definition 9.1. Let ⃗a ∈ Rn be a vector and let b ∈ R a scalar. Then a half-space in Rn is a region defined by the set {⃗x ∈ Rn | ⃗a · ⃗x ≥ b}. " # 1 · ⃗x ≥ 3} in R2 is shown As an example, the half space S = {⃗x | 1 on the right. (Note that we implicitly restrict ourselves to closed halfspaces.) 6 2 0 0 1 2 3 x1 4 5 Figure 9.1: The half-space in R2 given by the set S 6 94 linear programming where A is an m × n constraint matrix, x is an n × 1 vector of variables, and b is an m × 1 vector of constants. Definition 9.3 (Polytope). A polytope K ∈ Rn is a bounded polyhedron. In other words, a polytope is polyhedron such that there exists some radius R > 0 such that K ⊆ B(0, R) = { x | ∥ x ∥2 ≤ R}. A simple example of a polytope (where the bounded region of the polytope is highlighted by ) appears on the right. We can now define a linear program (often abbeviated as LP) in terms of a polyhedron. Figure 9.2: The polytope in R2 given by the constraints − x1 − x2 ≤ 1, x1 ≤ 0, and x2 ≤ 0. Definition 9.4 (Linear Program). For some integer n, a polyhedron K = { x | Ax ≤ b}, and an n by 1 vector c, a linear program in n dimensions is the linear optimization problem min{c · x | x ∈ K } = min{c · x | Ax ≤ b}. The set K is called the feasible region of the linear program. Although all linear programs can be put into this canonical form, in practice they may have many different forms. These presentations can be shown to be equivalent to one another by adding new variables and constraints, negating the entries of A and c, etc. For example, the following are all linear programs: max{c · x : Ax ≤ b} min{c · x : Ax = b} min{c · x : Ax ≥ b} min{c · x : Ax ≤ b, x ≥ 0}. x x x x The polyhedron K need not be bounded for the linear program to have a (finite) optimal solution. For example, the following linear program has a finite optimal solution even though the polyhedron is unbounded: min{ x1 + x2 | x1 + x2 ≥ 3}. (9.1) x 9.1.2 Vertices, Extreme Points, and BFSs We now introduce three different classifications of some special points associated with polyhedra. (Several of these definitions extend to convex bodies.) Definition 9.5 (Extreme Point). Given a polyhedron K ∈ Rn , a point x ∈ K is an extreme point of K if there do not exist distinct x1 , x2 ∈ K, and λ ∈ [0, 1] such that x = λx1 + (1 − λ) x2 . In other words, x is an extreme point of K if it cannot be written as the convex combination of two other points in K. See Figure 9.3 for an example. Here’s another kind of point in K. Figure 9.3: Here y is an extreme point, but x is not. In this course, we will use the notation c · x, c⊺ x, and ⟨c, x ⟩ to denote the innerproduct between vectors c and x. graph matchings iii: weighted matchings 95 Definition 9.6 (Vertex). Given a polyhedron K ⊆ Rn , a point x ∈ K is a vertex of K if there exists an vector c ∈ Rn such that c · x < c · y for all y ∈ K y ̸= x. In other words, a vertex is the unique optimizer of some linear objective function. Equivalently, the hyperplane {y ∈ Rn | c · y = c · x } intersects K at the single point x. Note that there may be a polyhedron that does not have any vertices: e.g., one given by a single constraint, or two parallel constraints. Finally, here’s a third kind of special point in K: Definition 9.7 (Basic Feasible Solution). Given a polyhedron K ∈ Rn , a point x ∈ K is a basic feasible solution (bfs) for K if there exist n linearly independent defining constraints for K which x satisfies at equality. In other words, let K := { x ∈ Rn | Ax ≤ b}, where the m constraints corresponding to the m rows of A are denoted by ai · x ≤ bi . Then x ∗ ∈ Rn is a basic feasible solution if there exist n linearly independent constraints for which ai · x ∗ = bi , and moreover ai · x ∗ ≤ bi for all other constraints (because x ∗ must belong to K, and hence satisfy all other constraints as well). Note there are only (m n ) basic feasible solutions for K, where m is the total number of constraints and n is the dimension. As you may have guessed by now, these three definitions are all related. In fact, they are all equivalent. Fact 9.8. Given a polyhedron K and a point x ∈ K, the following are equivalent: • x is a basic feasible solution, • x is an extreme point, and • x is a vertex. While we do not prove it here, you could try to prove it yourself, or consult a textbook. For now, we proceed directly to the main fact we need for this section. Fact 9.9. For a polytope K and a linear program LP := min{c · x | x ∈ K }, there exists an optimal solution x ∗ ∈ K such that x ∗ is an extreme point/vertex/bfs of K. This fact suggests an algorithm for LPs when K is a polytope: simply find all of the (at most (m n ) basic feasible solutions and pick the one that gives the minimum solution value. Of course, there are more efficient algorithm to solve linear programs; we will talk about them in a later chapter. However, let us state a theorem—a very restricted form of the general result—about LP solving that will suffice for now: Observe that we claimed Fact 9.9 for LPs whose feasible region is a polytope, since that suffices for today, but it can be proven with weaker conditions. However it is not true for all LPs: e.g., the LP in (9.1) has an infinite number of optimal solutions, none of which are at vertices. 96 weighted matchings in bipartite graphs Theorem 9.10. There exist algorithms that take any LP min{c · x | Ax = b, x ≥ 0, x ∈ Rn }, where both the constraint matrix A and the RHS b have entries in {0, 1} and poly(n) rows, and output a basic feasible solution to this LP in poly(n) time. We will see a sketch of the proof in a later chapter. Discuss the dependence on the number of bits to represent c? Or make this an informal theorem? 9.1.3 Convex Hulls and an Alternate Representation The next definition allows us to give another representation of polytopes: Definition 9.11 (Convex Hull). Given x1 , x2 , . . . , x N ∈ Rn , the convex hull of x1 , . . . , x N is the set of all convex combinations of these points. In other words, CH( x1 , . . . , x N ) is defined as ( ) x ∈ Rn N N i =1 i =1 ∃λ1 , . . . , λ N ≥ 0 s.t. ∑ λi = 1 and x = ∑ λi xi . (9.2) Put yet another way, the convex hull of x1 , . . . , x N is the intersection of all convex sets that contain x1 , . . . , x N . It follows from the definition that the convex hull of finitely many points is a polytope. (Check!) We also know the following fact: Fact 9.12. Given a polytope K with extreme points ext(K ), K = CH(ext(K )). The important insight that polytopes may be represented in terms of their extreme points, or their bounding half-planes. One representation may be easier to work with than the other, depending on the situation. The rest of this chapter will involve moving between these two methods of representing polytopes. 9.2 Weighted Matchings in Bipartite Graphs While the previous chapters focused on finding maximum matchings in graphs, let us now consider the problem of finding a minimumweight perfect matching in a graph with edge-weights. As before, we start with bipartite graphs, and extend our techniques to general graphs. We are given a bipartite graph G = ( L, R, E) with edge-weights we . We want to use linear programs to solve the problem, so it is natural to have a variable xe for each edge e of the graph. We want our solution to set xe = 1 if the edge is in the minimum-weight graph matchings iii: weighted matchings 97 perfect matching, and xe = 0 otherwise. Compactly, this collection of variables gives us a | E|-dimensional vector x ∈ R| E| , that happens to contain only zeros and ones. A bit of notation: for any subset S ⊆ E, let χS ∈ {0, 1}| E| denote the characteristic vector of this subset S, where χS has ones in coordinates that correspond to elements in S, and zeros in the other coordinates. 9.2.1 Goal: the Bipartite Perfect Matching Polytope It is conceptually easy to define an | E|-dimensional polytope whose vertices are precisely the perfect matchings of G: we simply define CPM ( G ) = CH ({χ M | M is a perfect matching in G }). Figure 9.4: This graph has one perfect matching M: it contains edges 1, 4, 5, and 6, represented by the vector χ M = (1, 0, 0, 1, 1, 1). (9.3) And now we get a linear program that finds the minimum-weight perfect matching in a bipartite graph. min{w · x | x ∈ CPM ( G )}. By Fact 9.9, there is an optimal solution at a vertex of CPM ( G ), which by construction represents a perfect matching in G. The good part of this linear program is that its feasible region has (a) only integer extreme points, (b) which are in bijection with the objects we want to optimize over. So optimizing over this LP will immediately solve our problem. (We can assume that there are linear program solvers which always return an optimal vertex solution, if one exists.) Moreover, the LP solver runs in time polynomial in the size of the LP. The catch, of course, is that we have no control over the size of the LP, as we have written it. Our graph G may have an exponential number of matchings, and hence the definition of CPM ( G ) given in (9.3) is too unwieldly to work with. Of course, the fact that there are an exponential number of vertices does not mean that there cannot be a smaller representation using half-spaces. Can we find a compact way to describe CPM ( G )? 9.2.2 A Compact Linear Program The beauty of the bipartite matching problem is that the “right” linear program is perhaps the very first one you may write. Here is the definition of the polytope using linear constraints: The unit cube K = { x ∈ Rn | 0 ≤ x i ≤ 1 ∀ i } is a polytope with 2n constraints but 2n vertices. 98 weighted matchings in bipartite graphs          ∑ xlr = 1    r ∈ N (l )   K PM ( G ) = x ∈ R| E| s.t. ∑ xlr = 1     l ∈ N (r )         xe ≥ 0 ∀l ∈ L ∀r ∈ R ∀e ∈ E              The constraints of this polytope merely enforce that each coordinate is non-negative (which gives us | E| constraints), and that the xe values of the edges leaving each vertex sum to 1 (which gives us | L| + | R| more constraints). All these constraints are satisfied by each χ M corresponding to a matching M, which is promising. But it still always comes as a surprise to realize that his first attempt is actually successful: Theorem 9.13. For any bipartite graph G, K PM ( G ) = CPM ( G ). Proof. For brevity, let us refer to the polytopes as K and C. The easy direction is to show that C ⊆ K. Indeed, the characteristic vector χ M for each perfect matching M satisfies the constraints for K. Moreover K is convex, so if it contains all the vertices of the convex set C, it contains all their convex combinations, and hence contains all of C. Now to show K ⊆ C, we again show that the vertices of K are contained in C, and then use Fact 9.12 to infer it for the rest of K. Consider an arbitrary vertex x ∗ of K. In this proof, we use the equivalent view of a vertex as an extreme point of K. (A proof using the “basic feasible solution” perspective appears in §9.2.3, and a proof using the “vertex” perspective appears in §9.3.) Let supp( x ∗ ) = {e | xe∗ > 0} be the support of this solution. We claim that supp( x ∗ ) is acyclic. Indeed, suppose not, and cycle C = e1 , e2 , . . . , ek is contained within the support supp( x ∗ ). Since the graph is bipartite, this is an even-length cycle. Define ε := min e∈supp( x ∗ ) xe∗ . Observe that for all ei ∈ C, xe∗i + xe∗i+1 ≤ 1, so xe∗i ≤ 1 − ε. And of course xe∗i ≥ ε, merely by the definition of ε. Now consider two solutions x + and x − , where xe+i = xe∗i + (−1)i ε and xe+i = xe∗i − (−1)i ε. I.e., the two solutions add and subtract ε on alternate edges; this ensures that both the solutions stay within K. But then x ∗ = 12 x + + 1 − ∗ 2 x , contradicting our that x is an extreme point. Figure 9.5: There cannot be a cycle in supp( x ∗ ), because this violates the assumption that x ∗ is an extreme point. graph matchings iii: weighted matchings Therefore there are no cycles in supp( x ∗ ); this means the support is a forest. Consider a leaf vertex v in the support. Then, by the equality constraint at v, the single edge e ∈ supp( x ∗ ) leaving v must have xe∗ = 1. But this edge e = uv goes to another vertex u; because x ∗ is in K, this vertex u cannot have other edges in supp( x ∗ ) without violating its equality constraint. So u and v are a matched pair in x ∗ . Now remove u and v from consideration. We have introduced no cycles into the remainder of supp( x ∗ ), so we may perform the same step inductively to show that x ∗ is the indicator of a perfect matching, and hence x ∗ ∈ C. This means all vertices of K are in C, which proves C ⊆ K, and completes the proof. This completes the proof that the polytope K PM ( G ) exactly captures precisely the perfect matchings in G, despite having such a simple description. Now, using the fact that the linear program min{w · x | x ∈ K PM ( G )} can be solved in polynomial time, we get an efficient algorithm for finding minimum-weight perfect matching in graphs. 9.2.3 A Proof via Basic Feasible Solutions Here is how to prove Theorem 9.13 using the notion of basic feasible solutions (bfs). Suppose x ∗ ∈ R| E| is a bfs: we now show that xe∗ ∈ {0, 1} for all edges. By the definition of a bfs, there is a collection of | E| tight linearly independent constraints that define x ∗ . These constraints are of two kinds: the degree constraints ∑e∈∂(v) xe = 1 for some subset S of vertices, and the non-negativity constraints xe ≥ 0 for some subset E′ ⊆ E be edges. (Hence we have | E′ | + |S| = | E|.) By reordering columns and rows, we can put the degree constraints at the top, and put all the edges in E′ at the end, to get that x ∗ is defined by: # " # " C C′ 1 s x∗ = 0 I 0m−s where C ∈ {0, 1}s×s , C ′ ∈ {0, 1}(m−s)×s , and m = | E| and s = |S|. The edges in E′ have xe∗ = 0, so consider edges in E \ E′ . By the linear independence of the constraints, we have C being non-singular, so x ∗ | E\ E′ = C −1 (1 − C ′ x ∗ | E′ ) = C −1 1. By Cramer’s rule, xe∗ = det(C [1]i ) . det(C ) The numerator is an integer (since the entries of C are integers), so showing det(C ) ∈ {±1} means that xe∗ is an integer. 99 100 another perspective: buyers and sellers Claim 9.14. Any k × k-submatrix of C has determinant in {−1, 0, 1}. Proof. The proof is by induction on k; the base case is trivial. If the submatrix D has a column with a single 1, we can expand using that entry, and use the inductive hypothesis. Else each column of D has two non-zeros. Recall that the columns of D correspond to some edges ED in E \ E′ , and the rows correspond to vertices SD in S—two non-zeros in each column means each edge in ED has both endpoints in SD . Now if we sum rows for vertices in SD ∩ L would give the all ones vector, as will summing up rows for vertices in SD ∩ R. (Here is the only place we’re using bipartiteness.) In this case det( D ) = 0. Using the claim and using the fact C is non-singular and hence det(C ) cannot be zero, we get that the entries of xe∗ are integers. By the structure of the LP, the only integers possible in a feasible solution are {0, 1} and the vector x ∗ corresponds to a matching. 9.2.4 Minimum-Weight Matchings How can we we find a minimum-weight (possibly non-perfect) matching in a bipartite graph G? If the edge weights are all nonnegative, the empty matching would be the solution—but what if some edge weights are negative? (In fact, that’s how we would find a maximum-weight matching–by negating all the weights.) As before, we can define the matching polytope for G as C Match ( G ) = CH ({χ M | M is a matching in G }). To write a compact LP that describes this polytope, we slightly modify our linear constraints as follows:   ∑ xij ≤ 1    j∈ R   K Match ( G ) = x ∈ R| E| s.t. ∑ x ji ≤ 1     j∈ L         xi,j ≥ 0        ∀i ∈ L ∀i ∈ R ∀i, j              We leave it as an exercise to apply the techniques used in Theorem 9.13 to show that the vertices of K Match are matchings of G, and hence the following theorem: Theorem 9.15. For any bipartite graph G, K Match ( G ) = CH Match ( G ). 9.3 Another Perspective: Buyers and sellers The results of the previous section show that the bipartite perfect matching polytope is integral, and hence the max-weight perfect graph matchings iii: weighted matchings matching problem on bipartite graphs can be be solved by “simply” solving the LP and getting a vertex solution. But do we need a generic linear program solver? Can we solve this problem faster? In this section, we develop (a variant of) the Hungarian algorithm that finds an optimal solutions using a “combinatorial” algorithm. This proof also shows that any vertex of the polytope K PM ( G ) is integral, and hence gives another proof of Theorem 9.13. 9.3.1 The Setting: Buyers and Items Consider the setting with a set B with n buyers and another set I with n items, where buyer b has value vbi for item i. The goal is to find a max-value perfect matching, that matches each buyer to a distinct item and maximizes the sum of the values obtained by this matching. Our algorithm will maintain a set of prices for items: each item i will have price pi . Given a price vector p := ( p1 , . . . , pn ), define the utility of item i to buyer b to be ubi ( p) := vbi − pi . Intuitively, the utility measures how favorable it is for buyer b to buy item i, since it factors in both the value and the price of the item. We say that buyer b prefers item i if item i gives the highest utility to buyer b, among all items. Formally, buyer b ∈ B prefers item i at prices p if i ∈ arg maxi′ ∈ I ubi′ ( p). The utility of buyer b at prices p is the utility of this preferred item: ub ( p) := max ubi ( p) = max(vbi − pi ). i∈ I i∈ I (9.4) A buyer has at least one preferred item, and can have multiple preferred items, since there can be ties. Given prices p, we build a preference graph H = H ( p), where the vertices are buyers B on the left, items I on the right, and where bi is an edge if buyer b prefers item i at prices p. The two examples show preference graphs, where the second graph results from an increase in price of item 1. Flip the figure. Theorem 9.16. For any price vector p∗ , if the preference graph H ( p∗ ) contains a perfect matching M, then M is a max-value perfect matching. Proof. This proof uses weak linear programming duality. Indeed, recall the linear program we wrote for the bipartite perfect matching problem: we allow fractional matchings by assigning each edge bi a 101 102 another perspective: buyers and sellers fractional value xbi ∈ [0, 1]. maximize ∑ vbi xbi bi n subject to ∑ xbi = 1 ∀i i =1 ∑ xbi = 1 ∀b xbi ≥ 0 ∀(b, i ) b =1 n The perfect matching M is clearly feasible for this LP, so it remains to show that it achieves the optimum. Indeed, we show this by exhibiting a feasible dual solution with value ∑bi∈ M vbi , the value of the primal solution. Then by weak duality, both these solutions must be optimal. The dual linear program is the following: n n i =1 b =1 minimize ∑ pi + ∑ u b subject to pi + ub ≥ vbi ∀bi (Observe that u and p are unconstrained variables.) In fact, given any settings of the pi variables, the ub variables that minimize the objective function, while still satisfying the linear constraints, are given by ub := maxi∈ I (vbi − pi ), exactly matching (9.4). Hence, the dual program can then be rewritten as the following (non-linear, convex) program with no constraints: min p=( p1 ,...,pn ) ∑ p i + ∑ u b ( p ). i∈ I b∈ B Consider the dual solution given by the price vector p∗ . Recall that M is a perfect matching in the preference graph H ( p∗ ), and let M(i ) be the buyer matched to item i by it. Since u M(i) ( p) = v M(i)i − pi , the dual objective is ∑ pi∗ + ∑ ub ( p∗ ) = ∑ pi∗ + ∑(v M(i)i − pi ) = ∑ vbi . i∈ I b∈ B i∈ I i∈ I bi ∈ M Since the primal and dual values are equal, the primal matching M must be optimal. Prices p = ( p1 , . . . , pn ) are said to be market-clearing if each item can be assigned to some person who has maximum utility for it at these prices, subject to the constraints of the problem. In our setting, having such prices are equivalent to having a perfect matching in the preference graph. Hence, Theorem 9.16 shows that market-clearing prices give us an optimal matching, so our goal will be to find such prices. graph matchings iii: weighted matchings 103 9.3.2 The Hungarian algorithm The “Hungarian” algorithm uses the buyers-and-sellers viewpoint from the previous section. The idea of the algorithm is to iteratively change item prices as long as they are not market-clearing, and the key is to show that this procedure terminates. To make our proofs easier, we assume for now that all the values vbi are integers. The price-changing algorithm proceeds as follows: 1. Initially, all items have price pi = 0. 2. In each iteration, build the current preference graph H ( p). If it contains a perfect matching M, return it. Theorem 9.16 ensures that M is an optimal matching. 3. Otherwise, by Hall’s theorem, there exists a set S of buyers such that if N (S) := {i ∈ I | ∃b ∈ S, bi ∈ E( H ( p))} is the set of items preferred by at least one buyer in S, then | N (S)| < |S|. (N (S) is the neighborhood of S in the preference graph.) Intuitively, we have many buyers trying to buy few items, so logically, the sellers of those items should raise their prices! The algorithm increases the price of every item in N (S) by 1, and starts a new iteration by going back to step 2. That’s it. Running the algorithm on our running example gives the prices on the right. The only way the algorithm can stop is to produce an optimal matching. So we must show it does stop, for which we use a “semiinvariant” argument. We keep track of the “potential” Φ ( p ) : = ∑ p i + ∑ u b ( p ), i b where pi are the current prices and ub ( p) = maxi (vbi − pi ) as above. This is just the dual value, and hence is is lower-bounded by the optimal value of the dual program. (We assume the optimal value of the LP is finite, e.g., if all the input values are finite.) Then, it suffices to prove the following: Lemma 9.17. Every time we increase the prices in N (S) by 1, the value of ∑i pi + ∑b ub decreases by at least 1. Proof. The value of ∑i pi increases by | N (S)|, because we increase the price of each item i ∈ N (S) by 1. For each buyer b ∈ S, the value ub must decrease by 1, since all their preferred items had their prices increased by 1, and all other items previously had utilities at least one lower than the original ub ( p). (Here, we used the fact H.W. Kuhn (1955) The algorithm was named the Hungarian algorithm by Harold Kuhn who based his ideas on the works of Jenö Egervary and Dénes König. Munkres subsequently showed that the algorithm was in fact implementable in O(n3 ). Later, the algorithm was found to have been proposed even earlier by Carl Gustav Jacobi, before 1851. 104 another perspective: buyers and sellers that all values were integral.) Therefore, the value of the potential ∑i pi + ∑b ub changes by | N ( B)| − | B| ≤ −1. Hence the algorithm stops in finite time, and produces a maximumvalue perfect matching. By the arguments above ?? we get yet another proof of integrality of the LP ?? for the bipartite pefect matching problem. A few other remarks about the algorithm: • In fact, one can get rid of the integrality assumption by raising the prices by the maximum amount possible for the above proof to still go through, namely  min ub ( p) − max (vib − pi ) . b∈S i ̸∈ N (S) It can be shown that this update rule makes the algorithm stop in only O(n3 ) iterations. • If all the values are non-negative, and we don’t like the utilities to be negative, then we can do one of the following things: (a) when all the prices become non-zero, subtract the same amount from all of them to make the lowest price hit zero, or (b) choose S to be a minimal “consticted” set and raise the prices for N (S). This way, we can ensure that each buyer still has at least one item which gives it nonngegative utility. (Exercise!) • Suppose there are n buyers and a single item, with all non-negative values. (Imagine there are n − 1 dummy items, with buyers having zero values for them.) The above algorithm behaves like the usual ascending-price English or Vickery auction, where prices are raised until only one bidder remains. Indeed, the final price for the “real” item will be such that the second-highest bidder is indifferent between it and a dummy item. This is a more general phenomenon: indeed, even in the setting with multiple items, the final prices are those produced by the Vickery-Clarke-Groves truthful mechanism, at least if we use the version of the algorithm that raises prices on minimal constricted sets. The truthfulness of the mechanism means there is no incentive for buyers to unilaterally lie about their values for items. See, e.g., 1 for the rich connection of matching algorithms to auction theory and (algorithmic) mechanism design. Check about negative values, they don’t seem to matter at all, as long as everything is finite. What about max-weight maximum matching: we can always convert the graph, but does the algorithm work out of the box? 1 graph matchings iii: weighted matchings This proof shows that for any setting of values, there is an optimal integer solution to the linear program max{v · x | x ∈ K LP(G) }. This implies that every vertex x ∗ of the polytope K LP(G) is integral— indeed, the definition of vertex means x ∗ is the unique solution to the linear program for some values v, and our proof just produced an integral matching that is the optimal solution. Hence, we get another proof of Theorem 9.13, this time using the notion of vertices instead of extreme points. 9.4 A Third Algorithm: Shortest Augmenting Paths Let us now see yet another algorithm for solving weighted matching problems in bipartite graphs. For now, we switch from maximumweight matchings to minimum-weight matchings, because they are conceptually cleaner to explain here. Of course, the two problems are equivalent, since we can always negate edge weights. In fact, we solve a min-cost max-flow problem here: given an flow network with terminals s and t, edge capacities ue , and also edge costs/weights we , find an s-t flow with maximum flow value, and whose total cost/weight is the least among all such flows. (Moreover, if the capacities are integers, the flow we find will also have integer flow values on all edges.) Casting the maximum-cardinality bipartite matching problem as a integer max-flow problem, as in §blah gives us a minimum-weight bipartite matching. This algorithm uses an augmenting path subroutine, much like the algorithm of Ford and Fulkerson. The subroutine, which takes in a matching M and returns one of size | M | + 1, is presented below. Then, we can start with the empty matching and call this subroutine until we get a maximum matching. Let the original bipartite graph be G. Construct the directed graph G M as follows: For each edge e ∈ M, insert that edge directed from right to left, with weight −we . For each edge e ∈ G \ M, insert that edge directed from left to right, with weight we . Then, compute the shortest path P that starts from the left and ends on the right, and return M △ P. It is easy to see that M △ P is a matching of size | M | + 1, and has total weight equal to the sum of the weights of M and P. Call a matching M an extreme matching if M has minimum weight among all matchings of size | M |. The main idea is to show that the above subroutine preserves extremity, so that the final matching must be extreme and therefore optimal. Theorem 9.18. If M is an extreme matching, then so is M △ P. 105 106 a third algorithm: shortest augmenting paths Proof. Suppose that M is extreme. We will show that there exists an augmenting path P such that M △ P is extreme. Then, since the algorithm finds the shortest augmenting path, it will find a path that is no longer than P, so the returned matching must also be extreme. Consider an extreme matching M′ of size | M| + 1. Then, the edges in M △ M′ are composed of disjoint paths and cycles. Since M △ M′ has more edges in M′ than edges in M, there is some path P ⊂ M △ M′ with one more edge in M′ than in M. This path necessarily starts and ends on opposite sides, so we can direct it to start from the left and end on the right. We know that | M′ ∩ P| = | M ∩ P| + 1, which means that M \ P and M′ \ P must have equal size. The total weight of M\ P and M′ \ P must be the same, since otherwise, we can swap the two matchings and improve one of M and M′ . Therefore, M △ P = ( M′ ∩ P) ∪ ( M\ P) has the same weight as M′ and is extreme. Note that the formulation of G M is exactly the graph constructed if we represent the minimum matching problem as a min-cost flow. Indeed, the previous theorem can be generalized to a very similar statement for the augmenting path algorithm for min-cost flows. graph matchings iii: weighted matchings 9.5 107 Perfect Matchings in General Graphs When the graph is not bipartite, there are no “left” and “right” sets of vertices, so we can simply define Kdeg ( G ) := { x ∈ R| E| | x (∂v) = 1 ∀v ∈ V, x ≥ 0}. Recall that ∂v is the set of edges incident on vertex v. This matches with the definition (9.3) when the graph is bipartite. Interestingly, CPM ( G ) ⊊ Kdeg ( G ) for non-bipartite graphs. Indeed, consider graph K3 which consists of a single 3-cycle: this graph has no perfect matching, but setting xe = 1/2 for each edge satisfies all the constraints. Or in the graph K6 (which does have perfect matchings), the solution where we set xe = 1/2 on two disjoint 3-cycles is an extreme point. This suggests that the linear constraints defining Kdeg ( G ) are not enough, and we need to add more constraints to capture the convex hull of perfect matchings in general graphs. In situations like this, it is instructive to look at the counterexample, to see what constraints must be satisfied by any integer solution, but are violated by this fractional solution. For a set of vertices S ⊆ V, let ∂S denote the edges leaving S. Here is one such set of constraints:   . ∑ xe ≥ 1 ∀S ⊆ V such that |S| is odd, Can you find a cost vector for which this half-integral solution is the unique optimum? e∈∂S These constraints say: the vertices belonging to a set S ⊆ V of odd size cannot be perfectly matched within themselves, and at least one edge from any perfect matching must leave S. Indeed, this constraint would be violated by the set of vertices on the odd cycle in the examples above. Adding these odd-set constraints to the previous degree constraints gives us the following polytope, which was originally proposed by Edmonds:           ∑ xvu = 1    u∈∂(v) KgenPM ( G ) = x ∈ R| E| s.t. ∑ xe ≥ 1      e∈∂(S)       xe ≥ 0 ∀v ∈ V ∀S s.t. |S| odd ∀e ∈ E Theorem 9.19. For an undirected graph G, we have KgenPM ( G ) = CgenPM ( G ), where CgenPM ( G ) the convex hull of all perfect matchings of G.              J. Edmonds. Maximum matching and a polyhedron with 0,1-vertices (1965) 108 perfect matchings in general graphs One proof is a more sophisticated version of the one in §9.2.3, where we may now have tight odd-set constraints; you can try it as a slightly challenging exercise, or see the proof in §9.5.1. Note that we seem to have exchanged one complication for another: while the polytope CgenPM ( G ) was unwieldy because it was defined as the convex hull of an exponential number of points, the new LP KgenPM ( G ) contains a exponential number of constraints (in contrast with the linear number of constraints needed for the bipartite case). In fact, a powerful recent theorem by Thomas Rothvoß (building on pioneering work by Mihalis Yannakakis) shows that any polytope whose vertices are the perfect matchings of the complete graph on n vertices must contain an exponential number of constraints. Elaborate? Given this negative result, this LP seems useless: we cannot even write it down explicitly in polynomial time! It turns out that despite this large size, it is possible to solve this LP in polynomial time. In a later lecture, we will see the Ellipsoid method to solve linear programs. This method can solve such a large LP, provided we give it a helper procedure called a “separation oracle”, which, given a point x ∈ R| E| , outputs Yes if x lies is within the desired polytope, and otherwise it outputs No and returns a violated constraint of this LP. It is easy to check if x satisfies the degree constraints, so the challenge here is to find an odd set S with ∑e∈∂(S) xe < 1, if there exists one. Such an algorithm can be indeed obtained using a sequence of mincut computations in the graph, as was shown by. Hopefully we will see this in a HW problem later in the course. T. Rothvoß (2014,2017) M. Yannakakis (1991) M.W. Padberg and M.R. Rao (1982) 9.5.1 Integrality of the Perfect Matching Polytope To prove integrality of KgenPM ( G ), we use a combination of the ideas we’ve aready used to argue about bipartite graphs. Fact 9.20. If x ∗ is an extreme point for the polytope Kdeg(G) , then any component in the support supp( x ∗ ) is either an edge, or has odd number of vertices. Proof. Consider the support supp( x ∗ ). Note that any vertex with unit degree is incident to an edge xe∗ = 1, and the other endpoint must also have unit degree. So all edges with value 1 are isolated edges. Also, we can ignore edges with value 0. So the rest of the edges are fractional and have value in the open interval (0, 1). The rest of the vertices of V—those not incident to the matching edges—have degree at least 2 in the support, and hence each other connected component of the support must have a cycle. Consider one such component H with vertices VH and edges EH (each of which has xe∗ ∈ (0, 1)). If |VH | is odd, we are done, so say The classical proof of this fact shows more, that the extreme points of Kdeg ( G ) have {0, 1/2, 1}-valued edges, and hence contain edges and oddlength cycles. That proof creates a bipartite graph by making two copies of each vertex, “lifts” the solution x ∗ to that graph, and then uses the integrality of the perfect matchings polytope for that bipartite graph. Our proof is more hands-on, and closer to what we saw earlier in Theorem 9.13. graph matchings iii: weighted matchings H has even size. If each vertex in H has an even degree, we can find an Euler tour of these edges (which uses these edges exactly once), and then apply the idea from Theorem 9.13 to this Euler tour to show that x ∗ is the convex combination of two other solutions x + /x − . The argument we used did not rely on the cycle being simple—just that it was of even length, which holds because H has even size, and that the solutions we get are different from x ∗ ! The rest of the proof just handles the case of non-Eulerian components of even size. Such a non-Eulerian graph H must contain vertices with an odd degree; but there must be an even number of them. Pair them up in any way you like; pick the pairs one by one, pick a path between them in H, and “duplicate” it—make one copy of each edge on the path. This increases the degree of each endpoint by 1 (thereby changing the parity), but does not change the parity of each other node. One important comment: pick some edge e′ on some cycle, and ensure that the duplicated path does not use this edge e′ by going the “other way around the cycle”. At the end this fixes the degrees of all vertices to be even (at the cost of duplicating edges in H). Again find an Euler tour and do the +ε/−ε trick on these. If an edge is duplicated, it may be used as an “odd” edge some pe number of times and an “even” edge the remaining ne times, so will get an offset of ε( pe − ne ) in one solution and −ε( pe − ne ) in the other. This again shows that x ∗ is not an extreme point. Now we use the basic feasible solution perspective of x ∗ : this means we have a “basis” containing some | E| linearly independent constraints of the linear program defining the polytope that are tight at x ∗ . Fix any such basis, and suppose S is the collection of sets for which the odd-set constraints are tight in this basis. There are two cases: 1. If S = ∅, then x is a basic feasible solution of Kdeg ( G ) as well. So by Fact 9.20 each non-edge component C of its support must be of odd size. But any such component C violates the corresponding odd-set constraint (because there is no edge in supp( x ∗ ) ∩ ∂C), giving a contradiction. So there are no such odd components, and x ∗ is just a perfect matching. 2. Else S ̸= ∅, and let S be a tight set in S . Now define two graphs: G1 := G/S obtained by contracting S to a vertex vS , and G2 := G/(V \ S) obtained by contracting V \ S to a vertex vS . Let x (1) , x (2) be obtained by restricting x ∗ to the edges of G1 , G2 respectively. We can check that each x (i) is a solution to KgenPM ( Gi ). 109 Why? The Handshake Lemma says the sum of degrees is twice the number of edges and hence is even. Since the edge e′ is used only once in the Euler tour, it is definitely increased or decreased in the two solutions, ensuring that x + /x − are not equal to x∗ . 110 integrality of polyhedra Moreover, both these polytopes are for smaller graphs, and have integer vertices by induction. Hence, any point within them can be written as a convex combination of their vertices. In particular, x (1) = ∑ α M ′ χ M ′ M′ where the sum is over perfect matchings in G1 ; similarly x (2) = ∑ β M′′ χ M′′ , M′′ where M′′ ranges over perfect matchings in G2 . Note that each of the matchings in the above sums contains exactly one of the edges in ∂S. Moreover, each edge e ∈ ∂S is contained in exactly xe∗ fraction of both these sums. So we can pair these matchings M′ , M′′ up and take their union to get a convex combination of matchings M in G. This shows that x ∗ can be written as a convex combination of perfect matchings in M, and hence x ∗ ∈ CgenPM ( G ), proving the claim. There are other proofs to show integrality of KgenPM ( G ), e.g., one using just the basic feasible solutions perspective; I will put links to them soon. 9.6 Integrality of Polyhedra We just saw several proofs that the bipartite perfect matching polytope has a compact linear program. Moreover, we claimed that the pefect matching polytope on general graphs has an explicit linear program that, while exponential-sized, can be solved in polynomial time. Such results allow us to solve the weighted bipartite matching problems using generic linear programming solvers (as long as they return vertex solutions). Having many different ways to view a problem gives us a deeper insight, and thereby come up with faster and better ways to solve it. Moreover, these different perspectives give us a handle into solving extensions of these problems. E.g., if we have a matching problem with two different kinds weights w1 and w2 on the edges: we want to find a matching x ∈ K PM ( G ) minimizing w1 · x, now subject to the additional constraint w2 · x ≤ B. While the problem is now NP-hard, this linear constraint can easily be added to the linear program to get a fractional optimal solution. Then we can reason about how to “round” this solution to get a near-optimal matching. We now show how two problems we considered earlier, namely minimum-cost arborescence and spanning trees, can be exactly modeled using linear programs. We then conclude with a pointer to a general theory of integral polyhedra. graph matchings iii: weighted matchings 9.6.1 Arborescences We already saw a linear program for the min-weight r-arborescence polytope in §2.3.2: since each node that is not the root r must have a path in the arborescence to the root, it is natural to say that for any subset of vertices S ⊆ V that does not contain the root, there must be an edge leaving it. Specifically, given the digraph G = (V, A), the polytope can be written as         ∑ x a ≥ 1 ∀S ⊂ Vs.t.r ̸∈ S  + (S) | A| a ∈ ∂ K Arb ( G ) = x ∈ R s.t. .     x ≥ 0  ∀a ∈ A a Here ∂+ (S) is the set of arcs that leave set S. The proof in §2.3.2 already showed that for each weight vector w ∈ R| A| , we can find an optimal solution to the linear program min{w · x | x ∈ K Arb ( G )}. 9.6.2 Minimum Spanning Trees One way to write an LP for minimum spanning trees is to reduce it to minimum-weight r-arborescences: indeed, replace each edge by two arcs in opposite directions, each having the same cost. Pick any node as the root r. Observe the natural bijection between rarborescence in this digraph and spanning trees in the original graph, having the same weight. But why go via arborescences? Why not directly model the fact that any tree has at least one undirected edge crossing each cut (S, V \ S), perhaps as follows:          ∑ xe ≥ 1 ∀S ⊂ V, S ̸= ∅, V | E| e ∈ ∂ ( S ) . KSTtry = x ∈ R s.t.      x ≥ 0 ∀e ∈ E e (The first constraint excludes the case where S is either empty or the entire vertex set.) Sadly, this does not precisely capture the spanning tree polytope: e.g., for the familiar cycle graph having three vertices, setting xe = 1/2 for all three edges satisfies all the constraints. If all edge weights are 1, this solution get a value of ∑e xe = 3/2, whereas any spanning tree on 3 vertices must have 2 edges. One can indeed write a different linear program that captures the spanning tree polytope, but it is a bit non-trivial:          ∑ xij ≤ |S| − 1    ij ∈ E:i,j ∈S   | E| KST ( G ) = x ∈ R s.t. ∑ xij = |V | − 1     ij∈ E         xij ≥ 0 ∀S ⊆ V, S ̸= ∅ ∀ij ∈ E              111 112 integrality of polyhedra Define the convex hull of all minimum spanning trees of G to be CH MST . Then, somewhat predictably, we will again find that CH MST = K MST . Both the polytopes for arborescences and spanning trees had exponentially many constraints. Again, we can solve these LPs if we are given separation oracles for them, i.e., procedures that take x and check if it is indeed feasible for the polytope. If it is not feasible, the oracle should output a violated linear inequality. We leave it as an exercise to construct separation oracles for the polytopes above. A different approach is to represent such a polytope K compactly ′ via an extended formulation: i.e., to define a polytope K ′ ∈ Rn+n using a polynomial number of linear contraints (on the original variables ′ x ∈ Rn and perhaps some new variables y ∈ Rn ) such that projecting K ′ down onto the original n-dimensions gives us K. I.e., we want that ′ K = { x ∈ Rn | ∃y ∈ Rn s.t. ( x, y) ∈ K ′ }. The homework exercises will ask you to write such a compact extended formulation for the arborescence problem. 9.6.3 Integrality of Polyhedra and Total Unimodularity This section still needs work. We have seen that LPs are a powerful way of formulating problems like min-cost matchings, min-weight r-aborescences, and MSTs. We reasoned about the structure of the polytopes that underly the LPs, and we were able to show that these LPs do indeed solve their combinatorial problems. But notice that simply forming the LP is not sufficient–significant effort was expended to show that these polytopes do indeed have integer solutions at the vertices. Without this guarantee, we could get fractional solutions to these LPs that do not actually give us solutions to our problem. There is a substantial field of study concerned with proving the integrality of various LPs. We will briefly introduce a matrix property that implies the integrality of corresponding LPs. Recall that an LP can be written as [ A]m×n · ⃗x ≤ ⃗b, where A is a m × n matrix with each row corresponding to a constraint, ⃗x is a vector of n variables, and ⃗b ∈ Rm is a vector corresponding to the m scalars bi ∈ R in the constraint A(i) · ⃗x ≤ bi . Definition 9.21. A matrix [ A]m×n is called totally unimodular if every square submatrix B of A has the property that det( B) ∈ {0, ±1} We then have the following neat theorem, due to Alan Hoffman and Joe Kruskal: When we say “integer solutions”, we mean the solution vector is integer valued graph matchings iii: weighted matchings Theorem 9.22 (Hoffman and Kruskal Theorem). If the constraint matrix [ A]m×n is totally unimodular and the vector ⃗b is integral, i.e., ⃗b ∈ Zm , then the vertices of the polytope induced by the LP are integer valued. Moreover, if for some matrix A the polytope has integer vertices for all integer vectors b, then the matrix A is totally unimodular. Proof. (Sketch) This proof uses that solutions to linear systems can be obtained using Cramer’s rule. Thus, to show that the vertices are indeed integer valued, one need not go through producing combinatorial proofs, as we have. Instead, one could just check that the constraint matrix A is totally unimodular. Here’s a nice presentation by Marc Uetz about the relation between total unimodularity and graph matchings. 113 A.J. Hoffman and J.B. Kruskal (1956) Part II Interlude: Dimension Reduction 10 Concentration of Measure Consider the following questions: 1. You distribute n tasks among n machines, by sending each task to a machine uniformly and independently at random: while any machine has unit expected load, what is the maximum load (i.e., the maximum number of tasks assigned to any machine)? 2. You want to estimate the bias p of a coin by repeatedly flipping it and then taking the sample mean. How many samples suffice to be within ±ε of the correct answer p with confidence 1 − δ? 3. How many unit vectors can you choose in Rn that are almost orthonormal? I.e., they must satisfy | vi , v j | ≤ ε for all i ̸= j? 4. A n-dimensional hyercube has N = 2n nodes. Each node i ∈ [ N ] contains a packet pi , which is destined for node πi , where π is a permutation. The routing happens in rounds. At each round, each packet traverses at most one edge, and each edge can transmit at most one packet. Find a routing policy where each packet reaches its destination in O(n) rounds, regardless of the permutation π. All these questions can be answered by the same basic tool, which goes by the name of Chernoff bounds or concentration inequalities or tail inequalities or concentration of measure, or tens of other names. The basic question is simple: if we have a real-valued function f ( X1 , X2 , . . . , Xm ) of several independent random variables Xi , such that it is “not too sensitive to each coordinate”, how often does it deviate far from its mean? To make it more concrete, consider this— Given n independent random variables X1 , . . . , Xn , each bounded in the interval [0, 1], let Sn = ∑in=1 Xi . What is h i Pr Sn ̸∈ (1 ± ε) ESn ? This question will turn out to have relations to convex geometry, to online learning, to many other areas. But of greatest interest to 118 asymptotic analysis us, this question will solve many problems in algorithm analysis, including the above four. Let us see some basic results, and then give the answers to the four questions. 10.1 Asymptotic Analysis We will be concerned with non-asymptotic analysis, i.e., the qualitative behavior of sums (and other Lipschitz functions) of finite number of (bounded) independent random variables. Before we begin that, a few words about the asymptotic analysis, which concerns the convergence of averages of infinite sequences of random variables. Given a sequence of random variables { Xn } and another random variable Y, the following two notions of convergence can be defined. Definition 10.1 (Convergence in Probability). { Xn } converges in probability to Y if for every ε > 0 we have lim P(| Xn − Y | > ε) = 0 n→∞ (10.1) p This is denoted by Xn − → Y. Definition 10.2 (Convergence in Distribution). Let FX (.) denote the CDF of a random variable X. { Xn } converges in distribution to Y if lim FXn (t) = FY (t) n→∞ (10.2) for all points t where the distribution function FY is continuous. This d is denoted by Xn − → Y. There are many results known here, and we only mention the two well-known results below. The weak law of large numbers states that the average of independent and identically distributed (i.i.d.) random variables converges in probability to their mean. Theorem 10.3 (Weak law of large numbers). Let Sn denote the sum of n i.i.d. random variables, each with mean µ and variance σ2 < ∞, then p Sn/n − → µ. (10.3) The central limit theorem tells us about the distribution of the sum of a large collection of i.i.d. random variables. Let N (0, 1) denote the standard normal variable with mean 0 and variance 1, whose 2 probability density function is f ( x ) = √1 exp(− x2 ). 2π Theorem 10.4 (Central limit theorem). Let Sn denote the sum of n i.i.d. random variables, each with mean µ and variance σ2 < ∞, then Sn − nµ d √ − → N (0, 1). nσ (10.4) There are many powerful asymptotic results in the literature; see need to give references here. concentration of measure 10.2 Non-Asymptotic Convergence Bounds Our focus will be on the behavior of finite sequences of random variables. The central question here will be: what is the chance of deviating far from the mean? Given an r.v. X with mean µ, and some deviation λ > 0, the quantity Pr[ X ≥ µ + λ] is called the upper tail, and the analogous quantity Pr[ X ≤ µ − λ] is the lower tail. We are interested in bounding these tails for various values of λ. 10.2.1 Markov’s inequality Most of our results will stem from the most basic of all results: Markov’s inequality. This inequality qualitatively generalizes that idea that a random variable cannot always be above its mean, and gives a bound on the upper tail. Theorem 10.5 (Markov’s Inequality). Let X be a non negative random variable and λ > 0, then P( X ≥ λ ) ≤ E( X ) λ (10.5) With this in hand, we can start substituting various non-negative functions of random variables X to deduce interesting bounds. For instance, the next inequality looks at both the mean µ := EX and the variance σ2 := E[( X − µ)2 ] of a random variable, and bounds both the upper and lower tails. 10.2.2 Chebychev’s Inequality Theorem 10.6 (Chebychev’s inequality). For any random variable X with mean µ and variance σ2 , we have Pr[| X − µ| ≥ λ] ≤ σ2 . λ2 Proof. Using Markov’s inequality on the non-negative r.v. Y = ( X − µ)2 , we get Pr[Y ≥ λ2 ] ≤ E [Y ] . λ2 The proof follows from Pr[Y ≥ λ2 ] = Pr[| X − µ| ≥ λ]. 119 120 non-asymptotic convergence bounds 10.2.3 Examples: The First Bounds Example 1 (Coin Flips): Let X1 , X2 , . . . , Xn be i.i.d. Bernoulli random variables with Pr[ Xi = 0] = 1 − p and Pr[ Xi = 1] = p. (Im other words, these are the outcomes of independently flipping n coins, each with bias p.) Let Sn := ∑in Xi be the number of heads. Then Sn is distributed as a binomial random variable Bin(n, p), with E[Sn ] = np Var[Sn ] = np(1 − p). and Recall that linearity of expectations for r.v.s X, Y means E[ X + Y ] = E[ X ] + E[Y ]. For independent we have Var[ X + Y ] = Var[ X ] + Var[Y ]. Applying Markov’s inequality for the upper tail gives Pr[Sn − pn ≥ βn] ≤ 1 pn = . pn + βn 1 + ( β/p) So, for p = 1/2, this is 1+12β ≈ 1 − O( β) for small values of β > 0. However, Chebychev’s inequality gives a much tighter bound: Pr[|Sn − pn| ≥ βn] ≤ np(1 − p) p < 2 . β2 n2 β n In particular, this already says that the sample mean Sn /n lies in the p interval p ± β with probability at least 1 − β2 n . Equivalently, to get p p confidence 1 − δ, we just need to set δ ≥ β2 n , i.e., take n ≥ β2 δ . (We will see a better bound soon.) Concretely, to get within an additive 1% error of the correct bias p with probability 99.9%, set β = 0.01 and δ = 0.001, so taking n ≥ 107 · p samples suffices. Example 2 (Balls and Bins): Throw n balls uniformly at random and independently into n bins. Then for a fixed bin i, let Li denote the number of balls in it. Observe that Li is distributed as a Bin(n, 1/n) random variable. Markov’s inequality gives a bound on the probability that Li deviates from its mean 1 by λ ≫ 1 as   Pr Li ≥ 1 + λ ≤ 1 1 ≈ . 1+λ λ However, Chebychev’s inequality gives a much tighter bound as h i 1 (1 − 1/n) Pr | Li − 1| ≥ λ ≤ ≈ 2. λ2 λ √ So setting λ = 2 n says that the probability of any fixed bin having √ (1−1/n) more than 2 n + 1 balls is at most 4n . Now a union bound over (1−1/n) all bins i means that, with probability at least n · 4n ≤ 1/4, the √ load on every bin is at most 1 + 2 n. Example 3 (Random Walk): Suppose we start at the origin and at each step move a unit distance either left or right uniformly randomly and independently. We can then ask about the behaviour of the final position after n steps. Each step (Xi ) can be modelled as a Rademacher random variable with the following distribution. Doing this argument with Markov’s inequality would give a trivial upper bound of 1 + 2n on the load. This is useless, since there are at most n balls, so the load can never be more than n. A random sign is also called a Rademacher random variable, the name Bernoulli being already taken for a random bit in {0, 1}. concentration of measure Xi =  1  −1 w.p. 12 w.p. 12 The position after n steps is given by Sn = ∑in=1 Xi , with mean and variance being µ = 0 and σ2 = n respectively. Applying Chebyshev’s √ inequality on Sn with deviation λ = tσ = t n, we get h √ i 1 Pr Sn > t n ≤ 2 . (10.6) t We will soon see how to get a tighter tail bound. 10.2.4 Higher-Order Moment Inequalities All the bounds in the examples above can be improved by using higher-order moments of the random variables. The idea is to use the same recipe as in Chebychev’s inequality. Theorem 10.7 (2kth -Order Moment inequalities). Let k ∈ Z≥0 . For any random variable X having mean µ, and finite moments upto order 2k, we have E(( X − µ)2k ) . λ2k Proof. The proof is exactly the same: using Markov’s inequality on the non-negative r.v. Y := ( X − µ)2k , Pr[| X − µ| ≥ λ] ≤ E [Y ] . λ2k We can get stronger tail bounds for large values of k, however it becomes increasingly tedious to compute E(( X − µ)2k ) for the random variables of interest. Pr[| X − µ| ≥ λ] = Pr[Y ≥ λ2k ] ≤ Example 3 (Random Walk, continued): If we consider the fourth moment of Sn : i h n   E ( S n ) 4 = E ∑ Xi h i =1 = E ∑ Xi4 + 4 ∑ Xi3 X j + 6 ∑ Xi2 X 2j + 12 ∑ Xi2 X j Xk + 24 i   n = n+6 , 2 i< j i< j i < j 0 to be chosen carefully. Since this map is monotone, Pr[Sn ≥ µ + λ)] = Pr[etSn ≥ et(µ+λ) ] E[etSn ] (using Markov’s inequality) et(µ+λ) ∏ E[etXi ] = it(µ+λ) (using independence) e ≤ (10.11) Bernoulli random variables: Assume that all the Xi ∈ {0, 1}; we will remove this assumption later. Let the mean be µi = E[ Xi ], so the moment generating function can be explicitly computed as E[etXi ] = 1 + µi (et − 1) ≤ exp(µi (et − 1)). Substituting, we get ∏i E[etXi ] (10.12) et(µ+λ) ∏ exp(µi (et − 1)) ≤ i (10.13) et(µ+λ) exp(µ(et − 1)) ≤ (since µ = ∑ µi ) et(µ+λ) i Pr[Sn ≥ µ + λ] ≤ = exp(µ(et − 1) − t(µ + λ)). (10.14) Since this calculation holds for all positive t, and we want the tightest upper bound, we should minimize the expression (10.14). Setting the derivative w.r.t. t to zero gives t = ln(1 + λ/µ) which is non-negative for λ ≥ µ. Pr[Sn ≥ µ + λ] ≤ eλ . (1 + λ/µ)µ+λ (10.15) This bound on the upper tail is also one to be kept in mind; it often is useful when we are interested in large deviations where λ ≫ µ. One such example will be the load-balancing application with jobs and machines. 124 chernoff bounds, and hoeffding’s inequality If we define β := λ/µ as the deviation in multiples of the mean, this quantity is Pr[Sn ≥ µ + λ] ≤  eβ (1 + β )1+ β µ , (10.16) which is an expression that may be easy to deal with/memorize. And we can simplify even further: since β ≤ ln(1 + β) 1 + β/2 (10.17) for all β ≥ 0, so we get (10.17) (10.16) ≤ exp  − β2 µ 2+β  = exp  − λ2 2µ + λ  , where the last expression follows by algebraic manipulation. This proves the upper tail bound (10.8); a similar proof gives us the lower tail as well. Removing the assumption that Xi ∈ {0, 1}: If the r.v.s are not Bernoullis, then we define new Bernoulli r.v.s Yi ∼ Bernoulli(µi ), which take value 0 with probability 1 − µi , and value 1 with probability µi , so that E[ Xi ] = E[Yi ]. Note that f ( x ) = etx is convex for every value of t ≥ 0; hence the function ℓ( x ) = (1 − x ) · f (0) + x · f (1) satisfies f ( x ) ≤ ℓ( x ) for all x ∈ [0, 1]. Hence E[ f ( Xi )] ≤ E[ℓ( Xi )]; moreover ℓ( x ) is a linear function so E[ℓ( Xi )] = ℓ(E[ Xi ]) = E[ℓ(Yi )], since Xi and Yi have the same mean. Finally, ℓ(y) = f (y) for y ∈ {0, 1}. Putting all this together, E[etXi ] ≤ E[etYi ] = 1 + µi (et − 1) ≤ exp(µi (et − 1)), so the step from (10.12) to (10.13) goes through again. This completes the proof of Theorem 10.8. Since the proof has a few steps, let’s take stock of what we did: i. Apply Markov’s inequality on the function etX , ii. Use independence and linearity of expectations to break into etXi , iii. Reduce to the Bernoulli case Xi ∈ {0, 1}, iv. Compute the MGF (moment generating function) E[etXi ], v. Choose t to minimize the resulting bound, and vi. Use convexity to argue that Bernoullis are the “worst case”. You can get tail bounds for other functions of random variables by varying this template around; e.g., we will see an application for sums of independent normal (a.k.a. Gaussian) random variables in the next chapter. Do make sure you see why the bounds of Theorem 10.8 are impossible in general if we do not assume some kind of boundedness and independence. concentration of measure 10.3.2 125 The Generic Chernoff Bound Let’s consider the case where the r.v.s Xi are identically distributed. Suppose we start off the same, and get to (10.11). Now define the log-MGF of the underlying r.v. X to be ψ(t) := E[etX ]. (10.18) The expression (10.11) can be then written as  exp n ψ(t) − t(µ + λ) = exp(−n(t(µ+λ)/n − ψ(t))). The tightest upper bound is obtained when the expression tλ/n − ψ(t) is the largest. The Legendre-Fenchel dual of the function ψ(t) is defined as ψ∗ (λ) := sup {tλ − ψ(t)}, t ≥0 This is also called the convex conjugate. Since it is the max of a collection of linear functions, one for each t, the dual function ψ∗ is always convex, even if the original function ψ is not. so we get the following concise statement, which we call the generic Chernoff bound: Exercise: if ψ1 (t) ≥ ψ2 (t) for all t ≥ 0, then ψ1∗ (λ) ≤ ψ2∗ (λ) for all λ. Theorem 10.10 (Generic Chernoff Bound). Suppose Sn is the sum of n i.i.d. random variables, each having log-MGF ψ(t). Let µ := E[Sn ]. Then    ∗ µ+λ Pr[Sn ≥ µ + λ] ≤ exp − n · ψ . (10.19) n For the rest of the proof of the Chernoff bound, we can just focus on computing the dual ψ∗ (λ) of the log-MGF ψ(t). Let’s see some examples: 1. The first example is when X ∼ N (0, σ2 ), then E[etX ) ] = √ =e 1 2πσ t2 σ2 /2 Z x ∈R ·√ 1 2πσ 2 − x2 dx Z e etx e 2σ 2 2 x ∈R − ( x−tσ2 ) 2 2 dx = et σ /2 . 2σ (10.20) Hence, for X ∼ N (0, σ2 ) r.v.s, we have ψ(t) = t2 σ 2 2 ψ∗ (λ) = and λ2 , 2σ2 the latter by basic calculus. Now the generic Chernoff bound (10.19) for the sum of n normal N (0, σ2 ) variables says: 2 Pr[Sn ≥ λ] ≤ e − λ 2 2n σ . (10.21) This is even interesting when n = 1, in which case we get that for a N (0, σ2 ) random variable G, 2 Pr[ G ≥ λ] ≤ e − λ2 2σ . (10.22) In fact, you may have noticed that for Gaussians, the two statements (10.21) and (10.22) are equivalent, using the fact that the sum of n independent N (0, σ2 ) r.v.s is itself a N (0, nσ2 ) r.v.. 126 chernoff bounds, and hoeffding’s inequality 2. How about a Rademacher {−1, +1}-valued r.v. X? The MGF is E[etX ] = 2 et + e−t t2 t4 = cosh t = 1 + + + · · · ≤ et /2 , 2 2! 4! so ψ(t) = t2 2 and ψ∗ (λ) = λ2 . 2 Note that ∗ ψRademacher (t) ≤ ψN (0,1) (t) =⇒ ψRademacher (λ) ≥ ψ∗N (0,1) (λ). This means the upper tail bound for a single Rademacher is at least as strong as that for the standard normal. 3. And what about a centered Bernoulli with bias p? The log-MGF is ψ(t) := log E[etX ] = log((1 − p) + pet ), and a little calculus shows that the dual is ψ∗ (λ) = λ log λ 1−λ + (1 − λ) log . p 1− p Interestingly this function has a name: it is Kullback-Leibler divergence DKL (λ∥ p) between two Bernoulli distributions, one with bias λ and the other with bias p. In summary, if we write µ + λ = qn for some q > p, we have Pr[Sn ≥ qn] ≤ e−nDKL (q∥ p) . We can also extend the generic Chernoff bound to sums of nonidentical distributions using the AM-GM inequality: details here. 10.3.3 The Examples Again: New and Improved Bounds Example 1 (Coin Flips): Since each r.v. is a Bernoulli( p), the sum Sn = ∑i Xi has mean µ = np, and hence    Pr |Sn − np| ≥ βn ≤ exp −  β2 n  β2 n  ≤ exp − . 2p + β 2 (For the second inequality, we use that the interesting settings have 2 ln(1/δ) p + β ≤ 1.) Hence, if n ≥ , the empirical average Sn /n is β2 within an additve β of the bias p with probability at least 1 − δ. This has an exponentially better dependence on 1/δ than the bound we obtained from Chebychev’s inequality. This is asymptotically the correct answer: consider the problem where we have n coins, n − 1 of them having bias 1/2, and one having bias 1/2 + 2β. We want to find the higher-bias coin. One way is to es1 , using timate the bias of each coin to within β with confidence 1 − 2n The KL divergence DKL (q∥ p), also called the relative entropy, is a distance measure between two distributions. It is not symmetric, so be careful with the order of the arguments! We will see more of it when we discuss online learning and mirror descent. concentration of measure 127 the procedure above—which takes O(log n/ε2 ) flips per coin—and then take a union bound. It turns out any algorithm needs flips, so this the bound we have is tight. . Ω(n log n) ε2 Example 2 (Load Balancing): Since the load Li on any bin i behaves like Bin(n, 1/n), the expected load is 1. Now (10.8) says: Pr[ Li ≥ 1 + λ] ≤ exp   λ2 − . 2+λ If we set λ = Θ(log n), the probability of the load Li being larger than 1 + λ is at most 1/n2 . Now taking a union bound over all bins, the probability that any bin receives at least 1 + λ balls is at most n1 . I.e., the maximum load is O(log n) balls with high probability. In fact, the correct answer is that the maximum load is (1 + o (1)) lnlnlnnn with high probability. For example, the proofs in cite show this. Getting this precise bound requires a bit more work, but we can get an asymptotically correct bound by using (10.15) instead, with a C ln n setting of λ = ln ln n with a large constant C. Moreover, this shows that the asymmetry in the bounds (10.8) and (10.9) is essential. A first reaction would have been to believe our proof to be weak, and to hope for a better proof to get Pr[Sn ≥ (1 + β)µ] ≤ exp(− β2 µ/c) for some constant c > 0, for all values of β. This is not possible, p however, because it would imply a max-load of Θ( log n) with high probability. Example 3 (Random Walk): In this case, the variables are [−1, 1] valued, and hence we cannot apply the bounds from Theorem 10.8 directly. But define Yi = 1+2Xi to get Bernoulli(1/2) variables, and define Tn = ∑in=1 Yi . Since Tn = Sn /2 + n/2, √  √    Pr |Sn | ≥ t n = Pr | Tn − n/2| ≥ (t/2) n   (t2 /n) · (n/2) √ ≤ 2 exp − 2 + t/n using (10.8) ≤ 2 exp(−t2 /6). Recall from §10.2.5 that the tail bound of ≈ exp(−t2 /O(1)) is indeed in the right ballpark. 10.4 Other concentration bounds Many of the extensions address the various assumptions of Theorem 10.8: that the variables are bounded, that they are independent, The situation where λ ≤ µ is often called the Gaussian regime, since the bound on the upper tail behaves like exp(−λ2 /µ) = exp(− β2 µ), with β = λ/µ. In other cases, the upper tail bound behaves like exp(−λ), and is said to be the Poisson regime. In general, if Xi takes values in [ a, b], X −a we can define Yi := bi−a and then use Theorem 10.8. 128 other concentration bounds and that the function Sn is the sum of these r.v.s. Add details and refs to this section. But before we move on, let us give the bound that Sergei Bernstein gave in the 1920s: it uses knowledge about the variance of the random variable to get a potentially sharper bound than Theorem 10.8. Theorem 10.11 (Bernstein’s inequality). Consider n independent random variables X1 , . . . , Xn with | Xi − E[ Xi ]| ≤ 1 for each i. Let Sn := ∑i Xi have mean µ and variance σ2 . Then for any λ ≥ 0 we have   λ2 Pr[|Sn − µ| ≥ λ] ≤ 2 exp − 2 . 2σ + 2λ/3 10.4.1 Mildly Correlated Variables The only place we used independence in the proof of Theorem 10.8 was in (10.11). So if we have some set of r.v.s where this inequality holds even without independence, the proof can proceed unchanged. Indeed, one such case is when the r.v.s are negatively correlated. Loosely speaking, this means that if some variables are “high” then it makes more likely for the other variables to be “low”. Formally, X1 , . . . , Xn are negatively associated if for all disjoint sets A, B and for all monotone increasing functions f , g, we have E[ f ( Xi : i ∈ A) · g( X j : j ∈ B)] ≤ E[ f ( Xi : i ∈ A)] · E[ g( X j : j ∈ B)]. We can use this in the step (10.11), since the function etx is monotone increasing for t > 0. Negative association arises in many settings: say we want to choose a subset S of k items out of a universe of size n, and let Xi = 1i∈S be the indicator for whether the ith item is selected. The variables X1 , . . . , Xn are clearly not independent, but they are negatively associated. 10.4.2 Martingales A different and powerful set of results can be obtained when we stop considering random variables are not independent, but allow variables X j to take on values that depend on the past choices X1 , X2 , . . . , X j−1 but in a controlled way. One powerful formalization is the notion of a martingale. A martingale difference sequence is a sequence of r.v.s Y1 , Y2 , . . . , Yn , such that E[Yi | Y1 , . . . , Yi−1 ] = 0 for each i. (This is true for mean-zero independent r.v.s, but may be true in other settings too.) Theorem 10.12 (Hoeffding-Azuma inequality). Let Y1 , Y2 , . . . , Yn be a martingale difference sequence with |Yi | ≤ ci for each i, for constants ci . concentration of measure Then for any t ≥ 0, Pr " n # ∑ Yi ≥ λ ≤ 2 exp i =1 − λ2 2 ∑in=1 c2i ! . For instance, applying the Azuma-Hoeffding bounds to the random walk in Example 3, where each Yi is a Rademacher r.v. gives √ 2 Pr[|Sn | ≥ t n] ≤ 2e−t /8 , which is very similar to the bounds we derived above. But we can also consider, e.g., a “bounded” random walk that starts at the origin, say, and stops whenever it reaches either −ℓ or +r. In this case, the step size Yi = 0 with unit probability 1 if ∑ij− =1 Yj ∈ {−ℓ, r }, else it is {±1} independently and uniformly at random. 10.4.3 Going Beyond Sums of Random Variables The Azuma-Hoeffding inequality can be used to bound functions of X1 , . . . , Xn other than their sum—and there are many other bounds for more general classes of functions. In all these cases we want any single variable to affect the function only in a limited way—i.e., the function should be Lipschitz. One popular packaging was given by Colin McDiarmid: Theorem 10.13 (McDiarmid’s inequality). Consider n independent r.v.s X1 , . . . , Xn , with Xi taking values in a set Ai for each i, and a function f : ∏ Ai → R satisfying | f ( x ) − f ( x ′ )| ≤ ci whenever x and x ′ differ only in the ith coordinate. Let µ := E[ f ( X1 , . . . , Xn )] be the expected value of the random variable f ( X ). Then for any non-negative β, Upper tail : 2µ2 β2 Pr[ f ( X ) ≥ µ(1 + β)] ≤ exp − ∑i c2i Lower tail : 2µ2 β2 Pr[ f ( X ) ≤ µ(1 − β)] ≤ exp − ∑i c2i ! ! This inequality does not assume very much about the function, except it being ci -Lipschitz in the ith coordinate; hence we can also use this to the truncated random walk example above, or for many other applications. 10.4.4 Moment Bounds vs. Chernoff-style Bounds One may ask how moment bounds relate to Chernoff-Hoeffding bounds: Philips and Nelson showed that Chernoff-style bounds obtained using the approach of bounding the moment-generating function are never stronger than moment bounds: T.K. Philips and R. Nelson (1995) 129 130 application #1: oblivious routing on the hypercube Theorem 10.14. Consider n independent random variables X1 , . . . , Xn , each with mean 0. Let Sn = ∑ Xi . Then E[ X k ] E[etX ] ≤ inf . t≥0 etλ k ≥0 λk Pr[Sn ≥ λ] ≤ min 10.4.5 Matrix-Valued Random Variables Finally, an important line of research considers concentration for vector-valued and matrix valued functions of independent (and mildly dependent) r.v.s. One object that we will see in a homework, and also in later applications, is the matrix-valued case: here the notation A ⪰ 0 means the matrix is positive-semidefinite (i.e., all its eigenvalues are non-negative), and A ⪰ B means A − B ⪰ 0. See, e.g., the lecture notes by Joel Tropp! Theorem 10.15 (Matrix Chernoff bounds). Consider n independent symmetric matrices X1 , . . . , Xn of dimension d. Moreover, I ⪰ Xi ⪰ 0 for each i, i.e., the eigenvalues of each matrix are between 0 and 1. If µmax := λmax (∑ E[ Xi ]) is the largest eigenvalue of their expected sum, then  Pr λmax    γ2 ∑ Xi ≥ µmax + γ ≤ d exp − 2µmax + γ  . As an example, if we are throwing n balls into n bins, then we can let matrix Xi have a single 1 at position ( j, j) if the ith ball falls into bin j, and zeros elsewhere. Now the sum of these matrices has the loads of the bins on the diagonal, and the maximum eigenvalue is precisely the highest load. This bound therefore gives that the 2 probability of a bin with load 1 + γ is at most n · eγ /(2+γ) —again implying a maximum load of O(log n) with high probability. But we can use this for a lot more than just diagonal matrices (which can be reasoned about using the scalar-valued Chernoff bounds, plus the naïve union bound). Indeed, we can sample edges of a graph at random, and then talk about the eigenvalues of the resulting adjacency matrix (or more interestingly, of the resulting Laplacian matrix) using these bounds. We will discuss this in a later chapter. 10.5 Application #1: Oblivious Routing on the Hypercube Now we return to fourth application mentioned at the beginning of the chapter. (The first two applications have already been considered above, the third will be covered as a homework problem.) The setting is the following: we are given the d-dimensional hypercube Qd , with n = 2d vertices. We have n = 2d vertices, each labeled concentration of measure 131 with a d-bit vector. Each vertex i has a single packet (which we also call packet i), destined for vertex π (i ), where π is a permutation on the nodes [n]. Packets move in synchronous rounds. Each edge is bi-directed, and at most one packet can cross each directed edge in each round. Moreover, each packet can cross at most one edge per round. So if uv ∈ E( Qd ), one packet can cross from u to v, and one from v to u, in a round. Each edge e has an associated waiting queue We ; so each node has d queues, one for each edge leaving it. If several packets want to cross an edge e in the same round, only one can cross; the rest wait in the queue We and try again the next round. We assume the queues are allowed to grow to arbitrary size (though one can also show queue length bounds in the algorithm below). The goal is to get a simple routing scheme that delivers the packets in O(d) rounds, no matter what permutation π needs to be routed. One natural proposal is the bit-fixing routing scheme: each packet i looks at its current position u, finds the first bit position where u differs from π (i ), and flips the bit (which corresponds to traversing an edge out of u). For example: 0001010 → 1001010 → 1101010 → 1100010 → 1100011. However, this proposal can create “congestion hotspots” in the network, and therefore delay some packets by 2Ω(d) . In fact, it turns out any deterministic oblivious strategy (that does not depend p on the actual sources and destinations) must have a delay of Ω( 2d /d) rounds. 10.5.1 A Randomized Algorithm. . . Here’s a lovely randomized strategy, due to Les Valiant, and to Valiant and Brebner. It requires no centralized control, and is optimal in the sense of requiring O(d) rounds (with high probability) on any permutation π. Each node i picks a randomized midpoint Ri independently and uniformly from [n]: it sends its packet to Ri . Then after 5d rounds have elapsed, the packets proceed to their final destinations π (i ). All routing is done using bit-fixing. 10.5.2 . . . and its Analysis Theorem 10.16. The random midpoint algorithm above succeeds in delivering the packets in at most 10d rounds, with probability at least 1 − n2 . Proof. We only prove that all packets reach their midpoints by time 5d, with high probability. The argument for the second phase is then Suppose we choose a permutation π such that π (w00 ) = 0 w, where w, 0 ∈ {0, 1}d/2 . All these 2d/2 packets have to pass through the allzeros node in the bit-fixing routing scheme; since this node can send out at most d packets at each timestep, need at least 2d/2 /d rounds. Valiant (1982) 132 application #1: oblivious routing on the hypercube identical. Let Pi be the bit-fixing path from i to the midpoint Ri , and define Si := { j ̸= i | path Pj shares an edge with Pi }. Claim 10.17. Any two paths Pi and Pj intersect in one contiguous segment. Proof. (Exercise.) This is where using a consistent routing strategy like bit-fixing helps. Claim 10.18. Packet i reaches midpoint Ri by time at most | Pi | + |Si |. Proof. Consider the path Pi = ⟨e1 , e2 , . . . , eℓ ⟩ taken by packet i. If Si were empty, clearly packet i would reach its destination in time | Pi |; we now show how to charge each timestep that packet i is delayed to a distinct packet in Si . For that, we first define the notion of lag. For any edge ek ∈ Pi , we say every packet in Wek at the beginning of timestep t has lag t − k. Note that all packets in the same queue at the same time have the same lag. Now: 1. Each packet j in Si ∪ {i } either reaches its destination on Pi or it leaves Pi (forever, by Claim 10.17) after traversing some last edge ek ∈ Pi . Call this traversal of ek the final traversal for packet j, and call its lag value just before this final traversal its final lag. 2. Suppose packet i traverses the last edge eℓ on its path and reaches its destination at timestep T. Since it has lag T − ℓ = T − | Pi | just before it traverses the edge, it reaches the destination at time | Pi | plus its final lag. So it suffices to show that i’s final lag is at most | Si | . 3. The initial lag (at time t = 1) of this packet i is (1 − 1) = 0, since it belongs to queue We1 at the very beginning. The lag of this packet never decreases over time as it makes its way along the path. Indeed, if it is in Wek at the beginning of some timestep t, and it traverses the edge, it now belongs to wek+1 at the start of timestep t + 1, and its new lag is (t + 1) − (k + 1) = t − k and therefore unchanged. 4. Else suppose packet i’s lag increases from some value L to L + 1 at some timestep. This is because i ∈ Wek for some k at the beginning of time t = L + k, but some other packet j ∈ Si from queue Wek was sent across the edge ek at this timestep. In this case, imagine packet i gives packet j a token numbered L. So there is a single token generated for each increase in i’s lag, each with a different number. Observe: the lags are defined for packets in Si according to the numbering of edges in Pi , not the numbering of their own paths. concentration of measure 5. We show (in the next bullet point) how to maintain the invariant that at the beginning of each time, any token numbered L still on the path Pi is carried by some packet in Si with current lag L. This implies that when a packet in Si makes its final traversal and it has some final lag L′ , it is either carrying a single token numbered L′ at that time or no token at all. Since each token is carried by some packet, this means there can be at most |Si | tokens overall, and hence i’s final lag value is at most |Si |. 6. To ensure the invariant, note that when j got the token numbered L from i, packet j had lag value L. Now as long as j does not get delayed as it proceeds along the path, its lag remains L (and it keeps the token). If it does get delayed, say while waiting in queue Wek′ while some other packet j′ (having the same lag value L, because they were sharing the same queue) traverses the edge ek′ , packet j gives its token numbered L to this j′ . This maintains the invariant. Finally, we bound the size of Si by a concentration bound. Since Ri is chosen uniformly at random from {0, 1}d , the labels of i and Ri differ in d/2 bits in expectation. Hence Pi has expected length d/2. There are d2d = dn (directed) edges, and all n = 2d paths behave symmetrically, so the expected number of paths Pj using any edge e d/2 is n·dn = 1/2. Claim 10.19. Pr[|Si | ≥ 4d] ≤ e−2d . Proof. If Xij is the indicator of the event that Pi and Pj intersect, then |Si | = ∑ j̸=i Xij , i.e., it is a sum of a collection of independent {0, 1}-valued random variables. Now conditioned on any choice of Pi (which is of length at most d), the expected number of paths using each edge in it is at most 1/2, so the conditional expectation of Si is at most d/2. Since this holds for any choice of Pi , the unconditional expectation µ = E[Si ] is also at most d/2. Now apply the Chernoff bound to Si with λ = 4d − µ and µ ≤ d/2 to get   (4d − µ)2 Pr[|Si | ≥ 4d] ≤ exp − ≤ e−2d . 2µ + (4d − µ) Note that we could apply the bound even though the variables Xij were not i.i.d., and moreover we did not need estimates for E[ Xij ], just an upper bound for their expected sum. Now applying a union bound over all n = 2d packets i means that all n packets reach their midpoints within d + 4d steps with probability 1 − 2d · e−2d ≥ 1 − e−d ≥ 1 − 1/n. Similarly, the second 133 134 application #2: graph sparsification phase has a probability at most 1/n of failing to complete in 5d steps, completing the proof. A different strategy would be to let each packet pick a random permutation and fix the bits according to that permutation. Sadly, this approach gives delay 2Ω(d) . This is true even if each node picks its permutation independently. One bad example appears in Valiant’s original paper (see Section 5 “The Necessity for Phase A”) and shows that you can fix a permutation that “gangs up” on some node, even if the bit-fixing order is random. 10.6 Application #2: Graph Sparsification 10.7 Application #3: The Power of Two Choices 11 Dimension Reduction and the JL Lemma For a set of n points { x1 , x2 , . . . , xn } in RD , can we map them into some lower dimensional space Rk and still maintain the Euclidean distances between them? We can always take k ≤ n − 1, since any set of n points lies on a n − 1-dimensional subspace. And this is (existentially) tight, e.g., if x2 − x1 , x3 − x1 , . . . , xn − x1 are all orthogonal vectors. But what if we were fine with distances being approximately preserved? There can only be k orthogonal unit vectors in Rk , but there are as many as exp(cε2 k) unit vectors which are ε-orthogonal—i.e., whose mutual inner products all lie in [−ε, ε]. Near-orthogonality allows us to pack exponentially more vectors! (Indeed, we will see this in a homework exercise.) This near-orthogonality of the unit vectors means that distances are also approximately preserved. Indeed, for any two a, b ∈ Rk , ∥ a − b∥22 = ⟨ a − b, a − b⟩ = ⟨ a, a⟩ + ⟨b, b⟩ − 2⟨ a, b⟩ = ∥ a∥22 + ∥b∥22 − 2⟨ a, b⟩, so the squared Euclidean distance between any pair of the points defined by these ε-orthogonal vectors falls in the range 2(1 ± ε). So, if we wanted n points at exactly the same (Euclidean) distance from each other, we would need n − 1 dimensions. (Think of a triangle in 2-dims.) But if we wanted to pack in n points which were at distance (1 ± ε) from each other, we could pack them into   log n k = O ε2 dimensions. 11.1 The Johnson Lindenstrauss lemma The Johnson Lindenstrauss “flattening” lemma says that such a claim is true not just for equidistant points, but for any set of n points in Euclidean space: Having n ≥ exp(cε2 k ) vectors in d dimensions means the dimension is k = O(log n/ε2 ). 136 the construction Lemma 11.1. Let ε ∈ (0, 1/2). Given any set of points  X=  { x1 , x2 , . . . , x n } in RD , there exists a map A : RD → Rk with k = O 1−ε ≤ log n ε2 such that ∥ A( xi ) − A( x j )∥22 ≤ 1 + ε. ∥ xi − x j ∥22 Moreover, such a map can be computed in expected poly(n, D, 1/ε) time. Note that the target dimension k is independent of the original dimension D, and depends only on the number of points n and the accuracy parameter ε. It is not difficult to show that we need at least Ω(log n) dimensions in such a result, using a packing argument. Noga Alon showed log n  a lower bound of Ω ε2 log 1/ε , and then Kasper Green Larson and log n  Jelani Nelson showed a tight and matching lower bound of Ω ε2 dimensions for any dimensionality reduction scheme from n dimensions that preserves pairwise distances. The JL Lemma was first considered in the area of metric embeddings, for applications like fast near-neighbor searching; today we use it to speed up algorithms for problems like spectral sparsification of graphs, and solving linear programs fast. 11.2 The Construction The JL lemma is pretty surprising, but the construction of the map is perhaps even more surprising: it is a super-simple randomized construction. Let M be a k × D matrix, such that every entry of M is filled with an i.i.d. draw from a standard normal N (0, 1) distribution (a.k.a. the “Gaussian” distribution). For x ∈ RD , define 1 A( x ) = √ Mx. k That’s it. You hit the vector x with a Gaussian matrix M, and scale it √ down by k. That’s the map A. Since A( x ) is a linear map and satisfies αA( x ) + βA(y) = A(αx + βy), it is enough to show the following lemma: Lemma 11.2. [Distributional Johnson-Lindenstrauss] Let ε ∈ (0, 1/2). If A is constructed as above with k = cε−2 log δ−1 , and x ∈ RD is a unit vector, then Pr[ ∥ A( x )∥22 ∈ 1 ± ε] ≥ 1 − δ. To prove Lemma 11.1, set δ = 1/n2 , and hence k = O(ε−2 log n). Now for each xi , x j ∈ X, use linearity of A(·) to infer A ( xi ) − A ( x j ) xi − x j 2 2 = A ( xi − x j ) xi − x j 2 2 = A(vij ) 2 ∈ (1 ± ε ) Given n points with Euclidean distances ε in (1 ± ε), the balls of radius 1− 2 around these points must be mutually disjoint, by the minimum distance, and they are contained within a ball ε of radius (1 + ε) + 1− 2 around x0 . Since volumes of balls in Rk of radius r behave like ck r k , we have  1 − ε k  3 + ε k n · ck ≤ ck 2 2 or k ≥ Ω(log n) for ε ≤ 1/2. Alon (2003) Larson and Nelson (2017) dimension reduction and the jl lemma 137 with probability at least 1 − 1/n2 , where vij is the unit vector in the direction of xi − x j . By a union bound, all (n2 ) pairs of distances in ( X2 ) are maintained with probability at least 1 − (n2 ) n12 ≥ 1/2. A few comments about this construction: • The above proof shows not only the existence of a good map, we also get that a random map as above works with constant probability! In other words, a Monte-Carlo randomized algorithm for dimension reduction. (Since we can efficiently check that the distances are preserved to within the prescribed bounds, we can convert this into a Las Vegas algorithm.) Or we can also get deterministic algorithms: see here. • The algorithm (at least the Monte Carlo version) is data-oblivious: it does not even look at the set of points X: it works for any set X with high probability. Hence, we can pick this map A before the points in X arrive. 11.3 Intuition for the Distributional JL Lemma Let us recall some basic facts about Gaussian distributions. The probability density function for the Gaussian N (µ, σ2 ) is ( x − µ )2 f ( x ) = √ 1 e 2σ2 . 2πσ We also use the following; the proof just needs some elbow grease. Proposition 11.3. If G1 ∼ N (µ1 , σ12 ) and G2 ∼ N (µ2 , σ22 ) are independent, then for c ∈ R, c G1 ∼ N (cµ1 , c2 σ12 ) (11.1) G1 + G2 ∼ N (µ1 + µ2 , σ12 + σ22 ). (11.2) Now, here’s the main idea in the proof of Lemma 11.2. Imagine that the vector x is the elementary unit vector e1 = (1, 0, . . . , 0). Then M e1 is just the first column of M, which is a vector with independent and identical Gaussian values.      G1,D 1 G1,1     G2,D  0  G2,1      ..   ..  =  ..  . .  .  .  · · · Gk,D 0 Gk,1 √ A( x ) is a scaling-down of this vector by k: every entry in this random vector A( x ) = A(e1 ) is distributed as G1,1   G2,1 M e1 =   ..  . Gk,1 G1,2 G2,2 .. . Gk,2 ··· ··· .. . √ 1/ k · N (0, 1) = N (0, 1/k ) (by (11.1)). The fact that the means and the variances take on the claimed values should not be surprising; this is true for all r.v.s. The surprising part is that the resulting variables are also Gaussians. 138 a direct proof of Lemma 11.2 Thus, the expected squared length of A( x ) = A(e1 ) is " # h i h i k k k 1 2 2 E ∥ A( x )∥ = E ∑ A( x )i = ∑ E A( x )2i = ∑ = 1. k i =1 i =1 i =1 So the expectation of ∥ A( x )∥2 is 1; the heart is in the right place! Now to show that ∥ A( x )∥2 does not deviate too much from the mean—i.e., to show a concentration result. Indeed, ∥ A( x )∥2 is a sum of independent N (0, 1/k )2 random variables, so if these N (0, 1/k)2 variables were bounded, we would be done by the Chernoff bounds of the previous chapter. Sadly, they are not. However, their tails are fairly “thin”, so if we squint hard enough, these random variables can be viewed as “pretty much bounded”, and the Chernoff bounds can be used. Of course this is very vague and imprecise. Indeed, the Laplace distribution with density function f ( x ) ∝ e−λ| x| for x ∈ R also has pretty thin tails—“exponential tails”. But using a matrix with Laplace entries does not work the same, no matter how hard we squint. It turns out you need the entries of M, the matrix used to define A( x ), to have “sub-Gaussian tails”. The Gaussian entries have precisely this property. We now make all this precise, and also remove the assumption that the vector x = e1 . In fact, we do this in two ways. 1. First we give a proof via a direct calculation: it has several steps, but each step is elementary, and you are mostly following your nose. 2. The second proof uses the notion of sub-Gaussian random variables from , and builds some general machinery for concentration bounds. 11.4 A Direct Proof of Lemma 11.2 Recall that we want to argue about the squared length of A( x ) ∈ Rk , where A( x ) = √1 Mx, and x is a unit vector. First, let’s understand k what the expected length of A( x ) is, and then we will show concentration about the mean. Lemma 11.4. Suppose the entries of M are independent random variables, with mean zero and unit variance. Then for unit vector x ∈ RD , E[∥ A( x )∥2 ] = ∥ x ∥2 . Proof. Each entry of the vector Mx is the inner product of x with a vector with independent zero mean and unit variance random If G has mean µ and variance σ2 , then E[ G2 ] = Var[ G ] + E[ G ]2 = σ2 + µ2 . dimension reduction and the jl lemma 139 variables, and so is itself a random variable with zero mean and variance ∑i xi2 = 1. This means that for any entry i ∈ [k], E[( Mx )2i ] = Var( Mx ) + E[( Mx )i ]2 = 1. Now E[∥ A( x )∥2 ] = 1k ∑ik=1 E[( Mx )2i ] = 1 = ∥ x ∥2 . Observe that did not use the fact that the matrix entries were Gaussians. We will use it for the concentration bound, which we show next. 11.4.1 Concentration about the Mean Using that each entry of M is an independent N (0, 1) r.v., we can use Proposition 11.3 to infer that ( Mx )i ∼ N (0, x12 + x22 + . . . + x2D ) = N (0, 1). So, each of the k coordinates of Mx behaves just like an independent Gaussian! For brevity, define k 1 ( Mx )2i , k i =1 Z := ∥ A(z)∥2 = ∑ so Z is the average of the squares of a collection of k independent N (0, 1) r.v.s. Next we show that Z does not deviate too much from 1. Since Z is the sum of a bunch of independent and identical random variables, let’s start down the usual path for a Chernoff bound, for the upper tail, say: Pr[ Z ≥ 1 + ε] ≤ Pr[etkZ ≥ etk(1+ε) ] ≤ E[etkZ ]/etk(1+ε)   2 = ∏ E[etG ]/et(1+ε) (11.3) (11.4) i 2 for every t > 0, where G ∼ N (0, 1). Now E[etG ], the momentgenerating function for G2 is easy to calculate for t < 1/2: 1 √ 2π Z 2 2 1 etg e− g /2 dg = √ g ∈R 2π Z 2 z ∈R e−z /2 √ dz 1 = √ . 1 − 2t 1 − 2t (11.5) Plugging back into (11.4), the bound on the upper tail shows that for all t ∈ (0, 1/2), Pr[ Z ≥ (1 + ε)] ≤  1 √ et(1+ε) 1 − 2t k . The easy way out is to observe that the squares of Gaussians are chisquared r.v.s, the sum of k of them is χ2 with k degrees of freedom, and the internet conveniently has tail bounds for these things. But even if you don’t recall these facts, and don’t have internet connectivity and cannot check Wikipedia, it is not that difficult to prove from scratch. 140 introducing subgaussian random variables Let’s just focus on part of this expression:     1 1 √ = exp −t − log(1 − 2t)) 2 et 1 − 2t   2 = exp (2t) /4 + (2t)3 /6 + · · ·   ≤ exp t2 (1 + 2t + 2t2 + · · · ) (11.6) (11.7) (11.8) = exp(t2 /(1 − 2t)). Plugging this back, we get Pr[ Z ≥ (1 + ε)] ≤  1 √ et(1+ε) 1 − 2t k 2 ≤ exp(kt2 /(1 − 2t) − ktε) ≤ e−kε /8 , if we set t = ε/4 and use the fact that 1 − 2t ≥ 1/2 for ε ≤ 1/2. (Note: this setting of t also satisfies t ∈ (0, 1/2), which we needed from our previous calculations.) Almost done: let’s take stock of the situation. We observed that ∥ A( x )∥22 was distributed like an average of squares of Gaussians, and by a Chernoff-like calculation we proved that Pr[∥ A( x )∥22 > 1 + ε] ≤ exp(−kε2 /8) ≤ δ/2 for k = ε82 ln 2δ . A similar calculation bounds the lower tail, and finishes the proof of Lemma 11.2. The JL Lemma was first proved by Bill Johnson and Joram Lindenstrauss. There have been several proofs after theirs, usually trying to tighten their results, or simplify the algorithm/proof (see citations in some of the newer papers): the proof above is some combinations of those by Piotr Indyk and Rajeev Motwani, and Sanjoy Dasgupta and myself. 11.5 Introducing Subgaussian Random Variables It turns out that the proof of Lemma 11.2 is a bit cleaner (with fewer calculations) if we use the abstraction provided by the generic Chernoff bound from last lecture, and the notion of subGaussian random variables which we introduce next. This abstraction will also allow us to extend the result to JL matrices having i.i.d. entries from other distributions, e.g., where each Mij ∈ R {−1, +1}. 11.5.1 Subgaussian Random Variables Recall the definitions of the log-MGF ψ(t) and its Legendre-Fenchel dual ψ∗ (λ) from §10.3.2. Johnson and Lindenstrauss (1982) Indyk and Motwani (1998) Dasgupta and Gupta (2004) dimension reduction and the jl lemma Definition 11.5. A random variable V with mean 0 is subgaussian with parameter σ if its log-MGF ψ(t) satisfies ψ(t) ≤ σ 2 t2 . 2 for all t ≥ 0. It is subgaussian with parameter σ up to t0 if the above inequality holds for all |t| ≤ t0 . In other words, the log-MGF of a subgaussian r.v. is bounded above by that of a Gaussian! At this point, it’s useful to recall a fact we asked as an exercise in §10.3.2: Fact 11.6. If ψ1 (t) ≥ ψ2 (t) for all t ≥ 0, then ψ1∗ (λ) ≤ ψ2∗ (λ) for all λ. Using this, the dual function of a subgaussian random variable with parameter σ is bounded below by that of a Gaussian N (0, σ2 ), which means we have a tighter upper tail bound! Indeed, combining with (10.22), we immediately get: Theorem 11.7 (Subgaussian Tail Bounds). If V is zero-mean and subgaussian with parameter σ, then 2 2 Pr[V ≥ λ] ≤ e−λ /(2σ ) . Most tail bounds you will prove using the subgaussian perspective will come down to showing that some random variable is subgaussian with parameter σ, whereupon you can use Theorem 11.7. Given that you will often reason about sums of subgaussians, you may use the next fact, which is an analog of Proposition 11.3. Lemma 11.8. If V1 , V2 , . . . are independent, q zero-mean and σi -subgaussian, and x1 , x2 , . . . are reals, then V = ∑i xi Vi is ∑i xi2 σi2 -subgaussian. Proof. Using independence and the definition of subgaussian-ness: E[etV ] = E[et ∑i xi Vi ] = ∏ E[etxi Vi ] ≤ ∏ e(txi ) σi /2 . 2 2 i i Finally taking logarithms, ψV (t) = ∑i ψVi (txi ) ≤ ∑i 11.6 t2 xi2 σi2 2 . A Proof of Lemma 11.2 using Subgaussian r.v.s Suppose we choose each Mij to be an independent copy of a subgaussian r.v. with zero mean and unit variance, and let A( x ) = √1 Mx k again? We want to show that Z := ∥ A( x )∥2 = 1 k ( Mx )2i k i∑ =1 (11.9) 141 142 a proof of Lemma 11.2 using subgaussian r.v.s has mean ∥ x ∥2 , and is concentrated sharply around that value. Conveniently, we had only used the mean and variance of the entries of M in proving Lemma 11.4, so we can still infer that E[ Z ] = E[∥ A( x )∥2 = ∥ x ∥ = 1. It just remains to show the concentration. 11.6.1 Sums of Squares of Subgaussians To add in. Until then see the explanaton in Matousek’s paper “On Variants of the Johnson-Lindenstrauss Lemma”. 11.6.2 Relating Subgaussian to Gaussians If you have done the proof for the Gaussian case, and just want to extend the JL Lemma to other subgaussian random variables, you need not do all the work in §11.6.1. Instead you can relate subgaussian concentration to good old Gaussian concentration. Indeed, the direct proof from §11.4 showed the ( Mx )i s were themselves Gaussian with variance ∥ x ∥2 . Since the Rademachers are 1subgaussian, Lemma 11.8 shows that ( Mx )i is subgaussian with parameter ∥ x ∥2 . Next, we need to consider Z, which is the average of squares of k independent ( Mx )i s. The following lemma shows that the MGF of squares of symmetric σ-subgaussians are bounded above by the corresponding Gaussians with variance σ2 . Lemma 11.9. If V is symmetric mean-zero σ-subgaussian r.v., and W ∼ 2 2 N (0, σ2 ), then E[etV ] ≤ E[etW ] for t > 0. Proof. Using the calculation in (10.20) in the “backwards” direction √ 2 EV [etV ] = EV,W [e 2t(V/σ) W ]. (Note that we’ve just introduced W into the mix, without any provocation!) Hence, rewriting √ √ EV,W [e 2t(V/σ) W ] = EW [EV [e( 2tW/σ)V ]], we can use the σ-subgaussian behavior of V in the inner expectation to get an upper bound of 2 √ 2 2 EW [eσ ( 2t|W |/σ) /2 ] = EW [etW ]. Excellent. Now the bound on the upper tail for sums of squares of symmetric mean-zero σ-subgaussians follows from that of Gaus2 sians. The lower tail (which requires us to bound E[etV ] for t < 0) needs one more idea: suppose V is a mean-zero σ-subgaussian with An r.v. X is symmetric if it is distributed the same as R| X |, where R is an independent Rademacher. dimension reduction and the jl lemma parameter σ2 = 1, and suppose |t| < 1. A Taylor expansion shows that 2 E[etV ] ≤ 1 + tE[V 2 ] + t2 ∑ E[V 2i /i!]. i ≥2 2 Since E[V 2 ] = 1 and |t| < 1, this is at most 1 + t + t2 E[eV ]. Now 2 2 2 use the above bound E[eV ] ≤ E[eW ] to get that E[etV ] ≤ 1 + t + √ t2 / 1 − 2t, and the proof proceeds as for the Gaussian case. In summary, we get the same tail bounds as in §11.4.1, and hence that the Rademacher matrix also has the distributional JL property, while using far fewer random bits! In general one can use other σ-subgaussian distributions to fill the matrix M—using σ different than 1 may require us to rework the proof from §11.4.1 since the linear terms in (11.6) don’t cancel any more, see works by Indyk and Naor or Matousek for details. Indyk and Naor (2008) Matoušek (2008) 11.6.3 The Fast JL Transform A different direction to consider is getting fast algorithms for the JL Lemma: Do we really need to plug in non-zero values into every entry of the matrix A? What if most of A is filled with zeroes? The first problem is that if x is a very sparse vector, then Ax might be zero with high probability? Achlioptas showed that having a random two-thirds of the entries of A being zero still works fine: Nir Ailon and Bernard Chazelle showed that if you first hit x with a suitable matrix P which caused Px to be “well-spread-out” whp, and then ∥ APx ∥ ≈ ∥ x ∥ would still hold for a much sparser A. Moreover, this P requires much less randomness, and furthermore, the computations can be done faster too! There has been much work on fast and sparse versions of JL: see, e.g., this paper from SOSA 2018 by Michael Cohen, T.S. Jayram, and Jelani Nelson. Jelani Nelson also has some notes on the Fast JL Transform. 11.7 Ailon and Chazelle Cohen, Jayram, and Nelson (2018) Optional: Compressive Sensing To rewrite. In an attempt to build a better machine to take MRI scans, we decrease the number of sensors. Then, instead of the signal x we intended to obtain from the machine, we only have a small number of measurements of this signal. Can we hope to recover x from the measurements we made if we make sparsity assumptions on x? We use the term s-sparse signal for a vector with at most s nonzero entries, i.e., with | supp( x )| ≤ s. Formally, x is a n-dimensional s-sparse vector, and a measurement of x with respect to a vector a is a real number given by ⟨ a, x ⟩. If we ask k questions, this gives us a k × n sensing matrix A (whose rows It is common to use the notation ∥ x ∥0 := | supp( x )|, even though this is not a norm. 143 144 optional: compressive sensing are the measurements), and a k-dimensional vector b of results. We want to reconstruct x with s nonzero entries satisfying Ax = b. This is often written as n o min ∥ x ∥0 | Ax = b . (11.10) 11.7.1 Sparse Recovery: A First Attempt What properties would we like from our sensing matrix A? The first would be some form of consistency: that the problem should be solvable. Definition 11.10 (Kruskal Rank). An m × n matrix A has Kruskal rank r if every subset of r of its columns are linearly independent. Lemma 11.11 (Unique Decoding). If A has Kruskal rank ≥ 2s, then for any b we have Ax = b for at most one s-sparse x. Proof. Suppose Ax = Ax ′ for two s-sparse vectors x, x ′ . Then A( x − x ′ ) = 0 for the 2s-sparse vector x − x ′ . The Kruskal rank being 2s means this vector x − x ′ = 0, and hence x = x ′ . So we can just find some sensing matrix with large Kruskal rank Give examples here and ensure our results will be unique. The next question is: how fast can we find x? (We should also be worried about noise in the measurements.) A generic construction of matrices with large Kruskal rank may not give us efficient solutions to (11.10). Indeed, it turns out that the problem as formulated is NP-hard, assuming A and b are contrived by an adversary. Of course, asking to solve (11.10) for general A, b is a more difficult problem than we need to solve. In our setting, we can choose A as we like and then are given b = Ax, so we can ask whether there are matrices A for which this decoding process is indeed efficient. This is precisely what we do next. 11.7.2 The Basis Pursuit Algorithm Consider the following similar looking problem called the basis pursuit (BP) problem: n o min ∥ x ∥1 | Ax = b . (11.11) This problem can be formulated as a linear program as follows, and hence can be efficiently solved. Introduce n new variables y1 , y2 , . . . , yn under the constraints o n min ∑ yi | Ax = b, −yi ≤ xi ≤ yi . i dimension reduction and the jl lemma 145 Definition 11.12. We call a matrix A as BP-exact for sparsity s if for all vectors b such that the non-convex program (11.10) has a unique solution x ∗ , this vector x ∗ is also the unique optimal solution to the basis pursuit LP (11.11). In other words, we want a matrix A for which the two programs return the same optimal solution. But do BP-exact matrices exist? If so, how do we efficiently construct them? Our next ingredient will be crucial to show their existence and construction. Definition 11.13 (Restricted Isometry Property (RIP)). A matrix A is (t, ε)-RIP if for all unit vectors x with ∥ x ∥0 ≤ t, we have ∥ Ax ∥22 ∈ [1 ± ε]. Lemma 11.14 (RIP =⇒ BP-exact). If a matrix A is (3s, ε)-RIP for some ε ≤ 1/9, then A is BP-exact for sparsity s. Proof. Suppose x ∗ is the unique solution to (11.10) and x the solution to (11.11), so that ∥ x ∥1 ≤ ∥ x ∗ ∥1 . (11.12) Suppose x − x ∗ = ∆ ̸= 0; hence A∆ = A( x − x ∗ ) = 0. If we could somehow show that supp(∆) ≤ 3s, then using the RIP property for A, we would get 0 = ∥ A∆∥2 ≥ (1 − ε)∥∆∥2 > 0, a contradiction. But of course, ∆ could have large support, so we need to work harder. The actual proof breaks up ∆ into small pieces (so that the RIP matrix A maintains their length), and argues that there is one large piece that the other pieces cannot cancel out. Let S := supp( x ∗ ) be the support of x ∗ , and S be the remaining coordinates. Let’s sort these coordinates in decreasing order of their absolute value, and group them into buckets of 2s consecutive coordinates. Call these buckets B1 , B2 , . . .. √ Claim 11.15. ∑ j≥2 ∥∆ Bj ∥2 ≤ ∥∆S ∥2 / 2. Before we prove the claim, let’s see how to use it. The claim says that total Euclidean length of the vectors {v Bj } j≥2 is a constant factor smaller than that of vS∪ B1 . So even after the near-isometric mapping A, the lengths of the former would not be able to cancel the length of the latter. Formally: 0 = ∥ A∆∥2 ≥ ∥ A∆S∪ B1 ∥2 − ∑ ∥ A∆ Bj ∥2 j ≥2 ≥ (1 − ε) ∥∆S∪ B1 ∥2 − (1 + ε) ∑ ∥∆ Bj ∥2 j ≥2 1+ε ≥ (1 − ε ) ∥ ∆ S ∥2 − √ ∥ ∆ S ∥2 , 2 For vector v ∈ Rn and subset T ⊆ [n], define vector v T ∈ Rn which agrees with v on the coordinates in S, and which has zeroes elsewhere. 146 optional: compressive sensing where the first step uses the triangle inequality for norms, the second uses that each ∆S∪ B1 and ∆ Bj are 3s-sparse, and the last step uses ∥∆S∪ B1 ∥2 ≥ ∥∆S ∥2 and also Claim 11.15. Finally, since ε ≤ 1/9, we have 1 − ε > 1√+ε , so the only remaining possibility is that ∆S = 0. 2 The next claim implies that ∆S = 0 implies that ∆ = 0, giving a contradiction and hence the proof of Lemma 11.14. Claim 11.16. ∥∆S ∥1 ≥ ∥∆S ∥1 . Proof. We finally use that x = x ∗ + ∆ is the optimizer for the LP, which means ∥ x ∗ ∥1 > ∥ x ∗ + ∆∥1 = ∥ xS∗ + ∆S ∥1 + ∥∆S ∥1 ≥ ∥ xS∗ ∥1 − ∥∆S ∥1 + ∥∆S ∥1 . (The last step uses the triangle inequality.) Since ∥ x ∗ ∥1 = ∥ xS∗ ∥1 , we get Claim 11.16. The final piece of the argument is to prove Claim 11.15: Proof of Claim 11.15. Take any bucket Bj for j ≥ 2. Each entry of ∆ in this bucket is smaller than the smallest entry of Bj−1 , and hence smaller than the average entry of Bj−1 . And there are 2s entries in this bucket Bj , so the Euclidean length of the bucket is ∥ ∆ B j ∥2 ≤ √ 2s · ∥ ∆ B j −1 ∥ 1 2s = ∥ ∆ B j −1 ∥ 1 √ . 2s Summing this over all j ≥ 2, we get ∥ ∆ B j −1 ∥ 1 ∥ ∆ ∥1 √ = √S . 2s 2s j ≥2 ∑ ∥ ∆ B j ∥2 ≤ ∑ j ≥2 Now ∥∆S ∥1 ≤ ∥∆S ∥1 by Claim 11.16. And finally, since the sup√ port of ∆S is of size s, we can bound its ℓ1 length by s times its ℓ2 √ length, finishing the claim. (Since we wanted that factor of 2 in the denominator, we made the buckets slightly larger than the size of S.) This completes the proof for Lemma 11.14. Finally, how do we construct RIP matrices? Call a distribution D over k × n matrices a distributional JL family if Lemma 11.2 is true when A is drawn from D . The following theorem was proved by David Donoho, and by Emanuel Candes and Terry Tao, and by Mark Rudelson and Roman Vershynin. (The connection of their constuction to the distributional JL was made explicit by Baraniuk et al.) Theorem 11.17 (JL =⇒ RIP). If we pick A ∈ Rk×n from a distributional JL family with k ≥ Ω(s log n/s), then with high probability A is BP-exact. Exercise:: forpany vector v ∈ Rd , show that ∥v∥1 ≤ supp(v) · ∥v∥2 . dimension reduction and the jl lemma 147 Proof. The proof is simple, but uses some fairly general ideas worth emphasizing. First, focus on some s-dimensional subspace of Rn (obtained by restricting to some subset of coordinates). For notational simplicity, we just identify this subspace with Rs . 1. For δ = ε/3, pick an δ-net N of the sphere Ss−1 (under Euclidean distances). This can be done by a greedy algorithm: if some point x does not satisfy the covering property at any time, it can be added to the net. We claim the size of the net is | N | := (4/δ)s . Indeed, define balls of radius δ/2 around the points in N; these are disjoint by the packing property of nets, and are all contained in a ball of radius 1 + δ around the origin. Since the volume of balls of radius r scales as r s , we have  s 1+δ |N| ≤ = (4/δ)s . δ/2 2. If A is an δ-isometry on the δ-net N ⊆ Ss−1 , we claim it is a 3δisometry on all of Ss−1 . Indeed, consider the point x that achives the maximum stretch arg max{∥ Ax ∥2 | x ∈ Ss−1 }, and let this stretch be M. Let y be the closest point in N to x; by the packing property ∥ x − y∥ ≤ δ. Then M = ∥ Ax ∥ ≤ ∥ Ay∥ + ∥ A( x − y)∥ ≤ δ (1 + δ) + Mδ. Rearranging, M ≤ 11+ −δ ≤ (1 + 3δ ) for δ ≤ 1/3, say. For the contraction, consider any x ∈ Ss−1 , with closest net point y. Then ∥ Ax ∥ ≥ ∥ Ay∥ − ∥ A( x − y)∥ ≥ 1 − δ − (1 + 3δ)δ ≥ 1 − 3δ, again as long as δ ≤ 1/3. 3. By Lemma 11.2, the random matrix A with m rows is an δ-isometry on each point in the net N, except with probability exp(−cδ2 m) for some constant c. 4. Now apply the above argument to each of the (ns) subspaces obtained by restricting to some subset S of coordinates. By a union bound over all subsets S, and over all points in the net for that subspace, the matrix A is an 3δ-isometry on all points with support in S except with probability   n · (4/δ)s · exp(−cδ2 m) ≤ exp(−Θ(m)), s as long as m is Ω(s log n/s). Since ε = 3δ, we have the proof. This presentation is based on notes by Jirka Matoušek. Also see Chapter 4 of Ankur Moitra’s book for more on compressed sensing, sparse recovery and basis pursuit. Given a metric space ( X, d), a δ-net is a subset N ⊆ X such that (i) d( x, y) ≥ δ for all x, y ∈ N, and (ii) for each x ∈ X there exists y ∈ N such that d( x, y) ≤ δ. The former is call the packing property and the latter the covering property of nets. 148 some facts about balls in high-dimensional spaces 11.8 Some Facts about Balls in High-Dimensional Spaces Consider the unit ball Bd := { x ∈ Rd | ∥ x ∥2 ≤ 1}. Here are two facts, whose proofs we sketch. These sketches can be made formal (since the approximations are almost the truth), but perhaps the style of arguments are more illuminating. Theorem 11.18 (Heavy Shells). At least 1 − ε of the mass of the unit ball log 1/ε in Rd lies within a Θ( d )-width shell next to the surface. Proof. (Sketch) The volume of a radius-r ball in Rd goes as r d , so the fraction of the volume not in the shell of width w is (1 − w)d ≈ e−wd , log 1/ε which is ε when w ≈ d . Given any hyperplane H = { x ∈ Rd | a · x = b} where ∥ a∥ = 1, the width-w slab around it is K = { x ∈ Rd | b − w ≤ a · x ≤ b + w}. Theorem 11.19 (Heavy Slabs). At least (1 − ε) of the mass of the unit ball √ in Rd lies within Θ(1/ d) slab around any hyperplane that passes through the origin. Proof. (Sketch) By spherical symmetry we can consider the hyperplane { x1 = 0}. The volume of the ball within {−w ≤ x1 ≤ w} is at least Z w q Z w 2 d −1 d −1 2 ( 1 − y ) dy ≈ e−y · 2 dy. y =0 y =0 1 If we define σ2 = d− 1 , this is Z w y =0 e − y2 2σ2 dy ≈ Pr[ G ≤ w], 2 2 where G ∼ N (0, σ2 ). But we know that Pr[ G ≥ w] ≤ e−w /2σ by our generic Chernoff bound for Gaussians (10.21). So setting that tail probability to be ε gives r q  log(1/ε)  2 . w ≈ 2σ log(1/ε) = O d This may seem quite counter-intuitive: that 99% of the volume of the sphere is within O(1/d) of the surface, yet 99% is within √ O(1/ d) of any central slab! This challenges our notion of the ball “looking like” the smooth circular object, and more like a very spiky sea-urchin. Finally, a last observation: Corollary 11.20 (Near-orthogonality). Two random vectors from the surface of the unit ball in Rd (i.e., from the sphere Sd−1 ) are nearly orthogonal qwith high probability. In particular, their dot-product is smaller than O( log(1/ε) ) with probability 1 − ε. d Figure 11.1: Sea Urchin (from uncommoncaribbean.com) dimension reduction and the jl lemma Proof. Fix one of the vectors u. Then for dot-product |u · v| to be at most ε, the other vector v must fall in the slab of width ε around the hyperplane { x · u = 0}. Now Theorem 11.19 completes the argument. This means that if we pick n random vectors in Rd , and q set ε = log n 1/n2 , a union bound gives that all have dot-product O( d ). Set2 ting this dot-product to ε gives us n = exp(ε d) unit vectors with mutual dot-products at most ε, exactly as in the calculation at the beginning of the chapter. 149 12 Streaming Algorithms We now consider a slightly different computational model called the data streaming model. In this model we see elements going past in a “stream”, and we have very little space to store things. For example, we might be running a program on an Internet router with limited space, and the elements might be IP Addresses. We certainly don’t have space to store all the elements in the stream. The question is: which functions of the input stream can we compute with what amount of time and space? While we focus on space, similar questions can be asked for update times. We denote the stream elements by a1 , a2 , a3 , . . . , a t , . . . We assume each stream element is from alphabet U, and takes b = | log2 U | bits to represent. For example, the elements might be 32-bit integers IP addresses. We imagine we are given some function, and we want to compute it continually, on every prefix of the stream. Let us denote a[1:t] = ⟨ a1 , a2 , . . . , at ⟩. For example, if we have seen the integers: 3, 1, 17, 4, −9, 32, 101, 3, −722, 3, 900, 4, 32, . . . (12.1) 1. Can we compute the sum of all the integers seen so far? I.e., F ( a[1:t] ) = ∑it=1 ai . We want the outputs to be 3, 4, 21, 25, 16, 48, 149, 152, −570, −567, 333, 337, 369, . . . If we have seen T numbers so far, the sum is at most T2b and hence needs at most O(b + log T ) space. So we can just keep a counter, and when a new element comes in, we add it to the counter. 2. How about the maximum of the elements so far? F ( a[1:t] ) = maxit=1 ai . Even easier. The outputs are: 3, 1, 17, 17, 17, 32, 101, 101, 101, 101, 900, 900, 900 152 streams as vectors, and additions/deletions We just need to store b bits. 3. The median? The outputs on the various prefixes of (12.1) now are 3, 1, 3, 3, 3, 3, 4, 3, . . . And doing this will small space is a lot more tricky. 4. (“distinct elements”) Or the number of distinct numbers seen so far? We’d want to output: 1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9, 9, 9 . . . 5. (“heavy hitters”) Or the elements that have appeared most often so far? Hmm... We can imagine the applications of the data-stream model. An Internet router might see a lot of packets whiz by, and may want to figure out which data connections are using the most space? Or how many different connections have been initiated since midnight? Or the median (or the 90th percentile) of the file sizes that have been transferred. Which IP connections are “elephants” (say the ones that have used more than 0.01% of our bandwidth)? Even if we are not working at “line speed”, but just looking over the server logs, we may not want to spend too much time to find out the answers, we may just want to read over the file in one quick pass and come up with an answer. Such an algorithm might also be cache-friendly. But how to do this? Two of the recurring themes will be: 1. Approximate solutions: in several cases, it will be impossible to compute the function exactly using small space. Hence we’ll explore the trade-offs between approximation and space. 2. Hashing: this will be a very powerful technique. 12.1 Streams as Vectors, and Additions/Deletions An important abstraction will be to view the stream as a vector (in high dimensional space). Since each element in the stream is an element of the universe U, we can imagine the stream at time t as a vector xt ∈ Z|U | . Here xt = ( x1t , x2t , . . . , x|tU | ) and xit is the number of times the ith element in U has been seen until time t. (Hence, xi0 = 0 for all i ∈ U.) When the next element comes in and it is element j, we increment x j by 1. Such a router might see tens of millions of packets per second. streaming algorithms This brings us a extension of the model: we could have another model where each element of the stream is either a new element, or an old element departing. Formally, each time we get an update at , it looks like (add, e) or (del, e). We usually assume that for each element, the number of deletes we see for it is at most the number of adds we see — the running counts of each element is non-negative. As an example, suppose the stream looks like: (add, A), (add, B), (add, A), (del, B), (del, A), (add, C ), . . . and if A is the first element of U, then the first coordinate x1 of the vector s would be 1, 1, 2, 2, 1, 1, . . .. This vector notation allows us to formulate some of the problems more easily: 1. The total number of elements currently in the system is just |U | ∥x∥1 := ∑i=1 xi . (This is easy.) 2. We might want to estimate the norms ∥x∥2 , ∥x∥ p of the vector x. 3. The number of distinct elements is the number of non-zero entries in x is denoted by ∥x∥0 . Let’s consider the (non-trivial) problems one by one. 12.2 Computing Moments Recall that xt was the vector of frequencies of elements seen so far. Several interesting problems can be posed as computing various norms of xt : in particular the Euclidean or 2-norm v u |U | u t ∥x ∥2 = t ∑ ( xit )2 , i =1 and the 0-norm (which is not really a norm) ∥xt ∥0 := number of non-zeroes in xt . Henceforth, we use the notation that F0 := ∥xt ∥0 is the number of non-zero entries in x. For p ≥ 1, we consider the p-moment, that is, the pth -power of the p-norm: |U | Fp := ∑ ( xit ) p . (12.2) i =1 We’ll develop an algorithm to compute F2 , and to compute F0 ; we may see extensions from F2 to Fp in the homeworks. 153 In data stream jargon, the addition-only model is called the cash-register model, whereas the model with both additions and deletions is called the turnstile model. I will not use this jargon. 154 computing moments 12.2.1 Computing the Second Moment F2 The “second moment” F2 of the stream is often called the “surprise number” (since it captures how uneven the data is). This is also the size of the self-join. Clearly we can store the entire vector x and compute F2 , but that requires storing |U | counts. Here’s an algorithm that uses much less space: Pick a random hash function h : U → {−1, +1} from family H . Maintain counter C, which starts off at zero. On update ( add, i ) ∈ U , increment the counter C → C + h(i ). On update (delete, i ) ∈ U , decrement the counter C → C − h(i ). This estimator is often called the “tugof-war” estimator: the hash function randomly partitions the elements into two parties (those mapping to 1, and those to −1), and the counter keeps the difference between the sizes of the two parties. On query about the value of F2 , reply with C2 . This estimator was given by Noga Alon, Yossi Matias, and Mario Szegedy, in their God̈el-award winning paper on streaming computation. 12.2.2 Properties of the Hash Family The choice of the hash family will be crucial: we want a small family so that we require only a small amount of space to store the hash function, but we want it to be rich enough for the subsequent analysis to go through. Definition 12.1 (k-universal hash family). H is k-universal (also called uniform and k-wise independent) mapping universe U to some range R if all distinct elements i1 , . . . , ik ∈ U and for values α1 , . . . , αk ∈ R, Pr h← H  ^ j=1..k  ( h (i j ) = α j ) = 1 . | R|k (12.3) In our application, we want the hash family to be 4-universal from U to the two-element range R = {−1, 1}. This means that for any element i, 1 Pr [h(i ) = 1] = Pr [h(i ) = −1] = . 2 h← H h← H Moreover, for four distinct elements i, j, k, l, their maps behave independently of each other, and hence E[h(i ) · h( j) · h(k) · h(l )] = E[h(i )] · E[h( j)] · E[h(k)] · E[h(l )]. E[h(i ) · h( j)] = E[h(i )] · E[h( j)]. We will discuss constructions of such hash families soon, but let us use them to analyze the tug-of-war estimator. Alon, Matias, Szegedy (2000) streaming algorithms 12.2.3 A Direct Analysis Hence, having seen the stream that results in the frequency vector |U | x ∈ Z≥0 , the counter will have the value C : = ∑ x i h ( i ). i ∈U Remember, the resulting estimate is C2 : so we need to show that E[C2 ] = F2 , and variance that is small enough that Chebyshev’s inequality ensures we are correct with reasonable probability.  E[C2 ] = E[∑ h(i ) xi · h( j) x j ] = ∑ xi x j E[(h(i ) · h( j))] i,j i,j =∑ i xi2 E[h(i ) · h(i )] + = ∑ xi2 = F2 . ∑ ∑ xi x j E[h(i)] · E[h( j)] i ̸= j i,j i So in expectation we are correct! Next, recall that the variance is defined as Var(C2 ) = E[(C2 )2 ] − E[C2 ]2 : E[(C2 )2 ] = E[ ∑ h( p)h(q)h(r )h(s) x p xq xr xs ] = p,q,r,s = ∑ x4p E[h( p)4 ] + 6 ∑ x2p xq2 E[h( p)2 h(q)2 ] + other terms p εE[C2 ]] ≤ Var(C2 ) 2 ≤ 2. (εE[C2 ])2 ε This is pretty pathetic: since ε is usually less than 1, the RHS usually more than 1. 12.2.4 Reduce the Variance by Repetition The idea is the simplest one: if we have an estimator with mean µ and variance σ2 , then taking the average of k independent copies of 155 156 a matrix view of our estimator this estimator has mean µ and variance σ2 /k. (Why? Summing the independent copies sums the variances and so increases it by k, but dividing by k reduces it by k2 .) So if we k such independent counters C1 , C2 , . . . , Ck , and return their average C = 1k ∑i Ci , we get 2 2 2 Var(C ) 2 Pr[|C − E[C ]| > εE[C ]] ≤ 2 (εE[C ])2 ≤ 2 . kε2 Taking k = ε22δ independent counters gives a probability δ of error on any query. Each counter uses a 4-universal hash function, which requires O(log U ) random bits to store. 12.2.5 Estimating the p-Moments To fix, please skip. A bunch of students (Jason, Anshu, Aram) proposed that for the pth -moment calculation we should use 2p-wise independent hash functions from U to R, where R = {1, ω, ω 2 , . . . , ω p−1 }, the p primitive roots of unity. Again, we set C := ∑i∈U xi h(i ), and return the real part of C p as our estimate. This approach has been explored by Ganguly in this paper. Some calculations (and elbowgrease) show that E[C p ] = Fp , but it seems that naively Var(C p ) p p tends to grow like F2 instead of Fk ; this leads to pretty bad bounds. Ganguly’s paper gives some ways of controlling the variance. BTW, there is a lower bound saying that any algorithm that outputs a 2-approximation for Fk requires at least |U |1−2/k bits of storage. Hence, while we just saw that for k = 2, we can get away with just O(log |U |) bits to get a O(1)-estimate, for k > 2 things are much worse. 12.3 A Matrix View of our Estimator Here’s a equivalent way of looking at this estimator, that also relates it to the previous chapter and the JL Theorem. Recall that the stream can be viewed as representing a vector x of size |U |, and F2 = ∥x∥2 . Take a matrix M of dimensions k × D, where D = |U |: again, M is a “fat and short” matrix, since k = O(ε−2 δ−1 ) is small and D = |U | is huge. Pick k independent hash functions h1 , h2 , . . . , hk from the 4-universal hash family, and use each one to fill a row of M: Mij := hi ( j). The k counters C1 , C2 , . . . , Ck are now nothing other than the entries of the matrix-vector product M x. streaming algorithms 2 The estimate C =  1 k k ∑i =1 Ci 2 157 is nothing but 1 ∥ Mx∥22 . k This is completely analogous to the construction for JL: we’ve got a slightly taller matrix with k = O(ε−2 δ−1 ) rows instead of k = O(ε−2 log δ−1 ) rows. However, the matrix entries are not fully independent (as in JL), just 4-wise independent. I.e., we need to store only O(k log D ) bits and can generate any entry of M quickly, whereas the construction for JL stored all kD bits. Let us record two properties of this construction: Henceforth, we use S = √1 M to denote k the “sketch” matrix. Theorem 12.2 (Tug-of-War Sketch). Take a k × D matrix S whose −1 }k -valued r.v.s. Then for x, y ∈ columns are 4-wise independent { √1 , √ k RD , k 1. E[ ⟨Sx, Sy⟩ ] = ⟨x, y⟩. 2. Var( ⟨Sx, Sy⟩ ) = 2k · ∥x∥22 ∥y∥22 . The proofs is similar to that in §12.2.3; using y = x gives us exactly the results from that section. Moreover, an analogous theorem can also be given in the JL construction, with fewer rows but with completely independent entries. 12.4 Application: Approximate Matrix Multiplication Suppose we want to multiply square matrices A, B ∈ Rn×n , but want to solve the problem faster, at the expense of getting only an approximate solution C ≈ AB. How should we measure the error? Requiring that the answer be close entry-wise to the actual answer is a hard problem. Let’s aim for something weaker: we want the “aggregate error” to be small. Formally, the Frobenius norm of matrix M is ∥ M∥ F := s It’s as though we think of the matrix as just a vector and look at its Euclidean length. ∑ Mij2 . i,j Our guarantee for approximate matrix multiplication will be ∥C − AB∥2F ≤ small. Here’s the idea: we want to do the matrix multiplication: C = AB A B = C 158 optional: computing the number of distinct elements This usually takes O(n3 ) time. Indeed, the ijth entry of the product C is the dot-product of the ith row Ai⋆ of A with the jth column B⋆ j of B, and the dot-product takes O(n) time. Suppose instead we use a “fat and short” k × n matrix S (for k ≪ n), and calculate e = AS⊺ SB. C By associativity of matrix multiplication, we could first compute ( AS⊺ ) and (SB) in times O(n2 k), and then multiply the results in time O(nk2 ). Moreover, the matrix S from the previous section works pretty well, where we set D = n. e satisfy Indeed, entries of the error matrix Y = C − C The intuition is that S⊺ S is an almostidentity matrix, it has 1 on the diagonals and at most ε everywhere else. And hence it gives only a small error. Of course, we don’t multiply out S⊺ S, but instead compute AS⊺ and SB, and then multiply the smaller matrices. A S> S B ≈ E[Yij ] = 0 and E[Yij2 ] = Var(Yij ) + E[Yij ]2 = Var(Yij ) ≤ 2k ∥ Ai⋆ ∥22 ∥ B⋆ j ∥22 . So E[∥ AB − AS⊺ SB∥2F ] = E[∑ Yij2 ] = ∑ E[Yij2 ] = 2k ∑ij ∥ Ai⋆ ∥22 ∥ B⋆ j ∥22 ij The squared Frobenius norm of a matrix is the sum of squared Euclidean lengths of the columns, or of the rows. ij = 2k ∥ A∥2F ∥ B∥2F . Finally, setting k = ε22δ and using Markov’s inequality, we can say that for any fixed ε > 0, we can compute an approximate matrix product C := AS⊺ SB such that   Pr ∥ AB − C ∥ F ≤ ε · ∥ A∥ F ∥ B∥ F ≥ 1 − δ, 2 in time O( εn2 δ ). (If we want to make δ very small, at the expense of picking more independent random bits in the sketching matrix S, we can use the JL matrices instead. Details will appear in a homework.) Finally, if the matrices A, B are sparse and contains only ≪ n2 entries, the time can be made to depend on nnz( A, B). The approximate matrix product question has been considered often, e.g., by Edith Cohen and David Lewis using a random-walks approach. The algorithm we present is due to Tamás Sarlós; his paper gives better results, as well as extensions to computing SVDs faster. Better bounds have subsequently been given by Clarkson and Woodruff. More recent refs too. 12.5 Optional: Computing the Number of Distinct Elements Our last example today will be to compute F0 , the number of distinct elements seen in the data stream, but in the addition-only model, with no deletions. (We’ll see another approach in a HW.) Cohen and Lewis (1999) d C streaming algorithms 12.5.1 A Simple Lower Bound Of course, if we store x explicitly (using |U | space), we can trivially solve this problem exactly. Or we could store the (at most) t elements seen so far, again we could give an exact answer. And indeed, we cannot do much better if we want no errors. Here’s a proof sketch for deterministic algorithms (one can extend this to randomized algorithms with some more work). Lemma 12.3 (A Lower Bound). Suppose a deterministic algorithm correctly reports the number of distinct elements for each sequence of length at most N. Suppose N ≤ 2|U |. Then it must use at least Ω( N ) bits of space. Proof. Consider the situation where first we send in some subset S of N − 1 elements distinct elements of U. Look at the information stored by the algorithm. We claim that we should be able to use this information to identify exactly which of the ( N|U−|1) subsets of U we have seen so far. This would require    |U | log2 ≥ ( N − 1) log2 |U | − log2 ( N − 1) = Ω( N ) N−1 bits of memory. OK, so why should we be able to uniquely identify the set of elements until time N − 1? For a contradiction, suppose we could not tell whether we’d seen S1 or S2 after N − 1 elements had come in. Pick any element e ∈ S1 \ S2 . Now if we gave the algorithm e as the N th element, the number of distinct elements seen would be N if we’d already seen S2 , and N − 1 if we’d seen S1 . But the algorithm could not distinguish between the two cases, and would return the same answer. It would be incorrect in one of the two cases. This contradicts the claim that the algorithm always correctly reports the number of distinct elements on streams of length N. OK, so we need an approximation if we want to use little space. Let’s use some hashing magic. 12.5.2 The Intuition Suppose there are d = ∥x∥0 distinct elements. If we randomly map d distinct elements onto the line [0, 1], we expect to see the smallest mapped value at location ≈ d1 . (I am assuming that we map these elements consistently, so that multiple copies of an element go to the same place.) So if the smallest value is δ, one estimator for the number of elements is 1/δ. This is the essential idea. To make this work (and analyze it), we change it slightly: The variance of the above estimator is large. By the We used the approximation that     m m k ≥ , k k and hence   m log2 ≥ k(log2 m − log2 k). k 159 160 optional: computing the number of distinct elements same argument, for any integer s we expect the sth smallest mapped value at ds . We use a larger value of s to reduce the variance. 12.5.3 The Algorithm Assume we have a hash family H with hash functions h : U → [ M ]. (We’ll soon figure out the precise properties we’ll want from this hash family.) We will later fix the value of the parameter s to be some large constant. Here’s the algorithm: Pick a hash function h randomly from H . If query comes in at time t Consider the hash values h( a1 ), h( a2 ), . . . , h( at ) seen so far. Let Lt be sth smallest distinct hash value h( ai ) in this set. ·s Output the estimate Dt = M L . t The crucial observation is: it does not matter if we see an element e once or multiple times — the algorithm will behave the same, since the output depends on what distinct elements we’ve seen so far. Also, maintaining the sth smallest element can be done by remembering at most s elements. (So we want to make s small.) How does this help? As a thought experiment, if we had d distinct darts and threw them in the continuous interval [0, M ], we would expect the location of the sth smallest dart to be about s·dM . So if the sth smallest dart was at location ℓ in the interval [0, M ], we would be tempted to equate ℓ = s·dM and hence guessing d = s·ℓM would be a good move. Which is precisely why we used the estimate Dt = M·s . Lt Of course, all this is in expectation—the following theorem argues that this estimate is good with reasonable probability. Theorem 12.4. Consider some time t. If H is a uniform 2-universal hash family mapping U → [ M ], and M is large enough, then both the following guarantees hold: 3 , and s ∥ x t ∥0 3 ]≤ . Pr[ Dt < 2 s Pr[ Dt > 2 ∥xt ∥0 ] ≤ (12.4) (12.5) We will prove this in the next section. First, some observations. Firstly, we now use the stronger assumption that that the hash family 2-universal; recall the definition from Section 12.2.2. Next, setting ∥xt ∥ s = 8 means that the estimate Dt lies within [ 2 0 , 2∥xt ∥0 ] with probability at least 1 − (1/4 + 1/4) = 1/2. (And we can boost the streaming algorithms success probability by repetitions.) Secondly, we will see that the estimation error of a factor of 2 can be made (1 + ε) by changing the parameters s and k. 12.5.4 Proof of Theorem 12.4 Now for the proof of the theorem. We’ll prove bound (12.5), the other bound (12.4) is proved identically. Some shorter notation may help. Let d := ∥xt ∥0 . Let these d distinct elements be T = {e1 , e2 , . . . , ed } ⊆ U. The random variable Lt is the sth smallest distinct hash value seen until time t. Our estimate is sM Lt , and we want this to be at least d/2. 2sM So we want Lt to be at most d . In other words, Pr[ estimate too low ] = Pr[ Dt < d/2] = Pr[ Lt > 2sM ]. d Recall T is the set of all d (= ∥xt ∥0 ) distinct elements in U that have appeared so far. How many of these elements in T hashed to values greater than 2sM/d? The event that Lt > 2sM/d (which is what we want to bound the probability of) is the same as saying that fewer than s of the elements in T hashed to values smaller than 2sM/d. For each i = 1, 2, . . . , d, define the indicator  1 if h(ei ) ≤ 2sM/d Xi = (12.6) 0 otherwise Then X = ∑id=1 Xi is the number of elements seen that hash to values below 2sM/d. By the discussion above, we get that   2sM Pr Lt < ≤ Pr[ X < s]. d We will now estimate the RHS. Next, what is the chance that Xi = 1? The hash h(ei ) takes on each of the M integer values with equal probability, so Pr[ Xi = 1] = By linearity of expectations, " # E[ X ] = E ⌊sM/2d⌋ s 1 ≥ − . M 2d M d d d i =1 i =1 i =1 ∑ Xi = ∑ E [Xi ] = ∑ Pr [Xi = 1] ≥ d ·  (12.7) 1 s − 2d M  =  s d − 2 M s Let’s imagine we set M large enough so that d/M is, say, at most 100 . Which means s s  49 s E[ X ] ≥ − = . 2 100 100  . 161 162 optional: computing the number of distinct elements So by Markov’s inequality,     100 49 Pr X > s = Pr X > E[ X ] ≤ . 49 100 Good? Well, not so good. We wanted a probability of failure to be smaller than 2/s, we got it to be slightly less than 1/2. Good try, but no cigar. 12.5.5 Enter Chebyshev Recall that Var(∑i Zi ) = ∑i Var( Zi ) for pairwise-independent random variables Zi . (Why?) Also, if Zi is a {0, 1} random variable, Var( Zi ) ≤ E[ Zi ]. (Why?) Applying these to our random variables X = ∑i Xi , we get Var( X ) = ∑ Var( Xi ) ≤ ∑ E[ Xi ] = E( X ). i i (The first inequality used that the Xi were pairwise independent, since the hash function was 2-universal.) Is this variance “low” enough? Plugging into Chebyshev’s inequality, we get: 50 100 µ X ] ≤ Pr[| X − µ X | > µX ] 49 49 σX2 1 3 ≤ ≤ ≤ . s (50/49)2 µ X (50/49)2 µ2X Pr[ X > s] = Pr[ X > Which is precisely what we want for the bound (12.4). The proof for the bound (12.5) is similar and left as an exercise. 12.5.6 Final Bookkeeping Excellent. We have a hashing-based data structure that answers “number of distinct elements seen so far” queries, such that each answer is within a multiplicative factor of 2 of the actual value ∥xt ∥0 , with small error probability. Let’s see how much space we actually used. Recall that for failure probability 1/2, we could set s = 12, say. And the space to store the s smallest hash values seen so far is O(s lg M ) bits. For the hash functions themselves, the standard constructions use O((lg M) + (lg U )) bits per hash function. So the total space used for the entire data structure is O(log M ) + (lg U ) bits. What is M? Recall we needed to M large enough so that d/M ≤ s/100. Since d ≤ |U |, the total number of elements in the universe, set M = Θ(U ). Now the total number of bits stored is O(log U ). And the probability of our estimate Dt being within a factor of 2 of the correct answer ∥xt ∥0 is at least 1/2. If we want the estimate to be at most ∥ x t ∥0 , then we would want to bound (1+ ε ) E[ X ] Pr[ X < (1+ε) ]. Similar calculations should give this to be at most ε23s , as long as M was large enough. In that case we would set s = O(1/ε2 ) to get some non-trivial guarantees. 13 Dimension Reduction: Singular Value Decompositions 13.1 Introduction In this lecture, we see a very popular und useful dimension reduction technique that is based on the singular value decomposition (SVD) of a matrix. In contrast to the dimension reduction obtained by the Johnson-Lindenstrauss Lemma, SVD based dimension reductions are not distance preserving. That means that we allow that the distances between pairs of points in our input change. Instead, we want to keep the shape of the point set by fitting it to a subspace according to a least squares error. This preserves most of the ‘energy’ of the points. More precisely, the problem that we want to solve is the following. We are given a matrix A ∈ Rn×d . The points are the rows of A, which we also name a1 , . . . , an ∈ Rd . Let the rank of A be r, so r ≤ min{n, d}. Given an integer k, we want to find a subspace V of dimension k that minimizes the sum of the squared distances of all points in A to V. Thus, for each point in A, we square the distance between the point and its projection to V and add these squared errors, and this term should be minimized by our choice of V. This task can be solved by computing the SVD of A, a decomposition of A into matrices with nice properties. We will see that we can write A as   a1   ..  A= .   an     0 v1 | |  σ1     .. ..  = UDV ⊺ ,  =  u1 · · · u n   . .    | | 0 σr vd where U ∈ Rn×r and V ∈ Rr×d are matrices with orthonormal columns and D ∈ Rr×r is a diagonal matrix. Notice that the columns 164 best fit subspaces of dimension k and the svd of V are the d-dimensional points v1 , . . . , vd which appear in the rows of the above matrix since it is V ⊺ . Figure 13.1: A visualization of AV = UD for r = 2. σ2 u2 v2 A σ1 u1 v1 Notice that the SVD can give us an intuition of how A acts as a mapping. We have that AV = UDV ⊺ V = UD because V consists of orthonormal columns. Imagine the r-dimensional sphere that is spanned by v1 , . . . , vr . The linear mapping defined by A maps this sphere to an ellipsoid with σ1 u1 , . . . , σr ur as the axes, like shown in Figure 13.1. The singular value decomposition was developed by different mathematicians around the beginning of the 19th century. The survey by Stewart 1 gives an historical overview on its origins. In the following, we see how to obtain the SVD and why it solves our best fit problem. The lecture is partly based on 2 . 13.2 1 2 Best fit subspaces of dimension k and the SVD a3 Figure 13.2: Finding the best fit subspace of dimension one. ai ∗ βi a1 a4 αi a2 dimension reduction: singular value decompositions 165 We start with the case that k = 1. Thus, we look for the line through the origin that minimizes the sum of the squared errors. See Figure 13.2. It depicts a one-dimensional subspace V in blue. We look at a point ai , its distance β i to V, and the length of its projection to V which is named αi in the picture. Notice that the length of ai is α2i + β2i . Thus, for our fixed ai , minimizing β i is equivalent to maximizing αi . If we represent V by a unit vector v that spans V (depicted in orange in the picture), then we can compute the projection of ai to V by the dot product ⟨ ai , v⟩. We have just argued that we can find the best fit subspace of dimension one by solving n max ∑ ⟨ ai , v ⟩2 = v ∈ R d , ∥ v ∥= 1 i = 1 n min ∑ dist( a i , span( v )) 2 v ∈ R d , ∥ v ∥= 1 i = 1 where we denote the distance between a point a i and the line spanned by v by dist ( a i , span ( v )) 2 . Now because Av = (⟨ a 1 , v ⟩ , ⟨ a 2 , v ⟩ , . . . , ⟨ a n v ⟩) ⊺ , we can rewrite ∑ id= 1 ⟨ a i , v ⟩ 2 as ∥ Av ∥ 2 . We define the first right singular vector to be a unit vector that maximizes ∥ Av ∥ . We thus know There may be many vectors that achieve the maximum: indeed, for every v that that the subspaces spanned by it is the best fit subspace of dimension achieves the maximum, −v also has one. the same maximum. Let us break ties arbitrarily. Now we want to generalize this concept to more than one dimension. It turns out that to do so, we can iteratively pick orthogonal unit vectors that span more and more dimensions. Among all unit vectors that are orthogonal to those chosen so far, we pick a vector that maximizes ∥ Av∥. This is formalized in the following definition. Definition 13.1. Let A ∈ Rn×d be a matrix. We define v1 = arg max ∥ Av∥, σ1 ( A) := ∥ Av1 ∥ ∥v∥=1 v2 = arg .. . vr = arg max ∥v∥=1,⟨v,v1 ⟩=0 ∥ Av∥, max σ2 ( A) := ∥ Av2 ∥ ∥v∥=1,⟨v,vi ⟩=0 ∀i =1,...,r −1 ∥ Av∥, σr ( A) := ∥ Avr ∥ and say that v1 , . . . , vr are right singular vectors of A and that σ1 := σ1 ( A), . . . , σr := σr ( A) are the singular values of A. Then we define the left singular vectors by setting ui : = Avi ∥ Avi ∥ for all i = 1, . . . , r. One worry is that this greedy process picked v2 after fixing v1 , and hence the span of v1 , v2 may not be the best two-dimensional subspace. The following claim says that Definition 13.1 indeed gives us the the best fit subspaces. 166 best fit subspaces of dimension k and the svd Claim 13.2. For any k, the subspace Vk , which is the span of v1 , . . . , vk , minimizes the sum of the squared distances of all points among all subspaces of dimension k. Proof. Let V2 be the subspace spanned by v1 and v2 . Let W be any other 2-dimensional subspace and let w1 , w2 be an orthonormal basis of W. Recall that the squared length of the projection of a point ai to V decomposes into the squared lengths of the projections to the lines spanned by v1 and v2 and the same is true for W, w1 and w2 . Since we chose v1 to maximize ∥ Av∥, we know that ∥ Aw1 ∥ ≤ ∥ Av1 ∥. Similarly, it holds that ∥ Aw2 ∥ ≤ ∥ Av2 ∥, which means that ∥ Aw1 ∥2 + ∥ Aw2 ∥2 ≤ ∥ Av1 ∥2 + ∥ Av2 ∥2 . We can extend this argument by induction to show that the space spanned by v1 , . . . , vk is the best fit subspace of dimension k. We review some properties of the singular values and vectors. Notice that as long as i < r, there is always a vector in the row space of A that is linearly independent to v1 , . . . , vi , which ensures that max ∥ Av∥ is nonzero. For i = r, the vectors v1 , . . . , vr span the row space of A. Thus, any vector that is orthogonal to them lies in the kernel of A, meaning that arg max∥v∥=1,⟨v,vi ⟩=0 ∀i=1,...,i−1 ∥ Av∥ = 0, so we end the process at this point. By construction, we know that the singular values are not increasing. We also see that the right singular vectors form a orthonormal basis of the row space of A. This is true for the left singular vectors and the column space as well (homework). The following fact summarizes the important properties. Fact 13.3. The sets {u1 , . . . , ur } and {v1 , . . . , vr } as defined in 13.1 are both orthonormal sets and span the column and row space, respectively. The singular values satisfy σ1 ≥ σ2 ≥ . . . ≥ σr > 0. So far, we defined the vi purely based on the goal to find the best fit subspace. Now we claim that in doing so, we have actually found the decomposition we wanted, i.e. that     0 v1 | |  σ1     .. ..  = A.  UDV ⊺ := u1 · · · un   . .    | | 0 σr vd (13.1) Claim 13.4. For any matrix A ∈ Rn×d and U, V, D as in (13.1), it holds that A = UDV ⊺ . Proof. We prove the claim by using the fact that two matrices A, B ∈ Rn×d are identical iff for all vectors v, the images are equal, i.e. Av = dimension reduction: singular value decompositions Bv. Notice that it is sufficient to check this for a basis, so it is true if the following subclaim holds (which we do not prove): Subclaim: Two matrices A, B ∈ R n × d are identical iff Av = Bv for all v in a basis of R d . We use the subclaim for B = U DV ⊺ . Notice that we can extend v 1 , . . . , v r to a basis of R d by adding orthonormal vectors from the kernel of A. These additional vectors are orthogonal to all vectors in the rows of V ⊺ , so V ⊺ v is the zero vector for all of them. Since they − → − → − → are in the kernel of A, it holds 0 = Av = Bv = U D 0 = 0 for the additional basis vectors. For i = 1, . . . , r, we notice that ( U DV ⊺ ) v i = U De i = u i σi = Av i · ∥ Av i ∥ = Av i ∥ Av i ∥ which completes the proof. 13.3 Useful facts, and rank-k-approximation Singular values are a generalization of the concept of eigenvalues for square matrices. Recall that a square symmetric matrix M can be written as M = ∑ri=1 λi vi v⊺i where λi and vi are eigenvalues and eigenvectors, respectively. This decomposition can be used to define the singular vectors in a different way. In fact, the right singular vectors of A correspond to the eigenvectors of A⊺ A (notice that this matrix is square and symmetric), and the left singular vectors correspond to the eigenvectors of AA⊺ . This fact can also be used to compute the SVD. Computing the SVD or eigenvalues and -vectors in a numerically stable way is the topic of a large research area, and there are different ways to obtain algorithms that converge under the assumption of a finite precision. Fact 13.5. The SVD can be found (up to arbritrary precision) in time O(min(nd2 , n2 d)) or even in time O(min(ndω −1 , dnω −1 )) where ω is the matrix multiplication constant. (Here the big-O term hides the dependence on the precision.) The SVD is unique in the sense that for any i ∈ [r ], the subspace spanned by unit vectors v that maximize ∥ Av∥ is unique. Aside from the different choices of an orthonormal basis of these subspaces, the singular vectors are uniquely defined. For example, if all singular values are distinct, then the subspace of unit vectors that maximize ∥ Av∥ is one-dimensional and the singular vector is unique (up to sign changes, i.e., up to multiplication by −1). Sometimes, it is helpful to observe that the matrix product UDV ⊺ can also be written as the sum of outer products of the singular vectors. This formulation has the advantage that we can write the pro- 167 168 applications jection of A to the best fit subspaces of dimension k as the sum of the first k terms. Remark 13.6. The SVD can equivalently be written as r A = ∑ σi ui v⊺i i =1 where ui v⊺i is the outer product. For k ≤ r, the projection of A to Vk is k Ak := ∑ σi ui v⊺i . i =1 Recall that the Frobenius norm of a matrix A is the square qroot of the sum of its squared entries, i.e. it is defined by ∥ A∥ F := ∑i,j a2ij . This means that ∥ A − B∥2F is equal to the sum of the squared distances between each row in A and the corresponding row in B for matrices of equal dimensions. Imagine that B is a rank k matrix. Then its points lie within a k-dimensional subspace, and ∥ A − B∥2F cannot be smaller than the distance between A and this subspace. Since Ak is the projection to the best fit subspace of dimension k, Ak minimizes ∥ A − B∥ F (notice that Ak has rank at most k). It is therefore also called the best rank k-approximation of A. Theorem 13.7. Let A ∈ Rn×d be a matrix of rank r and let k ≤ r be given. It holds that ∥ A − Ak ∥ F ≤ ∥ A − B∥ F for any matrix B ∈ Rn×d of rank at most k. The theorem is also true if the Frobenius norm is replaced by the spectral norm. For a matrix A, the spectral norm is equal to the maximum singular value, i.e. ∥ A∥2 := maxv∈Rd ,∥v∥=1 ∥ Av∥ = σ1 . 13.4 Applications Topic modeling. Replacing A by Ak is a great compression idea. For example, for topic modeling, we imagine A to be a matrix that stores the number of times that any of d words appears in any of n documents. Then we assume that the rank rof A corresponds to r topics. Recall that   A=  a1 .. . an   |    =  u1  | ···  |  σ1  un    | 0 0 .. . σr     Assume that the entries in U and V are positive. Since the column vectors are unit vectors, they define a convex combination of the v1 .. . vd In fact, this theorem holds for any unitarily invariant matrix norm; a matrix norm ∥ · ∥ is unitarily invariant if ∥ A∥ = ∥U AV ∥ for any unitary matrices U, V. Other examples of unitarily invariant norms are the Schatten norms, and the Ky Fan norms. J. von Neumann characterized all unitarily invariant matrix norms as those obtained by taking a “symmetric” (vector) norm of the vector of singular values — here symmetric means ∥ x ∥ = ∥y∥ when y is obtained by flipping the signs of some entries of x and then permuting them around. See Theorem 7.4.24 in the text by Horn  and Johnson.  .  dimension reduction: singular value decompositions r topics. We can thus imagine U to contain information on how much each of the documents consists of each topic. Then, D assigns a weight to each of the topics. Finally, we V ⊺ gives information on how much each topic consists of each of the words. The combination of the three matrices generates the actual documents. By using the SVD, we can represent a set of documents based on fewer topics, thus obtaining an easier model of how they are generated. Notice that this interpretation of the SVD needs that the entries are non negative, and that obtaining such a decomposition is an NP-hard problem. 13.4.1 Pseudoinverse and least squares regression a4 x b4 b2 a3 x a2 x a1 b3 a2 a3 a4 a1 x b1 For any diagonal matrix M = diag(d1 , . . . , dℓ ), define M+ := diag(1/d1 , . . . , 1/dℓ ). We notice that for the matrices from the SVD, it holds that VD + U ⊺ UDV = diag(1, . . . , 1, 0, . . . , 0). | {z } r times If A is an n × n-matrix of rank n, then r = n and the result of this product is I. Thus, A+ := VD + U ⊺ is then the inverse of A. In general, A+ is the (Moore Penrose) pseudoinverse of A. It satisfies that A( A+ b) = b ∀b in the image of A The pseudoinverse helps to find the solution to another popular minimization problem, least squares regression. Given an overconstrained system of equations Ax = b, least squares regression asks for a point x that minimizes the squared error ∥ Ax − b∥22 . I.e., we want x ∗ := arg min ∥ Ax − b∥22 . 169 170 symmetric matrices Notice that if there is an x ′ with Ax ′ = b, then it also minimizes ∥ Ax ′ − b∥22 , and if A had full rank this x ′ would be obtained by computing A−1 b. If A does not have full rank, an optimal solution is obtained by using the pseudoinverse: x ∗ = A+ b (This is often used as another definition for the pseudoinverse.) Here’s a proof: for any choice of x ∈ Rd , Ax is some point in the column span of A. So x ∗ , the minimizer, must be the projection of b onto colspan( A). One orthonormal basis for colspan( A) is the columns of U. Hence the projection Πb of b onto colspan( A) is given by UU ⊺ b. (Why? Extend U to a basis for all of Rd , write b in this basis, and consider what it’s projection must be.) Hence we want Ax ∗ = UU ⊺ b. For this, it suffices to set x ∗ = VD + U ⊺ b = A+ b. 13.5 Symmetric Matrices For a (square) symmetric matrix A, the (normalized) eigenvectors vi and the eigenvalues λi satisfy the following properties: the vi s form an orthonormal basis, and A = VΛV ⊺ , where the columns of V are the vi vectors, and Λ is a diagonal matrix with λi s on the diagonal. It is no longer the case that the eigenvalues are all non-negative. (In fact, we can match up the eigenvalues and singular values such that they differ only in sign.) Given a function f : R → R, we can extend this to a function on symmetric matrices as follows: f ( A) = V diag( f (λ1 ), . . . , f (λn )) V ⊺ . For instance, you can check that Ak or e A defined this way indeed correspond to what you think they might mean. (The other way to k define e A would be ∑k≥0 Ak! .) Part III “Modern” Algorithms 14 Online Learning: Experts and Bandits In this set of chapters, we consider a basic problem in online algorithms and online learning: how to dynamically choose from among a set of “experts” in a way that compares favorably to any fixed expert. Both this abstract problem, and the techniques behind the solution, are important parts of the algorithm designer’s toolkit. 14.1 The Mistake-Bound Model Suppose there are N experts who make predictions about a certain event every day—for example, whether it rains today or not, or whether the stock market goes up or not. Let U be the set of possible choices. The process in the experts setting goes as follows: 1. At the beginning of each time step t, each expert makes a prediction. Let E t ∈ U N be the vector of predictions. 2. The algorithm makes a prediction at , and simultaneously, the actual outcome o t is revealed. The goal is to minimize the number of mistakes, i.e., the number of times our prediction at differs from the outcome o t . Fact 14.1. There exists an algorithm that makes at most ⌈log2 N ⌉ mistakes, if there is a perfect expert. Proof. The algorithm just considers all the experts who have made no mistakes so far, and predicts what the majority of them predict. Note that every time we make a mistake, the number of experts who have not been wrong yet reduces by a factor of 2 or more. (And when we do not make a mistake, this number does not increase.) Since there is at least one perfect expert, we can make at most ⌈log2 N ⌉ mistakes. Show that any algorithm must make at least ⌈log2 N ⌉ mistakes in the worst case. The term expert just refers to a person who has an opinion, and does not reflect whether they are good or bad at the prediction task at hand. Note the order of events: the experts predictions come first, then the algorithm chooses an expert at the same time as the reality being revealed. Suppose we have 8 experts, and E t = (0, 1, 0, 0, 0, 1, 1, 0). If we follow the third expert and predict at = 0, but the actual outcome is o t = 1, we make a mistake; if we would have picked the second expert, we would have been correct. 174 the weighted majority algorithm Fact 14.2. There is an algorithm that, on any sequence, makes at most M ≤ m∗ (⌈log2 N ⌉ + 1) + ⌈log2 N ⌉ mistakes, where m∗ is the number of mistakes made by the best of these experts on this sequence. Proof. Think of time as being divided into “epochs”. In each epoch, we proceed as in the perfect expert scenario as in Fact 14.1: we keep track of all experts who have not yet made a mistake in that epoch, and predict the majority opinion. The set of experts halves (at least) with every mistake the algorithm makes. When the set becomes empty, we end the epoch, and start a new epoch with all the N experts. Note that in each epoch, every expert makes at least one mistake. Therefore the number of completed epochs is at most m∗ . Moreover, we make at most ⌈log2 N ⌉ + 1 mistakes in each completed epoch, and at most ⌈log2 N ⌉ mistakes the last epoch, giving the result. However, this algorithm is very harsh and very myopic. Firstly, it penalizes even a single mistake by immediately discarding the expert. But then, at the end of an epoch, it wipes the slate clean and forgets the past performance of the experts. Maybe we should be gentler, but have a better memory? 14.2 The Weighted Majority Algorithm This algorithm, due to Littlestone and Warmuth, is remarkable for (t) its simplicity. We assign a weight wi to each expert i ∈ [ N ]. Let wi denote the weight of expert i at the beginning of round t. Initially, all (1) weights are 1, i.e., wi = 1. 1. In round t, predict according to the weighted majority of experts. In other words, choose the outcome that maximizes the sum of weights of experts that predicted it. I.e., at ← arg max u ∈U ∑ i:expert i predicts u 2. Upon seeing the outcome, set  1 ( t +1) (t) wi = wi · 1 2 (t) wi . We break ties arbitrarily, say, by picking the first of the options that achieve the maximum. if i was correct if i was incorrect . Theorem 14.3. For any sequence of predictions, the number of mistakes made by the weighted majority algorithm (WM) is at most 2.41(mi + log2 N ), where mi is the number of mistakes made by expert i. online learning: experts and bandits 175 Proof. The proof uses a potential-function argument. Let Φt := (t) ∑ wi . i ∈[ N ] Note that 1. Φ1 = N, since the weights start off at 1, 2. Φt+1 ≤ Φt for all t, and 3. if Algorithm WM makes a mistake in round t, the sum of weights of the wrong experts is higher than the sum of the weights of the correct experts, so Φ t +1 = ( t +1) ∑ wi i wrong = ∑ i correct ( t +1) wi 1 (t) (t) ∑ wi + ∑ wi 2 i wrong i correct = Φt − ≤ + 1 (t) ∑ wi 2 i wrong 3 t Φ 4 If after T rounds, expert i has made mi mistakes and WM has made M mistakes, then  M   mi  M 3 1 ( T +1) T +1 1 3 =N . = wi ≤Φ ≤Φ 2 4 4 Now taking logs, and rearranging, M≤ mi + log2 N log2 43 ≤ 2.41(mi + log2 N ). In other words, if the best of the N experts on this sequence was wrong m∗ times, we would be wrong at most 2.41(m∗ + log2 n) times. Note that we are much better on the multiplier in front of the m∗ term than Fact 14.2 was, at the expense of being slightly worse on the multiplier in front of the log2 N term. 14.2.1 A Gentler Penalization Instead of penalizing each wrong expert by a factor of 1/2, we could penalize the experts by a factor of (1 − ε). This allows us to trade off the multipliers on the m∗ term and the logarithmic term. Theorem 14.4. For ε ∈ (0, 1/2), penalizing each incorrect expert by a factor of (1 − ε) guarantees that the number of mistakes made by MW is at most   log N 2(1 + ε ) m i + O . ε We cannot hope to compare ourselves to the best way of dynamically choosing experts to follow. This result says that at least we do not much worse to the best static policy of choosing an expert—in fact, choosing the best expert in hindsight—and sticking with them. We’ll improve our performance soon, but all our results will still compare to the best static policy for now. 176 randomized weighted majority Proof. Using an analysis identical to Theorem 14.3, we get that Φt+1 ≤ (1 − 2ε )Φt and therefore    ε M ε M (1 − ε ) m i ≤ Φ T +1 ≤ Φ 1 1 − = N 1− ≤ N exp − εM/2 . 2 2 Now taking logs, and simplifying, −mi log(1 − ε) + ln N ε/2   m ( ε + ε2 ) log N ≤2 i +O , ε ε M≤ 2 3 because − ln(1 − ε) = ε + ε2 + ε3 + · · · ≤ ε + ε2 for ε ∈ [0, 1]. This shows that we can make our mistakes bound as close to 2m∗ as we want, but this approach seems to have this inherent loss of a factor of 2. In fact, no deterministic strategy can do better than a factor of 2, as we show next. Proposition 14.5. No deterministic algorithm A can do better than a factor of 2, compared to the best expert. Proof. Note that if the algorithm is deterministic, its predictions are completely determined by the sequence seen thus far (and hence can also be computed by the adversary). Consider a scenario with two experts A,B, the first always predicts 1 and the second always predicts 0. Since A is deterministic, an adversary can fix the outcomes such that A’s predictions are always wrong. Hence at least one of A and B will have an error rate of ≤ 1/2, while A’s error rate will be 1. 14.3 Randomized Weighted Majority Consider the proof of Proposition 14.5, but applied to the WM algorithm: the algorithm alternates between predicting 0 and 1, whereas the actual outcome is the opposite. The weights of the two experts remain approximately the same, but because we are deterministic, we choose the wrong one. What if we interpret the weights being equal as a signal that we should choose one of the two options with equal probability? This is the idea behind the Randomized Weighted Majority algorithm (RMW) of Littlestone and Warmuth: the weights evolve in exactly the same way as in Theorem 14.4, but now the prediction at each time is drawn randomly proportional to the current weights of the experts. I.e., instead of Step 1 in that algorithm, we do the following: (t) Pr[action u is picked] = ∑i:expert i predicts u wi (t) ∑ i wi . online learning: experts and bandits 177 Note that the update of the weights proceeds exactly the same as previously. Theorem 14.6. Fix ε ≤ 1/2. For any fixed sequence of predictions, the expected number of mistakes made by randomized weighted majority (RWM) is at most   log N E[ M ] ≤ (1 + ε ) m i + O ε Proof. The proof is an analysis of the weight evolution that is more (t) careful than in Theorem 14.4. Again, the potential is Φt = ∑i wi . Define (t) ∑i incorrect wi Ft := (t) ∑ i wi to be the fraction of weight on incorrect experts at time t. Note that E[ M ] = ∑ Ft . t∈[ T ] Indeed, we make a mistake at step t precisely with the probability Ft , since the adversary does not see our random choice when deciding on the actual outcome at . By our re-weighting rules, Φt+1 = Φt ((1 − Ft ) + Ft (1 − ε)) = Φt (1 − εFt ) Bounding the size of the potential after T steps, T (1 − ε)mi ≤ Φ T +1 = Φ1 ∏ (1 − εFt ) ≤ Ne−ε ∑ Ft = Ne−εE[ M] t =1 Now taking logs, we get mi ln(1 − ε) ≤ ln N − εE[ M], using the approximation − log(1 − ε) ≤ ε + ε2 gives us E[ M ] ≤ m i (1 + ε ) + 14.3.1 ln N . ε Classifying Adversaries for Randomized Algorithms In the above analysis, it was important that the actual random outcome was independent of the prediction of the algorithm. Let us formalize the power of the adversary: Oblivious Adversary. Constructs entire sequence E 1 , o1 , E 2 , o2 , · · · upfront. Adaptive Adversary. Sees the previous choices of the algorithm, but must choose o t independently of our actual prediction at in round t. Hence, o t can be a function of E 1 , o1 , . . . , E t−1 , o t−1 , E t , as well as of a1 , . . . , at−1 , but not of at . log N  The quantity εmi + O ε gap between the algorithm’s performance and that of the best expert is called the regret with respect to expert i. 178 the hedge algorithm, and a change in perspective The adversaries are equivalent on deterministic algorithms, because such an algorithm always outputs the same prediction and the oblivious adversary could have calculated at in advance when creating E t+1 . They may be different for randomized algorithms. However, it turns out that RWM works in both models, because our predictions do not affect the weight updates and hence the future. 14.4 The Hedge Algorithm, and a Change in Perspective Let’s broaden the setting slightly, and consider the following dotproduct game. In each round, 1. The algorithm produces a vector of probabilities p t Define the probability simplex as  ∆ N := x ∈ [0, 1] N | ∑ xi = 1 . i = ( p1t , p2t , · · · , ptN ) ∈ ∆ N . 2. The adversary produces ℓt = (ℓ1t , ℓ2t , · · · , ℓtN ) ∈ [−1, 1] N . 3. The loss of the algorithm in this round is ℓt , pt . We can move between this “fractional” model where we play a point in the probability simplex ∆ N , and the randomized model of the previous section (with an adaptive adversary), where we must play a single expert (which is a vertex of the simplex ∆ N . Indeed, setting ℓt to be a vector of 0s and 1s can capture whether an expert is correct or not, and we can set pit = Pr[algorithm plays expert i at time t] to deduce that Pr[mistake at time t] = ℓt , pt . 14.4.1 The Hedge Algorithm The Hedge algorithm starts with weights wi1 = 1 for all experts i. In each round t, it defines pt ∈ ∆ N using: pit ← wit , ∑ j wtj (14.1) and updates weights as follows: wit+1 ← wit · exp(−εℓit ). (14.2) Theorem 14.7. Consider a fixed ε ≤ 1/2. For any sequences of loss vectors in [−1, 1] N and for all indices i ∈ [ N ], the Hedge algorithm guarantees: T T i =1 t =1 ∑ ⟨ pt , ℓt ⟩ ≤ ∑ ℓit + εT + ln N ε This equivalence between randomized and fractional algorithms is a common theme in algorithm design, especially in approximation and online algorithms. online learning: experts and bandits Proof. As in previous proofs, let Φt = ∑ j wtj , so that Φ1 = N, and t Φt+1 = ∑ wit+1 = ∑ wit e−εℓi i ≤ ∑ wit 1 − εℓit + ε2 ℓit )2 i  ≤ ∑ wit (1 + ε2 ) − ε ∑ wit ℓit = (1 + ε2 )Φt − εΦt ⟨ pt , ℓt ⟩  = Φ t 1 + ε2 − ε ⟨ p t , ℓ t ⟩ t 2 t ≤ Φ t e ε − ε ⟨ p ,ℓ ⟩ (using e x ≤ 1 + x + x2 ∀ x ∈ [−1, 1]) (because |ℓit | ≤ 1) (because wit = pit · Φt ) (using 1 + x ≤ e x ) Again, comparing to the final weight of the ith coordinate, t ( t +1) e − ε ∑ ℓi = wi 2 t t ≤ Φ T +1 ≤ Φ 1 e ε T − ε ∑ ⟨ p ,ℓi ⟩ ; now using Φ1 = N and taking logs proves the claim. q √ Moreover, choosing ε = lnTN gives εT + lnεN = 2 T ln N, and the regret term is concave and sublinear in time T. This suggests that the further we run the algorithm, the quicker the average regret goes to zero, which suggests the algorithm is in some sense “learning". One final observation: instead of just bounding the (ℓit )2 terms by 1, we could just keep them around: we would get the slightly more nuanced bound: Theorem 14.8. Consider a fixed ε ≤ 1/2. For any sequences of loss vectors in [−1, 1] N , the Hedge algorithm guarantees that for any index i ∈ [ N ]: D E ln N T t t t t 2 t ⟨ p , ℓ ⟩ ≤ ℓ + ε (ℓ ) , p + ∑ ∑ i ε t =1 i =1 T In §14.5.3 we will see a situation where the bound from Theorem 14.8 is more useful than the one from Theorem 14.7. 14.4.2 Two Useful Corollaries The following corollary will be useful in many contexts: it just flips Theorem 14.7 on its head, and shows that the average regret is small after sufficiently many steps. Corollary 14.9. For T ≥ 4 log N , the average loss of the Hedge algorithm is ε2 1 1 ⟨ pt , ℓt ⟩ ≤ min ∑ ℓit + ε ∑ T t T t i = min ∗ p ∈∆ N 1 ℓt , p∗ + ε. T∑ t 179 180 optional: the bandit setting The viewpoint of the last expression is useful, since it indicates that the dynamic strategy given by Hedge for the dot-product game is comparable (in the sense of having tiny regret) against any fixed strategy p∗ in the probability simplex. Finally, we state a further corollary that is useful in future lectures. It can be proved by running Corollary 14.9 with losses ℓt = − gt /ρ. Corollary 14.10 (Average Gain). Let ρ ≥ 1 and ε ∈ (0, 1/2). For any 4ρ2 ln N sequence of gain vectors g1 , . . . , g T ∈ [−ρ, ρ] N with T ≥ ε2 , the gains version of the Hedge algorithm produces probability vectors pt ∈ ∆ N such that 1 T 1 T t t g , p ≥ max ∑ gt , ei − ε. T t∑ T i ∈[ N ] =1 t =1 In passing we mention that if the gains or losses lie in the range N . [−γ, ρ], then we can get an asymmetric guarantee of T ≥ 4γρεln 2 14.5 Optional: The Bandit Setting The model of experts or the dot-product problem is often called the full-information model, because the algorithm gets to see the entire loss vector ℓt at each step. (Recall that we view the entries of the probability vector pt played by the algorithm as the probability of playing each of the actions, and hence ℓt , pt is just the expected loss incurred by the algorithm. Now we consider a different model, where the algorithm only gets to see the loss of the action it plays. Specifically, in each round, 1. The algorithm again produces a vector of probabilities pt = ( p1t , p2t , · · · , ptN ) ∈ ∆ N . It then chooses an action at ∈ [ N ] with these marginal probabilities. 2. In parallel, the adversary produces ℓt = (ℓ1t , ℓ2t , · · · , ℓtN ) ∈ [−1, 1] N . However, now the algorithm only gets to see the loss ℓtat corresponding to the action chosen by the algorithm, and not the entire loss vector. This limited-information setting is called the bandit setting. 14.5.1 The Exp3 Algorithm Surprisingly, we can obtain algorithms for the bandit setting from algorithms for the experts setting, by simply “hallucinating” the The name comes from the analysis of slot machines, which are affectionately known as “one-armed bandits”. online learning: experts and bandits cost vector, using an idea called importance sampling. This causes the parameters to degrade, however. Indeed, consider the following algorithm: we run an instance A of the RWM algorithm, which is in the full information model. So at each timestep, 1. A produces a probability vector pt ∈ ∆ N . 2. We choose an expert I t ∈ [ N ], where Pr[ I t = i ] = qit := γ · 1 + (1 − γ) · pit . N I.e., with probability γ we pick a uniformly random expert, else we follow the suggestion given by pt . 3. We get back the loss value ℓtI t for this chosen expert. 4. We construct an “estimated loss” ℓ̂t ∈ [0, 1] N by setting  t   ℓ j if j = I t t t . ℓ̃ j = q j  0 if j ̸= I t We now feed ℓ̃t to the RWM instance A, and go back to Step 1. We now show this algorithm achieves low regret. The first observation is that the estimated loss vector is an unbiased estimate of the actual loss, just because of the way we reweighted the answer by the inverse of the probability of picking it. Indeed, E[ℓ̃it ] = ℓit t · q + 0 · (1 − qit ) = ℓit . qit i (14.3) Since each true loss value lies in [−1, 1], and each probability value is at least γ/N, the absolute value of each entry in the ℓ̃ vectors is at most N/γ. Now, since we run RWM on these estimated loss vectors belonging to [0, N/γ] N , we know that   N log N t t t (14.4) ∑ p , ℓ̃ ≤ ∑ ℓ̃i + γ εT + ε . t t Taking expectations over both sides, and using (14.3),   N log N t t t ∑ p , ℓ ≤ ∑ ℓi + γ εT + ε . t t (14.5) However, the LHS is not our real loss, since we chose I t according to qt and not pt . This means our expected total loss is really γ ∑ qt , ℓt = (1 − γ) ∑ pt , ℓt + N ∑ 1, ℓt t t t   N log N ≤ ∑ ℓit + εT + + γT. γ ε t 181 182 optional: the bandit setting Now choosing ε = q √ log N N T and γ =  log N T 1/4 gives us a regret of ≈ N 1/2 T 3/4 . The interesting fact here is that the regret is again sub-linear in T, the number of timesteps: this means that as T → ∞, the per-step regret tends to zero. The dependence on N, the number of experts/options, is now polynomial, instead of being logarithmic as in the full-information √ case. This is necessary: there is a lower bound of Ω( NT ) in the bandit setting. And indeed, the Exp3 algorithm itself achieves a nearp optimal regret bound of O( NT log N ); we can show this by using a finer analysis of Hedge that makes more careful approximations. We defer these improvements to §14.5.3, and instead give an application of this bandit setting to a problem in item pricing. 14.5.2 Item Pricing via Bandits To be added in. 14.5.3 Getting a Tight “Square-Root” Regret We would need non-negative losses. And then use that the bound exp(− x ) ≤ 1 − x + x2 for all x ≥ −1. Etc. Suppose we use the bound from Theorem 14.8 to get   log N t t t (14.6) ∑ p , ℓ̃ ≤ ∑ ℓ̃i + ε ∑ ε . + t t 15 Solving Linear Programs using Experts We can now use the low-regret algorithms for the experts problem to show how to approximately solve linear programs (LPs). As a warmup, we use it to solve two-player zero-sum games, which are a special case of LPs. 15.1 (Two-Player) Zero-Sum Games There are two players in such a game, traditionally called the “row player" and the “column player". Each of them has some set of actions: the row player with m actions (associated with the set [m]), and the column player with the n actions in [n]. Finally, we have a payoff matrix M ∈ Rm×n . In a play of the game, the row player chooses a row i ∈ [m], and simultaneously, the column player chooses a column j ∈ [n]. If this happens, the row player gets Mi,j , and the column player loses Mi,j . The winnings of the two players sum to zero, and so we imagine that the payoff is from the row player to the column player. 15.1.1 In fact, zero-sum games are equivalent to linear programming, see this work of Ilan Adler. Is there an earlier reference? Strategies, and Best-Response Each player is allowed to have a randomized strategy. Given strategies p ∈ ∆m for the row player, and q ∈ ∆n for the column player, the expected payoff (to the row player) is E[payoff to row] = p⊺ Mq = ∑ pi q j Mi,j . i,j The row player wants to maximize this value, while the column player wants to minimize it. Suppose the row player fixes a strategy p ∈ ∆m . Knowing p, the column player can choose an action to minimize the expected payoff: C ( p) := min p⊺ Mq = min p⊺ Me j . q∈∆n j∈[n] Henceforth, when we talk about payoffs, these will always refer to payoffs to the row player from the column player. This payoff may be negative, which would capture situations where the column player does better. 184 (two-player) zero-sum games The equality holds because the expected payoff is linear, and hence the column player’s best strategy is to choose a column that minimizes the expected payoff. The column player is said to be playing their best response. Analogously, if the column player fixes a strategy q ∈ ∆n , the row player can maximize the expected payoff by playing their own best response: R(q) := max p⊺ Mq = max ei⊺ Mq. p∈∆m i ∈[m] Now, the row player would love to play the strategy p such that even if the column player plays best-response, the payoff is as large as possible: i.e., it wants to achieve max C ( p). p∈∆m Similarly, the column player wants to choose q to minimize the payoff against a best-response row player, i.e., to achieve min R(q). q∈∆n Lemma 15.1. For any p ∈ ∆m , q ∈ ∆n , we have C ( p) ≤ R(q) (15.1) Proof. Intuitively, since the column player commits to a strategy q, it hence gives more power to the row player. Formally, the row player could always play strategy p in response to q, and hence could always get value C ( p). But R(q) is the best response, which could be even higher. Interestingly, there always exist strategies p ∈ ∆m , q ∈ ∆n which achieve equality. This is formalized by the following theorem: Theorem 15.2 (Von Neumann’s Minimax Theorem). For any finite zero-sum game M ∈ Rm×n , max C ( p) = min R(q). p∈∆m q∈∆n This common value V is called the value of the game M. Proof. We assume for the sake of contradiction that ∃ M ∈ [−1, 1]m×n such that max p∈∆m C ( p) ≤ minq∈∆n R(q) − δ for some δ > 0. (The assumption that Mij ∈ [−1, 1] follows by scaling.) Now we use the fact that the average regret of the Hedge algorithm tends to zero to construct strategies pb and qb that have R(qb) − C ( pb) < δ, thereby giving us a contradiction. We consider an instance of the experts problem, with m experts, one for each row of M. At each time step t, the row player produces solving linear programs using experts 185   pt ∈ ∆m . Initially p1 = m1 , . . . , m1 , which represents that the row player chooses each row with equal probability, when they have no information to work with. At each time t, the column player plays the best-response to pt , i.e., jt := arg max ( pt )⊺ Me j . j∈[n] This defines a gain vector for the row player: gt := Me jt , which is the jth column of M. The row player uses this to update the weights and get pt+1 , etc. Define pb := 1 T t p T t∑ =1 and qb := 1 T e jt T t∑ =1 to be the average long-term plays of the row player, and of the best responses of the column player to those plays. We know that C ( pb) ≤ R(qb) m by (15.1). But by Corollary 14.10, after T ≥ 4 ln steps, ε2 1 1 ⟨ pt , gt ⟩ ≥ max ∑ ei , gt − ε ∑ T t T t i D 1 E = max ei , ∑ gt − ε T t i D 1 E = max ei , M e jt −ε ∑ T t i (by Hedge) (by definition of gt ) = max⟨ei , Mb q⟩ − ε i = R(qb) − ε. Since pt is the row player’s strategy, and C is concave (i.e., the payoff on the average strategy pb is no more than the average of the payoffs: 1  1 1 ⟨ pt , gt ⟩ = ∑ C ( pt ) ≤ C pt = C ( pb). ∑ ∑ T T T Putting it all together: R(qb) − ε ≤ C ( pb) ≤ R(qb). Now for any δ > 0 we can choose ε < δ to get the contradiction. Observe that the proof gives us an explicit algorithm to find strategies pb, qb that have a small gap. The minimax theorem is also implied by strong duality of linear programs: indeed, we can write minq∈∆n R(q) as a linear program, take its dual and observe that it computes min p∈∆m C ( p). The natural question is: we can solve linear programs using low-regret algorithms. We now show how to do this. We should get a clean proof of strong duality this way? To see this, recall that C ( p) := min p⊺ Mq. q Let q∗ be the optimal value of q that minimizes C ( p). Then for any a, b ∈ ∆m , we have that C ( a + b) = ( a + b)⊺ Mq∗ = a⊺ Mq∗ + b⊺ Mq∗ ≥ min a⊺ Mq + min b⊺ Mq = C ( a) + C (b) q q 186 solving lps approximately 15.2 Solving LPs Approximately Consider an LP with constraint matrix A ∈ Rm×n : max ⟨c, x ⟩ (15.2) Ax ≤ b x≥0 Suppose x ∗ is an optimal solution, with OPT := ⟨c, x ∗ ⟩. Let K ⊆ Rn be the polyhedron defined by the “easy” constraints, i.e., K := { x ∈ Rn | ⟨c, x ⟩ = OPT, x ≥ 0}, where OPT is found by binary search over possible objective values. Binary search over the reals is typically not a good idea, since it may never reach the answer. (E.g., searching for 1/3 by binary search over [0, 1].) However, we defer this issue for now, and imagine we know the value of OPT. We now use low-regret algorithms to find xb ∈ K such that ⟨ ai , x ⟩ ≤ bi + ε for all i ∈ [m]. 15.2.1 The Oracle The one assumption we make is that we can solve the feasibility problem obtained by intersecting the “easy” constraints K with a single linear constraint. Suppose α ∈ Rn , β ∈ R, then we want to solve the problem: Oracle: find a point x ∈ K ∩ { x | ⟨α, x ⟩ ≤ β}. (15.3) Proposition 15.3. There is an O(n)-time algorithm to solve (15.3), when K = { x ≥ 0 | ⟨c, x ⟩ = OPT }. Proof. We give the proof only for the case where ci > 0 for all i; the general case is left as an exercise. Let j∗ := arg min j α j/c j , and define x = (OPT/c j∗ )e j∗ . Say “infeasible” if x does not satisfy ⟨α, x ⟩ ≤ β, else return x. Of course, this problem can be solved in time linear in the number of variables (as Proposition 15.3 above shows), but the situation can be more interesting when the number of variables is large. For instance, when we solve flow LPs, the number of variables will be exponential in the size of the graph, yet the oracle will be implementable in time poly(n). 15.2.2 The Algorithm The key idea to solving general LPs is then similar to that for zerosum games. We have m experts, one corresponding to each constraint. In each round, we combine the multiple constraints using a The fix to the “binary search over reals” problem is this: the optimal value of a linear program in n dimensions where all numbers integers using at most b bits is a rational p/q, whereboth p, q use at most poly(nb) bits. So once we the granularity of the search is fine enough, there is a unique rational close the query point, and we can snap to it. See, e.g., the problem on finding negative cycles in the homeworks. solving linear programs using experts weighted sum, we call the above oracle on this single-constraint LP to get a solution, we construct a gain vector from this solution and feed this to Hedge, which then updates the weights that we use for the next round. The gain of an expert in a round is based based on how badly the constraint was violated by the current solution. The intuition is simple: greater violation means more gain, and hence more weight in the next iteration, which forces us to not violate the constraint as much. An upper bound on the maximum possible violation is the width ρ of the LP, defined by ρ := max {| ⟨ ai , x ⟩ − bi |}. x ∈K,i ∈[m] (15.4) We assume that ρ ≥ 1. Algorithm 13: LP-Solver p1 ← (1/m, . . . , 1/m). T ← Θ(ρ2 ln m/ε2 ) 13.2 for t = 1 to T do 13.3 Define αt := ∑im=1 pit ai ∈ Rn and βt = ∑im=1 pit bi ∈ R. 13.4 Use Oracle to find x ∈ K ∩ { αt , x t ≤ βt }. 13.5 if oracle says infeasible then 13.6 return infeasible 13.1 else git ← ⟨ ai , x t ⟩ − bi for all i. 13.9 feed gt to Hedge(ε) to get pt+1 . b ← ( x1 + · · · + x T )/T. 13.10 return x 13.7 13.8 15.2.3 The Analysis Theorem 15.4. Fix 0 ≤ ε ≤ 1/4. Then Algorithm 13 calls the oracle O(ρ2 ln m/ε2 ) times, and either correctly returns “infeasible”, or returns xb ∈ K such that Ab x ≤ b − ε1. Proof. Observe that if x ∗ is feasible for the original LP (15.2) then it is feasible for any of the calls to the oracle, since it satisfies any positive linear combination of the constraints. Hence, we are correct if we ever return “infeasible”. Moreover, x t ∈ K in each iteration, and xb is an average of x t ’s, so it also lies in K by convexity. So it remains to show that xb approximately satisfies the other linear constraints. Recall the guarantee from Corollary 14.10: 1 1 ⟨ pt , gt ⟩ ≥ max ∑ ei , gt − ε, ∑ T t T t i (15.5) for precisely the choice of T in Algorithm 13, since the definition of width in (15.4) ensures that gt ∈ [−ρ, ρ]m . 187 188 solving lps approximately Let i ∈ [m], and recall the definitions of αt = ∑im=1 pit ai , βt m ∑i=1 pit bi , and gt = Ax t − b from the algorithm. Then = ⟨ pt , gt ⟩ = ⟨ pt , Ax t − b⟩ = ⟨ pt , Ax t ⟩ − ⟨ pt , b⟩ = ⟨αt , x t ⟩ − βt ≤ 0, the last inequality because x t satisfies the single linear constraint αt , x ≤ βt . Averaging over all times, the left hand side of (15.5) is 1 T t t ⟨ p , g ⟩ ≤ 0. T t∑ =1 However, the average on the RHS in (15.5) for constraint/expert i is: D 1 T E 1 T ei , g t = ei , ∑ g t ∑ T t =1 T t =1  T  1 = ∑ ai , xbt − bi T t =1 = ⟨ ai , xb⟩ − bi . Substituting into (15.5) we have 0≥  1 T t t ⟨ p , g ⟩ ≥ max ⟨ ai , xb⟩ − bi − ε. T t∑ i =1 This shows that Ab x ≤ b + ε1. 15.2.4 A Small Extension: Approximate Oracles Recall the definition of the problem width from (15.4). A few comments: • In the above analysis, we do not care about the maximum value of | a⊺i x − bi | over all points x ∈ K, but only about the largest this expression gets over points that are potentially returned by the oracle. This seems a pedantic point, but if there are many solutions to (15.3), we can return one with small width. But we can do more, as the next point outlines. • We can also relax the oracle to satisfy ⟨α, x ⟩ ≤ β + δ for some small δ > 0 instead. Define the width of the LP with respect such a relaxed oracle to be ρrlx := max i ∈[m],x returned by relaxed oracle {| a⊺i x − bi |}. (15.6) solving linear programs using experts Now running the algorithm with a relaxed oracle gives us a slightly worse guarantee that Ab x ≤ b + (ε + δ)1, but now the number of calls to the relaxed oracle can be even smaller, namely O(ρ2rlx ln m/ε2 ). • Of course, if we violations can be bounded in some better way, e.g., if we can ensure that violations are always positive or negative, then we can give stronger bounds on the regret, and hence reduce the number of calls even further. Details to come. All these improvements will be crucial in the upcoming applications. 189 16 Approximate Max-Flows using Experts We now use low-regret multiplicative-weight algorithms to give approximate solutions to the s-t-maximum-flow problem. In the previous chapter, we already saw how to get approximate solutions to general linear programs. We now show how a closer look at those algorithms give us improvements in the running time (albeit in the setting of undirected graphs), which go beyond those known via usual “combinatorial” techniques. The first set of results we give will hold for directed graphs as well, but the improved results will only hold for undirected graphs. 16.1 The Maximum Flow Problem In the s-t maximum flow problem, we are given a graph G = (V, E), and distinguished vertices s and t. Each edge has a capacities ue ≥ 0; we will mostly focus on the unit-capacity case of ue = 1 in this chapter. The graph may be directed or undirected; an undirected edge can be modeled by two oppositely directed edges having the same capacity. Recall that an s-t flow is an assignment f : E → R+ such that (a) f (e) ∈ [0, ue ], i.e., capacity-respecting on all edges, and (b) ∑e=(u,v)∈E f (e) = ∑e=(v,w)∈E f (e), i.e., flow-conservation at all non-{s, t}-nodes. The value of flow f is ∑e=(s,w)∈E f (e) − ∑e=(u,s)∈E f (e), the net amount of flow leaving the source node s. The goal is to find an s-t flow in the network, that satisfies the edge capacities, and has maximum value. Algorithms by Edmonds and Karp, by Yefim Dinitz, and many others can solve the s-t max-flow problem exactly in polynomial time. For the special case of (directed) graphs with unit capacities, Shimon Even and Bob Tarjan, and independently, Alexander Karzanov showed in 1975 that the Ford-Fulkerson algorithm finds 192 a first algorithm using the mw framework the maximum flow in time O(m · min(m1/2 , n2/3 )). This runtime was eventually matched for general capacities (up to some polylogarithmic factors) by an algorithm of Andrew Goldberg and Satish Rao in 1998. For the special case of m = O(n), these results gave a runtime of O(m1.5 ), but nothing better was known even for approximate max-flows, even for unit-capacity undirected graphs—until a breakthrough in 2010, which we will see at the end of this chapter. 16.1.1 A Linear Program for Maximum Flow We formulate the max-flow problem as a linear program. There are many ways to do this, and we choose to write an enormous LP for it. Let P be the set of all s-t paths in G. Define a variable f P denoting the amount of flow going on path P ∈ P . We can now write: max ∑ fP P∈P (16.1) ∑ f P ≤ ue ∀e ∈ E fP ≥ 0 ∀P ∈ P P:e∈ P The first set of constraints says that for each edge e, the contribution of all possible flows is no greater than the capacity ue of that edge. The second set of constraints say that the contribution from each path must be non-negative. This is a gigantic linear program: there could be an exponential number of s-t paths. As we see, this will not be a hurdle. 16.2 A First Algorithm using the MW Framework To using the framework from the previous section, we just need to implement the Oracle: i.e., we solve a problem with a single “average” constraint, as in (15.3). Specifically, suppose we want a flow value of F, then the “easy” constraints are: K := { f | ∑ f p = F, f ≥ 0}. P∈P Moreover, the constraint ⟨α, f ⟩ ≤ β is not an arbitrary constraint—it is one obtained by combining the original constraints. Specifically, given a vector pt ∈ ∆m , the average constraint is obtained by the convex combination of these constraints:   ∑ pte ∑ f P ≤ ue , e∈ E P:e∈ P (16.2) approximate max-flows using experts 193 where f e represents the net flow over edge e. By swapping order of summations, and using the unit capacity assumption, we obtain   ∑ f P ∑ pte ≤ ∑ pte ue = 1. P∈P e∈ P e Now, the inner summation is the path length of P with respect to edge weights pte , which we denote by lent ( P) := ∑e∈ P pte . The constraint now becomes: ∑ f P lent ( P) ≤ 1, (16.3) P∈P and we want a point f ∈ K satisfying it. The best way to satisfy it is to place all F units of flow on the shortest path P, and zero everywhere else; we output “infeasible” if the shortest-path has a length more than 1. This step can be done by a single call to Dijkstra’s algorithm, which takes O(m + n log n) time.   Now Theorem 15.4 says that running this algorithm for Θ ρ2 log m ε2 iterations gives a solution f ∈ K, that violates the constraints by an additive ε. Hence, the scaled-down flow f /(1 + ε) would satisfy all the capacity constraints, and have flow value F/(1 + ε), which is what we wanted. To complete the runtime analysis, it remains to bound the value of ρ, the maximum amount by which any constraint gets violated by a solution from the oracle. Since we send all the F units of flow on a single edge, the maximum violation is F − 1. Hence the total runtime is at most O(m + n log n) · F2 log m . ε2 Moreover, the maximum flow F is m, by the unit capacity assumption, which gives us an upper bound of O(m3 poly(log m/ε)). 16.2.1 A Better Bound, via an Asymmetric Guarantee for Hedge Let us state (without proof, for now) a refined version of the Hedge algorithm for the case of asymmetric gains, where the gains lie in the range [−γ, ρ]. Theorem 16.1 (Asymmetric Hedge). Let ε ∈ (0, 1/2), and γ, ρ ≥ 1. Θ(γρ ln N ) Moreover, let T ≥ . There exists an algorithm for the experts ε2 problem such that for every sequence g1 , . . . , g T of gains with g ∈ [−γ, ρ] N , produces probability vectors { pt ∈ ∆ N }t∈[T ] online such that for each i: 1 T 1 T gt , pt ≥ ∑ gt , ei − ε. ∑ T t =1 T t =1 The proof is a careful (though not difficult) reworking of the standard proof for Hedge. (We will add it soon; a hand-written We already argued in Theorem 15.4 that if there exists a feasible flow of value F in the graph, we never output “infeasible”. Here is a direct proof. If there is a flow of value F, there are F disjoint s-t paths. The vector pt ∈ ∆m , so its values sum to 1. Hence, one of the F s-t paths P∗ has ∑e∈ P pte ≤ 1/F. Setting f P = F for that path satisfies the constraint. 194 finding max-flows using electrical flows proof is on the webpage.) Moreover, we can use this statement to Θ(γρ ln m) prove that the approximate LP solver can stop after calls ε2 to an oracle, as long as each of the oracle’s answer x guarantee that ( Ax )i − bi ∈ [−γ, ρ]. Since a solution f found by our shortest-path oracle sends all F flow on a single path, and all capacities are 1, we have γ = 1 and ρ = F − 1 ≤ F. The runtime now becomes O(m + n log n) · 1 · ( F − 1) log m . ε2 Again, using the naïve bound of F ≤ m, we have a runtime of O(m2 poly(log m/ε)) to find a (1 + ε)-approximate max-flow, even in directed graphs. 16.2.2 An Intuitive Explanation and an Example Observe that the algorithm repeats the following natural process: 1. it finds a shortest path in the graph, 2. it pushes F units of flow on it, and then 3. it increases the length of each edge on this path multiplicatively. This length-increase makes congested edges (those with a lot of flow) be much longer, and hence become very undesirable when searching for short paths. Note that the process is repeated some number of times, and then we average all the flows we find. So unlike usual network flow algorithms based on residual networks, these algorithms are truly greedy and cannot “undo” past actions (which is what pushing flow in residual flow networks does, when we use an arc backwards). This means these MW-based algorithms must ensure that very little flow goes on edges that are “wasteful”. To illustrate this point, consider an example commonly used to show that the greedy algorithm does not work for max-flow: Change the figure to make it more instructive. 16.3 The factor happens to be (1 + ε/F ), because of how we rescale the gains, but that does not matter for this intuition. Finding Max-Flows using Electrical Flows The approach of the previous sections suggests a way to get faster algorithms for max-flow: reduce the width of the oracle. The approach of the above section was to push all F flow along a single path, which is why we have a width of Ω( F ). Can we implement the oracle in a way that spreads the flow over several paths, and hence has smaller width? Of course, one such solution is to use the max-flow as the oracle response, but that would defeat the purpose of the MW approach. Indeed, we want a fast way of implementing the oracle. e ( f (n)) to hide We use the notation O factors that are poly-logarithmic in e (n), and f (n). E.g., O(n log2 n) lies in O e (log n), etc. O(log n log log n) lies in O approximate max-flows using experts For undirected graphs, one good solution turns out to be to use electrical flows: to model the graph as an electrical network, set a voltage difference between s and t, and compute how electrical current would flow between them. We now show how this approach gives us e (m1.5 /εO(1) )-time algorithm quite easily; then with some more an O e (m4/3 /εO(1) ). While we work, we improved this to get a runtime of O focus only on unit-capacity graphs, the algorithm can be extended to all undirected graphs with a further loss of poly-logarithmic factors in the maximum capacity, and moreover to get a runtime of e (mn1/3 / poly(ε)). O At the time this result was announced (by Christiano et al.), it was the fastest algorithm for the approximate maximum s-t-problem in undirected graphs. Since then, works by Jonah Sherman, and by Kelner et al. gave O(m1+o(1) /εO(1) )-time algorithms for the problem. The current best runtime is O(m poly log m/εO(1) )-time, due to Richard Peng. 16.3.1 Electrical Flows Given a connected undirected graph with general edge-capacities, we can view it as an electrical circuit, where each edge e of the original graph represents a resistor with resistance re = 1/ue , and we connect (say, a 1-volt) battery between s to t. This causes electrical current to flow from s (the node with higher potential) to t. Recall the following laws about electrical flows. Theorem 16.2 (Kirchoff’s Voltage Law). The directed potential changes along any cycle sum to 0. This means we can assign each node v a potential ϕv . Now the actual amount of current on any edge is given by Ohm’s law, and is related to the potential drop across the edge. Theorem 16.3 (Ohm’s Law). The electrical flow f uv on the edge e = uv is the ratio between the difference in potential ϕ (or voltage) between u, v and the resistance re of the edge: f uv = ϕu − ϕv . ruv Finally, we have flow conservation, much like in traditional network flows: Theorem 16.4 (Kirchoff’s Current Law). If we set s and t to some voltages, the electrical current ensures flow-conservation at all nodes except s, t: the total current entering any non-terminal node equals the current leaving it. 195 Christiano, Kelner, Madry, Spielman, and Teng (2010) Sherman (2013) Kelner, Lee, Orecchia, and Sidford (2013) Peng (2014) Interestingly, Shang-Hua Teng, Jonah Sherman, and Richard Peng are all CMU graduates. 𝜑 𝑡 =0 𝜑 𝑠 =1 t s + - Figure 16.1: The currents on the wires would produce an electric flow (where all the wires within the graph have resistance 1). 196 finding max-flows using electrical flows These laws give us a set of linear constraints that allow us to go between the voltages and currents. In order to show this, we define the Laplacian matrix of a graph. 16.3.2 The Laplacian Matrix Given an undirected graph on n nodes and m edges, with nonnegative conductances cuv for each edge e = uv, we define the Laplacian matrix to be a n × n matrix LG , with entries    ∑w:uw∈E cuw ( LG )uv = −cuv   0 if u = v if (u, v) ∈ E otherwise The conductance of an edge is the reciprocal of the resistance of the edge: ce = 1/re . . For example, if we take the 6-node graph in Figure 16.1 and assume that all edges have unit conductance, then its Laplacian LG matrix is: s t s 2  t  0  u  −1 LG = v   −1  w 0 x 0 0 2 0 0 −1 −1  u v w x  −1 −1 0 0  0 0 −1 −1   3 0 −1 −1  . 0 2 0 −1    −1 0 2 0  −1 −1 0 3 Equivalently, we can define the Laplacian matrix Luv for the graph consisting of a single edge uv as Luv := cuv (eu − ev )⊺ (eu − ev ). Now for a general graph G, we define the Laplacian to be: LG = ∑ Luv . Note that LG is not a full-rank matrix since, e.g., the columns sum to zero. However, if the graph G is corrected, then the vector 1 is the only vector in the kernel of LG , so its rank is n − 1. (proof?) This Laplacian for the single edge uv has 1s on the diagonal at locations (u, u), (v, v), and −1s at locations (u, v), (v, u). Draw figure. uv∈ E In other words, LG is the sum of little ‘per-edge’ Laplacians Luv . (Since each of those Laplacians is clearly positive semidefinite (PSD), it follows that LG is PSD too.) For yet another definition for the Laplacian, first consider the edge-vertex incidence matrix B ∈ {−1, 0, 1}m×n , where the rows are indexed by edges and the columns by vertices. The row corresponding to edge e = uv has zeros in all columns other than u, v, it has an entry +1 in one of those columns (say u) and an entry −1 in the A symmetric matrix A ∈ Rn×n is called PSD if x⊺ Ax ≥ 0 for all x ∈ Rn , or equivalently, if all its eigenvalues are non-negative. approximate max-flows using experts other (say v). su sv uw ux vx wt s 1  t  0  u  −1 B= v   0  w 0 x 0 1 0 0 −1 0 0 0 0 1 0 −1 0 0 0 1 0 0 −1 0 0 0 1 0 −1 0 −1 0 0 1 0  xt  0  −1   0  . 0    0  1 The Laplacian matrix is now defined as LG := B⊺ CB, where C ∈ Rm×m is a diagonal matrix with entry Cuv containing the conductance for edge uv. E.g., for the example above, here’s the edge-vertex incidence matrix, and since all conductances are 1, we have LG = BB⊺ . 16.3.3 Solving for Electrical Flows: Lx = b Given the Laplacian matrix for the electrical network, we can figure out how the current flows by solving a linear system, i.e., a system of linear equations. Indeed, by Theorem 16.4, all the current flows from s to t. Suppose k units of current flows from s to t. By Theorem 16.3, the net current flow into a node v is precisely ∑ u:uv∈ E f uv = ϕu − ϕv . ruv u:uv∈ E ∑ A little algebra shows this to be the vth entry of the vector Lϕ. Finally, by 16.4, this net current into v must be zero, unless v is either s or t, in which case it is either −k or k respectively. Summarizing, if ϕ are the voltages at the nodes, they satisfy the linear system: Lϕ = k(es − et ). (Recall that k is the amount of current flowing from s to t, and es , et are elementary basis vectors.) It turns out the solutions ϕ to this linear system are unique up to translation, as long as the graph is connected: if ϕ is a solution, then {ϕ + a | a ∈ R} is the set of all solutions. Great: we have n + 1 unknowns so far: the potentials at all the nodes, and the current value k. The above discussion gives us potentials at all the nodes in terms of the current value k. Now we can set unit potential at s, and ground t (i.e., set its potential to zero), and solve the linear system (with n − 1 linearly independent constraints) for the remaining n − 1 variables. The resulting value of k gives us the s-t current flow. Moreover, the potential settings at all the other nodes can now be read off from the ϕ vector. Then we can use Ohm’s law to also read off the current on each edge, if we want. 197 198 finding max-flows using electrical flows How do we solve the linear system Lϕ = b (subject to these boundary conditions)? We can use Gaussian elimination, of course, but the best implementations can take nω time in the worst-case. Thankfully, there are faster (approximate) methods, which we discuss in §16.3.5. 16.3.4 Electrical Flows Minimize Energy Burn Here’s another useful way of characterizing this current flow of k units from s and t: the current flow is one minimizing the total energy dissipated. Indeed, for a flow f , the energy burn on edge e is given by 2 ( f uv )2 ruv = (ϕur−uvϕv ) , and the total energy burn is E ( f ) := ∑ f e2 re = e∈ E (ϕu − ϕv )2 = ϕ⊺ Lϕ. r uv (u,v)∈ E ∑ The electrical flow f produced happens to be arg min f is an s-t flow of value k {E ( f )}. We often use this characterization when arguing about electrical flows. 16.3.5 Solving Linear Systems We can solve a linear system Lx = b fast? If L is a Laplacian matrix and we are fine with approximate solutions, we can do things much faster than Gaussian elimination. A line of work starting with Dan Spielman and Shang-Hua Teng, and then refined by Ioannis Koutis, Gary Miller, and Richard Peng shows how to (approximately) solve a Laplacian linear system in the time essentially near-linear in the number of non-zeros of the matrix L. Theorem 16.5 (Laplacian Solver). There exists an algorithm that given a linear system Lx = b with L being a Laplacian matrix (and having solution x̄), find a vector x̂ such that the error vector z := L x̂ − b satisfies z⊺ Lz ≤ ε( x̄⊺ L x̄ ). The algorithm is randomized? and runs in time O(m log2 n log 1/ε). Moreover, Theorem 16.5 can be converted to what we need; details appear in the Christiano et al. paper. Corollary 16.6 (Laplacian Solver II). There is an algorithm given a linear system Lx = b corresponding to an electrical system as above, outputs an electrical flow f that satisfies E ( f ) ≤ (1 + δ)E ( fe), Spielman and Teng (200?) Koutis, Miller, and Peng (2010) Given a positive semidefinite matrix A, the A-norm is defined as ∥ x ∥ A := √ x⊺ Ax. Hence the guarantee here says ∥ L x̂ − b∥ L ≤ ε ∥ x̄ ∥ L . approximate max-flows using experts 199 m log R e( where fe is the min-energy flow. The algorithm runs in O ) time, δ where R is the ratio between the largest and smallest resistances in the network. For the rest of this lecture we assume we can compute the corree (m). The arguments sponding minimum-energy flow exactly in time O can easily be extended to incorporate the errors. 16.4 e (m3/2 )-time Algorithm An O Recall the setup from §16.2: given the polytope K = { f | ∑ f P = F, f ≥ 0}, P∈P and some edge weights pe , we wanted a vector in K that satisfies ∑ pe f e ≤ 1. (16.4) e where f e := ∑ P:e∈ P f P . Previously, we set f P∗ = F for P∗ being the shortest s-t path according to edge weights pe , but that resulted in the width—the maximum capacity violation—being too as large as Ω( F ). So we want to spread the flow over more paths. Our solution will now be to have the oracle return a flow with √ width O( m/ε), and which satisfies the following weaker version of the length bound (16.4) above: ∑ pe f e ≤ (1 + ε) ∑ pe + ε = 1 + 2ε. e∈ E e∈ E It is a simple exercise to check that this weaker oracle changes the analysis of Theorem 15.4 only slightly, still showing that the multiplicativeweights-based process finds an s-t-flow of value F, but now the edgecapacities are violated by 1 + O(ε) instead of just 1 + ε. Indeed, we replace the shortest-path implementation of the oracle by the following electrical-flow implementation: we construct a weighted electrical network, where the resistance for each edge e is defined to be This idea of setting the edge length ε to be pe plus a small constant term is a ret := pte + . general technique useful in controlling m the width in other settings, as we will We now compute currents f et by solving the linear system Lt ϕ = see in a HW problem. F (es − et ) and return the resulting flow. It remains to show that this flow spreads its mass around, and yet achieves a small “length” on average. Theorem 16.7. If f ∗ is a flow with value F and f is the minimum-energy flow returned by the oracle, then 1. (length) ∑e∈E pe f e ≤ (1 + ε) ∑e∈E pe , 200 e ( m 4/3 ) -time algorithm optional: an O √ 2. (width) maxe f e ≤ O( m/ε). Proof. Since the flow f ∗ satisfies all the constraints, it burns energy E ( f ∗ ) = ∑( f e∗ )2 re ≤ ∑ re = ∑( pe + e e e ε ) = 1 + ε. m Here we use that ∑e pe = 1. But since f is the flow K that minimizes the energy, E ( f ) ≤ E ( f ∗ ) ≤ 1 + ε. Now, using Cauchy-Schwarz, r √ √ √ √ ∑ re f e = ∑ ( re f e · re ) ≤ (∑ re f e2 )(∑ re ) ≤ 1 + ε 1 + ε = 1 + ε. e e e e This proves the first part of the theorem. For the second part, we may use the bound on energy burnt to obtain  ε ε ∑ f e2 m ≤ ∑ f e2 pe + m = ∑ f e2 re = E ( f ) ≤ 1 + ε. e e e Since each term in the leftmost summation is non-negative, r r m (1 + ε ) 2m 2 ε fe ≤ 1 + ε =⇒ f e ≤ ≤ m ε ε for each edge e. Using this oracle within the MW framework means the width is √ ρ log m  e (m) time by ρ = O( m), and each of the O ε2 iterations takes O e (m3/2 ). Corollary 16.6, giving a runtime of O In fact, this bound on the width is tight: consider the example network on the right. The effective resistance of the entire collection of black edges is 1, which matches the effective resistance of the red edge, so half the current goes on the top red edge. If we set F = k + 1 √ (which is the max-flow), this means a current of Θ( m) goes on the top edge. Sadly, while the idea of using electrical flows is very cool, the runtime of O(m3/2 ) is not that impressive. The algorithms of Karzanov, and of Even and Tarjan, for exact flow on directed unit-capacity graphs in time O(m min(m1/2 , n2/3 )) were known even back in the 1970s. (Algorithms with similar runtime are known for capacitated cases too.) Thankfully, this is not the end of the story: we can take the idea of electrical flows further to get a better algorithm, as we show in the next section. 16.5 e (m4/3 )-time Algorithm Optional: An O The idea to get an improved bound on the width is to use a crude but effective trick: if we have an edge with electrical flow of more than √ Figure 16.2: There are k = Θ( m) black paths of length k each. All edges have unit capacities. approximate max-flows using experts 201 ρ ≈ m1/3 in some iteration, we delete it for that iteration (and for the rest of the process), and find a new flow. Clearly, no edge now carries a flow more than ρ. The main thrust of the proof is to show that we do not end up butchering the graph, and that the maximum flow value reduces by only a small amount due to these edge deletions. Formally, we set: ρ= m 1/3 log m . ε (16.5) and show that at most εF edges are ever deleted by the process. The crucial ingredient in this proof is this observation: every time we delete an edge, the effective resistance between s and t increases by a lot. Since we need to argue about how many edges are deleted in the entire algorithm (and not just in one call to the oracle), we explicitly maintain edge-weights wet , instead of using the results from the previous sections as a black-box. 16.5.1 The Effective Resistance Loosely speaking, the effective resistance between nodes u and v is the resistance offered by the network to electrical flows between u and v. There are many ways of formalizing this: the most useful one in this context is the following. Definition 16.8 (Effective Resistance). The effective resistance between s and t, denoted by Reff (st), is the energy burned if we send one unit of electrical current from s to t. Since we only consider the effective resistance between s and t in this lecture, we simply write Reff . The following results relate the effective resistances before and after we change the resistances of some edges. Lemma 16.9. Consider an electrical network with edge resistances re . 1. (Rayleigh Monotonicity) If we increase the resistances to re′ ≥ re for all e, the resulting effective resistance is ′ Reff ≥ Reff . 2. Suppose f is an s-t electrical flow, suppose e is an edge with energy burn f e2 re ≥ βE ( f ). If we set re′ ← ∞, then the new effective resistance ′ Reff ≥( Reff ). 1−β We assume that a flow value of F is feasible; moreover, F ≥ ρ, else FordFulkerson can be implemented in time e (m4/3 ). O(mF ) ≤ O 202 e ( m 4/3 ) -time algorithm optional: an O Proof. Recall that if we send electrical flow from s to t, the resulting flow f minimizes the total energy burned E ( f ) = ∑e f e2 re . To prove the first statement: for each flow, the energy burned with the new resistances is at least that with the old resistances. Need to add in second part. 16.5.2 A Modified Algorithm Let’s give our algorithm that explicitly maintains the edge weights: We start off with weights w1e = 1 for all e ∈ E. At step t of the algorithm: 1. Find the min-energy flow f t of value F in the remaining graph with respect to edge resistances ret := wet + mε W t . 2. If there is an edge e with f et > ρ, delete e (for the rest of the algorithm), and go back to Item 1. 3. Update the edge weights wet+1 ← wet (1 + ρε f et ). This division by ρ accounts for the edge-capacity violations being as large as ρ. Stop after T := 16.5.3 ρ log m iterations, and output fb = T1 ∑t f t . ε2 The Analysis Let us first comment on the runtime: each time we find an electrical flow, we either delete an edge, or we push flow and increment t. The latter happens for T steps by construction; the next lemma shows that we only delete edges in a few iterations. Lemma 16.10. We delete at most m1/3 ≤ εF edges over the run of the algorithm. We defer the proof to later, and observe that the total number of electrical flows computed is therefore O( T ). Each such computation e (m/ε) by Corollary 16.6, so the overall runtime of our algotakes O rithm is O(m4/3 / poly(ε)). Next, we show that the flow fb is an (1 + O(ε)-approximate maximum s-t flow. We start with an analog of Theorem 16.7 that accounts for edge deletions. Lemma 16.11. Suppose ε ≤ 1/10. If we delete at most εF edges from G: 1. the flow f t at step t burns energy E ( f t ) ≤ (1 + 3ε)W t , 2. ∑e wet f et ≤ (1 + 3ε)W t ≤ 2W t , and 3. if fb ∈ K is the flow eventually returned, then fbe ≤ (1 + O(ε)). approximate max-flows using experts Proof. We assumed there exists a flow f ∗ of value F that respects all capacities. Deleting εF edges can only hit εF of these flow paths, so there exists a capacity-respecting flow of value at least (1 − ε) F. 1 Scaling up by (1− , there exists a flow f ′ of value F using each edge ε) 1 to extent (1− . The energy of this flow according to resistances ret is ε) at most E ( f ′ ) = ∑ r et ( f e′ ) 2 ≤ e 1 Wt t r ≤ ≤ ( 1 + 3ε ) W t , ∑ e (1 − ε )2 e (1 − ε )2 for ε small enough. Since we find the minimum energy flow, E ( f t ) ≤ E ( f ′ ) ≤ W t ( 1 + 3ε ) . For the second part, we again use the CauchySchwarz inequality: q r r ∑ w et f et ≤ ∑ w et ∑ w et ( f et ) 2 ≤ W t · W t ( 1 + 3ε ) ≤ ( 1 + 3ε ) W t ≤ 2W t . e e e The last step is very loose, but it will suffice for our purposes. To calculate the congestion of the final flow, observe that even though the algorithm above explicitly maintains weights, we can just wt appeal directly to the guarantees . Indeed, define p te : = Wet for each time t; the previous part implies that the flow f t satisfies ∑ p te f et ≤ 1 + 3ε e for precisely the p t values that the Hedge-based LP solver would return if we gave it the flows f 0 , f 1 , . . . , f t − 1 . Using the guarantees of that LP solver, the average flow bf uses any edge e to at most ( 1 + 3ε ) + ε. Finally, it remains to prove Lemma 16.10. Proof of Lemma 16.10. We track two quantities: the total weight W t and the s-t-effective resistance R eff . First, the weight starts at W 0 = m, and when we do an update,   ε ε W t + 1 = ∑ w et 1 + f et = Wt + ∑ w et f et ρ ρ e e ε t t ≤ W + ( 2W ) (From Claim 16.11) ρ Hence we get that for T = T 0 W ≤W ·  2ε 1+ ρ T ρ ln m , ε2 ≤ m · exp  2ε · T ρ  = m · exp  2 ln m ε  Therefore, the total weight is at most m 1 + 2/ε . Next, we consider the s-t-effective resistance R eff . . 203 204 e ( m 4/3 ) -time algorithm optional: an O 1. At the beginning, all edges have resistance 1 + ε. When we send F flow, some edge has at least F/m flow on it, so the energy burn is at least ( F/m ) 2 . This means R eff at the beginning is at least ( F/m ) 2 ≥ 1/m 2 . 2. The weights increase each time we do an update, so R eff does not decrease. (This is one place it is more convenience to argue about weights w et explicitly, and not just the probabilities p te .) 3. Each deleted edge e has flow at least ρ, and hence energy burn at least ( ρ 2 ) w et ≥ ( ρ 2 ) mε W t . Since the total energy burn is at most 2W t from Lemma 16.11, the deleted edge e was burning at least ρ2 ε β : = 2m fraction of the total energy. Hence new R eff ≥ ol d R eff 2 ( 1 − ρ2mε ) ol d ≥ R eff · exp  ρ2 ε 2m  if we use 1 −1 x ≥ e x/2 when x ∈ [ 0, 1/4 ] . 4. For the final effective resistance, note that we send F flow with total energy burn 2W T ; since the energy depends on the square of T f inal the flow, we have R eff ≤ 2W ≤ 2W T . F2 (All these calculations hold as long as we have not deleted more than ε F edges.) Now, to show that this invariant is maintained, suppose D edges are deleted over the course of the T steps. Then     ρ2 ε 2 ln m f inal 0 T R eff exp D · ≤ R eff ≤ 2W ≤ 2m · exp . 2m ε Taking logs and simplifying, we get that ερ 2 D 2 ln m ≤ ln ( 2m 3 ) + 2m ε   2m ( ln m )( 1 + O ( ε )) =⇒ D ≤ 2 ≪ m 1/3 ≤ εF. ε ερ This bounds the number of deleted edges D as desired. 16.5.4 Tightness of the Analysis This analysis of the algorithm is tight. Indeed, the algorithm needs Ω ( m 1/3 ) iterations, and deletes Ω ( m 1/3 ) edges for the example on the right. In this example, m = Θ ( n ) . Each black gadget has a unit effective resistance, and if we do the calculations, the effective resistance between s and t tends to the golden ratio. If we set F = n 1/3 (which is almost the max-flow), a constant fraction of the current (about Θ ( n 1/3 ) ) uses the edge e 1 . Once that edge is deleted, the next red edge e 2 carries a lot of current, etc., until all red edges get deleted. Figure 16.3: Again, all edges have unit capacities. approximate max-flows using experts 16.5.5 Subsequent Work A couple years after this work, Sherman, and independently, Kelner et al. gave O ( m 1 + o ( 1 ) /ε O ( 1 ) ) -time algorithms for approximate maxflow problem on undirected graphs. This was improved, using some more ideas, to a runtime of O ( m poly log m/ε O ( 1 ) ) -time by Richard Peng. These are based on the ideas of oblivious routings, and nonEuclidean gradient descent, and we hope to cover this in an upcoming lecture. There has also been work on faster directed flows: work by Madry, and thereafter by more refs here, have improved the current best ree ( m 4/3 ) , matchsult for max-flow in unweighted directed graphs to O ing the above result. Sherman (2013) Kelner, Lee, Orecchia, and Sidford (2013) Peng (2014) 205 17 The Gradient Descent Framework Consider the problem of finding the minimum-energy s-t electrical unit flow: we wanted to minimize the total energy burn E ( f ) = ∑ f e2 r e e for flow values f that represent a unit flow from s to t (these form a polytope). We alluded to algorithms that solve this problem, but one can also observe that E ( f ) is a convex function, and we want to find a minimizer within some polytope K. Equivalently, we wanted to solve the linear system Lϕ = ( e s − e t ) , which can be cast as finding a minimizer of the convex function ∥ Lϕ − ( e s − e t )∥ 2 . How can we minimize these functions efficiently? In this lecture, we will study the gradient descent framework for the general problem of minimizing functions, and give concrete performance guarantees for the case of convex optimization. 17.1 Convex Sets and Functions First, recall the following definitions: Definition 17.1 (Convex Set). A set K ⊆ R n is called convex if for all x, y ∈ K, λx + ( 1 − λ ) y ∈ K, (17.1) for all values of λ ∈ [ 0, 1 ] . Geometrically, this means that for any two points in K, the line connecting them is contained in K. Definition 17.2 (Convex Function). A function f : K → R defined on a convex set K is called convex if for all x, y ∈ K, f ( λx + ( 1 − λ ) y ) ≤ λ f ( x ) + ( 1 − λ ) f ( y ) , (17.2) y 208 convex sets and functions f [λx + (1 − λ)y] λ f ( x ) + (1 − λ ) f ( y ) for all values of λ ∈ [ 0, 1 ] . f (x) There are two kinds of problems that we will study. The most basic question is that of unconstrained convex minimization (UCM): given a convex function f , we want to find x x λx + (1 − λ)y y min f ( x ). x ∈Rn In some cases we will be concerned with the constrained convex minimization (CCM) problem: given a convex function f and a convex set K, we want to find min f ( x ). x ∈K Note that setting K = Rn gives us the unconstrained case. 17.1.1 Gradient For most of the following discussion, we assume that the function f is differentiable. In that case, we can give an equivalent characterization, based on the notion of the gradient ∇ f : Rn → Rn . Fact 17.3 (First-order condition). A function f : K → R is convex if and only if f (y) ≥ f ( x ) + ⟨∇ f ( x ), y − x ⟩ , (17.3) The directional derivative of f at x (in the direction y) is defined as f ′ ( x; y) := lim ε →0 f ( x + εy) − f ( x ) . ε If there exists a vector g such that ⟨ g, y⟩ = f ′ ( x; y) for all y, then f is called for all x, y ∈ K. differentiable at x, and g is called the gradient. It follows that the gradient Geometrically, Fact 17.3 states that the function always lies above must be of the form   its tangent plane, for all points in K. If the function f is twice-differentiable, ∂f ∂f ∂f ( x ), ∇ f (x) = ( x ), · · · , (x) . and if H f ( x ) is its Hessian matrix, i.e. its matrix of second derivatives ∂x1 ∂x2 ∂xn at x ∈ K: ( H f )i,j ( x ) := ∂2 f ( x ), ∂xi ∂x j (17.4) then we get yet another characterization of convex functions. Fact 17.4 (Second-order condition). A twice-differentiable function f is convex if and only if H f ( x ) is positive semidefinite for all x ∈ K. 17.1.2 Lipschitz Functions We will need a notion of “niceness” for functions: Definition 17.5 (Lipschitz continuity). For a convex set K ⊆ Rn , a function f : K → R is called G-Lipschitz (or G-Lipschitz continuous) with respect to the norm ∥ · ∥ if | f ( x ) − f (y)| ≤ G ∥ x − y∥ , for all x, y ∈ K. Figure 17.1: The blue line denotes the function and the red line is the tangent line at x. (Figure from Nisheeth Vishnoi.) the gradient descent framework In this chapter we focus on the Euclidean or ℓ2 -norm, denoted by ∥ · ∥2 . General norms arise in the next chapter, when we talk about mirror descent. Again, assuming that the function is differentiable allows us to give an alternative characterization of Lipschitzness. Fact 17.6. A differentiable function f : K → Rn is G-Lipschitz with respect to ∥ · ∥2 if and only if ∥∇ f ( x )∥2 ≤ G, (17.5) for all x ∈ K. 17.2 Unconstrained Convex Minimization If the function f is convex, any stationary point (i.e., a point x ∗ where ∇ f ( x ∗ ) = 0) is also a global minimum: just use Fact 17.3 to infer that f (y) ≥ f ( x ∗ ) for all y. Now given a convex function, we can just solve the equation ∇ f (x) = 0 to compute the global minima exactly. This is often easier said than done: for instance, if the function f we want to minimize may not be given explicitly. Instead we may only have a gradient oracle that given x, returns ∇ f ( x ). Even when f is explicit, it may be expensive to solve the equation ∇ f ( x ) = 0, and gradient descent may be a faster way. One example arises when solving linear systems: given a quadratic function f ( x ) = 1 ⊺ 2 x Ax − bx for a symmetric matrix A (say having full rank), a simple calculation shows that ∇ f ( x ) = 0 ⇐⇒ Ax = b ⇐⇒ x = A−1 b. This can be solved in O(nω ) (i.e., matrix-multiplication) time using Gaussian elimination—but for “nice” matrices A we are often able to approximate a solution much faster using the gradient-based methods we will soon see. 17.2.1 The Basic Gradient Descent Method Gradient descent is an iterative algorithm to approximate the optimal solution x ∗ . The main idea is simple: since the gradient tells us the direction of steepest increase, we’d like to move opposite to the direction of the gradient to decrease the fastest. So by selecting an initial position x0 and a step size ηt at each time t, we can repeatedly perform the update: x t +1 ← x t − η t · ∇ f ( x t ). (17.6) 209 210 unconstrained convex minimization There are many choices to be made: where should we start? What are the step sizes? When do we stop? While each of these decisions depend on the properties of the particular instance at hand, we can show fairly general results for general convex functions. 17.2.2 An Algorithm for General Convex Functions The algorithm fixes a step size for all times t, performs the update (17.6) for some number of steps T, and then returns the average of all the points seen during the process. Algorithm 14: Gradient Descent x1 ← starting point for t ← 1 to T do 14.3 x t +1 ← x t − η · ∇ f ( x t ) 14.1 14.2 14.4 return xb := 1 T xi . T t∑ =1 This is easy to visualize in two dimensions: draw the level sets of the function f , and the gradient at a point is a scaled version of normal to the tangent line at that point. Now the algorithm’s path is often a zig-zagging walk towards the optimum (see Fig 17.2). Interestingly, we can give rigorous bounds on the convergence of this algorithm to the optimum, based on the distance of the starting point from the optimum, and bounds on the Lipschitzness of the function. If both these are assumed to be constant, then our error is smaller than ε in only O(1/ε2 ) steps. Proposition 17.7. Let f : Rn → R be convex, differentiable and GLipschitz. Let x ∗ be any point in Rd . If we define T := η := G 2 ∥ x0 − x ∗ ∥2 and ε2 ∗ ∥ x0 − √x ∥ , then the solution x b returned by gradient descent satisfies G T f ( xb) ≤ f ( x ∗ ) + ε. (17.7) In particular, this holds when x ∗ is a minimizer of f . The core of this proposition lies in the following theorem Theorem 17.8. Let f : Rn → R be convex, differentiable and G-Lipschitz. Then the gradient descent algorithm ensures that T T 1 1 f ( x ) ≤ ∑ t ∑ f (x∗ ) + 2 ηTG2 + 2η ∥ x0 − x∗ ∥2 . t =1 t =1 (17.8) We will prove Theorem 17.8 in the next section, but let’s first use it to prove Proposition 17.7, our guarantee on the offline convergence of vanilla gradient descent. Figure 17.2: The yellow lines denote the level sets of the function f and the red walk denotes the steps of gradient descent. (Figure from Wikipedia.) the gradient descent framework Proof of Proposition 17.7. By definition of xb and the convexity of f , By Theorem 17.8, f ( xb) = f 1 T  T 1 x t ≤ ∑ f ( x t ). T ∑ T t =1 t =1 1 1 1 T f ( xt ) ≤ f ( x ∗ ) + ηG2 + ∥ x0 − x ∗ ∥2 . T t∑ 2 2ηT =1 | {z } error The error terms balance when η = ∗ ∥ x0 − √x ∥ f ( xb) ≤ f ( x ∗ ) + G T , giving ∥ x0 − x ∗ ∥ G √ . T Finally, we set T = ε12 G2 ∥ x0 − x ∗ ∥2 to obtain f ( xb) ≤ f ( x ∗ ) + ε. Observe: we do not (and cannot) show that the point xb is close in distance to x ∗ ; we just show that the function value f ( xb) ≈ f ( x ∗ ). Indeed, if the function is very flat close to x ∗ and we start off at some remote point, we make tiny steps as we get close to x ∗ , and we cannot hope to get close to it. The 1/ε2 dependence of the number of oracle calls was shown to be tight for gradient-based methods by Yurii Nesterov, if we allow f to be any G-Lipschitz function. However, if we assume that the function is “well-behaved”, we can indeed improve on the 1/ε2 dependence. Moreover, if the function is strongly convex, we can show that x ∗ and xb are close to each other as well: see §17.5 for such results. The convergence guarantee in Proposition 17.7 is for the timeaveraged point xb. Indeed, using a fixed step size means that our iterates may get stuck in a situation where xt+2 = xt after some point and hence we never improve, even though xb is at the minimizer. One can also show that f ( x T ) ≤ f ( x ∗ ) + ε if we use a time-varying √ step size ηt = O(1/ t), and increase the time horizon slightly to O(1/ε2 log 1/ε). We refer to the work of Shamir and Zhang. 17.2.3 Proof of Theorem 17.8 Like in the proof of the multiplicative weights algorithm, we will use a potential function. Define Φt := ∥ x t − x ∗ ∥2 . 2η (17.9) We start the proof of Theorem 17.8 by understanding the one-step change in the potential: 211 212 unconstrained convex minimization Lemma 17.9 (Change in Potential). Φt+1 − Φt ≤ ⟨∇ f ( xt ), x ∗ − xt ⟩ + η ∥∇ f ( xt )∥2 . 2 Proof. Using the identity ∥ a + b∥2 = ∥ a∥2 + 2 ⟨ a, b⟩ + ∥b∥2 , with a + b = xt+1 − x ∗ and a = xt − x ∗ , we get  1 ∥ x t +1 − x ∗ ∥ 2 − ∥ x t − x ∗ ∥ 2 (17.10) 2η  1 = 2 ⟨x − x t , x t − x ∗ ⟩ + ∥ x t +1 − x t ∥ 2 ; {z } 2η | t+1 {z } | Φ t +1 − Φ t = ∥ b ∥2 ⟨b,a⟩ now using xt+1 − xt = −η ∇ f ( xt ) from gradient descent, =  1 2 ⟨−η ∇ f ( xt ), xt − x ∗ ⟩ + ∥η ∇ f ( xt )∥2 . 2η Now rearranging terms proves the lemma. Now that we understand how our potential changes over time, proving the theorem is straightforward. Proof of Theorem 17.8. We start with the inequality we proved above, and use that since f is G-Lipschitz, ∥∇ f ( x )∥ ≤ G for all x. Thus, f ( xt ) + (Φt+1 − Φt ) = f ( xt ) + ⟨∇ f ( xt ), x ∗ − xt ⟩ + η 2 G . 2 Since f is convex, we know that f ( xt ) + ⟨∇ f ( xt ), x ∗ − xt ⟩ ≤ f ( x ∗ ). Thus, we conclude that f ( x t ) + ( Φ t +1 − Φ t ) ≤ f ( x ∗ ) + η 2 G . 2 Summing over t = 1, . . . , T, T T T t =1 t =1 t =1 η ∑ f ( x t ) + ∑ ( Φ t +1 − Φ t ) ≤ ∑ f ( x ∗ ) + 2 G 2 T The sum of potentials on the left telescopes to give: T T t =1 t =1 η ∑ f ( x t ) + Φ T +1 − Φ 1 ≤ ∑ f ( x ∗ ) + 2 G 2 T Since the potentials are nonnegative, we can drop the Φ T term: T T t =1 t =1 η ∑ f ( x t ) − Φ1 ≤ ∑ f ( x ∗ ) + 2 G 2 T Substituting in the definition of Φ1 and moving it over to the right hand side completes the proof. the gradient descent framework 17.2.4 213 Some Remarks on the Algorithm We assume a gradient oracle for the function: given a point x, it returns the gradient ∇ f ( x ) at that point. If the function f is not given explicitly, we may have to estimate the gradient using, e.g., random sampling. One particularly sample-efficient solution is to pick a uniformly random point u ∼ Sn−1 from the sphere in Rn , and return h f ( x + δu) i d u δ for some tiny δ > 0. It is slightly mysterious, so perhaps it is useful to consider its expectation in the case of a univariate function: Eu∼{−1,+1} As δ → 0, the expectation of this expression tends to ∇ f ( x ), using Stokes’ theorem. h f ( x + δu) i f ( x + δ) − f ( x − δ) u = ≈ f ′ ( x ). δ 2δ In general, randomized strategies form the basis of stochastic gradient descent, where we use an unbiased estimator of the gradient, instead of computing the gradient itself (because it is slow to compute, or because enough information is not available). The challenge is now to control the variance of this estimator. Another concern is that the step-size η and the number of steps T both require knowledge of the distance ∥ x1 − x ∗ ∥ as well as the bound on the gradient. More here. As an exercise, show that using ∥ x −x∗ ∥ the time-varying step-size ηt := 0 √ also gives a very similar G t convergence rate. Finally, the guarantee is for f ( xb), where xb is the time-average of the iterates. What about returning the final iterate? It turns out this has comparable guarantees, but the proof is slightly more involved. Add references. 17.3 Constrained Convex Minimization Unlike the unconstrained case, the gradient at the minimizer may not be zero in the constrained case—it may be at the boundary. In this case, the condition for a convex function f : K → R to be minimized at x ∗ ∈ K is now ⟨∇ f ( x ∗ ), y − x ∗ ⟩ ≥ 0 for all y ∈ K. (17.11) In other words, all vectors y − x ∗ pointing within K are “positively correlated” with the gradient. 17.3.1 This is the analog of the minimizer of a single variable function being achieved either at a point where the derivative is zero, or at the boundary. Projected Gradient Descent While the gradient descent algorithm still makes sense: moving in the direction opposite to the gradient still moves us towards lower When x ∗ is in the interior of K, the condition (17.11) is equivalent to ∇ f ( x ∗ ) = 0. 214 constrained convex minimization function values. But we must change our algorithm to ensure that the new point xt+1 lies within K. To ensure this, we simply project the new iterate xt+1 back onto K. Let projK : Rn → K be defined as projK (y) = arg minx∈K ∥ x − y∥2 . The modified algorithm is given below in Algorithm 15, with the changes highlighted in blue. xt Algorithm 15: Projected Gradient Descent For CCM x1 ← starting point for t ← 1 to T do 15.3 xt′ +1 ← xt − η · ∇ f ( xt ) 15.4 xt+1 ← projK ( xt′ +1 ) 15.1 x t +1 15.2 T 15.5 return xb := T1 ∑ xt t =1 We will show below that a result almost identical to that of Theorem 17.8, and hence that of Proposition 17.7 holds. Proposition 17.10. Let K be a closed convex set, and f : K → R be convex, differentiable and G-Lipschitz. Let x ∗ ∈ K, and define T := ∥x −x∗ ∥ G 2 ∥ x0 − x ∗ ∥2 and ε2 η := 0 √ . Then the solution xb returned by projected gradient descent G T satisfies f ( xb) ≤ f ( x ∗ ) + ε. (17.12) In particular, this holds when x ∗ is a minimizer of f . Proof. We can reduce to an analogous constrained version of Theorem 17.8. Let us start the proof as before: Φ t +1 − Φ t =  1 ∥ x t +1 − x ∗ ∥ 2 − ∥ x t − x ∗ ∥ 2 2η (17.13)  1 ∥ xt′ +1 − x ∗ ∥2 − ∥ xt − x ∗ ∥2 . 2η (17.14) But xt+1 is the projection of xt′ +1 onto K, which is difficult to reason about. Also, we know that −η ∇ f ( xt ) = xt′ +1 − x ∗ , not xt+1 − x ∗ , so we would like to move to the point xt′ +1 . Indeed, we claim that xt′ +1 − x ∗ ≥ ∥ xt+1 − x ∗ ∥, and hence we get Φ t +1 − Φ t = Now the rest of the proof of Theorem 17.8 goes through unchanged. Why is the claim xt′ +1 − x ∗ ≥ ∥ xt+1 − x ∗ ∥ true? Since K is convex, projecting onto it gets us closer to every point in K, in particular to x ∗ ∈ K. To formally prove this fact about projections, consider the angle x ∗ → xt+1 → xt′ +1 . This is a non-acute angle, since the orthogonal projection means K likes to one side of the hyperplane defined by the vector xt′ +1 − xt+1 , as in the figure on the right. Figure 17.3: Projection onto a convex body xt′ +1 the gradient descent framework 215 Note that restricting the play to K can be helpful in two ways: we can upper-bound the distance ∥ x ∗ − x1 ∥ by the diameter of K, and moreover we need only consider the Lipschitzness of f for points within K. 17.4 Online Gradient Descent, and Relationship with MW We considered gradient descent for the offline convex minimization problem, but one can use it even when the function changes over time. Indeed, consider the online convex optimization (OCO) problem: at each time step t, the algorithm proposes a point xt ∈ K and an adversary gives a function f t : K → R with ∥∇ f t ∥ ≤ G. The cost of each time step is f t ( xt ) and your objective is to minimize regret = ∑ f t ( xt ) − min ∑ f t ( x ∗ ). ∗ x ∈K t t For instance if K = ∆n , and f t ( x ) := ⟨ℓt , x ⟩ for some loss vector ℓt ∈ [−1, 1]n , then we are back in the experts setting of the previous chapters. Of course, the OCO problem is far more general, allowing arbitrary convex functions. Surprisingly, we can use the almost same algorithm to solve the OCO problem, with one natural modification: the update rule is now taken with respect to gradient of the current function f t : x t +1 ← x t − η · ∇ f t ( x t ). Looking back at the proof in §17.2, the proof of Lemma 17.9 immediately extends to give us 1 f t ( xt ) + Φt+1 − Φt ≤ f t ( x ∗ ) + ηG2 . 2 Now summing this over all times t gives T  T  η ∑ f t (xt ) − f t (x∗ ) ≤ ∑ Φt − Φt+1 + 2 TG2 t =1 t =1 1 ≤ Φ1 + ηTG2 , 2 since Φ T +1 ≥ 0. The proof is now unchanged: setting T ≥ and η = ∥ x1 − x ∗ ∥2 G 2 ε2 ∗ ∥ x1 − √x ∥ , and doing some elementary algebra as above, G T  ∥ x − x∗ ∥G 1 T f t ( xt ) − f t ( x ∗ ) ≤ 1 √ ≤ ε. ∑ T t =0 T 17.4.1 Comparison to the MW/Hedge Algorithms One advantage of the gradient descent approach (and analysis) over the multiplicative weight-based ones is that the guarantees here hold This was first observed by Martin Zinkevich in 2002, when he was a Ph.D. student here at CMU. 216 stronger assumptions for all convex bodies K and all convex functions, as opposed to being just for the unit simplex ∆n and linear losses f t ( x ) = ⟨ℓt , x ⟩, say for ℓt ∈ [−1, 1]n . However, in order to make a fair comparison, suppose we restrict ourselves to ∆n and linear losses, and consider the number of rounds T before we get an average regret of ε. • If we consider ∥ x1 − x ∗ ∥ (which, in the worst case, is the diameter of K), and G (which is an upper bound on ∥∇ f t ( x )∥ over points in K) as constants, then the T = Θ( ε12 ) dependence is the same. √ • For a more quantitative comparison, note that ∥ x1 − x ∗ ∥ ≤ 2 for √ x1 , x ∗ ∈ ∆n , and ∥∇ f t ( x )∥ = ∥ℓt ∥ √ ≤ n for ℓt ∈ [−1, 1]n . Hence,  log n  Proposition 17.10 gives us T = Θ ε2n , as opposed to T = Θ ε2 for multiplicative weights. The problem, at a high level, is that we are “choosing the wrong norm”: when dealing with probabilities, the “right” norm is the ℓ1 norm and not the Euclidean ℓ2 norm. In the next lecture we will formalize what this means, and how this dependence on n be improved via the Mirror Descent framework. 17.5 Stronger Assumptions If the function f is “well-behaved”, we can improve the guarantees for gradient descent in two ways: we can reduce the dependence on ε, and we can weaken (or remove) the dependence on the parameters G and ∥ x1 − x ∗ ∥. There are two standard assumptions to make on the convex function: that it is “not too flat” (captured by the idea of strong convexity), and it is not “not too curved” (i.e., it is smooth). We now use these assumptions to improve the guarantees. 17.5.1 Strongly-Convex Functions Definition 17.11 (Strong Convexity). A function f : K → R is αstrongly convex if for all x, y ∈ K, any of the following holds: 1. (Zeroth order) f (λx + (1 − λ)y) ≤ λ f ( x ) + (1 − λ) f (y) − α2 λ(1 − λ)∥ x − y∥2 for all λ ∈ [0, 1]. 2. (First order) If f is differentiable, then f (y) ≥ f ( x ) + ⟨∇ f ( x ), y − x ⟩ + α ∥ x − y ∥2 . 2 (17.15) 3. (Second order) If f is twice-differentiable, then all eigenvalues of H f ( x ) are at least α at every point x ∈ K. the gradient descent framework We will work with the first-order definition, and show that the  1 gradient descent algorithm with (time-varying) step size ηt = O αt 2 converges to a value at most f ( x ∗ ) + ε in time T = Θ( Gαε ). Note there is no more dependence on the diameter of the polytope. Before we give this proof, let us give the other relevant definitions. 17.5.2 Smooth Functions Definition 17.12 (Lipschitz Smoothness). A function f : K → R is β-(Lipschitz)-smooth if for all x, y ∈ K, any of the following holds: β 1. (Zeroth order) f (λx + (1 − λ)y) ≥ λ f ( x ) + (1 − λ) f (y) − 2 λ(1 − λ)∥ x − y∥2 for all λ ∈ [0, 1]. 2. (First order) If f is differentiable, then f (y) ≤ f ( x ) + ⟨∇ f ( x ), y − x ⟩ + β ∥ x − y ∥2 . 2 (17.16) 3. (Second order) If f is twice-differentiable, then all eigenvalues of H f ( x ) are at most β at every point x ∈ K. In this case, the gradient descent algorithm with fixed step size  ηt = η = O β1 yields an xb which satisfies f ( xb) − f ( x ∗ ) ≤ ε when β ∥ x1 − x ∗ ∥  T = Θ . In this case, note we have no dependence on the ε Lipschitzness G any more; we only depend on the diameter of the polytope. Again, we defer the proof for the moment. 17.5.3 Well-conditioned Functions Functions that are both β-smooth and α-strongly convex are called well-conditioned functions. From the facts above, the eigenvalues of their Hessian H f must lie in the interval [α, β] at all points x ∈ K. In this case, we get a much stronger convergence—we can achieve ε-closeness in time T = Θ(log 1ε ), where the constant depends on the condition number κ = β/α. Theorem 17.13. For a function f which is β-smooth and α-strongly convex, let x ∗ be the solution to the unconstrained convex minimization problem arg minx∈Rn f ( x ). Then running gradient descent with ηt = 1/β gives f ( xt ) − f ( x ∗ ) ≤  −t  β exp ∥ x1 − x ∗ ∥2 . 2 κ Proof. For β-smooth f , we can use Definition 17.12 to get β f ( xt+1 ) ≤ f ( xt ) − η ∥∇ f ( xt )∥2 + η 2 ∥∇ f ( xt )∥2 . 2 217 218 stronger assumptions The right hand side is minimized by setting η = β1 , when we get f ( x t +1 ) − f ( x t ) ≤ − 1 ∥∇ f ( xt )∥2 . 2β (17.17) For α-strongly-convex f , we can use Definition 17.11 to get: α ∥ x t − x ∗ ∥2 , 2 α ≤ ∥∇ f ( xt )∥ ∥ xt − x ∗ ∥ − ∥ xt − x ∗ ∥2 , 2 1 2 ≤ (17.18) ∥∇ f ( xt )∥ , 2α f ( xt ) − f ( x ∗ ) ≤ ⟨∇ f ( xt ), xt − x ∗ ⟩ − where we use that the right hand side is maximized when ∥ xt − x ∗ ∥ = ∥∇ f ( xt )∥ /α. Now combining with (17.17) we have that α f ( x t +1 ) − f ( x t ) ≤ − β  ∗  f ( xt ) − f ( x ) , (17.19) or setting ∆t = f ( xt ) − f ( x ∗ ) and rearranging, we get ∆ t +1 ≤  1− α β  ∆t ≤  1− 1 κ t   t ∆1 ≤ exp − · ∆1 . κ We can control the value of ∆1 by using (17.16) in x = x ∗ , y = x1 ; β since ∇ f ( x ∗ ) = 0, get ∆1 = f ( x1 ) − f ( x ∗ ) ≤ 2 ∥ x1 − x ∗ ∥2 . Strongly-convex (and hence well-conditioned) functions have the nice property that if f ( x ) is close to f ( x ∗ ) then x is close to x ∗ : intuitively, since the function is curving at least quadratically, the function values at points far from the minimizer must be significant. Formally, use (17.15) with x = x ∗ , y = xt and the fact that ∇ f ( x ∗ ) = 0 to get ∥ x t − x ∗ ∥2 ≤ 2 ( f ( xt ) − f ( x ∗ )). α We leave it as an exercise to show the claimed convergence bounds using just strong convexity, or just smoothness. (Hint: use the statements proved in (17.17) and (17.18). Before we end, a comment on the strong O(log 1/ε) convergence result for well-conditioned functions. Suppose the function values lies in [0, 1]. The Θ(log 1/ε) error bound means that we are correct up to b bits of precision—i.e., have error smaller than ε = 2−b —after Θ(b) steps. In other words, the number of bits of precision is linear in the number of iterations. The optimization literature refers to this as linear convergence, which can be confusing when you first see it. the gradient descent framework 17.6 Extensions and Loose Ends 17.6.1 Subgradients What if the convex function f is not differentiable? Staring at the proofs above, all we need is the following: Definition 17.14 (Subgradient). A vector z x is called a subgradient at point x if f (y) ≥ f ( x ) + ⟨z x , y − x ⟩ for all y ∈ Rn . Now we can use subgradients at the point x wherever we used ∇ f ( x ), and the entire proof goes through. In some cases, an approximate subgradient may also suffice. 17.6.2 Stochastic Gradients, and Coordinate Descent 17.6.3 Acceleration 17.6.4 Reducing to the Well-conditioned Case 219 18 Mirror Descent The gradient descent algorithm of the previous chapter is general and powerful: it allows us to (approximately) minimize convex functions over convex bodies. Moreover, it also works in the model of online convex optimization, where the convex function can vary over time, and we want to find a low-regret strategy—one which performs well against every fixed point x ∗ . This power and broad applicability means the algorithm is not always the best for specific classes of functions and bodies: for instance, for minimizing linear functions over the probability simplex ∆n , we saw in §17.4.1 that the generic gradient descent algorithm does significantly worse than the specialized Hedge algorithm. This suggests asking: can we somehow change gradient descent to adapt to the “geometry” of the problem? The mirror descent framework of this section allows us to do precisely this. There are many different (and essentially equivalent) ways to explain this framework, each with its positives. We present two of them here: the proximal point view, and the mirror map view, and only mention the others (the preconditioned or quasi-Newton gradient flow view, and the follow the regularized leader view) in passing. 18.1 Mirror Descent: the Proximal Point View Here is a different way to arrive at the gradient descent algorithm from the last lecture: Indeed, we can get an expression for xt+1 by Algorithm 16: Proximal Gradient Descent Algorithm x1 ← starting point 16.2 for t ← 1 to T do 16.3 xt+1 ← arg minx {η ⟨∇ f t ( xt ), x ⟩ + 21 ∥ x − xt ∥2 } 16.1 setting the gradient of the function to zero; this gives us the expres- 222 mirror descent: the proximal point view sion η · ∇ f t ( x t ) + ( x t +1 − x t ) = 0 =⇒ x t +1 = x t − η · ∇ f t ( x t ), which matches the normal gradient descent algorithm. Moreover, the intuition for this algorithm also makes sense: if we want to minimize the function f t ( x ), we could try to minimize its linear approximation f t ( xt ) + ⟨∇ f t ( xt ), x − xt ⟩ instead. But we should be careful not to “over-fit”: this linear approximation is good only close to the point xt , so we could add in a penalty function (a “regularizer”) to prevent us from straying too far from the point xt . This means we should minimize xt+1 ← arg min{ f t ( xt ) + ⟨∇ f t ( xt ), x − xt ⟩ + x 1 ∥ x − x t ∥2 } 2 or dropping the terms that don’t depend on x, xt+1 ← arg min{⟨∇ f t ( xt ), x ⟩ + x 1 ∥ x − x t ∥2 } 2 (18.1) If we have a constrained problem, we can change the update step to: xt+1 ← arg min{η ⟨∇ f t ( xt ), x ⟩ + x ∈K 1 ∥ x − x t ∥2 } 2 (18.2) The optimality conditions are a bit more complicated now, but they again can show this algorithm is equivalent to projected gradient descent from the previous chapter. Given this perspective, we can now replace the squared Euclidean norm by other distances to get different algorithms. A particularly useful class of distance functions are Bregman divergences, which we now define and use. 18.1.1 Bregman Divergences Given a strictly convex function h, we can define a distance based on how the function differs from its linear approximation: Definition 18.1. The Bregman divergence from x to y with respect to function h is Dh (y∥ x ) := h(y) − h( x ) − ⟨∇h( x ), y − x ⟩. The figure on the right illustrates this definition geometrically for a univariate function h : R → R. Here are a few examples: 1. For the function h( x ) = 21 ∥ x ∥2 from Rn to R, the associated Bregman divergence is Dh (y∥ x ) = 21 ∥y − x ∥2 , the squared Euclidean distance. Figure 18.1: Dh (y∥ x ) for function h : R → R. mirror descent 223 2. For the (un-normalized) negative entropy function h( x ) = ∑in=1 ( xi ln xi − x i ),  y Dh (y∥ x ) = ∑i yi ln xi − yi + xi . i Using that ∑i yi = ∑i xi = 1 for y, x ∈ ∆n gives us Dh (y∥ x ) = y ∑i yi ln xii for x, y ∈ ∆n : this is the Kullback-Leibler (KL) divergence between probability distributions. Many other interesting Bregman divergences can be defined. 18.1.2 Changing the Distance Function Since the distance function 21 ∥ x − y∥2 in (18.1) is a Bregman divergence, what if we replace it by a generic Bregman divergence: what algorithm do we get in that case? Again, let us first consider the unconstrained problem, with the update: xt+1 ← arg min{η ⟨∇ f t ( xt ), x ⟩ + Dh ( x ∥ xt )}. x Again, setting the gradient at xt+1 to zero (i.e., the optimality condition for xt+1 ) now gives: η ∇ f t ( xt ) + ∇h( xt+1 ) − ∇h( xt ) = 0, or, rephrasing ∇ h ( x t +1 ) = ∇ h ( x t ) − η ∇ f t ( x t ) =⇒ xt+1 = ∇h−1 ∇h( xt ) − η ∇ f t ( xt ) Let’s consider this for our two running examples:  (18.3) (18.4) 1. When h( x ) = 12 ∥ x ∥2 , the gradient ∇h( x ) = x. So we get x t +1 = x t − η ∇ f t ( x t ), the standard gradient descent update. 2. When h( x ) = ∑i ( xi ln xi − xi ), then ∇h( x ) = (ln x1 , . . . , ln xn ), so ( xt+1 )i = eln(xt )i −η ∇ f t (xt ) = ( xt )i e−η ∇ f t (xt ) . Now if f t ( x ) = ⟨ℓt , x ⟩, its gradient is just the vector ℓt , and we get back precisely the weights maintained by the Hedge algorithm! The same ideas also hold for constrained convex minimization: we now have to search for the minimizer within the set K. In this case the algorithm using negative entropy results in the same Hedgelike update, followed by scaling the point down to get a probability vector, thereby giving the probability values in Hedge. To summarize: this algorithm that tries to minimize the linear ap- What is the “right” choice of h to minimize the function f ? A little thought shows that h should equal f , because adding D f ( x ∥ xt ) to the linear approximation of f at xt gives us back exactly f . Of course, the update now requires us to minimize f ( x ), which is the original problem. So we should choose an h that is somehow “similar” to f , and yet such that the update step is tractable. 224 mirror descent: the mirror map view Algorithm 17: Proximal Gradient Descent Algorithm x1 ← starting point 17.2 for t ← 1 to T do 17.3 xt+1 ← arg minx∈K {η ⟨∇ f t ( xt ), x ⟩ + Dh ( x ∥ xt )} 17.1 proximation of the function, regularized by a Bregman distance Dh , gives us vanilla gradient descent for one choice of h (which is good for quadratic-like functions over Euclidean space), and Hedge for another choice of h (which is good for linear functions over the space of probability distributions). Indeed, depending on how we choose the function, we can get different properties from this algorithm—this is the mirror descent framework. 18.2 Mirror Descent: The Mirror Map View A different view of the mirror descent framework is the one originally presented by Nemirovski and Yudin. They observe that in gradient descent, at each step we set xt+1 = xt − η ∇ f t ( xt ). However, the gradient was actually defined as a linear functional on Rn and hence naturally belongs to the dual space of Rn . The fact that we represent this functional (i.e., this covector) as a vector is a matter of convenience, and we should exercise care. In the vanilla gradient descent method, we were working in Rn endowed with ℓ2 -norm, and this normed space is self-dual, so it is perhaps reasonable to combine points in the primal space (the iterates xt of our algorithm) with objects in the dual space (the gradients). But when working with other normed spaces, adding a covector ∇ f t ( xt ) to a vector xt might not be the right thing to do. Instead, Nemirovski and Yudin propose the following: A linear functional on vector space X is a linear map from X into its underlying field F. 1. we map our current point xt to a point θt in the dual space using a mirror map. 2. Next, we take the gradient step θ t +1 ← θ t − η ∇ f t ( x t ). (18.5) 3. We map θt+1 back to a point in the primal space xt′ +1 using the inverse of the mirror map from Step 1. 4. If we are in the constrained case, this point xt′ +1 might not be in the convex feasible region K, so we to project xt′ +1 back to a “closeby” xt+1 in K. Figure 18.2: The four basic steps in each iteration of the mirror descent algorithm mirror descent 225 The name of the process comes from thinking of the dual space as being a mirror image of the primal space. But how do we choose these mirror maps? Again, this comes down to understanding the geometry of the problem, the kinds of functions and the set K we care about, and the kinds of guarantees we want. In order to discuss these, let us discuss the notion of norms in some more detail. 18.2.1 Norms and their Duals Definition 18.2 (Norm). A function ∥ · ∥ : Rn → R is a norm if • If ∥ x ∥ = 0 for x ∈ Rn , then x = 0; • for α ∈ R and x ∈ Rn we have ∥αx ∥ = |α|∥ x ∥; and • for x, y ∈ Rn we have ∥ x + y∥ ≤ ∥ x ∥ + ∥y∥. The well-known ℓ p -norms for p ≥ 1 are defined by n ∥ x ∥ p := ( ∑ | xi | p )1/p i =1 for x ∈ Rn . The ℓ ∞ -norm is given by n ∥ x ∥∞ := max | xi | i =1 for x ∈ Rn . Definition 18.3 (Dual Norm). Let ∥ · ∥ be a norm. The dual norm of ∥ · ∥ is a function ∥ · ∥∗ defined as ∥y∥∗ := sup{⟨ x, y⟩ : ∥ x ∥ ≤ 1}. The dual norm of the ℓ2 -norm is again the ℓ2 -norm; the Euclidean norm is self-dual. The dual for the ℓ p -norm is the ℓq -norm, where 1/p + 1/q = 1. Corollary 18.4 (Cauchy-Schwarz for General Norms). For x, y ∈ Rn , we have ⟨ x, y⟩ ≤ ∥ x ∥ ∥y∥∗ . Proof. Assume ∥ x ∥ ̸= 0, otherwise both sides are 0. Since ∥ x/∥ x ∥∥ = 1, we have ⟨ x/∥ x ∥, y⟩ ≤ ∥y∥∗ . Theorem 18.5. For a finite-dimensional space with norm ∥ · ∥, we have (∥ · ∥∗ )∗ = ∥ · ∥. Using the notion of dual norms, we can give an alternative characterization of Lipschitz continuity for a norm ∥ · ∥, much like Fact 17.6 for Euclidean norms: Fact 18.6. For f be a differentiable function. Then f is G-Lipschitz with respect to norm ∥ · ∥ if and only if for all x ∈ R, ∥∇ f ( x )∥∗ ≤ G. Figure 18.3: The unit ball in ℓ1 -norm (Green), ℓ2 -norm (Blue), and ℓ∞ -norm (Red). 226 mirror descent: the mirror map view 18.2.2 Defining the Mirror Maps To define a mirror map, we first fix a norm ∥ · ∥, and then choose a differentiable convex function h : Rn → R that is α-strongly-convex with respect to this norm. Recall from §17.5.1 that such a function must satisfy h(y) ≥ h( x ) + ⟨∇h( x ), y − x ⟩ + α ∥ y − x ∥2 . 2 We use two familiar examples: 1. h( x ) = 12 ∥ x ∥22 is 1-strongly convex with respect to ∥ · ∥2 , and 2. h( x ) := ∑in=1 xi (log xi − 1) is 1-strongly convex with respect to ∥ · ∥1 ; the proof of this is called Pinsker’s inequality. Having fixed ∥ · ∥ and h, the mirror map is Check out the two proofs pointed to by Aryeh Kontorovich, or this proof (part 1, part 2) by Madhur Tulsiani. ∇(h) : Rn → Rn . Since h is differentiable and strongly-convex, we can define the inverse map as well. This defines the mappings that we use in the Nemirovski-Yudin process: we set θt = ∇ h( xt ) and xt′ +1 = (∇h)−1 (θt+1 ). For our first running example of h( x ) = 21 ∥ x ∥2 , the gradient (and hence its inverse) is the identity map. For the (un-normalized) negative entropy example, (∇h( x ))i = ln xi , and hence (∇h)−1 (θ )i = eθi . 18.2.3 The Algorithm (Again) Let us formally state the algorithm again, before we state and prove a theorem about it. Suppose we want to minimize a convex function f over a convex body K ⊆ Rn . We first fix a norm ∥ · ∥ on Rn and choose a distance-generating function h : Rn → R, which gives the mirror map ∇h : Rn → Rn . In each iteration of the algorithm, we do the following: (i) Map to the dual space θt ← ∇h( xt ). (ii) Take a gradient step in the dual space: θt+1 ← θt − ηt · ∇ f t ( xt ). (iii) Map θt+1 back to the primal space xt′ +1 ← (∇h)−1 (θt+1 ). (iv) Project xt′ +1 back into the feasible region K by using the Bregman divergence: xt+1 ← minx∈K Dh ( x ∥ xt′ +1 ). In case xt′ +1 ∈ K, e.g., in the unconstrained case, we get xt+1 = xt′ +1 . Note that the choice of h affects almost every step of this algorithm. The function h used in this way is often called a distance-generating function. mirror descent 18.3 The Analysis We prove the following guarantee for mirror descent, which captures the guarantees for both Hedge and gradient descent, and for other variants that you may use. Theorem 18.7 (Mirror Descent Regret Bound). Let ∥ · ∥ be a norm on Rn , and h be an α-strongly convex function with respect to ∥ · ∥. Given f 1 , . . . , f T be convex, differentiable functions such that ∥∇ f t ∥∗ ≤ G, the mirror descent algorithm starting with x0 and taking constant step size η in every iteration produces x1 , . . . , x T such that for any x ∗ ∈ Rn , T T t =1 t =1 ∑ f t ( xt ) ≤ ∑ f t ( x ∗ ) + Dh ( x ∗ ∥ x1 ) η ∑tT=1 ∥∇ f t ( xt )∥2∗ + . η 2α | {z } (18.6) regret Before proving Theorem 18.7, observe that when ∥ · ∥ is the ℓ2 norm and h = 21 ∥ · ∥2 , the regret term is ∥ x ∗ − x1 ∥22 η ∑tT=1 ∥∇ f t ( xt )∥22 + , 2η 2 which is what Theorem 17.8 guarantees. Similarly, if ∥ · ∥ is the ℓ1 norm and h is the negative entropy, the regret versus any point x ∗ ∈ ∆n is x∗ η ∑tT=1 ∥∇ f t ( xt )∥2∞ 1 n ∗ xi ln i + . ∑ η i =1 ( x1 )i 2/ ln 2 For linear functions f t ( x ) = ⟨ℓt , x ⟩ with ℓt ∈ [−1, 1]n , and x1 = n1 · 1, the regret is KL( x ∗ ∥ x1 ) ηT + η 2/ ln 2 ≤ ln n + ηT. η The last inequality uses that the KL divergence to the uniform distribution on n items is at most ln n. (Exercise!) In fact, if we start with a distribution x1 that is closer to x ∗ , the first term of the regret gets smaller. 18.3.1 227 The Proof of Theorem 18.7 The proof here is very similar in spirit to that of Theorem 17.8: we give a potential function Φt = Dh ( x ∗ ∥ x t ) η and bound the amortized cost at time t as follows: f t ( xt ) − f t ( x ∗ ) + (Φt+1 − Φt ) ≤ f t ( x ∗ ) + blaht . (18.7) The theorem is stated for the unconstrained version, but extending it to the constrained version is an easy exercise. 228 the analysis Summing over all times, T T t =1 t =1 T ∑ f t (xt ) − ∑ f t (x∗ ) ≤ Φ1 − ΦT+1 + ∑ blaht t =1 T ≤ Φ1 + ∑ blaht = t =1 T Dh ( x ∗ ∥ x1 ) + ∑ blaht . η t =1 The last inequality above uses that the Bregman divergence is always non-negative for convex functions. To complete the proof, it remains η to show that blaht in inequality (18.7) can be made 2α ∥∇ f t ( xt )∥2∗ . Let us focus on the unconstrained case where xt+1 = xt′ +1 , and prove an analog of Lemma 17.9 for our generalized setting: Lemma 18.8 (Potential Change). Φt+1 − Φt ≤ ⟨∇ f t ( xt ), x ∗ − xt ⟩ + η ∥∇ f t ( xt )∥2∗ . 2α Note that we use the dual norm ∥ · ∥∗ for the gradient. Moreover, restricting Lemma 18.8 to the case of h( x ) = ∥ x ∥2 and using the fact that the Euclidean norm is self-dual gives us back Lemma 17.9 bounding the potential change for standard gradient descent. The proof of Lemma 18.8 can be skipped at the first reading; the calculations are simple, but they rely crucially on the strong-convexity of h, and the mirror-descent update rule. Proof of Lemma 18.8. The change in potential is Φ t +1 − Φ t =  1 D ( x ∗ ∥ x t +1 ) − Dh ( x ∗ ∥ x t ) ; {z } η |h (⋆) now using the definition of the divergence, (⋆) = h( x ∗ ) − h( xt+1 ) − ⟨∇h( xt+1 ), x ∗ − xt+1 ⟩ − h( x ∗ ) + h( xt ) + ⟨∇h( xt ), x ∗ − xt ⟩ | {z } | {z } θ t +1 ∗ θt ∗ = h( xt ) − h( xt+1 ) − ⟨θt+1 , x − xt+1 ⟩ + ⟨θt , ( x − xt+1 ) + ( xt+1 − xt )⟩. (18.8) Now we can use the α-strong convexity of h wrt to ∥ · ∥ to claim h ( x t +1 ) ≥ h ( x t ) + ⟨ θ t , x t +1 − x t ⟩ + α ∥ x − x t ∥2 . 2 t +1 Substituting into (18.8), α (⋆) ≤ − ∥ xt+1 − xt ∥2 + ⟨θt − θt+1 , ( xt − xt+1 ) + ( x ∗ − xt )⟩ 2 α ≤ − ∥ xt+1 − xt ∥2 + ∥η ∇ f t ( xt )∥∗ ∥ xt − xt+1 ∥ +η ⟨∇ f t ( xt ), x ∗ − xt ⟩ , | 2 {z } (†) mirror descent where the latter inequality used the update rule (18.5) for mirror descent, and the Cauchy-Schwarz inequality Corollary 18.4 for general norms. Now using the AM-GM inequality shows that (†) ≤ 1 ∥η ∇ f t ( xt )∥2∗ . 2α Finally, remembering that the change in potential is given by η1 (⋆) finishes the proof of Lemma 18.8. The rest of the proof of Theorem 18.7 follows now-familiar lines. Using Lemma 18.8, and then the convexity of f on the first two terms: f t ( xt ) + (Φt+1 − Φt ) ≤ f t ( xt ) + ⟨∇ f t ( xt ), x ∗ − xt ⟩ + ≤ f t (x∗ ) + η ∥∇ f t ( xt )∥2∗ 2α η ∥∇ f t ( xt )∥2∗ . 2α Hence blaht in (18.7) is at most 2α ∥∇ f t ( xt )∥2∗ , as claimed, completing the proof of Theorem 18.7. In order to extend this to the constrained case, we need to show that if xt′ +1 ∈ / K, and xt+1 = arg minx∈K Dh ( x ∥ xt′ +1 ), then η Dh ( x ∗ ∥ xt+1 ) ≤ Dh ( x ∗ ∥ xt′ +1 ) for any x ∗ ∈ K. This is a Generalized Pythagorean Theorem for Bregman divergences, and is left as an exercise. 18.4 Alternative Views of Mirror Descent To complete and flesh out. In this lecture, we reviewed mirror descent algorithm as a gradient descent scheme where we do the gradient step in the dual space. We now provide some alternative views of mirror descent. 18.4.1 Preconditioned Gradient Descent For any given space which we use a descent method on, we can linearly transform the space with some map Q to make the geometry more regular. This technique is known as preconditioning, and improves the speed of the descent. Using the linear transformation Q, our descent rule becomes x t + 1 = x t − η Hh ( x t ) − 1 ∇ f ( x t ) . Some of you may have seen Newton’s method for minimizing convex functions, which has the following update rule: x t +1 = x t − η H f ( x t ) −1 ∇ f ( x t ). 229 230 alternative views of mirror descent This means mirror descent replaces the Hessian of the function itself by the Hessian of a strongly convex function h. Newton’s method has very strong convergence properties (it gets error ε in O(log log 1/ε) iterations!) but is not “robust”—it is only guaranteed to converge when the starting point is “close” to the minimizer. We can view mirror descent as trading off the convergence time for robustness. Fill in more on this view. 18.4.2 As Follow the Regularized Leader 19 The Centroid and Ellipsoid Algorithms Our focus in this chapter is on the constrainted optimization problem: Given a convex function f , a convex set K, and a parameter ε > 0, find a point x̂ ∈ K such that f ( x̂ ) ≤ min f ( x ) + ε. x ∈K In previous sections, we saw gradient descent and mirror descent gave us algorithms whose dependence on ε was like poly(1/ε). Moreover, we have examples that show algorithms based only on local gradient information need time at least polynomial in 1/ε. Where? So can we do better? In this chapter, we show how to use global information to get algorithms for convex programming that have O(log 1/ε)-type convergence guarantees (under suitable assumptions). Specifically, we will examine the Centroid and Ellipsoid algorithms in depth. In turn, these will give us polynomial-time algorithms for Linear Programming problems. 19.1 The Centroid Algorithm In this section, we discuss the Centroid Algorithm in the context of constrained convex minimization. Besides being interesting in its own right, it is a good lead-in to Ellipsoid, since it gives some intuition about high-dimensional bodies and their volumes. Given a convex body K ⊆ Rn and a convex function f : K → R, we want to approximately minimize f ( x ) over x ∈ K. As in previous sections, we assume a gradient oracle for f , one that returns the value ∇ f ( x ) for any query point x ∈ K. We also assume that we can perform exact arithmetic over the reals; however, we will soon begin discussing issues that arise from using only finite-precision arithmetic. The algorithms also had some dependence on f and K; e.g., if gradients ∥∇ f ( x )∥2 ≤ G for x ∈ K, and the diameter of K was at most D, then projected gradient descent ran in O(( GD/ε)2 ) time. 232 the centroid algorithm As the name suggests, the algorithm is based on the notion of centroid for compact convex sets. The centroid of a set K is the point c ∈ Rn such that R R x dx x ∈K x dx c := = Rx∈K , vol(K ) x ∈K dx where vol(K ) is the volume of the set K. Since c is the “average” of points in some convex set K, it also lies within K. The following result captures the crucial fact about the centroid that we use in our algorithm. This is the analog of the centroid of a discrete set S = { x1 , x2 , . . . , x N }: centroid(S) := 1 xi . |S| ∑ i Other names for the centroid are the center of gravity, and the barycenter. B. Grünbaum (1960) Theorem 19.1 (Grünbaum’s Theorem). For any compact convex set K ∈ Rn with a centroid c ∈ Rn , and any halfspace H = { x | a⊺ ( x − c) ≥ 0} whose supporing hyperplane passes through c,   1 vol(K ∩ H ) 1 ≤ ≤ 1− . e vol(K ) e This bound of 1/e in Grünbaum’s Theorem is the best possible: e.g., consider the simplex K = { x ∈ [0, 1]n | ∥ x ∥1 ≤ 1} with centroid 1 n+1 1. Defining the halfspace H = { x1 ≥ c }, we get that K ∩ H is a scaled-down copy of K, with volume  1− 1 n+1 n → 1/e as n → ∞. 19.1.1 The Algorithm In 1965, A. Ju. Levin and Donald Newman independently (and on opposite sides of the iron curtain) proposed the following algorithm. Algorithm 18: Centroid(K, f, T) K1 ← K 18.2 for t = 1, . . . T do 18.3 at step t, let ct ← centroid of Kt 18.4 Kt+1 ← Kt ∩ { x | ⟨∇ f (ct ), x − ct ⟩ ≤ 0} b ← arg mint∈{1,...,T } f (ct ) 18.5 return x 18.1 The figure to the right shows a sample execution of the algorithm, where K is initially a ball. (Ignore the body K ε for now.) We find the centroid c1 and compute the gradient ∇ f (c1 ). Instead of moving in the direction opposite to the gradient, we consider the halfspace H1 of vectors negatively correlated with the gradient, restrict our search to K ← K ∩ H1 , and continue. We repeat this step some number of times, and then return the smallest of the function value at all the centroids seen by the algorithm. Note that the algorithm assumes: A.Ju. Levin (1965) D.J. Newman (1965) For most of this chapter, we assume that we can perform exact arithmetic on real numbers. This assumption could be very restrictive loss in generality, since some of our algorithm take squareroots (e.g., when computing ellipsoids). Rounding numbers create all sorts of numerical problems, and a large part of the complication in the actual algorithms comes from these numerical issues. the centroid and ellipsoid algorithms 1. Access to both a gradient oracle and a value oracle for the function f , and 2. access to a procedure that computes the centroid for any compact convex set K. Theorem 19.2. Consider a convex set K ⊆ (0, R) ⊆ Rn , and a convex function f : K → R such that let ∥∇ f ( x )∥ ≤ G for all x ∈ K. If xb is the result of the algorithm, and x ∗ = arg minx∈K f ( x ), then f ( xb) − f ( x ∗ ) ≤ 4GR · exp(− T/3n). 233 K ∇ f ( c2 ) • c2 Kε • c1 • ∇ f ( c0 ) c0 ∇ f ( c1 ) Hence, for any ε ≤ 1, as long as T ≥ 3n ln 4GR ε , f ( xb) − f ( x ∗ ) ≤ ε. Proof. For some δ ≤ 1, define the body K δ := {(1 − δ) x ∗ + δx | x ∈ K } as a scaled-down version of K centered at x ∗ . The following facts are immediate: 1. vol(K δ ) = δn · vol(K ). 2. For any points x, y ∈ K, integrating along the path from x to y and using the fact that the gradients are bounded by G gives f ( x ) − f (y) = ≤ Z 1 t =0 ⟨ f (y + t( x − y)), x − y⟩ dt t =0 ∥ f (y + t( x − y))∥∥ x − y∥dt ≤ G ∥ x − y∥ ≤ G · (2R). Z 1 3. The value of f on any point y = (1 − δ) x ∗ + δx ∈ K δ is f (y) = f ((1 − δ) x ∗ + δx ) ≤ (1 − δ) f ( x ∗ ) + δ f ( x ) ≤ f ( x ∗ ) + δ( f ( x ) − f ( x ∗ )) ≤ f ( x ∗ ) + 2δGR. Using Grünbaum’s lemma, the volume falls by a constant factor in each iteration, so vol(Kt ) ≤ vol(K ) · (1 − 1e )t . If we define δ := 2(1 − 1/e) T/n , then after T steps the volume of KT is smaller than that of K δ , so some point of K δ must have been cut off. Consider such a step t such that K δ ⊆ Kt but K δ ̸⊆ Kt+1 . Let y ∈ K δ ∩ (Kt \ Kt+1 ) be a point that is “cut off”. By convexity we have f (y) ≥ f (ct ) + ⟨∇ f (ct ), y − ct ⟩ ; moreover, ⟨∇ f (ct ), y − ct ⟩ > 0 since the cut-off point y ∈ Kt \ Kt+1 . Hence the corresponding centroid has value f (ct ) < f (y) ≤ f ( x ∗ ) + 2δGR. Since xb is the centroid with the smallest function value, we get  T/n f ( xb) − f ( x ∗ ) ≤ 2GR · 2 1 − 1/e ≤ 4GR exp(− T/3n). The second claim follows by substituting T ≥ 3n ln 4GR ε into the first claim, and simplifying. Figure 19.1: Sample execution of first three steps of the Centroid Algorithm. 234 multi-dimensional binary search 19.1.2 Comments on the Runtime The number of iterations T needed by the Centroid algorithm to get an error of ε is O(n log( GR/ε)); compare this linear convergence to gradient descent requiring O(( GR/ε)2 ) steps in the same setting. One downside with this approach is that the number of iterations explicitly depends on the number of dimensions n, whereas gradient descent does not. Another all-important question is: how do we compute the centroid? This is a difficult problem—it is #P-hard to do exactly, which means it is at least as hard as counting the number of satisfying assignments to a SAT instance. (You will see this in a homework problem.) In 2002, Dimitris Bertsimas and Santosh Vempala suggested a way to find approximate centroids by sampling random points from convex bodies (which in turn is done via random walks). Combined with a robust version of Grünbaum’s theorem gives us a polynomial-time version of the algorithm. 19.2 D. Bertsimas and S. Vempala (2006) Multi-Dimensional Binary Search Let us put the Centroid algorithm in a broader context. Given a convex body, one of the canonical ways of specifying it will be via a separation oracle. Definition 19.3 (Strong Separation Oracle). For a convex set K ⊆ Rn , a strong separation oracle for K is an algorithm that takes a point z ∈ Rn and correctly outputs one of: (i) Yes (i.e., z ∈ K), or (ii) No (i.e., z ̸∈ K), as well as a separating hyperplane given by a ∈ Rn , b ∈ R such that K ⊆ { x ∈ Rn | ⟨ a, x ⟩ ≤ b} but ⟨ a, z⟩ > b. The Hahn-Banach separation theorem ensures that exactly one of the two cases can hold for any x and K. Our goal now is to solve the following feasibility problem: Given access to a strong separation oracle for a convex body K, as well as positive values R, r such that (a) K ⊆ B(0, R) ⊆ Rn , and (b) the body K is either empty, or else there is some unknown (full-dimensional) r-ball B(c, r ) ⊆ K. If K ̸= ∅, output a point x ∈ K, else say K = ∅. 19.2.1 Recall that linear convergence refers to a rate where we the number of bits of precision increases linearly: i.e., the number of iterations is logarithmic in 1/ε. Feasibility using Centroids The ideas behind the Centroid algorithm also solves the feasibility problem: An ε-weak separation oracle is one where we are just ensured that ⟨ a, x ⟩ > ⟨ a, y⟩ − ε for all y ∈ K. Specifying K via a weak separation oracle makes all our tasks much more challenging; in this course we restrict our discussions to strong separation, and defer the generalization to the GLS book. the centroid and ellipsoid algorithms Algorithm 19: CentroidFeasibility(K, R, r) E0 ← B(0, R) 19.2 for t = 0, 1, . . . T : = 3n ln( R/r ) do 19.3 query strong separation oracle on ct , the centroid of Et 19.4 if ct ̸∈ K then 19.5 at ← direction from strong separation oracle 19.6 Et+1 ← Et ∩ { x | ⟨∇ at , x − ct ⟩ ≤ 0} 19.7 else 19.8 output “ct ∈ K” and stop 19.9 output “K = ∅” 19.1 The argument is nearly identical to the one we saw above: 1. By Grünbaum’s theorem, vol(Et ) ≤ (1 − 1/e)t vol(E0 ) = (1 − 1/e)t vol( B(0, R)), which gives an upper bound on Et ’s volume. 2. Suppose K ̸= ∅. If none of the centers ct′ for t′ < t belong to K, then K ⊆ Et and hence vol(Et ) ≥ vol( B(c, r )). This gives a lower bound on the volume of Et , as long as none of the centroids fall within our target convex body K. vol( B(0,R)) 3. Putting the above statements togehter, (1 − 1/e)t ≤ vol( B(c,r)) = ( R/r )n . This means that if we do not find a point in K within 3n log( R/r ) steps, K must be empty! This approach is very flexible: we just need to (efficiently) maintain a sequence of bodies {Et }t such that for each step t: (a) vol(Et+1 ) ≤ vol(Et ) · (1 − δ) for some δ > 0, (b) if each of the “test points” c1 , c2 , . . . , ct did not belong to K, then K ⊆ E t +1 . Following the same outline with these properties gives an iteration  complexity of O nδ log Rr . Maybe this more abstract view allows us to get an efficient algorithm (since computing centroids is #P-hard)? 19.2.2 The Ellipsoid Algorithm Going from the Centroid to the Ellipsoid algorithm requires a remarkably small change. If our test point ct = centroid(Et ) does not belong to K, the separation oracle returns a half-space H := { x | ⟨ at , x − ct ⟩ ≤ 0} that contains K. Now we don’t just define E t +1 ← E t ∩ H 235 236 ellipsoid for convex optimization but instead define: Et+1 ← minimum volume ellipsoid containing Et ∩ H. How can we compute this minimum-volume ellipsoid? And does the volume go down by a constant factor? Since we start off with E0 being a ball (which is trivially an ellipsoid), it suffices to show how to compute the minimum-volume ellipsoid Et+1 of half an ellipsoid (the intersection of an ellipsoid Et with a half-space passing through its center). We show how to do this in §19.5, and show that the ellipsoid Et+1 ⊇ Et ∩ Ht has volume The minimum-volume ellipsoid containing a convex body K is often called the John ellipsoid for K, after Fritz John who proved several properties for it in 1948. Why not balls? Clearly, the smallest volume ball that contains half a ball is the ball itself. Interestingly, the same is true for boxes: the volume of the new box may not decrease. Thankfully, ellipsoids—and in fact, simplices—do have the volume-reduction property.  vol(Et+1 ) − 1 ≤ e 2(n+1) ≈ 1 − O(1/n) . vol(Et ) Therefore, after 2(n + 1) iterations, the ratio of the volumes falls by at least a factor of 1e . Hence, if after O(n2 ln( R/r )) steps, none of the ellipsoid centers have been inside K, we know that K must be empty. 19.3 Ellipsoid for Convex Optimization Let’s go back to convex minimization: we want to solve min{ f ( x ) | x ∈ K }. Again, assume that K is given by a strong separation oracle, and we have numbers R, r such that K ⊆ Ball(0, R), and K is either empty or contains a ball of radius r. The general structure is a one familiar by now, and combines ideas from both the previous sections. 1. Let the starting point x1 ← 0, the starting ellipsoid be E1 ← Ball(0, R), and the starting convex set K1 ← K. 2. At time t, ask the separation oracle: “Is the center ct of ellipsoid Et in the convex body Kt ?” Yes: Define half-space Ht := { x | ⟨∇ f (ct ), x − ct ⟩ ≤ 0}. Observe that Kt ∩ Ht contains all points in Kt with value at most f (ct ). No: In this case the separation oracle also gives us a separating hyperplane. This defines a half-space Ht such that ct ̸∈ Ht , but Kt ⊆ Ht . In both cases, set Kt+1 ← Kt ∩ Ht , and Et+1 to an ellipsoid containing Et ∩ Ht . Since we knew that Kt ⊆ Et , we maintain that K t +1 ⊆ E t +1 . 3. Finally, after T = 2n(n + 1) ln( R/r ) rounds either we have not seen any point in K—in which case we say “K is empty”—or else we output xb ← arg min{ f (ct ) | ct ∈ Kt , t ∈ 1 . . . T }. This volume reduction is weaker by a factor of Θ(n) than that of the Centroid algorithm. the centroid and ellipsoid algorithms 237 One subtle issue: we make queries to a separation oracle for Kt , but we are promised only a separation oracle for K1 = K. However, we can build separation oracles for Ht inductively: indeed, given strong separation oracle for Kt−1 , we build one for Kt = Kt−1 ∩ Ht−1 as follows: Given z ∈ Rn , query the oracle for Kt−1 at z. If z ̸∈ Kt−1 , the separating hyperplane for Kt−1 also works for Kt . Else, if z ∈ Kt−1 , check if z ∈ Ht−1 . If so, z ∈ Kt = Kt−1 ∩ Ht−1 . Otherwise, the defining hyperplane for halfspace Ht−1 is a separating hyperplane between z and Kt . Now adapting the analysis from the previous sections gives us the following result (assuming exact arithmetic again): Theorem 19.4 (Idealized Convex Minimization using Ellipsoid). Given K, r, R as above (and a strong separation oracle K), and a function f with gradients bounded by G, the Ellipsoid algorithm run for T steps either correctly reports that K = ∅, or else produces a point xb such that f ( xb) − f ( x ∗ ) ≤ n o O( GR) T exp − . r 2n(n + 1) Note the similarity to Theorem 19.2, as well as the differences: the exponential term is slower by a factor of 2(n + 1). This is because the volume of the successive ellipsoids shrinks much slower than in Grünbaum’s lemma. Also, we lose a factor of R/r because K is potentially smaller than the starting body by precisely this factor. (Again, this presentation ignores precision issues, and assumes we can do exact real arithmetic.) 19.4 The Ellipsoid Algorithm to Solve LPs The Ellipsoid algorithm is usually attributed to Naum Shor; the fact that this algorithm gives a polynomial-time algorithm for linear programming was a breakthrough result due to Khachiyan, and was front page news at the time. A great source of information about this algorithm is the Grötschel-Lovász-Schrijver book. A historical perspective appears in this this survey by Bland, Goldfarb, and Todd. Let us mention some theorem statements about the Ellipsoid algorithm that are most useful in designing algorithms. The second-most important theorem is the following. Recall the notion of an extreme point or basic feasible solution (bfs) from §9.1.2. Let ⟨ A⟩, ⟨b⟩, ⟨c⟩ denote the number of bits required to represent of A, b, c respectively. Theorem 19.5 (Linear Programming in Polynomial Time). Given a linear program min{c⊺ x | Ax ≥ b}, the Ellipsoid algorithm produces an optimal vertex solution for the LP, in time polynomial in ⟨ A⟩, ⟨b⟩, ⟨c⟩. N. Z. Šor and N. G. Žurbenko (1971) L.G. Khachiyan (1979) M. Grötschel, L. Lovász, and A. Schrijver (1988) 238 the ellipsoid algorithm to solve lps One may ask: does the runtime depend on the bit-complexity of the input because doing basic arithmetic on these numbers may require large amounts of time. Unfortunately, that is not the case. Even if we count the number of arithmetic operations we need to perform, the Ellipsoid algorithm performs poly(⟨ A⟩ + ⟨b⟩ + ⟨c⟩) operations. A stronger guarantee would have been for the number of arithmetic operations to be poly(m, n), where the matrix A ∈ Qm×n : such an algorithm would be called a strongly polynomial-time algorithm. Obtaining such an algorithm remains a major open question. 19.4.1 Finding Vertex Solutions for LPs There are several issues that we need to handle when solving LPs using this approach. For instance, the polytope may not be fulldimensional, and hence we do not have any non-trivial ball within K. Our separation oracles may only be approximate. Moreover, all the numerical calculations may only be approximate. Even after we take care of these issues, we are working over the rationals so binary search-type techniques may not be able to get us to a vertex solution. So finally, when we have a solution xt that is “close enough” to x ∗ , we need to “round” it and get a vertex solution. In a single dimension we can do the following (and this idea already appeared in a homework problem): we know that the optimal solution x ∗ is a rational whose denominator (when written in reduced terms) uses at most some b bits. So we find a solution within distance to x ∗ is smaller than some δ. Moreover δ is chosen to be small enough such that there is a unique rational with denominator smaller than 2b in the δ-ball around xt . This rational can only be x ∗ , so we can “round” xt to it. In higher dimensions, the analog of this is a technique (due to Lovász) called simultaneous Diophantine equations. 19.4.2 Separation Implies Optimization The most important theorem about Ellipsoid is the following: Theorem 19.6 (Separation implies Optimization). Given an LP min{c⊺ x | x ∈ K } for a polytope K = { x | Ax ≥ b} ⊆ Rn , and given access to a strong separation oracle for K, the Ellipsoid algorithm produces a vertex solution for the LP in time poly(n, maxi ⟨ ai ⟩, maxi ⟨bi ⟩, ⟨c⟩). There is no dependence on the number of constraints in the LP; we can get a basic solution to any finite LP as long as each constraint has Consider the case where we perform binary-search over the interval [0, 1] and want to find the point 1/3: no number of steps will get us exactly to the answer. the centroid and ellipsoid algorithms a reasonable bit complexity, and we can define a separation oracle for the polytope. This is often summarized by saying: “separation implies optimization”. Let us give two examples of exponential-sized LPs, for which we can give a separation oracles, and hence optimize over them. 19.5 Getting the New Ellipsoid This brings us to the final missing piece: given a current ellipsoid E and a half-space H that does not contain its center, we want an ellipsoid E ′ that contains E ∩ H, and as small as possible. To start off, let us recall some basic facts about ellipsoids. The simplest ellipses in R2 are axis aligned, say with principal semi-axes having length a and b, and written as: x2 y2 + 2 ≤ 1. 2 a b Or in matrix notation we could also say " #⊺ " #" # 1/a2 x 0 x ≤1 2 1 y 0 /b y More generally, any ellipsoid E is perhaps best thought of as a invertible linear transformation L applied to the unit ball B(0, 1), and then it being shifted to the correct center c. The linear transformation yields: L(Ball(0, 1)) = { Lx : x⊺ x ≤ 1} = { y : ( L −1 y )⊺ ( L −1 y ) ≤ 1 } = {y : y⊺ ( LL⊺ )−1 y ≤ 1} = { y : y ⊺ Q −1 y ≤ 1 } , where Q−1 := LL⊺ is a positive semidefinite matrix. For an ellipsoid centered at c we simply write { y + 1 : y ⊺ Q −1 y ≤ 1 } = { y : ( y − c )⊺ Q −1 ( y − c ) ≤ 1 } . It is helpful to note that for any ball A, vol( L( A)) = vol( A) · | det( L)| = vol( A) q det( Q) In the above problems, we are given an ellipsoid Et and a halfspace Ht that does not contain the center of Et . We want to find a matrix Qt+1 and a center ct+1 such that the resulting ellipsoid Et+1 contains Et ∩ Ht , and satisfies vol(Et+1 ) ≤ e−1/2(n+1) . vol(Et ) 239 240 getting the new ellipsoid Given the above discussion, it suffices to do this when Et is a unit ball: indeed, when Et is a general ellipsoid, we apply the inverse linear transformation to convert it to a ball, find the smaller ellipsoid for it, and then apply the transformation to get the final smaller ellipsoid. (The volume changes due to the two transformations cancel each other out.) We give the construction for the unit ball below, but first let us record the claim for general ellipsoids: Theorem 19.7. Given an ellipsoid Et given by (ct , Qt ) and a separating hyperplane a⊺t ( x − ct ) ≤ 0 through its center, the new ellipsoid Et+1 with center ct+1 and psd matrix Qt+1 ) is found by taking c t +1 : = c t − and where h = q n2 Q t +1 = 2 n −1  1 h n+1 2 hh⊺ Qk − n+1  a⊺t Qt at . Note that the construction requires us to take square-roots: this may result in irrational numbers which we then have to either truncate, or represent implicitly. In either case, we face numerical issues; ensuring that these issues are not real problems lies at the heart of the formal analysis. We refer to the GLS book, or other textbooks for details and references. 19.5.1 Halving a Ball Before we end, we show that the problem of finding a smaller ellipsoid that contains half a ball is, in fact, completely straight-forward. By rotational symmetry, we might as well find a small ellipsoid that contains K = Ball(0, 1) ∩ { x | x1 ≥ 0}. By symmetry, it makes sense that the center of this new ellipsoid E should be of the form c = (c1 , 0, . . . , 0). Again by symmetry, the ellipsoid can be axis-aligned, with semi-axes of length a along e1 , and b > a along all the other coordinate axes. Moreover, for E to contain the unit ball, it should contain the points (1, 0) and (0, 1), say. So (1 − c1 )2 ≤ 1 and a2 c21 1 + 2 ≤ 1. a2 b the centroid and ellipsoid algorithms Suppose these two inequalities are tight, then we get s s (1 − c1 )2 (1 − c1 )2 a = 1 − c1 , b= , = (1 − 2c1 (1 − c1 )2 − c21 and moreover the ratio of volume of the ellipsoid to that of the ball is abn−1 = (1 − c1 ) ·  (1 − c )2 (n−1)/2 1 1 − 2c1 . 1 This is minimized by setting c1 = n+ 1 gives us vol(E ) − 1 = · · · ≤ e 2( n +1) . vol(Ball(0, 1)) For a more detailed description and proof of this process, see these notes from our LP/SDP course for details. In fact, we can view the question of finding the minimum-volume ellipsoid that contains the half-ball K: this is a convex program, and looking at the optimality conditions for this gives us the same construction above (without having to make the assumptions of symmetry). 19.6 Algorithms for Solving LPs We have now seen two different classes of algorithms to solve linear programs: the first approach using multiplicative weights gave us solutions which violate the constraints by ε and take O(1/ε2 ) steps (ignoring terms that depend on the other input parameters for now). Next we saw the Centroid and Ellipsoid algorithms for convex programming which require only O(log 1/ε) steps. However, they are typically not used to solve LPs in practice. There are several other algorithms: let us mention them in passing. Let K := { x | Ax ≥ b} ⊆ Rn , and we want to minimize {c⊺ x | x ∈ K }. Simplex: This is perhaps the first algorithm for solving LPs that most of us see. It was also the first general-purpose linear program solver known, having been developed by George Dantzig in 1947. This is a local-search algorithm: it maintains a vertex of the polyhedron K, and at each step it moves to a neighboring vertex without decreasing the objective function value, until it reaches an optimal vertex. (The convexity of K ensures that such a sequence of steps is possible.) The strategy to choose the next vertex is called the pivot rule. Unfortunately, for most known pivot rules, there are examples on which the following the pivot rule takes exponential (or at least a super-polynomial) number of steps. Despite that, it is often used in practice: e.g., the Excel software contains an implementation of simplex. G.B. Dantzig (1990) 241 242 algorithms for solving lps Interior Point: A different approach to get algorithms for LPs is via interior-point algorithms: these happen to be good both in theory and in practice. The first polynomial-time interior-point algorithm was proposed by Karmarkar in 1984. We discuss this in the next chapter. Geometric Algorithms for LPs: These approaches are geared towards solving LPs fast when the number of dimensions n is small. If m is the number of constraints, these algorithms often allow a poor runtime in n, at the expense of getting a good dependence on m. As an example, a randomized algorithm of Raimund Seidel’s has a runtime of O(m · n!) = O(m · nn/2 ); a different algorithm of Ken Clarkson (based on the multiplicative weights approach!) has a runtime of O(n2 m) + nO(n) O(log m)O(log n) . One of the fastest such algorithm is by Jiri Matoušek, Micha Sharir, and Emo Welzl, and has a runtime of √ O(n2 m) + eO( n log n) . For details and references, see this survey by Martin Dyer, Nimrod Megiddo, and Emo Welzl. Naturally, there are other approaches to solve linear programs as well: write more here. 20 Interior-Point Methods In this chapter, we continue our discussion of polynomial-time algorithms for linear programming, and cover the high-level details of an interior-point algorithm. The runtime for these linear programs has recently been improved both qualitatively and quantitatively, so this is an active area of research that you may be interested in. Moreover, these algorithms contain sophisticated general ideas (duality and the method of Lagrange multipliers, and the use of barrier functions) that are important even beyond this context. Another advantage of delving into the details of these methods is that we can work on getting better algorithms for special kinds of linear programs of interest to us. For instance, the line of work on faster max-flow algorithms for directed graphs, starting with the work of Madry, and currently resulting in the O(m4/3+ε )-time algorithms of Kathuria, and of Liu and Sidford, are based on a better understanding of interior-point methods. We will consider the following LP with equality constraints: min c⊺ x Ax = b x≥0 where A ∈ Rm×n , b ∈ Rm and c, x ∈ Rn . Let K := { x | Ax = b, x ≥ 0} be the polyhedron, and x ∗ = arg min{c⊺ x | x ∈ K } an optimal solution. To get the main ideas across, we make some simplifying assumptions and skip over some portions of the algorithm. For more details, please refer to the book by Jiri Matoušek and Bernd Gärtner (which has more details), or the one by Steve Wright (which has most details). Figure 20.1: The feasible region for an LP in equational form (from the Matoušek and Gärtner book). 244 barrier functions 20.1 Barrier Functions The first step in solving the LP using an interior-point method will be to introduce a parameter η > 0 and exchange our constrained linear optimization problem for an unconstrained but nonlinear one: f η ( x ) := c⊺ x + η  n 1 ∑ log xi . i =1 Let xη∗ := arg min{ f η ( x ) | Ax = b} be the minimizer of this function over the subspace given by the equality constraints. Note that we’ve added in η times a barrier function n B( x ) := ∑ log i =1 1 . xi The intuition is that when x approaches the boundary x ≥ 0 of the feasible region, the barrier function B( x ) will approach +∞. The parameter η lets us control the influence of this barrier function. If η is sufficiently large, the contribution of the barrier function dominates in f η ( x ), and the minimizer xη∗ will be close to the “center” of the feasible region. However, as η gets close to 0, the effect of B( x ) will diminish and the term c⊺ x will now dominate, causing that xη∗ to approach x ∗ . Now consider the trajectory of the minimizer xη∗ as we lower η continuously, starting at some large value and tending to zero: this path is called the central path. The idea of our path-following algorithm will be to approximately follow this path. In essence, such algorithms conceptually perform the following steps (although we will only approximate these steps in practice): If we had inequality constraints Ax ≥ b as well, we would have added ∑im=1 log a⊺ x1−b to the barrier function. i i 1. Pick a sufficiently large η0 and a starting point x (0) that is the minimizer of f η0 ( x ). (We will ignore this step in our discussion, for now.) 2. At step t, move to the corresponding minimizer x (t+1) for f ηt+1 , where ηt+1 := ηt · (1 − ϵ). Since ηt is close to ηt+1 , we hope that the previous minimizer x (t) is close enough to the current goal x (t+1) for us to find it efficiently. 3. Repeat until η is small enough that xη∗ is very close to an optimal solution x ∗ . At this point, round it to get a vertex solution, like in §19.4.1. We will only sketch the high-level idea behind Step 1 (finding the starting solution), and will skip Step 2 (the rounding); our focus will Figure 20.2: A visualization of a pathfollowing algorithm. interior-point methods 245 be on the update step. To understand this step, let us look at the structure of the minimizers for f η ( x ). 20.1.1 The Primal and Dual LPs, and the Duality Gap Recall the primal linear program: ( P) min c⊺ x Ax = b x ≥ 0, and its dual: ( D ) max b⊺ y A⊺ y ≤ c. We can rewrite the dual using non-negative slack variables s: ( D ′ ) max b⊺ y A⊺ y + s = c s ≥ 0. We assume that both the primal ( P) and dual ( D ) are strictly feasible: i.e., they have solutions even if we replace the inequalities with strict ones). Then we can prove the following result, which relates the optimizer for f η to feasible primal and dual solutions: Lemma 20.1 (Optimality Conditions). The point x ∈ Rn≥0 is a minimizer of f η ( x ) if and only if there exist y ∈ Rm and s ∈ Rn≥0 such that: Ax − b = 0 (20.1) ⊺ A y+s = c (20.2) ∀i ∈ [ n ] : si xi = η (20.3) The conditions (20.1) and (20.2) show that x and (y, s) are feasible for the primal ( P) and dual ( D ′ ) respectively. The condition (20.3) is an analog of the usual complementary slackness result that arises when η = 0. To prove this lemma, we use the method of Lagrange multipliers. Theorem 20.2 (The Method of Lagrange Multipliers). Let functions f and g1 , · · · , gm be continuously differentiable, and defined on some open subset of Rn . If x ∗ is a local optimum of the following optimization problem min f ( x ) s.t. ∀i ∈ [m] : gi ( x ) = 0 then there exists y∗ ∈ Rm such that ∇ f ( x ∗ ) = ∑im=1 yi∗ · ∇ gi ( x ∗ ). Observe: we get that if there exists a maximum x ∗ , then x ∗ satisfies these conditions. 246 the update step Proof Sketch of Lemma 20.1. We need to show three things: 1. The function f η ( x ) achieves its maximum x ∗ in the feasible region. 2. The point x ∗ satisfies the conditions (20.1)–(20.3). 3. And that no other x satisfies these conditions. The first step uses that if there are strictly feasible primal and dual solutions ( x̂, ŷ, ŝ), then the region { x | Ax = b, f µ ( x ) ≤ f µ x̂ } is bounded (and clearly closed) and hence the continuous function f µ ( x ) achieves its minimum at some point x ∗ inside this region, by the Extreme Value theorem. (See Lemma 7.2.1 of Matoušek and Gärtner, say.) For the second step, we use the functions f µ ( x ), and gi ( x ) = a⊺i x − bi in Theorem 20.2 to get the existence of y∗ ∈ Rm such that: m f η ( x ∗ ) = ∑ yi∗ · ∇( a⊺i x ∗ − bi ) i =1 ⇐⇒ c − η · 1/x1∗ , · · · , 1/xn∗ ⊺ m = ∑ yi∗ ai . i =1 Define a vector s∗ with si∗ = η/xi∗ . The above condition is now equivalent to setting A⊺ y∗ + s∗ = c and si∗ xi∗ = η for all i. Finally, for the third step of the proof, the function f η ( x ) is strictly convex and has a unique local/global optimum. Finish this proof. By weak duality, the optimal value of the linear program lies between the values of any feasible primal and dual solution, so the duality gap c⊺ x − b⊺ y bounds the suboptimality c⊺ x − OPT of our current solution. Lemma 20.1 allows us to relate the duality gap to η as follows. c⊺ x − b⊺ y = c⊺ x − ( Ax )⊺ y = x⊺ c − x⊺ (c − s) = x⊺ s = n · η. If the representation size of the original LP is L := ⟨ A⟩ + ⟨b⟩ + ⟨c⟩, then making η ≤ 2− poly( L) means we have primal and dual solutions whose values are close enough to optimal, and can be rounded (using the usual simultaneous Diophantine equations approach used for Ellipsoid). 20.2 The Update Step Let us now return to the question of obtaining x (t+1) from x (t) at step t? Recall, we want x (t+1) to satisfy the optimality conditions (20.1)–(20.3) for f ηt+1 . The hurdles to finding this point directly are: (a) the non-negativity of the x, s variables means this is not just a linear system, there are inequalities to contend with. And more worryingly, (b) it is not a linear system at all: we have non-linearity in the constraints (20.3) because of multiplying xi with si . interior-point methods To get around this, we use a “local-search” method. We start with the solution x (t) “close to” the optimal solution xη∗t for f ηt , and take a small step, so that we remain non-negative, and also get “close to” the optimal solution xη∗t+1 for f ηt+1 . Then we lower η and repeat this process. Let us make these precise. First, to avoid all the superscripts, we use ( x, y, s) and η to denote ( x (t) , y(t) , s(t) ) and ηt . Similarly, ( x ′ , y′ , s′ ) and η ′ denote the corresponding values at time t + 1. Now we assume we have ( x, y, s) with x, s > 0, and also: Ax = b (20.4) A y+s = c (20.5) 2 (20.6) ⊺ n ∑ s i x i − ηt i =1 ≤ (ηt/4)2 . The first two are again feasibility conditions for ( P) and ( D ′ ). The third condition is new, and is an approximate version of (20.3). Suppose that 1  η ′ := η ′ · 1 − √ . 4 n Our goal is a new solution x ′ = x + ∆x, y′ = y + ∆y, s′ = s + ∆s, which satisfies non-negativity, and ideally also satisfies the original optimality conditions (20.1)–(20.3) for the new η ′ . (Of course, we will fail and only satisfy the weaker condition (20.6) instead of (20.3), but we should aim high.) Let us write the goal explicitly, by substituting ( x ′ , y′ , s′ ) into (20.4)–(20.6) and using the feasibility of ( x, y, s). This means the increments ∆x, ∆y, ∆s satisfy A (∆x ) = 0 ⊺ A (∆y) + (∆s) = 0 si (∆xi ) + (∆si ) xi + (∆si )(∆xi ) = η ′ − xi si . Note the quadratic term in blue. Since we are aiming for an approximation anyways, and these increments are meant to be tiny, we drop the quadratic term to get a system of linear equations in these increments: A (∆x ) = 0 ⊺ A (∆y) + (∆s) = 0 si (∆xi ) + (∆si ) xi = η ′ − xi si . This is often written in the following matrix notation (which I am 247 248 the update step putting down just so that you recognize it the next time you see it):      A 0 0 ∆x 0      A⊺ I   ∆y  =  0  0 . diag( x ) 0 diag(s) ∆s η′1 − x ◦ s Here x ◦ s stands for the component-wise product of the vectors x and s. The bottom line: this is a linear system and we can solve it, say using Gaussian elimination. Now we can set x ′ = x + ∆x, etc., to get the new point ( x ′ , y′ , s′ ). It remains to check the non-negativity and also the weakened conditions (20.4)–(20.6) with respect to η ′ . 20.2.1 Properties of the New Solution While discarding the quadratic terms means we do not satisfy xi si = η for each coordinate i, we can show that we satisfy it on average, allowing us to bound the duality gap. Lemma 20.3. The new duality gap is ⟨ x ′ , s′ ⟩ = n η ′ . Proof. The last set of equalities in the linear system ensure that si xi + si (∆xi ) + (∆si ) xi = η ′ , (20.7) so we get x ′ , s′ = ⟨ x + ∆x, s + ∆s⟩  = ∑ si xi + si (∆xi ) + (∆si ) xi + ⟨∆x, ∆s⟩ i = nη ′ + ⟨∆x, − A⊺ (∆y)⟩ = n · η ′ − ⟨ A(∆x ), ∆y⟩ = n · η′, using the first two equality constraints of the linear system. We explicitly maintained the invariants given by (20.4), (20.5), so it remains to check (20.6). This requires just a bit of algebra (that also √ causes the n to pop out). 2 Lemma 20.4. ∑in=1 si′ xi′ − η ′ ≤ (η ′/4)2 . Proof. As in the proof of Lemma 20.3, we get that si′ xi′ − η ′ = (∆si )(∆xi ), so it suffices to show that s n ∑ (∆si )2 (∆xi )2 ≤ η′/4. i =1 We can use the inequality r 1 1 ∑ a2i bi2 ≤ 4 ∑(ai + bi )2 = 4 ∑(a2i + bi2 + 2ai bi ), i i i The goal of many modern algorithms is to get faster ways to solve this linear system. E.g., if it were a Laplacian system we could (approximately) solve it in near-linear time. interior-point methods where we set a2i = s n ∑ i =1 (∆si ∆xi )2 ≤ xi (∆si )2 s (∆x )2 and bi2 = i x i . Hence si i 1 n 4 i∑ =1  xi s · (∆si )2 + i · (∆xi )2 + 2(∆si )(∆xi ) si xi = 1 n ( xi ∆si )2 + (si ∆xi )2 4 i∑ si xi =1 ≤ 1 ∑in=1 ( xi ∆si + si ∆xi )2 4 mini∈[n] si xi =  [since (∆s)⊺ ∆x = 0 by Claim 20.3] 1 ∑in=1 (η ′ − si xi )2 . 4 mini∈[n] si xi (20.8) We now bound the numerator and denominator separately. This claim and proof are incorrect. However, we can prove that mini si xi ≥ 3η/4. This weaker claim suffices for the rest of the proof. The details will come soon, sorry for the mistake!  Claim 20.5 (Denominator). mini si xi ≥ η 1 − 4√1 n . Proof. By the inductive hypothesis, ∑i (si xi − η )2 ≤ (η/4)2 . This η means that maxi |si xi − η | ≤ 4√n , which proves the claim. Claim 20.6 (Numerator). ∑in=1 (η ′ − si xi )2 ≤ η 2 /8. Proof. Let δ = 4√1 n . Then, n n i =1 i =1 n n n i =1 i =1 i =1 ∑ (η ′ − si xi )2 = ∑ ((1 − δ)η − si xi )2 = ∑ (η − si xi )2 + ∑ (δη )2 + 2δη ∑ (η − si xi ). The first term is at most (η/4)2 , by the induction hypothesis. On the other hand, by Claim 20.3 we have n n i =1 i =1 ∑ (η − si xi ) = nη − ∑ si xi = 0. Thus n 1 ∑ (η ′ − si xi )2 ≤ (η/4)2 + n (4√n)2 η 2 = η 2 /8. i =1 Substituting these results into (20.8), we get s n η 2 /8 η′ 1 1 2 ≤ ( ∆s ∆x ) = . ∑ i i 4 (1 − √1 )η 32 (1 − √1 )2 i =1 4 n 4 n This expression is smaller than η ′ /4, which completes the proof. Lemma 20.7. The new values x ′ , s′ are non-negative. 249 250 the newton-raphson method Proof. By induction, we assume the previous point has xi > 0 and si > 0. (For the base case we need to ensure that the starting solution ( x (0) , s(0) ) also satisfies this property.) Now for a scalar α ∈ [0, 1] we define x ′′ := x + α∆x, s′′ := s + α∆s, and η ′′ := (1 − α)η + αη ′ , to linearly interpolate between the old values and the new ones. Then we can show ⟨ x ′′ , s′′ ⟩ = n η ′′ , and also ∑(si′′ xi′′ − η ′′ )2 ≤ (η ′′ /4)2 , (20.9) i which are analogs of Lemmas 20.3 and 20.4 respectively. The latter inequality means that |si′′ xi′′ − η ′′ | ≤ η ′′ /4 for each coordinate i, else that coordinate itself would violate inequality (20.9). Specifically, this means that neither xi′′ nor si′′ ever becomes zero for any value of α ∈ [0, 1]. Now since ( xi′′ , si′′ ) is a linear interpolation between ( xi , si ) and ( xi′ , si′ ), and the former were strictly positive, the latter cannot be non-positive. Theorem 20.8. Given an LP min{c⊺ x | Ax = b, x ≥ 0} with an initial feasible ( x (0) , η0 ) pair, the interior-point algorithm produces a primal-dual √ nη pair with duality gap at most ε in O( n log ε 0 ) iterations, each involving solving one linear system. The proof of the above theorem follows immediately from the fact that the duality gap at the beginning is nη0 , and the value of η drops by (1 − 4√1 n ) in each iteration. If the LP has representation size L := ⟨ A⟩ + ⟨b⟩ + ⟨c⟩, we can stop when ε = exp(− poly( L)), and then round this solution to an vertex solution of the LP. The one missing piece is finding the initial ( x (0) , η0 ) pair: this is a somewhat non-trivial step. One possible approach is to run the interior-point algorithm “in reverse”. The idea is that we can start with some vertex of the feasible region, and then to successively increase η through a similar mechanism as the one above, until the value of η is sufficiently large to begin the algorithm. 20.3 The Newton-Raphson Method A more “modern” way of viewing interior-point methods is via the notion of self-concordance. To do this, let us revisit the classic NewtonRaphson method for finding roots. 20.3.1 Finding Zeros of Functions The basic Newton-Raphson method for finding a zero of a univariate function is the following: given a function g, we start with a point x1 , interior-point methods 251 and at each time t, set x t +1 ← x t − g( xt ) . g′ ( xt ) (20.10) We now show that if f is “nice enough” and we are “close enough” to a zero x ∗ , then this process converges very rapidly to x ∗ . Theorem 20.9. Suppose g has continuous second-derivatives, then if x ∗ is a zero of g, then if we start at x1 “close enough” to X ∗ , the error goes to ε in O(log log 1/ε) steps. Make this formal! Since we take O(log log 1/ε) iterations to get to error ε, the number of bits of accuracy squares each time. This is called quadratic convergence in the optimization literature. Proof. By Taylor’s theorem, the existence of continuous second derivatives means we can approximate f around xt as: f ( x ∗ ) = f ( xt ) + f ′ ( xt )( x ∗ − xt ) + 1/2 f ′′ (ξ t )( x ∗ − xt )2 , where ξ t is some point in the interval [ x ∗ , xt ]. However, x ∗ is a zero of f , so f ( x ∗ ) = 0. Moreover, using (20.10) to replace xt f ′ ( xt ) − f ( xt ) by xt+1 f ′ ( xt ), and rearranging, we get − f ′′ (ξ t ) x ∗ − x t +1 = · ( x ∗ − x t )2 . | {z } 2 f ′ ( xt ) | {z } =:δt+1 =:δt2 Above, we use δt to denote the error x ∗ − xt . Taking absolute values |δt+1 | = f ′′ (ξ t ) · δt2 . 2 f ′ ( xt ) f ′′ (ξ ) Hence, if we can ensure that | 2 f ′ ( x) | ≤ M for each x and each ξ that lies bewteen x ∗ and x, then once we have δ0 small enough, then each subsequent error drops quadratically. This means the number of significant bits of accuracy double each step. More careful analysis. 20.3.2 An Example Given an n-bit integer a ∈ Z, suppose we want to compute its reciprocal 1/a without using divisions. This reciprocal is a zero of the expression g( x ) = 1/x − a. Hence, the Newton-Raphson method says, we can start with x1 = 1, say, and then use (20.10) to get x t +1 ← x t − (1/xt − a) = xt + xt (1 − a xt ) = 2xt − a xt2 . (−1/xt2 ) If we define ε t := 1 − a xt , then ε t+1 = 1 − a xt+1 = 1 − (2a xt − a2 xt2 ) = (1 − a xt )2 = ε2t . Hence, if ε 1 ≤ 1/2, say, the number of bits of accuracy double at each step. Moreover, if we are careful, we can store xt using integers (by instead keeping track of 2k xt for suitably chosen values k ≈ 2t ). This method for computing reciprocals appears in the classic book of Aho, Hopcroft, Ullman, without any elaboration—it always mystified me until I realized the connection to the Newton-Raphson method. I guess they expected their readers to be familiar with these connections, since computer science used to have closer connections to numerical analysis in the 1970s. 252 self-concordance 20.3.3 Minimizing Convex Functions To find the minimum of a function f (especially a convex function) we can focus on finding a stationary point, i.e., a point x such that f ′ ( x ) = 0. Setting g = f ′ , the update rule just changes to x t +1 ← x t − 20.3.4 f ′ ( xt ) . f ′′ ( xt ) (20.11) On To Higher Dimensions For general functions f : Rn → R, the rule remains the same, with the natural changes: xt+1 ← xt − [ H f ( xt )]−1 · ∇ f ( xt ). 20.4 (20.12) Self-Concordance Analogy between self-concordance and the convergence conditions for the 1-d case? Present the view using the “modern view” of self-concordance. Mention that the current bound is really O(m)-self-concordant. That universal barrier is O(n) self-concordant, but not efficient. Vaidya’s volumetric barrier? The entropic barrier? The Lee-Sidford barrier, based on leverage scores. What’s the cleanest way, without getting lost in the algebra? 24 Approximation Algorithms In this chapter, we turn to the problem of combating intractability: many combinatorial optimization problems are NP-hard, and hence are unlikely to have polynomial-time algorithms. Hence we consider approximation algorithms: algorithms that run in polynomial-time, but output solutions whose quality is close the optimal solution’s quality. We illustrate some of the basic ideas in the context of two NP-hard problems: Set Cover and Bin Packing. Both have been studied since the 1970s. Let us start with some definitions: having fixed an optimization problem, let I denote an instance of the problem, and Alg denote an algorithm. Then Alg( I ) is the output/solution produced by the algorithm, and c(Alg( I )) its cost. Similarly, let Opt( I ) denote the optimal output for input I, and let c(Opt( I )) denote its cost. For minimization problems, the approximation ratio of the algorithm A is defined to be the worst-case ratio between the costs of the algorithm’s solution and the optimum: ρ = ρA := max I c(Alg( I )) . c(Opt( I )) In this case, we say that Alg is an ρ-approximation algorithm. For maximization problems, we define rho to be ρ = ρA := min I c(Alg( I )) , c(Opt( I )) and therefore a number in [0, 1]. 24.1 A Rough Classification into Hardness Classes In the late 1990s, there was an attempt to classify combinatorial optimization problems into a small number of hardness classes: while this ultimately failed, a rough classification of NP-bard problems is still useful. Typically, if the instance is clear from context, we use the notation Alg ≤ r · Opt to denote that c(Alg( I )) ≤ r · c(Opt( I )). 254 a rough classification into hardness classes • Fully Poly-Time Approximation Scheme (FPTAS): For problems in this category, there exist approximation algorithms that take in a parameter ε, and output a solution with approximation ratio 1 + ε in time poly(⟨ I ⟩, 1/ε). E.g., one such problem is Knapsack, where given a collection of n items, with each item i having size si ∈ Q+ and value vi ∈ Q+ , find the subset of items with maximum value that fit into a knapsack of unit size. • Poly-Time Approximation Scheme (PTAS): For a problem in this category, for any ε > 0, there exists an approximation algorithm with approximation ratio 1 + ε, that runs in time O(n f (ε) ) for some function f (·). For instance, the Traveling Salesman Problem in d-dimensional Euclidean space has an algorithm due to Sanjeev Arora (1996) that computes a (1 + ε)-approximation in time O(n f (ε) ), where f (ε) = exp{(1/ε)d }. Moreover, it is known that this dependence on ε, with the doubly-exponential dependence on d, is unavoidable. • Constant-Factor Approximation: Examples in this class include the Traveling Salesman Problem on general metrics. In the late 1970s, Nicos Christofides and Anatoliy Serdyukov discovered the same 1.5-approximation algorithm for metric TSP, using the Blossom algorithm to connect up the odd-degree vertices of an MST of the metric space to get an Eulerian spanning subgraph, and hence a TSP tour. This was improved only in 2020, when Anna Karlin, Nathan Klein, and Shayan Oveis-Gharan gave an (1.5 − ε)-approximation, which we hope to briefly outline in a later chapter. Meanwhile, it has been shown that metric TSP can’t be approximated with a ratio better than 123 122 under the assumption of P ̸= NP, by Karpinski, Lampis, Schmied. • Logarithmic Approximation: An example of this is Set Cover, which we will discuss in some detail. • Polynomial Approximation: One example is the Independent Set problem, for which any algorithm with an approximation ratio n1−ε for some constant ε > 0 implies that P = NP. The best approximation algorithm for Independent Set known has an approximation ratio of O(n/ log3 n). However, there are problems that do not fall into any of these clean categories, such as Asymmetric k-Center, for which there exists a O ( log ∗ n ) -approximation algorithm, and this is best possible unless P = NP. Or Group Steiner Tree, where the approximation ratio is O ( log 2 n ) on trees, and this is also best possible. As always, let ⟨ I ⟩ denote the bit complexity of the input I. The runtime has been improved to O(n log n + n exp{(1/ε)d }). Christofides’ result only ever appeared as a CMU GSIA technical report in 1976. Serdyukov’s result only came to be known a couple years back. Karlin, Klein, and Oveis Gharan (2020) approximation algorithms 24.2 255 The Surrogate Given that it is difficult to find an optimal solution, how can we argue that the output of some algorithm has cost comparable to that of Opt ( I ) . An important idea in proving the approximation guarantee involves the use of a surrogate, or a lower bound, as follows: Given an algorithm Alg and an instance I, if we want to calculate the approximation ratio of Alg, we first find a surrogate map S from instances to the reals. To bound the approximation ratio, we typically do the following: 1. We show that S( I ) ≤ Opt( I ) for all I, and 2. then show that Alg( I ) ≤ αS( I ) for all I. This shows that Alg( I ) ≤ α Opt( I ). Which leaves us with the question of how to construct the surrogate. Sometimes we use the combinatorial properties of the problem to get a surrogate, and at other times we use a linear programming relaxation. 24.3 The Set Cover Problem In the Set Cover problem, we are given a universe U with n elements, and a family S = {S1 , . . . , Sm } of m subsets of U, such that U = ∪S∈S S. We want to find a subset S ′ ⊆ S , such that U = ∪S∈S S while minimizing the size |S ′ |. In the weighted version of Set Cover, we have a cost cS for each set S ∈ S , and want to minimize c(S ′ ) = ∑S∈S ′ cS . We will focus on the unweighted version for now, and indicate the changes to the algorithm and analysis to extend the results to the weighted case. The Set Cover problem is NP-complete, even for the unweighted version. Several approximation algorithms are known: the greedy algorithm is a ln n-approximation algorithm, with different analyses given by Vašek Chvátal, David Johnson, Laci Lovász, Stein, and others. Since then, the same approximation guarantee was given based on the relax-and-round paradigm. This was complemented by a hardness result in 1998 by Uri Feige (building on previous work of Carsten Lund and Mihalis Yannakakis), who showed that a (1 − ε) ln n-approximation algorithm for any constant ε > 0 would imply that NP has algorithms that run in time O(nlog log n ). This was improved by Irit Dinur and David Steurer, who tightened the result to show that such an approximation algorithm would in fact imply that NP has polynomial-time algorithm (i.e., that P = NP). Figure 24.1: The cost diagram on instance I (costs increase from left to right). 256 the set cover problem 24.3.1 The Greedy Algorithm for Set Cover The greedy algorithm is simple: Repeatedly pick the set S ∈ S that covers the most uncovered elements, until all elements of U are covered. Theorem 24.1. The greedy algorithm is a ln n-approximation. The greedy algorithm does not achieve a better ratio than Ω(log n): one example is given by the figure to the right. The optimal sets are the two rows, whereas the greedy algorithm may break ties poorly and pick the set covering the left half, and then half the remainder, etc. A more sophisticated example can show a matching gap of ln n. Proof of Theorem 24.1. Suppose Opt picks k sets from S . Let ni be the number of elements yet uncovered when the algorithm has picked i sets. Then n0 = n = |U |. Since the k sets in Opt cover all the elements of U, they also cover the uncovered elements in ni . By averaging, there must exist a set in S that covers ni /k of the yet-uncovered elements. Hence, ni+1 ≤ ni − ni /k = ni (1 − 1/k). Iterating, we get nt ≤ n0 (1 − 1/k)t < n · e−t/k . So setting T = k ln n, we get n T < 1. Since n T must be an integer, it is zero, so we have covered all elements using T = k ln n sets. 24.3.2 Extending to the Weighted Case number of yet-uncovered elements in S . cS One can give an analysis somewhat like the one above for this weighted case as well: let k now be the total cost of sets in the optimal set cover. After i sets have been picked, the remaining ni elements can still be covered using a collection of cost k, so there must be a set whose cost-to-fresh-coverage ratio is at most k/ni . If it covers ni+1 − ni previously uncovered elements, then we know that its cost most be at most (ni+1 − ni ) · k/ni . So if the algorithm picks ℓ sets, the total cost is  ∑ (ni+1 − ni ) · k/ni ≤ k 1/n + 1/(n−1) + . . . + 1/2 + 1 = k · Hn , i =1 As always, we use 1 + x ≤ e x , and here we can use that the inequality is strict whenever x ̸= 0. If the sets are of size at most B, we can show that the greedy algorithm is an HB -approximation, where HB = 1 + 1/2 + 1/3 + . . . + 1/B Moreover, for the weighted case, the greedy algorithm changes to picking the set S in that maximizes: ℓ Figure 24.2: A Tight Example for the Greedy Algorithm where we used that n0 = n, since all elements are initially uncovered. is the Bth Harmonic number. approximation algorithms 24.4 A Relax-and-Round Algorithm for Set Cover The second algorithm for Set Cover uses the popular relax-andround framework. The steps of this process are as follows: 1. Write an integer linear program for the problem. This will also be NP-hard to solve, naturally. 2. Relax the integrality constraints to get a linear program. Since this is a minimization problem, relaxing the constraints causes the optimal LP value to be no larger than the optimal IP value (which is just Opt). This optimal value LP value is the surrogate. 3. Now solve the linear program, and round the fractional variables to integer values, while ensuring that the cost of this integer solution is not much higher than the LP value. Let’s see this in action: here is the integer linear program (ILP) that precisely models Set Cover: min ∑ cS xS S∈S s.t. ∑ xS ≥ 1 S:e∈S xS ∈ {0, 1} (ILP-SC) ∀e ∈ U ∀S ∈ S . The LP relaxation just drops the integrality constraints: min ∑ cS xS S∈S s.t. (LP-SC) ∑ xS ≥ 1 ∀e ∈ U xS ≥ 0 ∀S ∈ S . S:e∈S If LP( I ) is the optimal value for the linear program, then we get: LP( I ) ≤ Opt( I ). Finally, how do we round? Suppose x ∗ is the fractional solution obtained by solving the LP optimally. We do the following two phases: 1. Phase 1: Repeat t = ln n times: for each set S, pick S with probability xS∗ independently. 2. Phase 2: For each element e yet uncovered, pick any set covering it. Clearly the solution produced by the algorithm is feasible; it just remains to bound the number of sets picked by it. Theorem 24.2. The expected number of sets picked by this algorithm is (ln n) LP( I ) + 1. 257 258 the bin packing problem Proof. Clearly, the expected number of sets covered in each round in phase 1 is ∑S xS∗ = LP( I ), and hence the expected number of sets in phase 1 is at most ln n times as much. For the second phase, the number of sets not picked is precisely the the expected number of elements not covered in Phase 1. To calculate this, consider an arbitrary element e. Pr[e not covered in phase 1] = (ΠS:e∈S (1 − xS∗ ))t ≤ (e− ∑S:e∈S xS )t ≤ ( e −1 ) t 1 = , n since t = ln n. By linearity of expectations, the expected number of uncovered elements in Phase 2 should be 1, so in expectation we’ll pick 1 set in Phase 2. This completes the proof. In a homework problem, we will show that if the sizes of the sets are bounded by B, then we can get a (1 + ln B)-approximation as well. And that the analysis can extend to the weighted case, where sets have costs. 24.5 The Bin Packing Problem Bin Packing is another classic NP-hard optimization problem. We are givenn items, each item i having some size si ∈ [0, 1]. We want to find the minimum number of bins, each with capacity 1, such that we can pack all n items into them. Formally, we want to find the partition of [n] into S1 ∪ S2 ∪ . . . ∪ Sk such that ∑i∈S j si ≤ 1 for each set S j , and moreover, the number of parts k is minimized. The Bin Packing is NP-hard, and this can be shown via a reduction from the Partition problem (where we are given n positive integers s1 , s2 , . . . , sn and an integer K such that ∑in=1 si = 2K, and we want to decide whether we can find a partition of these integers into two disjoint sets A, B such that ∑i∈ A si = ∑ j∈ B s j = K). Since this partition instance corresponds gives us Bin Packing instances where the optimum is either 2 or at least 3, the reduction shows that getting an approximation factor of smaller than 3/2 for Bin Packing is also NP-hard. We show two algorithms for this problem. The first algorithm First-Fit uses at most 2 Opt bins, whereas the second algorithm uses at most (1 + ε) Opt +O(1/ε2 ) bins. These are not the best results possible: e.g., a recent result by Rebecca Hoberg and Thomas Rothvoß gives a solution using at most Opt +O(log Opt) bins, and it is conceivable that we can get an algorithm that uses Opt +O(1) bins. One can derandomize this algorithm to get a deterministic algorithm with the same guarantee. We may see this in an exercise. Also, Neal Young has a way to solve this problem without solving the LP at all! approximation algorithms 24.5.1 259 A Class of Greedy Algorithms: X-Fit We can define a collection of greedy algorithms that consider the items in some arbitrary order: for each item they try to fit it into some “open” bins; if the item does not fit into any of these bins, then they open a new bin and put it there. Here are some of these algorithms: 1. First-Fit: add the item to the earliest opened bin where it fits. 2. Next-Fit: add the item to the single most-recently opened bin. 3. Best-Fit: consider the bins in increasing order of free space, and add the item to the first one that can take it. 4. Worst-Fit: consider the open bins in decreasing(!) order of free space, and add the item to the first one that can take it. The idea is to ensure that no bin has small amounts of free space remaining, which is likely to then get wasted. All these algorithms are 2-approximations. Let us give a proof for First-Fit, the others have similar proofs. Theorem 24.3. AlgFF ( I ) ≤ 2 · Opt( I ). Proof. The surrogate in this case is the total volume V ( I ) = ∑i si of items. Clearly, ⌈V ( I )⌉ ≤ OPT ( I ). Now consider the bins in the order they were opened. For any pair of consecutive bins 2j − 1, 2j, the first item in bin 2j could not have fit into bin 2j − 1 (else we would not have opened the new bin). So the total size of items in these two consecutive bins is strictly more than 1. Hence, if we open K bins, the total volume of items in these bins is strictly more than ⌊K/2⌋. Hence, Another way to say this: at most one bin is at-most-half-full, because if there were two, the later of these bins would not have been opened. ⌊K/2⌋ < V ( I ) =⇒ K ≤ 2 ⌈V ( I )⌉ ≤ 2 Opt( I ). Exercise: if all the items were of size at most ε, then each bin (except the last one) would have at least 1 − ε total size, thereby giving an approximation of 1 Opt( I ) + 1 ≈ (1 + ε) Opt( I ) + 1. 1−ε 24.6 The Linear Grouping Algorithm for Bin Packing The next algorithm was given by Wenceslas Fernandez de la Vega and G.S. Luecker, and it uses a clever linear programming idea to get an “almost-PTAS” for Bin Packing. Observe that we cannot hope to get a PTAS, because of the hardness result we showed above. But Recall, a PTAS (polynomial-time approximation scheme) is an algorithm that for any ε > 0 outputs a (1 + ε)approximation in time n f (ε) . Hence, we can get the approximation factor to any constant above 1 as we want, and still get polynomial-time—just the degree of the polynomial in the runtime gets larger. 260 the linear grouping algorithm for bin packing we will show that if we allow ourselves a small additive term, the hardness goes away. The main ideas here will be the following: 1. We can discretize the item sizes down to a constant number of values by losing at most ε Opt (where the constant depends on ε). 2. The problem for a constant number of sizes can be solved almost exactly (up to an additive constant) in polynomial-time. 3. Items of size at most ε can be added to any instance while maintaining an approximation factor of (1 + ε). 24.6.1 The Linear Grouping Procedure Lemma 24.4 (Linear Grouping). Given an instance I = (s1 , s2 , . . . , sn ) of Bin Packing, and a parameter D ∈ N, we can efficiently produce another instance I ′ = (s1 ′ , s2 ′ , . . . , sn ′ ) with increased sizes si ′ > si and at most D distinct item sizes, such that Opt( I ′ ) ≤ Opt( I ) + ⌈n/D⌉ . Proof. The instance I ′ is constructed as follows: • Sort the sizes si in non-increasing order to get s1 ≥ s2 ≥ . . . ≥ sn . • Group items into D groups of ⌈n/D⌉ consecutive items, with the last group being potentially slightly smaller. • Define the new size si ′ for each item i to be the size of the largest element in i’s group. There are D distinct item sizes, and all sizes are only increased, so it remains to show a packing for the items in I ′ that uses at most Opt( I ) + ⌈n/D ⌉ bins. Indeed, suppose Opt( I ) assigns item i to some bin b. Then we assign item (i + ⌈n/D⌉) to bin b. Since the sizes of the items only get smaller, this allocates all the items except items in first group, without violating the sizes. Now we assign each item in the first group into a new bin, thereby opening up ⌈n/D⌉ more bins. 24.6.2 An Algorithm for a Constant Number of Item Sizes Suppose we have an instance with at most D distinct item sizes: let the sizes be s1 < s2 < . . . < SD , with δ > 0 being the smallest size. The instance is then defined by the number of items for each size. Define a configuration to be a collection of items that fits into a bin: there can be at most 1/s1 items in any bin, and each item has one of D sizes (or it can be the “null” item), so there are at most N := ( D + 1)1/s1 different configurations. Note that if D and s1 are both constants, this is (large) constant. (In the next section, we use approximation algorithms 261 this result for the case where s1 ≥ ε.) Let C be the collection of all configurations. We now use an integer LP due to Paul Gilmore and Ralph Gomory (from the 1950s). It has one variable xC for every configuration C ∈ C that denotes the number of bins with configuration C in the solution. The LP is: min ∑ xC , C ∈C s.t. ∑ ACs xC ≥ ns , C ∀ sizes s xC ∈ N. Here ACs is the number of items of type s being placed in the configuration C, and ns is the total number items of size s in the instance. This is an exact formulation, and relaxing the integrality constraint to xC ≥ 0 gives us an LP that we can solve in time poly( N, n). This is polynomial time when N is a constant. We use the optimal value of this LP as our surrogate. How do we round the optimal solution for this LP? There are only D non-trivial constraints in the LP, and N non-negativity constraints. So if we pick an optimal vertex solution, it must have some N of these constraints at equality. This means at least N − D of these tight constraints come from the latter set, and therefore N − D variables are set to zero. In other words, at most D of the variables are nonzero. Rounding these variables up to the closest integer, we get a solution that uses at most LP( I ) + D ≤ Opt( I ) + D bins. Since D is a constant, we have approximated the solution up to a constant. 24.6.3 The Final Bin Packing Algorithm Combining the two ideas, we get a solution that uses Opt( I ) + ⌈n/D ⌉ + D bins. Now if we could ensure that n/D were at most ε Opt( I ), when D was f (ε), we would be done. Indeed, if all the items have size at least ε, the total volume (and therefore Opt( I )) is at least εn. If we   now set D = 1/ε2 , then n/D ≤ ε2 n ≤ ε Opt( I ), and the number of bins is at most l m (1 + ε) Opt( I ) + 1/ε2 . What if some of the items are smaller than ε? We now use the observation that First-Fit behaves very well when the item sizes are small. Indeed, we first hold back all the items smaller than ε, and solve the remaining instance as above. Then we add in the small items using First-Fit: if it does not open any new bins, we are In fact, we show in a homework problem that the LP can be solved in time polynomial in n even when N is not a constant. 262 subsequent results and open problems fine. And if adding these small items results in opening some new bin, then each of the existing bins—and all the newly opened bins (except the last one)—must have at least (1 − ε) total size in them. The number of bins is then at most 1 Opt( I ) + 1 ≈ (1 + O(ε)) Opt( I ) + 1, 1−ε as long as ε ≤ 1/2. 24.7 Subsequent Results and Open Problems 25 Approximation Algorithms via SDPs Just like the use of linear programming was a major advance in the design of approximation algorithms, specifically in the use of linear programs in the relax-and-round framework, another significant advantage was the use of semidefinite programs in the same framework. For instance, the approximation guaranteee for the Max-Cut problem was improved from 1/2 to 0.878 using this technique. Moreover, subsequent results have shown that any improvements to this approximation guarantee in polynomial-time would disprove the Unique Games Conjecture. 25.1 Positive Semidefinite Matrices The main objects of interest in semidefinite programming, not surprisingly, are positive semidefinite matrices. Definition 25.1 (Positive Semidefinite Matrices). Let A ∈ Rn×n be a real-valued symmetric matrix and let r = rank( A). We say that A is positive semidefinite (PSD) if any of the following equivalent conditions hold: a. x⊺ Ax ≥ 0 for all x ∈ Rn . b. All of A’s eigenvalues are nonnegative (with r of them being strictly positive), and hence A = ∑ri=1 λi vi v⊺i for λ1 , . . . , λr > 0, and vi ’s being orthonormal. c. There exists a matrix B ∈ Rn×r such that A = BB⊺ . d. There exist vectors v1 , . . . , vn ∈ Rr such that Ai,j = vi , v j for all i, j. e. There exist jointly distributed (real-valued) random variables X1 , . . . , Xn such that Ai,j = E[ Xi X j ]. f. All principal minors have nonnegative determinants. A principal minor is a submatrix of A obtained by taking the columns and rows indexed by some subset I ⊆ [n]. 264 semidefinite programs The different definitions may be useful in different contexts. As an example, we see that the condition in Definition 25.1(f) gives a short proof of the following claim. Lemma 25.2. Let A ⪰ 0. If Ai,i = 0 then A j,i = Ai,j = 0 for all j. Proof. Let j ̸= i. The determinant of the submatrix indexed by {i, j} is Ai,i A j,j − Ai,j A j,i We will write A ⪰ 0 to denote that A is PSD; more generally, we write A ⪰ B if A − B is PSD: this partial order on symmetric matrices is called the Löwner order. is nonnegative, by assumption. Since Ai,j = A j,i by symmetry, and Ai,i = 0, we get A2i,j = A2j,i ≤ 0 and we conclude Ai,j = A j,i = 0. Definition 25.3 (Frobenius Product). Let A, B ∈ Rn×n . The Frobenius inner product A • B, also written as ⟨ A, B⟩ is defined as ⟨ A, B⟩ := A • B := ∑ Ai,j Bi,j = Tr( A⊺ B). i,j We can think of this as being the usual vector inner product treating A and B as vectors of length n × n. Note that by the cyclic property of the trace, A • xx⊺ = Tr( Axx⊺ ) = Tr( x⊺ Ax ) = x⊺ Ax; we will use this fact to derive yet another of PSD matrices. Lemma 25.4. A is PSD if and only if A • X ≥ 0 for all X ⪰ 0. Proof. Suppose A ⪰ 0. Consider the spectral decomposition X = ⊺ ∑i λi xi xi where λi ≥ 0 by Definition 25.1(b). Then A • X = ∑ λi ( A • xi xi⊺ ) = ∑ λi xi⊺ Axi ≥ 0. i i On the other hand, if A ⪰̸ 0, there exists v such that v⊺ Av < 0, by 25.1(a). Let X = vv⊺ ⪰ 0. Then A • X = v⊺ Av < 0. Finally, let us mention a useful fact (which can be proved, e.g., using the x⊺ Ax ≥ 0 characterization of PSD matrices): Fact 25.5 (PSD cone). Given two matrices A, B ⪰ 0, and scalars α, β > 0 then αA + βB ⪰ 0. Hence the set of PSD matrices forms a convex cone in Rn(n+1)/2 . 25.2 Semidefinite Programs Loosely, a semidefinite program (SDP) is the problem of optimizing a linear function over the intersection of a convex polyhedron K (given by finitely many linear constraints, say Ax ≥ b) with the PSD cone K. Let us give two useful packagings for semidefinite programs. Here n(n + 1)/2 is the number of entries on or above the diagonal in an n × n matrix, and completely specifies a symmetric matrix. approximation algorithms via sdps 25.2.1 As Linear Programs with a PSD Constraint Consider a linear program where the variables are indexed by pairs i, j ∈ [n], i.e., a typical variable is xi,j . Let X be the n × n dimensional matrix whose (i, j)th entry is xi,j . As the objective and constraints are linear, we can write them as C • X and Ak • X ≤ bk for some (not necessarily PSD) matrices C, A1 , . . . , Am and scalars b1 , . . . , bm . An SDP is an LP of this form with the additional constraint X ⪰ 0: maximize C•X subject to A k • X ≤ bk , X ∈Rn × n X ⪰ 0. 25.2.2 ∀k ∈ [m] where x denotes the diagonal of the PSD matrix X. As Vector Programs maximize n ∑ cij vi , v j subject to ∑ aij v1 ,...,vn ∈R i,j (k) i,j v i , v j ≤ bk , ∀ k ∈ [ m ]. In particular, we optimize over vectors in n-dimensional space; we cannot restrict the dimension of these vectors, much like we cannot restrict the rank of the matrices X in the previous representation. Examples of SDPs Let A a symmetric n × n real matrix. Here is an SDP to compute the maximum eigenvalue of A: maximize A•X subject to I•X =1 X ∈Rn × n Observe that if each of the matrices Ai and C are diagonal matrices, say with diagonals ai and c, this SDP becomes the linear program max{c⊺ x | a⊺k x ≤ bk , x ≥ 0}, We can use Definition 25.1(d) to rewrite the above program as a “vector program”: where the linear objective and the linear constraints are on inner products of vector variables: 25.2.3 265 (25.1) X⪰0 Lemma 25.6. SDP (25.1) computes the maximum eigenvalue of A. Proof. Let X maximize SDP (25.1) (this exists as the objective is continuous and the feasible set is compact). Consider the spectral decomposition X = ∑in=1 λi xi xi⊺ where λi ≥ 0 and ∥ xi ∥2 = 1. The trace constraint I • X = 1 implies ∑i λi = 1. Thus the objective value A • X = ∑i λi xi⊺ Axi is a convex combination of xi⊺ Axi . Hence without loss of generality, we can put all the weight into one of these terms, in which case X = yy⊺ is a rank-one matrix with ∥y∥2 = 1. By the Courant-Fischer theorem, OPT ≤ max∥y∥2 =1 y⊺ Ay = λmax . 266 sdps in approximation algorithms On the other hand, letting v be a unit eigenvector of A corresponding to λmax , we have that OPT ≥ A • vv⊺ = v⊺ Av = λmax . Here is another SDP for the same problem: minimize t subject to tI − A ⪰ 0. t (25.2) Lemma 25.7. SDP (25.2) computes the maximum eigenvalue of A. Proof. The matrix tI − A has eigenvalues t − λi . And hence the constraint tI − A ⪰ 0 is equivalent to the constraint t − λ ≥ 0 for all its eigenvalues λ. In other words, t ≥ λmax , and thus OPT = λmax . 25.3 SDPs in Approximation Algorithms We now consider designing approximation algorithms using SDPs. Recall that given a matrix A, we can check if it is PSD in (strongly) polynomial time, by performing its eigendecomposition. Moreover, if A is not PSD, we can return a hyperplane separating A from the PSD cone. Thus using the ellipsoid method, we can approximate SDPs when OPT is appropriately bounded. Informally, Theorem 25.8 (Informal Theorem). Assuming that the radius of the feasible set is at most exp(poly(⟨SDP⟩)), the ellipsoid algorithm can weakly solve SDP in time poly(⟨SDP⟩, log(1/ε)) up to an additive error of ε. For a formal statement, see Theorem 2.6.1 of Matoušek and Gärtner. However, we will ignore these technical issues in the remainder of this chapter, and instead suppose that we can solve our SDPs exactly. 25.4 The MaxCut Problem and Hyperplane Rounding Given a graph G = (V, E), the MaxCut problem asks us to find a partition of the vertices (S, V \ S) maximizing the number of edges crossing the partition. This problem is NP-complete. In fact assuming P ̸= NP, a result of Johan Håstad shows that we cannot approximate MaxCut better than 17/16 − ε for any ε > 0. 25.4.1 In fact, it turns out that this SDP is dual to the one in (25.1). Weak duality still holds for this case, but strong duality does not hold in general for SDPs. Indeed, there could be a duality gap for some cases, where both the primal and dual are finite, but the optimal solutions are not equal to each other. However, under some mild regularity conditions (e.g., the Slater conditions) we can show strong duality. More about SDP duality here. Greedy and Randomized Algorithms We begin by considering a greedy algorithm: process the vertices v1 , . . . , vn in some order, and place each vertex vi in the part of the bipartition that maximizes the number of edges cut so far (breaking ties arbitrarily). We know that there is an optimal LP solution where the numbers are singly exponential, and hence can be written using a polynomial number of bits. But this is not true in SDPs, in fact, OPT in an SDP may be as large (or small) as doubly exponential in the size of the SDP. (See Section 2.6 of the Matoušek and Gärtner.) approximation algorithms via sdps 267 Lemma 25.9. The greedy algorithm cuts at least | E|/2-many edges. Proof. Let δi be the number of edges from vertex i to vertices j < i: then the greedy algorithm cuts at least ∑i δi /2 = | E|/2 edges. This result shows two things: (a) every graph has a bipartition that cuts half the edges of the graph, so Opt ≥ | E|/2. Moreover, (b) that since Opt ≤ | E| on any graph, this means that Alg ≥ | E|/2 ≥ Opt /2. Here’s a simple randomized algorithm: place each vertex in either S or in S̄ independently and uniformly at random. Since each edge is cut with probability 1/2, the expected number of cut edges is | E|/2. Moreover, by the probabilistic method Opt ≥ | E|/2. 25.4.2 We cannot hope to prove a better result than Lemma 25.9 in terms of | E|, since the complete graph Kn has (n2 ) ≈ n2 /2 edges and any partition can cut at most n2 /4 of them. Relax-and-Round using LPs A natural direction would be to write an ILP formulation for MaxCut and to relax it: this approach does not give us anything beyond a factor of 1/2, say. 25.4.3 A Semidefinite Relaxation We now see a well-known example of an SDP-based approximation algorithm due to Michel Goemans and David Williamson. Again, we will use the relax-and-round framework from the previous chapter. The difference is that we write a quadratic program to model the problem exactly, and then relax it to get an SDP. Indeed, observe that the MaxCut problem can be written as the following quadratic program. maximize x1 ,...,xn ∈R ( x i − x j )2 ∑ 4 (i,j)∈ E subject to xi2 = 1 (25.3) ∀i. Since each xi is real-valued, and xi2 = 1, each variable must be assigned one of two labels {−1, +1}. Since each term in the objective contributes 1 for an edge connecting two vertices in different partitions, and 0 otherwise, this IP precisely captures MaxCut. We now relax this program by replacing the variables xi with vector variables vi ∈ Rn , where ∥vi ∥2 = 1. maximize n v1 ,...,vn ∈R subject to ∥ v i − v j ∥2 ∑ 4 (i,j)∈ E 2 ∥ vi ∥ = 1 (25.4) ∀i. Noting that ∥vi − v j ∥2 = ∥vi ∥2 + ∥v j ∥2 − 2 vi , v j = 2 − 2 vi , v j , we rewrite this vector program as The SDP relaxation for the MaxCut problem was first introduced by Svata Poljak and Franz Rendl. 268 the maxcut problem and hyperplane rounding maximize n 1 − vi , v j 2 (i,j)∈ E subject to ⟨ vi , vi ⟩ = 1 v1 ,...,vn ∈R ∑ (25.5) ∀i. This is a relaxation of the original quadratic program, because we can model any {−1, +1}-valued solution using vectors, say by a corresponding {−e1 , +e1 }-valued solution. Since this is a maximization problem, the SDP value is now at least the optimal value of the quadratic program. 25.4.4 The Hyperplane Rounding Technique In order to round this vector solution {vi } to the MaxCut SDP into an integer scalar solution to MaxCut, we use the remarkably simple method of hyperplane rounding. The idea is this: a term in the SDP objective incurs a tiny cost close to zero when vi , v j are very close to each other, and almost unit cost when vi , v j point in nearly opposite directions. So we would like to map close vectors to the same value. To do this, we randomly sample a hyperplane through the origin and partition the vectors according to the side on which they land. Formally, this corresponds to picking a vector g ∈ Rn according to the standard n-dimensional Gaussian distribution, and setting S : = { i | ⟨ v i , g ⟩ ≥ 0}. v1 g We now argue that this procedure gives us a good cut in expectation; this procedure can be repeated to get an algorithm that succeeds with high probability. Theorem 25.10. The partition produced by the hyperplane rounding algorithm cuts at least αGW · SDP edges in expectation, where αGW := 0.87856. Proof. By linearity of expectation, it suffices to bound the probability of an edge (i, j) being cut. Let θij := cos−1 ( vi , v j ) be the angle between the unit vectors vi and v j . Now consider the 2-dimensional plane P containing vi , v j and the origin, and let ge be the projection of the Gaussian vector g onto this plane. Observe that the edge (i, j) is cut precisely when the hyperplane defined by g separates vi , v j . This is precisely when the vector perpendicular to ge in the plane P lands between vi and v j . As the projection onto a subspace of the standard Gaussian is again a standard Guassian (by spherical symmetry), Pr[(i, j) cut] = 2θij θij = . 2π π v3 v4 Figure 25.1: A geometric picture of Goemans-Williamson randomized rounding v2 θij approximation algorithms via sdps 269 g̃ Since the SDP gets a contribution of 1 − vi , v j 1 − cos(θi,j ) = 2 2 for this edge, it suffices to show that θ 1 − cos θ ≥α . π 2 Indeed, we can show (either by plotting, or analytically) that α = 0.87856 . . . suffices for the above inequality, and hence E[# edges cut] = 1 − cos(θij ) = α SDP . 2 (i,j)∈ E ∑ θij /π ≥ α ∑ (i,j)∈ E This proves the theorem. Corollary 25.11. For any ε > 0, repeating the hyperplane rounding algorithm O(1/ε log 1/δ) times and returning the best solution ensures that we output a cut of value at least (.87856 − ε) Opt with probability 1 − δ. We leave this proof as an exerise in using Markov’s inequality: note that we want to show that the algorithm returns something not too far below the expecation, which seems to go the wrong way, and hence requires a moment’s thought. The above algorithm is randomized and the result only holds in expectation. However, it is possible to derandomize this result to obtain a polynomial-time deterministic algorithm with the same approximation ratio. 25.4.5 Subsequent Work and Connections Can we get a better approximation factor, perhaps using a more sophisticated SDP? An influential result of Subhash Khot, Guy Kindler, Elchanan Mossel, and Ryan O’Donnell says that a constant-betterthan-αGW -approximation would refute the Unique Games Conjecture. Also, one can ask if similar rounding procedures exist for an linear-programming relaxation as opposed to the SDP relaxation here. Unfortunately the answer is again no: a result of Siu-On Chan, James Lee, Prasad Raghavendra, and David Steurer shows that no polynomial-sized LP relaxation of MaxCut can obtain a non-trivial approximation factor, that is, any polynomial sized LP of MaxCut has an integrality gap of 1/2. 25.5 Coloring 3-Colorable Graphs Suppose we are given a graph G = (V, E) and a promise that there is some 3-coloring of G. What is the minimum k such that we can Figure 25.2: Angle between two vectors. We cut edge (i, j) when the vector perpendicular to ge lands in the grey area. 270 coloring 3-colorable graphs find a k-coloring of G in polynomial time? It is well-known that 2coloring a graph can be done in linear time, but 3-coloring a graph is NP-complete. Hence, even given a 3-colorable graph, it is NP-hard to color it using 3 colors. (In fact, a result of Venkat Guruswami and Sanjeev Khanna shows that it is NP-hard to color it using even 4 colors.) But what if we ask to color a 3-colorable graph using 5 colors? O(log n) colors? O(nα ) colors, for some fixed constant α? We will see √ an easy algorithm to achieve an O( n)-coloring, and then will use e (nlog6 (2) ) colorsemidefinite programming to improve this to an O ing. Before we describe these, let us recall the easy part of Brooks’ theorem. Lemma 25.12. Let ∆ be the maximum degree of a graph G, then we can find a (∆ + 1)-coloring of G in linear time. Proof. Pick any vertex v, recursively color the remaining graph, and then assign v a color not among the colors of its ∆ neighbors. We will now describe an algorithm that colors a 3-colorable graph √ G with O( n) colors, originally due to Avi Wigderson: while there √ exists a vertex with degree at least n, color it using a fresh color. Moreover, its neighborhood must be 2-colorable, so use two fresh √ colors to do so. This takes care of n vertices using 3 colors. Remove these, and repeat. Finally, use Lemma 25.12 to color the remaining √ vertices using n colors. This proves the following result. Lemma 25.13. There is an algorithm to color a 3-colorable graph with √ O( n) colors. 25.5.1 An Algorithm using SDPs Let’s consider an algorithm that uses SDPs to color a 3-colorable e (∆log3 2 ) ≈ O e (∆0.63 ) colors. graph with maximum degree ∆ using O In general ∆ could be as large as n, so this could be worse than the algorithm in Lemma 25.13, but we will be able to combine the ideas together to get a better result. For some parameter λ ∈ R, consider the following feasibility SDP (where we are not optimizing any objective): find subject to v 1 , . . . , v n ∈ Rn vi , v j ≤ λ ⟨ vi , vi ⟩ = 1 ∀(i, j) ∈ E (25.6) ∀i ∈ V. Why is this SDP relevant to our problem? The goal is to have vectors clustered together in groups, such that each cluster represents a color. Intuitively, we want to have vectors of adjacent vertices to be far apart, so we want their inner product to be close to −1 (recall we are The harder part is to show that in fact ∆ colors suffice unless the graph is either a complete graph, or an odd-length cycle. approximation algorithms via sdps 271 dealing with unit vectors, due to the last constraint) and vectors of the same color to be close together. Lemma 25.14. For 3-colorable graphs, SDP (25.6) is feasible with λ = −1/2. Proof. Consider the vector placement shown in the figure to the right. If the graph is 3-colorable, we can assign all vertices with color 1 the red vector, all vertices with color 2 the blue vector and all vertices with color 3 the green vector. Now for every edge (i, j) ∈ E, we have that   2π vi , v j = cos = −1/2. 3 At first sight, it may seem like we are done: if we solve the above SDP with λ = −1/2, don’t all three vectors look like the figure above? No, that would only hold if all of them were to be co-planar. And in n-dimensions we can have an exponential number of cones of angle 2π 3 , like in the next figure, so we cannot cluster vectors as easily as in the above example. To solve this issue, we apply a hyperplane rounding technique similar to that from the MaxCut algorithm. Indeed, for some parameter t we will pick later, pick t random hyperplanes. Formally, we pick gi ∈ Rn from a standard n-dimensional Gaussian distribution, for i ∈ [t]. Each of these defines a normal hyperplane, and these split the Rn unit sphere into 2t regions (except if two of them point in the same direction, which has zero probability). Now, each vectors {vi } that lie in the same region can be considered “close” to each other, and we can try to assign them a unique color. Formally, this means that if vi and v j are such that sign(⟨vi , gk ⟩) = sign( v j , gk ) for all k ∈ [t], then i and j are given the same color. Each region is given a different color, of course. However, this may color some neighbors with the same color, so we use the method of alterations: while there exists an edge between vertices of the same color, we uncolor both endpoints. When this uncoloring stops, we remove the still-colored vertices from the graph, and then repeat the same procedure on the remaining graph, until we color every vertex. Note that since we use t hyperplanes, we add at most 2t new colors per iteration. The goal is to now show that (a) the number of interations is small, and (b) the value of 2t is also small. Lemma 25.15. If half of the vertices are colored in a single iteration in expectation, then the expected number of iterations to color the whole graph is O(log n). 120◦ 120◦ 120◦ Figure 25.3: Optimal distribution of vectors for 3-coloring graph Figure 25.4: Dimensionality problem of 2π/3 far vectors 272 coloring 3-colorable graphs Proof. Since the expected number of uncolored vertices is at most half, Markov’s inequality says that more than 3/4 of the vertices are uncolored in a single iteration, with probability at most 2/3. In other words, at least 1/4 of the vertices are colored with probability 1/3. Hence, the number of iterations to color the whole graph is dominated by the number of flips of a coin of bias 1/3 to get log4 n heads. This is 4 log4 n, which proves the result. Lemma 25.16. The expected number of vertices that remain uncolored after a single iteration is at most n∆ (1/3)t . Proof. Fix an edge ij: for a single random hyperplane, the probability that vi , v j are not separated by it is π − θij 1 ≤ , π 3 using that θij ≥ 2π 3 which follows from the constraint in the SDP. Now if i is uncolored because of j, then vi , v j have the same color, which happens when all t hyperplanes fail to separate the two. By independence, this happens with probability at most (1/3)t . Finally, E[remaining] = ∑ Pr[i uncolored] i ∈V ≤ ∑ ∑ Pr[i uncolored because of j]. (25.7) i ∈V (i,j)∈ E There are n vertices, and each vertex has degree at most ∆, which proves the result. Lemma 25.17. There is an algorithm that colors a 3-colorable graph with maximum degree ∆ with O(∆log3 2 · log n) colors in expectation. Proof. Setting t = log3 (2∆) in Lemma 25.16, the expected number of uncolored vertices in any iteration is n · ∆ · (1/3)t ≤ n/2. (25.8) Now Lemma 25.15 says we perform O(log n) iterations in expectation. Since we use most 2log3 (2∆) = (2∆)log3 2 colors in each iteration, we get the result. 25.5.2 Improving the Algorithms Further The expected number of colors used by the above algorithm is √ e (nlog3 2 ) ≈ O e (n0.63 ), which is worse than our initial O( n) algoO rithm. However we can combine the ideas together to get a better result: approximation algorithms via sdps Theorem 25.18. There is an algorithm that colors a 3-colorable graph with e (nlog6 (2) ) colors. O Proof. For some value σ, repeatedly remove vertices with degree greater than σ and color them and their neighbors with 3 new colors, as in Lemma 25.13. This requires at most 3n/σ colors overall, and leaves us with a graph having maximum degree σ. Now use Lemma 25.17 to color the remaining graph with O(σlog3 2 · log n) colors. Picking σ to be nlog6 3 to balance these terms, we get a procedure e (nlog6 2 ) ≈ O e (n0.38 ) colors. that uses O 25.5.3 Final notes on coloring 3-colorable graphs This result us due to David Karger, Rajeev Motwani, and Madhu Sudan. They gave a better rounding algorithm that uses spherical caps e (n1/4 ) colors. This result was instead of hyperplanes to achieve O then improved over a sequence of papers: the current best result by Ken-Ichi Kawarabayashi and Mikkel Thorup uses O(n0.199 ) colors. It remains an outstanding open problem to either get a better algorithm, or to show hardness results, even under stronger complexitytheoretic hypotheses. 273 26 Online Algorithms In this chapter we introduce online algorithms and study two classic online problems: the rent-or-buy problem, and the paging problem. While the models we consider are reminiscent of those in regret minimization in online learning and online convex optimization, there are some important differences, which lend a different flavor to the results that are obtained. 26.1 The Competitive Analysis Framework In the standard setting of online algorithms there is a sequence of requests σ = (σ1 , σ2 , . . . , σt , . . .) received online. An online algorithm does not know the input sequence up-front, but sees these requests one by one. It must serve request σi before it is allowed to see σi+1 . Serving this request σi involves some choice of actions, and incurs some cost. We will measure the performance of an algorithm by considering the ratio of the total cost incurred on σ to the optimal cost of serving σ in hindsight. To make all this formal, let us xsee an example of an online problem. 26.1.1 Example: The Paging Problem The paging problem arises in computer memory systems. Often, a memory system consists of a large but slow main memory, as well as a small but fast memory called a cache. The CPU typically communicates directly with the cache, so in order to access an item that is not contained in the cache, the memory system has to load the item from the slow memory into the cache. Moreover, if the cache is full, then some item contained in the cache has to be evicted to create space for the requested item. We say that a cache miss occurs whenever there is a request to an item that is not currently in the cache. The goal is to come up with an eviction strategy that minimizes the number of cache misses. 276 the competitive analysis framework Typically we do not know the future requests that the CPU will make so it is sensible to model this as an online problem. We let U be a universe of n items or pages. The cache is a memory containing at most k pages. The requests are pages σi ∈ U and the online algorithm is an eviction policy. Now we return back to defining the performance of an online algorithm. 26.1.2 The Competitive Ratio As we said before, the online algorithm incurs some cost as it serves each request. If the complete request sequence is σ, then we let Alg(σ ) be the total cost incurred by the online algorithm in serving σ. Similarly, we let Opt(σ ) be the optimal cost in hindsight of serving σ. Note that Opt(σ ) represents the cost of an optimal offline algorithm that knows the full sequence of requests. Now we define the competitive ratio of an algorithm to be: max σ Alg(σ ) Opt(σ ) In some sense this is an “apples to oranges” comparison, since the online algorithm does not know the full sequence of requests, whereas the optimal cost is aware of the full sequence and hence is an “offline” quantity. Note two differences from regret minimization: there we made a prediction xt before (or concurrently with) seeing the function f t , whereas we now see the request σt before we produce our response at time t. In this sense, our problem is easier. However, the benchmark is different—we now have to compare with the best dynamic sequence of actions for the input sequence σ, whereas regret is typically measured with respect to a static response, i.e., to the cost of playing the same fixed action for each of the t steps. In this sense, we are now solving a harder problem. There are is a smaller, syntactic difference as well: regret is an additive guarantee whereas the competitive ratio is a multiplicative guarantee—but this is more a reflection on the kind of results that are possible, rather than fundamental difference between the two models. 26.1.3 If the entire sequence of requests is known, show that Belády’s rule is optimal: evict the page in cache that is next requested furthest in the future. What About Randomized Algorithms? The above definitions generally hold for deterministic algorithms, so how should we characterize randomized algorithms. For the deterministic case we generally think about some adversary choosing the worst possible request sequence for our algorithm. For randomized algorithms we could consider either oblivious or adaptive adversaries. Oblivious adversaries fix the input sequence up front and then Figure 26.1: Illustration of the Paging Problem online algorithms let the randomized algorithm process it. An adaptive adversary is allowed to see the results of the coin flips the online algorithm makes and thus adapt its request sequence. We focus on oblivious adversaries in these notes. To define the performance of a randomized online algorithm we just consider the expected cost of the algorithm. Against an oblivious adversary, we say that a randomized online algorithm is αcompetitive if for all request sequences σ, E[Alg(σ)] ≤ α · Opt(σ ). 26.2 The Ski Rental Problem: Rent or Buy? Now that we have a concrete analytical framework, let’s apply it to a simple problem. Suppose you are on a ski trip with your friends. On each day you can choose to either rent or buy skis. Renting skis costs $1, whereas buying skis costs $B for B > 1. However, the benefit of buying skis is that on subsequent days you do not need to rent or buy again, just use the skis you already bought. The problem that arises is that for some mysterious reason we do not know how long the ski trip will last. On each morning we are simply told whether or not the trip will continue that day. The goal of the problem is to find a rent/buy strategy that is competitive with regards to minimizing the cost of the trip. In the notation that we developed above, the request for the i’th day, σi , is either “Y” or “N” indicating whether or not the ski trip continues that day. We also now that once we see a “N” request that the request sequence has ended. For example a possible sequence might be σ = (Y, Y, Y, N ). This allows us to characterize all instances of the problem as follows. Let Ij be the sequence where the ski trip ends on day j. Suppose we knew ahead of time what instance we received, then we have that Opt( Ij ) = min{ j, B} since we can choose to either buy skis on day 1 or rent skis every day depending on which is better. 26.2.1 Deterministic Rent or Buy We can classify and analyze all possible deterministic algorithms since an algorithm for this problem is simply a rule deciding when to buy skis. Let Algi be the algorithm that rents skis until day i, then buys on day i if the trip lasts that long. The cost on instance Ij is then Algi ( Ij ) = (i − 1 + B) · 1{i≤ j} + j · 1{i> j} . What is the best deterministic algorithm from the point of view of competitive analysis? The following claims answer this question. 277 278 the ski rental problem: rent or buy? Lemma 26.1. The competitive ratio of algorithm AlgB is 2 − 1/B and this is the best possible ratio for any deterministic algorithm. Proof. There are two cases to consider j < B and j ≥ B. For the first case, AlgB ( Ij ) = j and Opt( Ij ) = j, so AlgB ( Ij )/ Opt( Ij ) = 1. In the second case, AlgB ( Ij ) = 2B − 1 and Opt( Ij ) = B, so AlgB ( Ij )/ Opt( Ij ) = 2 − 1/B. Thus the competitive ratio of AlgB is AlgB ( Ij ) max = 2 − 1/B Opt( Ij ) Ij Now to show that this is the best possible competitive ratio for any deterministic algorithm. Consider algorithm Algi . We find an instance Ij such that Algi ( Ij )/ Opt( Ij ) ≥ 2 − 1/B. If i ≥ B then we take j = B so that Algi ( Ij ) = (i − 1 + B) and Opt( Ij ) = B so that Algi ( Ij ) i−1+B i 1 1 = = +1− ≥ 2− Opt( Ij ) B B B B Now if i = 1, we take j = 1 so that Algi ( Ij ) B = ≥2 Opt( Ij ) 1 Since B is an integer > 1 by assumption. Now for 1 < i < B, we take j = ⌊(i − 1 + B)/(2 − 1/B)⌋ ≥ 1 so that Algi ( Ij ) 1 ≥ 2− . Opt( Ij ) B 26.2.2 Randomized Rent or Buy Can randomization improve over deterministic algorithms in terms of expected cost? We will show that this is in fact the case. So how do we design a randomized algorithm for this problem? We use the following general insight about randomized algorithms, notably that a randomized algorithm is a distribution over deterministic algorithms. To keep things simple let’s consider the case when B = 4. We construct the following table of payoffs where the rows correspond to deterministic algorithms Algi and the columns correspond to instances Ij . Alg1 Alg2 Alg3 Alg4 I1 4/1 1/1 1/1 1/1 I2 4/2 5/2 2/2 2/2 I3 4/3 5/3 6/3 3/3 I∞ 4/4 5/4 6/4 7/4 online algorithms 279 While the real game is infinite along with coordinates, we do not need to put columns for I4 , I5 , . . . because these strategies for the adversary are dominated by the column I∞ . Now given that the adversary would rather give us only these inputs, we do not need to put rows after B = 4 since buying after day B is worse than buying on day B for these inputs. (Please check these for yourself!) This means we can think of the above table as a 2-player zero-sum game with 4 strategies each. The row player chooses an algorithm and the column player chooses an instance, then the number in the corresponding entry indicates the loss of the row player. Thinking along the lines of the Von Neumann minimax theorem, we can consider mixed strategies for the row player to construct a randomized algorithm for the ski rental problem. Let pi be the probability of our randomized algorithm choosing row i. What is the expected cost of this algorithm? Suppose that the competitive ratio was at most c in expectation. The expected competitive ratio of our algorithm against each instance should be at most c, so this yields the following linear constraints. 4p1 + p2 + p3 + p4 ≤ c 4p2 + 5p2 + 2p3 + 2p4 ≤c 2 4p1 + 5p3 + 6p3 + 3p4 ≤c 3 4p1 + 5p2 + 6p3 + 7p4 ≤c 4 We would like to minimize c subject to p1 + p2 + p3 + p4 = 1 and pi ≥ 0. It turns out that one can do this by solving the following system of equations: p1 + p2 + p3 + p4 = 1 4p1 + p2 + p3 + p4 = c 4p1 + 5p2 + 2p3 + 2p4 = 2c 4p1 + 5p2 + 6p3 + 3p4 = 3c 4p1 + 5p2 + 6p3 + 7p4 = 4c Subtracting each line from the previous one gives us p1 + p2 + p3 + p4 = 1 3p1 = c − 1 4p2 + p3 + p4 = c 4p3 + p4 = c 4p4 = c. Why is it OK to set the inequalities to equalities? Simply because it works out: in essence, we are guessing that making these four constraints tight gives a basic feasible solution—and our guess turns out to right. It does not show optimality, but we can do that by giving a matching dual solution. 280 the ski rental problem: rent or buy? This gives us p4 = c/4, p3 = (3/4)(c/4), etc., and indeed that c= 1 1 − (1 − 1/4)4 and pi = (3/4)i−1 (c/4) for i = 1, 2, 3, 4. For general B, we get c = cB = 1 e . ≤ e−1 1 − (1 − 1/B) B Moreover, this value of c B is indeed the best possible competitive ratio for anyth randomized algorithm for the ski rental problem. How might one prove such a result? We instead consider playing a random instance against a deterministic algorithm. By Von Neumann’s minimax theorem the value of this should be the same as what we considered above. We leave it as an exercise to verify this for the case when B = 4. 26.2.3 (Optional) A Continuous Approach This section is quite informal right now, needs to be made formal. For simplicity, assume B = 1 by scaling, and that both the algorithm and the length of the season can be any real in [0, 1]. Now our randomized algorithm chooses a threshold t ∈ [0, 1] from the probability distribution with density function f (t). Let’s say f is continuous. Then we get that for any season length ℓ, Z ℓ t =0 (1 + t) f (t) dt + Z 1 t=ℓ ℓ f (t) dt = c · ℓ. (Again, we’re setting an equality there without justification, except that it works out.) Now we can differentiate w.r.t. ℓ to get (1 + ℓ) f (ℓ) + Z 1 t=ℓ f (t) dt − ℓ f (ℓ) = c. (This differentiation is like taking differences of successive lines that we did above.) Simplifying, f (ℓ) + Z 1 t=ℓ f (t) dt = c. (26.1) Taking derivatives again: f ′ (ℓ) − f (ℓ) = 0 But this solves to f (ℓ) = Ceℓ for some constant C. Since f is a probR1 ability density function, ℓ=0 f (ℓ) = 1, we get C = e−1 1 . Substituting into (26.1), we get that the competitive ratio is c = e−e 1 , as desired. online algorithms 26.3 The Paging Problem Now we return to the paging problem that was introduced earlier and start by presenting a disappointing fact. Lemma 26.2. No deterministic algorithm can be < k-competitive for the paging problem. Proof. Consider a universe with k + 1 pages in all. In each step the adversary requests a page not in the cache (there is always at least 1 such page). Thus the algorithm’s cost over n requests is n. The optimal offline algorithm can cut losses by always evicting the item that will be next requested furthest in the future, thus it suffers a cache miss every k steps so the optimal cost will be n/k. Thus the n = competitive ratio of any deterministic algorithm is at least n/k k. It is also known that many popular eviction strategies are kcompetitive such as Least Recently Used (LRU) and First-In First-Out (FIFO). We will show that a 1-bit variant of LRU is k-competitive and also show that a randomized version of it achieves an O(log k)competitive randomized algorithm for paging. 26.3.1 The 1-bit LRU/Marking Algorithm The 1-bit LRU/Marking algorithm works in phases. The algorithm maintains a single bit for each page in the universe. We say that a page is marked/unmarked if its bit is set to 1/0. At the beginning of each phase, all pages are unmarked. When a request for a page not in the cache comes, then we evict an arbitrary unmarked page and put the requested page in the cache, then mark the requested page. If there are no unmarked pages to evict, then we unmark all pages and start a new phase. Lemma 26.3. The Marking algorithm is k-competitive for the paging problem. Proof. Consider the i’th phase of the algorithm. By definition of the algorithm, Alg incurs a cost of at most k during the phase since we can mark at most k different pages and hence we will have at most k cache misses in this time. Now consider the first request after the i’th phase ends. We claim that Opt has incurred at least 1 cache miss by the time of this request. This follows since we have now seen k + 1 different pages. Now summing over all phases we see that Alg ≤ k Opt 281 282 the paging problem Now suppose that instead of evicting an arbitrary unmarked page, we instead evicted an unmarked page uniformly at random. For this randomized marking algorithm we can prove a much better result. Lemma 26.4. The Randomized Marking Algorithm is O(log k)-competitive Proof. We break up the proof into an upper bound on Alg’s cost and a lower bound on Opt’s cost. Before doing this we set up some notation. For the ith phase, let Si be the set of pages in the algorithm’s cache at the beginning of the phase. Now define ∆ i = | Si + 1 \ Si | . We claim that the expected number of cache misses made by the algorithm in phase i is at most ∆i ( Hk + 1), where Hk is the kth harmonic number. By summing over all phases we see that E[Alg] ≤ ∑i ∆i ( Hk + 1). Now let Ri be the set of distinct requests in phase i. For each request in Ri we say that it is clean if the requested page is not Si , otherwise the request is called stale. Every cache miss in the ith phase is caused by either a clean request or a stale request. 1. The number of cache misses due to clean requests is at most ∆i since there can be at most ∆i clean requests in phase i: each clean request brings in a page not belonging to Si into the cache and marks it, so it will be in Si+1 . 2. To bound the cache misses due to stale requests, suppose there have been c clean requests and s stale requests so far, and consider the s + 1st stale request. The probability this request causes a cache miss is at most k−c s since we have evicted c random pages out of k − s remaining stale requests. Now since c ≤ ∆i , we have that the expected cost due to stale requests is at most k −1 c k −1 1 ∑ k − s ≤ ∆i ∑ k − s = ∆i Hk . s =0 s =0 Now the expected total cost in phase i is at most ∆i Hk + ∆i = ∆i ( Hk + 1). Now we claim that Opt ≥ 12 ∑i ∆i . Let Si∗ be the pages in Opt’s cache at the beginning of phase i. Let ϕi be the number of pages in Si but not in Opt’s cache at the beginning of phase i, i.e., ϕi = |Si \ Si∗ |. Now let Opti be the cost that Opt incurs in phase i. We have that Opti ≥ ∆i − ϕi since this is the number of “clean” requests that Opt sees. Moreover, consider the end of phase i. Alg has the k most recent This is like the Airline seat problem, where we can imagine that c confused passengers get on at the beginning. online algorithms requests in cache, but Opt does not have ϕi+1 of them by definition of ϕi+1 . Hence Opti ≥ ϕi+1 . Now by averaging, Opti ≥ max{ϕi+1 , ∆i − ϕi } ≥ 1 (ϕ + ∆i − ϕi ). 2 i +1 So summing over all phases we have Opt ≥ 1 1 ∆i + ϕ f inal − ϕinitial ≥ ∑ ∆i , ∑ 2 i 2 i since ϕ f inal ≥ 0 and ϕinitial = 0. Combining the upper and lower bound yields E[Alg] ≤ 2( Hk + 1) Opt = O(log k) Opt . It can also be shown that no randomized algorithm can do better than Ω(log k )-competitive for the paging problem. For some intuition as to why this might be true, consider the coupon collector problem: if you repeatedly sample a uniformly random number from {1, . . . , k + 1} with replacement, show that the expected number of samples to see all k + 1 coupons is Hk+1 . 26.4 Generalizing Paging: The k-Server Problem Another famous problem in online algorithms is the k-server problem. Consider a metric space M = (V, d) with point set V and distance function d : V × V → R+ satisfying the triangle inequality. In the k-server problem there are k servers that are located at various points of M. At each timestep t we receive a request σt ∈ V. If there is a server at point σt already, then we can server that request for free. Otherwise we move some server from point x to point σt and pay a cost equal to d( x, σt ). The goal of the problem is to serve the requests while minimizing the total cost of serving them. The paging problem can be modeled as a k-server problem as follows. We let U be the points of the metric space and take d( x, y) = 1 for all pages x, y where x ̸= y. This special case shows that every deterministic algorithm is at least k-competitive and every randomized algorithm is Ω(log k)-competitive by the discussion in the previous section. It is conjectured that there is a k-competitive deterministic algorithm: the best known result is a (2k − 1)-competitive algorithm of Elias Koutsoupias and Christos Papadimitriou. For randomized algorithms, a poly-logarithmic competitive algorithm was given by Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Seffi Naor. This was recently improved via an approach based on Mirror descent by Sebastien Bubeck, Michael Cohen, and James Lee, Yin Tat Lee, Aleksander Madry; see this paper of Niv Buchbinder, Marco Molinaro, Seffi Naor, and myself for a discretization. 283