Locally Regularized Neural Differential Equations: Some Black Boxes Were Meant to Remain Closed! Avik Pal 1 2 Alan Edelman 1 3 Christopher Rackauckas 1 Abstract 1. Introduction Implicit Models, such as Neural Ordinary Differential Equations (Chen et al., 2018) and Deep Equilibrium Models (Bai et al., 2019; Pal et al., 2022), have emerged as a promising technique to determine the depth of neural networks automatically. To maximize performance on a dataset, explicit models are tuned to the “hardest” training sample, which hurts the inference timings for “easier” – more abundant – samples. Using adaptive differential equation solvers allow these implicit models to choose the number of steps they need effectively. This idea of representing neural networks as ODEs has since been generalized to Stochastic Differential Equations (Liu et al., 2019; Rackauckas et al., 2020) and other architectures to improve robustness. Implicit layer deep learning techniques, like Neural Differential Equations, have become an important modeling framework due to their ability to adapt to new problems automatically. Training a neural differential equation is effectively a search over a space of plausible dynamical systems. However, controlling the computational cost for these models is difficult since it relies on the number of steps the adaptive solver takes. Most prior works have used higher-order methods to reduce prediction timings while greatly increasing training time or reducing both training and prediction timings by relying on specific training algorithms, which are harder to use as a drop-in replacement due to strict requirements on automatic differentiation. In this manuscript, we use internal cost heuristics of adaptive differential equation solvers at stochastic time-points to guide the training towards learning a dynamical system that is easier to integrate. We “close the blackbox” and allow the use of our method with any adjoint technique for gradient calculations of the differential equation solution. We perform experimental studies to compare our method to global regularization to show that we attain similar performance numbers without compromising on the flexibility of implementation on ordinary differential equations (ODEs) and stochastic differential equations (SDEs). We develop two sampling strategies to trade-off between performance and training time. Our method reduces the number of function evaluations to 0 .556 × −0 .733 × and accelerates predictions by 1 .3 × −2 ×. Despite the rapid progress in these methods, the core problem of the scalability of these models is still persistent. Several solutions to these have been proposed: • Kelly et al. (2020); Finlay et al. (2020) use higher order derivatives for regularization. • Poli et al. (2020) learn neural solvers to solve Neural ODEs faster. • Pal et al. (2021) proposed a “zero-cost” global regularization scheme. • Behl et al. (2020) randomize the integration stop time to “smoothen” the dynamics. All these methods have definite tradeoffs (See Section 5 for more details). This paper presents a generally applicable method to force the neural differential equation training process to choose the least expensive option. We build upon the global regularization scheme proposed in Pal et al. (2021) and “close” the blackbox allowing our method to work across various sensitivity algorithms. Our main contributions include the following1 : 1 CSAIL MIT 2 Department of Electrical Engineering and Computer Science MIT 3 Department of Mathematics MIT. Correspondence to: Avik Pal . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1 Our code is publicly available at https://github.com/ avik-pal/LocalRegNeuralDE.jl 1 Locally Regularized Neural Differential Equations Sensitivity Algorithm Memory Requirement Memory Requirement with Local Regularization Backsolve Adjoint (Chen et al., 2018) Backsolve Adjoint with Checkpointing (Chen et al., 2018) Interpolating Adjoint (Hindmarsh et al., 2005) Interpolating Adjoint with Checkpointing (Hindmarsh et al., 2005) Quadrature Adjoint (Kim et al., 2021) Direct Reverse Mode Differentiation O (s) O (s × c) O (s × t) O (s × c) O ((s + p) × t) O (s × t × stages) O (s × (1 + stages)) O (s × (c + stages)) O (s × (t + stages)) O (s × (c + ×stages)) O ((s + p) × t + s × stages) O (s × (t + 1) × stages) Table 1: Memory Requirements for various Sensitivity Algorithms for ODEs with Local Regularization: Local Regularization requires O (s × (c + stages)) memory in the best case, while Global Regularization (Pal et al., 2021) always requires O (s × t × stages) memory. NFE (lower is better) Value Relative to Vanilla NDE 2.0 Prediction Timing (higher is better) Training Time (higher is better) 2.0 2.0 1.5 1.5 1.5 1.0 1.0 1.0 0.5 0.5 0.5 Local Unbiased Local Biased 0.0 0.0 0.0 MNIST NODE MNIST NSDE Physionet CIFAR10 MNIST NODE MNIST NSDE Physionet CIFAR10 MNIST NODE MNIST NSDE Physionet CIFAR10 Figure 1: Locally Regularized NDE leads to faster predictions and faster training compared to vanilla NDE. later time t1 , given the value z0 at time t0 . The final state: 1. We show that our local regularization method – building upon the primitives proposed in Pal et al. (2021) – performs at par with global regularization (See Section 4). Z t1 z(t1 ) = z0 + fθ (z(t), t) dt t0 generally cannot be computed analytically and requires numerical solvers. Prior works had observed the similarity between fixed time-step discretization of ODEs and Residual Neural Networks Lu et al. (2018), which was later extended in the Neural ODE framework by Chen et al. (2018). 2. We present two sampling methods that trade-off small computational costs for consistently better performance (See Section 3). 3. Using local regularization allows our models to leverage optimize-then-discretize in the backward pass (in addition to discretize-then-optimize methods). Our method works around the several engineering limitations of automatic differentiation (AD) systems (Rackauckas, 2022) that are needed to make global regularization work. dz(t) = fθ (z(t), t) dt Using adaptive time stepping allows the model to operate at a variable continuous depth depending on the inputs. Removal of the fixed depth constraint of Residual Networks provides a more expressive framework and offers several advantages in problems like density estimation (Grathwohl et al., 2018), irregularly spaced time series problems (Rubanova et al., 2019), etc. 4. We empirically show that regularizing solver heuristics with biased sampling stabilizes the training of larger neural ODEs (See Section 4.3). 2.2. Neural Stochastic Differential Equations 2. Background Stochastic Differential Equations (SDEs) couple the effect of noise to a deterministic system of equations. SDE noise can have different variations; however, we exclusively focus on diagonal multiplicative noise for this paper. Liu et al. (2019) modified Neural ODEs by stochastic noise injection 2.1. Neural Ordinary Differential Equations Neural ODEs use an explicit neural network to parameterize the dynamical system. These involve finding the state at a 2 Locally Regularized Neural Differential Equations 2.4. Global Regularization using Local Error Estimates in the form of Neural SDEs. Liu et al. (2019) empirically showed that stochastic noise injection improves the robustness and generalization performance of neural ODEs. In Neural SDEs, we simultaneously train two explicit neural networks – drift fθ (z(t), t) and diffusion gϕ (z(t), t)) s.t.: In Section 2.3, we observe how larger local error estimates EEst contribute to reduced step sizes and, in turn, higher training and prediction times for Neural Differential Equations. Pal et al. (2021) proposed a regularization scheme to minimize the “total local error in order to learn Neural ODEs with as large step sizes as possible.” They compute the regularization term (RE )g as: X (EEst )j · |dtj | (RE )g = dz(t) = fθ (z(t), t)dt + gϕ (z(t), t))dW 2.3. Adaptive Time Stepping for Differential Equations Runge-Kutta (RK) Methods (Runge, 1895; Kutta, 1901) are widely used to approximate the solutions of ordinary differential equations numerically. Given a tableau of coefficients {A, c, b}, these methods combine s stages to obtain the estimate at t + dt. ! s−1 X ks = f t + cs · dt, z(t) + asi · ki z(t + dt) = z(t) + dt · i=1 s X j where the summation is over the j time steps of the solution. All the local error estimates EEst are precomputed by adaptive differential equation solvers, which makes the forward pass effectively have zero additional overhead. Continuous adjoint methods (Chen et al., 2018) define the derivativesP in terms of the ODE quantities. The terms s (EEst )j = i=1 (bi − b̃i ) · ki cannot be constructed directly from the trajectory of the ODE solution, since ki terms are defined by the solver method (and not by the continuous ODE solution). As a result Pal et al. (2021) performs discrete sensitivity analysis to compute the derivatives w.r.t. the regularization terms. ! bi · k i i=1 Adaptive solvers need to maximize the step size (dt) while keeping the error estimate below the user-specified tolerances, i.e., they need to satisfy: EEst ≤ atol + max (|z(t)|, |z(t + dt)|) · rtol Discrete Sensitivity analysis is more stable than continuous adjoints (Zhang & Sandu, 2014) and stabilizes the training process of Neural ODEs (Onken & Ruthotto, 2020). However, these methods are more memory intensive than continuous adjoints. Another shortcoming is the advanced tooling mechanism needed to differentiate through differential equation solvers (Rackauckas, 2022), making this method hard to adopt. Since adaptive ODE solvers change the number of steps based on the values in the differential equation, discrete sensitivity analysis requires dynamic AD tooling which is generally less optimized than those built for static graph cases. We note that there are checkpointing methods to reduce the memory overhead (Dauvergne & Hascoët, 2006), though these generally have focused on static graph AD systems. where EEst is the local error estimate. Adaptive solvers utilize an additional linear combiner b̃i to get an alternate solution, typically with one order less convergence (Wanner & Hairer, 1996; Fehlberg, 1968; Dormand & Prince, 1980; Tsitouras, 2011). ! s X z̃(t + dt) = z(t) + dt · b̃i · ki i=1 A classic result from Richardson extrapolation shows that EEst = ∥z̃(t + dt) − z(t + dt)∥ is an estimate of the local truncation error (Hairer et al., 1993; Ascher & Petzold, 1998). The new step size is determined using the following: q= EEst atol + max (|z(t)|, |z(t + dt)|) · rtol 2.5. Multiscale Neural ODE with Input Injection • If q < 1, dt is accepted. Multiscale modeling (Burt & Adelson, 1987) has been the central theme for several deep computer vision applications (Farabet et al., 2012; Yu & Koltun, 2015; Chen et al., 2016; 2017). Standard Neural ODEs set the initial condition as input x from the previous layer. Multiscale Neural ODE with Input Injection is based on the Multiscale Deep Equilibrium Models (DEQs) (Bai et al., 2019; 2020; Pal et al., 2022). If our model operates at n scales, the initial condition {(z0 )i }i∈[n−1] is defined as (z0 )0 = x and (z0 )i∈{1,...n−1} = 0. Every feature set is upsampled or downsampled and fed into the other scales using a neural network. In line with Multiscale DEQs, we also repeatedly • Otherwise, it is rejected and reduced. A standard PI controller proposes the new step size to be: α dtnew = η · qn−1 · qnβ · dt where η is the safety factor, qn−1 is the error proportion of the previous step, and (α, β) are tunable PI-gain hyperparameters (Wanner & Hairer, 1996). We defer the discussion of error estimation schemes for stochastic RK integrators to Rackauckas & Nie (2017; 2020). 3 Locally Regularized Neural Differential Equations Algorithm 1 Unbiased Sampling: Training Phase Data: x, fθ , tspan Result: sol, r Define du dt = fθ (u, t) t0 , t1 ← tspan treg ∼ U [t0 , t1 ] sol ← solve( du dt , DE Solver, tspan ) utreg ← sol(treg ) Run single step for the solver with time-span (treg , t1 ) r ← Local Error Estimate @t = treg Algorithm 2 Biased Sampling: Training Phase Data: x, fθ , tspan Result: sol, r Define du dt = fθ (u, t) t0 , t1 ← tspan sol ← solve( du dt , DE Solver, tspan ) treg ∼ U (sol.t) utreg ← sol(treg ) Run single step for the solver with time-span (treg , t1 ) r ← Local Error Estimate @t = treg inject the original input x at every step of the dynamical system. 1.0 3. Methods u In Section 2.4, we discussed the downsides of using global regularization with local error estimates. To summarize: 1. Global Regularization relies on discrete sensitivity analysis, which is more memory intensive. u 1(t) u 2(t) u 3(t) 0.5 0.0 0 2. Global Regularization depends on AD tooling to support dynamic compute graphs in an efficient way, making it hard to incorporate into existing codebases. 5.0×10⁴ t 1.0×10⁵ Figure 2: Robertson Stiff ODE System: Solving stiff systems like Robertson (Robertson, 1966) (using Rodas5 (Piché, 1995)) involves spending around 75% of the time in t < 5000 (i.e. 5% of the timespan). The vertical lines denote the timepoints at which the ODE System was solved. To get around these limitations, we developed a new technique using local sampling of error estimates at specific time points, rather than globally over the full interval. We deal with sampling the “appropriate” time point for regularization by two strategies: larize at random uniformly sampled time points in the timespan, the learned dynamical system will demonstrate similar properties in terms of NFE compared to global regularization. Our new regularization term becomes (RE )unbiased as: • Algorithm 1 Unbiased Sampling: We random uniformly sample the time-point in the integration time span. Intuitively, since we will perform the training for “a large number of steps,” the learned dynamical system would end up being faster to solve “everywhere” over the time span. (RE )unbiased = (EEst )treg · |dttreg | treg ∼ U[tspan] • Algorithm 2 Biased Sampling: Adaptive TimeStepping Differential Equation Solvers naturally take more steps around the area, which is harder to integrate. We can bias the regularization to operate around parts of the dynamical system which are “harder” by sampling a time-point from the solution time points. 3.2. Biased Sampling Consider a simple scenario where the learned dynamics of the DE is harder to solve in [0.25, 0.35], and we are solving the DE from t0 = 0 to t1 = 1. Our primary aim is to modify the learned system s.t. it becomes simpler to solve in [0.25, 0.35]. If we use unbiased sampling, the probability that we regularize at treg ∈ [0.25, 0.35] is 0.1 (which is low). The problem gets even more severe if the range is lowered. An extreme version of this problem is observed for stiff systems like Robertson’s Equations (See Figure 2) where 75% of the time is spent in solving 5% of the problem. We note that these extreme scenarios rarely occur for traditional deep learning tasks since Pal et al. (2021) observed 3.1. Unbiased Sampling When training a Neural ODE, the integration timespan is fixed. Training any deep learning model involves several thousand steps. We compute the total local error estimate over the entire timespan when performing global regularization. For unbiased sampling, we hypothesize that if we regu4 Locally Regularized Neural Differential Equations 100 Training Accuracy (%) minor speedups using stiffness regularization. However, the problem that some parts of the dynamical system are harder to integrate persists, and designing a regularization scheme targeting those parts is highly desirable. We considered a simple scenario where the learned dynamical system was fixed. However, while training NDEs, this system evolves with training, and apriori predicting the more difficult portions to integrate is not feasible. Adaptive solvers take more frequent steps in the parts of the DE where it is harder to integrate. Anantharaman et al. (2020) leveraged this property of adaptive solvers to learn surrogates for stiff systems. Since these solvers adapt to concentrate around the most numerically difficult time points, we automatically obtain the time points where we want to regularize the model. Hence, for our biased sampling regularize, we uniformly sample the regularization timepoint treg from the time points at which the solver solved the differential equation. 95 Regularization Biased Unbiased None 90 0 2000 0 2000 4000 6000 Number of Function Evaluations 350 300 250 200 4. Experiments In this section, we compare the effectiveness of unbiased and biased local regularization’s effectiveness on the training and prediction timings of NDEs. We choose image classification and time series prediction problems in line with prior works on accelerating NDEs. We consider the following baselines: 4000 Training Iterations 6000 Figure 3: MNIST Classification using Neural ODE 4.1. MNIST Image Classification We train a neural differential equation classifier to map flattened MNIST (LeCun et al., 1998) images to their corresponding labels. 1. Vanilla Neural ODE with Continuous Interpolating Adjoint. 2. Vanilla Neural SDE with discrete sensitivities. 4.1.1. N EURAL O RDINARY D IFFERENTIAL E QUATION 3. Global Regularization of Neural Differential Equations using discrete sensitivity analysis (Pal et al., 2021). Training Details: We use the same model architecture as described in Kelly et al. (2020). Our model comprises of single hidden layered explicit model fθ modeling the ODE dynamics followed by a linear classifier gϕ . The hidden layer is 100-dimensional. We train with a batch size of 512 for a total of 7500 steps. We use Adam (Kingma & Ba, 2014) with a constant learning rate of 0.001. For error estimate regularization, we exponentially decrease the regularization coefficient from 2.5 to 1.0. We use Tsit5 (Tsitouras, 2011) as the ODE integrator with an absolute and relative tolerance of 10−8 .2 4. TayNODE (Kelly et al., 2020) and STEER (Behl et al., 2020) for models reported in Pal et al. (2021). We use the DifferentialEquations.jl (Rackauckas et al., 2019) and Lux.jl (Pal, 2022) software stack written in the Julia Programming Language (Bezanson et al., 2017) for all our experiments. Some details about the data presented in the tables: Baselines: We consider a Vanilla Neural ODE trained with the exact aforementioned specifications. All other baselines are directly taken from Pal et al. (2021). • All experimental results in the tables marked with ¶ were taken directly from Pal et al. (2021). • We have tried to match the hardware details presented in the paper and the corresponding GitHub repository for Pal et al. (2021), but we note that differences in wall clock timings can be partially attributed to hardware. Results: We summarize the results in Table 2 and Figure 3. Using local regularization speeds up prediction in all cases while it leads to a minor slowdown during training for biased sampling. 2 • TayNODE (Kelly et al., 2020) uses a different ODE integrator. Hence the NFEs are not directly comparable. We note that this is not a realistic tolerance at which image classification models are trained. We use this tolerance to allow a direct comparison to prior works. 5 Locally Regularized Neural Differential Equations Method Train Accuracy (%) Test Accuracy (%) Training Time (hr) Prediction Time (s / batch) Testing NFE Vanilla NODE STEER ¶ TayNODE ¶ ERNODE ¶ SRNODE ¶ Local Unbiased ERNODE Local Biased ERNODE 99.898 ± 0.066 100.00 ± 0.000 98.98 ± 0.06 99.71 ± 0.28 100.00 ± 0.000 99.447 ± 0.039 99.477 ± 0.051 97.612 ± 0.163 97.94 ± 0.03 97.89 ± 0.00 97.32 ± 0.06 98.08 ± 0.15 97.526 ± 0.131 97.488 ± 0.016 0.54 ± 0.001 1.31 ± 0.07 1.19 ± 0.07 0.82 ± 0.02 1.24 ± 0.06 0.49 ± 0.002 1.12 ± 0.065 0.088 ± 0.020 0.092 ± 0.002 0.079 ± 0.007 0.060 ± 0.001 0.094 ± 0.003 0.046 ± 0.002 0.044 ± 0.002 303.559 ± 3.194 265.0 ± 3.46 80.3 ± 0.43 177.0 ± 0.00 259.0 ± 3.46 187.961 ± 1.812 182.849 ± 1.578 Vanilla NSDE ERNSDE ¶ SRNSDE ¶ Local Unbiased ERNSDE Local Biased ERNSDE 98.27 ± 0.11 98.16 ± 0.11 98.79 ± 0.12 98.05 ± 0.09 98.02 ± 0.07 96.66 ± 0.16 96.27 ± 0.35 96.80 ± 0.07 96.57 ± 0.13 96.44 ± 0.16 2.70 ± 0.00 4.19 ± 0.04 8.54 ± 0.37 2.10 ± 0.01 1.90 ± 0.00 0.51 ± 0.07 7.23 ± 0.14 14.50 ± 0.40 0.39 ± 0.10 0.36 ± 0.03 313.86 ± 2.94 184.67 ± 2.31 382.00 ± 4.00 228.93 ± 1.77 230.10 ± 0.71 Table 2: MNIST Image Classification using Neural DE: Using local unbiased regularization on neural ODE speeds up training by 1 .1 × and predictions by 1 .9 × while reducing the total NFEs to 0 .619 ×. Local Biased Regularization tends to slow down training for smaller models on GPU while it further reduces the NFEs by 0 .602 ×. For Neural SDE, we observe a similar reduction of NFEs by 0 .729 × −0 .733 × and a training time improvement of 1 .28 × −1 .42 ×. The best global regularization method gets lower NFEs but overall takes more wall clock than the best performing local regularization method. Method Test Loss (×10−3 ) Vanilla NODE STEER ¶ TayNODE ¶ ERNODE ¶ SRNODE ¶ Local Unbiased ERNODE Local Biased ERNODE 3.41 ± 0.10 3.48 ± 0.01 4.21 ± 0.01 3.57 ± 0.00 3.58 ± 0.05 3.64 ± 0.07 3.63 ± 0.08 Training Time (hr) Prediction Time (s / batch) Testing NFE 2.48 ± 0.22 1.62 ± 0.26 12.3 ± 0.32 0.94 ± 0.13 0.87 ± 0.09 2.31 ± 0.02 2.12 ± 0.24 0.16 ± 0.01 0.54 ± 0.06 0.22 ± 0.02 0.21 ± 0.02 0.20 ± 0.01 0.09 ± 0.00 0.10 ± 0.01 758.0 ± 25.87 699.0 ± 141.1 167.3 ± 11.93 287.0 ± 17.32 273.0 ± 0.000 422.0 ± 4.580 463.0 ± 63.02 Table 3: Physionet Time Series Interpolation: Local Regularization reduces NFEs by 0 .556 × −0 .610 × reducing the prediction timings by 1 .6 × −1 .78 ×. Our methods additionally improve training timings by 1 .073 × −1 .167 ×. We note that the difference in training time compared to (E/S)RNODE methods is due to change in the sensitivity algorithm. 4.1.2. N EURAL S TOCHASTIC D IFFERENTIAL E QUATION Local regularization improves training and prediction performance while keeping the test accuracy nearly constant. Training Details: We downsample the flattened images to a 32-dimensional vector before feeding it into the Neural SDE which uses a diffusion model (fθ ) having a 64-dimensional hidden layer and a linear drift model (gϕ ). Finally, a linear classifier (hγ ) predicts the label. We train our models on CPU with a batch size of 512 for a total of 4000 steps. We optimize the weights using Adam (Kingma & Ba, 2014) with a constant learning rate of 0.01. We use SOSRI2 SDE solver (Rackauckas & Nie, 2017) with a tolerance of 0.14. We fix our regularization coefficient to be 103 . For this experiment, we rely on discrete sensitivity analysis. 4.2. Physionet Time Series Interpolation Training Details: We use the experimental setup for Physionet 2012 Challenge Dataset (Citi & Barbieri, 2012) from Kelly et al. (2020). We use a Latent Neural ODE (Rubanova et al., 2019) to perform time series interpolation on the dataset. We use the preprocessed dataset from Kelly et al. (2020) to ensure a fair comparison and independent runs are performed using an 80:20 split of the dataset. For specific model architecture details, we refer the readers to Pal et al. (2021). We train the model for a total of 3000 iterations using Adamax (Kingma & Ba, 2014) with a learning rate of 0.01 with 10−5 inverse decay per step. We use a batch size of 512. We diverge from Pal et al. (2021), in using Baselines: ERNSDE and SRNSDE results were taken from Pal et al. (2021). These were trained for 40 epochs, nearly equivalent to training for 4000 iterations. Results: We summarize the results in Table 2 and Figure 4. 6 Locally Regularized Neural Differential Equations Method Train Accuracy (%) Test Accuracy (%) Training Time (s / batch) Prediction Time (s / batch) Testing NFE Standard Vanilla Local Unbiased ER Local Biased ER 83.683 ± 1.450 83.665 ± 0.805 83.958 ± 1.032 67.394 ± 0.849 67.678 ± 0.874 67.745 ± 0.824 0.457 ± 0.018 0.399 ± 0.014 0.555 ± 0.008 0.130 ± 0.013 0.096 ± 0.007 0.088 ± 0.003 115.315 ± 12.136 89.048 ± 7.335 81.301 ± 1.255 Multi-Scale Vanilla Local Unbiased ER Local Biased ER 92.807 ± 12.458 94.159 ± 9.694 99.987 ± 0.023 80.048 ± 6.740 80.432 ± 5.548 83.460 ± 0.727 0.572 ± 0.012 0.641 ± 0.025 0.774 ± 0.293 0.170 ± 0.005 0.175 ± 0.019 0.163 ± 0.015 27.616 ± 0.905 27.760 ± 0.177 26.334 ± 0.992 Configuration Table 4: CIFAR10 Image Classification using Neural DE: For the standard Neural ODE, local regularization reduces the NFE by 0 .705 × −0 .772 ×, thereby improving prediction timings by 1 .35 × −1 .477 ×. However, unregularized model training takes 0 .823 × the time for the biased model. For multi-scale models, the NFE and prediction time improvements are marginal and come at the cost of higher training time. 100 Regularization Biased Unbiased None 0.007 Test Loss Training Accuracy (%) 0.008 95 0.005 Regularization Biased Unbiased None 90 0 1000 2000 3000 0.006 0.004 4000 0 1000 2000 3000 0 1000 2000 Training Iterations 3000 Number of Function Evaluations Number of Function Evaluations 800 350 300 250 200 700 600 500 400 300 200 0 1000 2000 Training Iterations 3000 4000 Figure 4: MNIST Classification using Neural SDE Figure 5: Physionet Time Series Interpolation using Latent ODE 4.3. CIFAR10 Image Classification the regularization term as (EEst )treg · |dt|treg instead of the P 2 squared regularization term j (EEst )j . Additionally, we decay the regularization coefficient exponentially from 100 to 10 over the 3000 training iterations. 4.3.1. N EURAL O RDINARY D IFFERENTIAL E QUATION Baselines: Vanilla NODE was trained with the exact aforementioned configuration. All the other baselines were trained using discrete sensitivity analysis, and the exact details are present in Pal et al. (2021). Training Details: We use the CNN architecture for CIFAR10 as described in Poli et al. (2020). We train the models for 31250 steps with Adam (Kingma & Ba, 2014) using a cosine-annealing learning rate scheduler from 0.003 to 0.0001. We train the models with a batch size of 32 and keep the regularization coefficient fixed at 2.5. We use Tsit5 (Tsitouras, 2011) with a tolerance of 10−4 . Results: We summarize the results in Figure 5 and Table 3. Results: We summarize the results in Figure 6 and Table 4. 7 Locally Regularized Neural Differential Equations 90 Test Accuracy (%) Training Accuracy (%) 60 70 60 50 40 40 Regularization Biased Unbiased None 30 20 0 1.0×10⁴ 2.0×10⁴ Training Iterations 20 3.0×10⁴ 0 1.0×10⁴ 2.0×10⁴ Training Iterations Number of Function Evaluations 150 80 100 50 3.0×10⁴ 0 1.0×10⁴ 2.0×10⁴ Training Iterations 3.0×10⁴ 75 Test Accuracy (%) Training Accuracy (%) 90 60 50 Regularization 30 Biased Unbiased None 25 0 5.00×10³ 1.00×10⁴ 1.50×10⁴ 2.00×10⁴ Training Iterations 0 Number of Function Evaluations Figure 6: CIFAR10 Image Classification using Standard Neural ODE 5.00×10³ 1.00×10⁴ 1.50×10⁴ 2.00×10⁴ Training Iterations 40 35 30 0 5.00×10³ 1.00×10⁴ 1.50×10⁴ 2.00×10⁴ Training Iterations Figure 7: CIFAR10 Image Classification using Multi-Scale Neural ODE 4.3.2. M ULTISCALE N EURAL ODE are often overshadowed by a massive training slowdown (Pal et al., 2021). Training Details: We modify the Tiny Multiscale DEQ architecture for CIFAR10 from Bai et al. (2020) as Multiscale Neural ODE with Input Injection (See Section 2.5). To stabilize the training for larger models, we exponentially increase the regularization coefficient from 0.1 to 5.0. We train with a batch size of 128 using VCAB3 (Wanner & Hairer, 1996) with a tolerance of 0.05. More recently, quite a few first-order schemes have been proposed. Behl et al. (2020) randomized the endpoint of Neural ODEs to incentivize simpler dynamics. However, Pal et al. (2021) didn’t find significant benefits of using STEER in their experiments. Pal et al. (2021) used internal solver heuristics – local error and stiffness estimates – to control the learned dynamics in a way that decreased both prediction and training time. We discuss their work in detail in Section 2.4. Xia et al. (2021) rewrite Neural ODEs as heavy ball ODEs to accelerate both forward and backward passes. Djeumou et al. (2022) replace ODE solvers in the forward with a Taylor-Lagrange expansion and report significantly better training and prediction times. Results: We summarize the results in Figure 7 and Table 4. The benefits from regularization for NFEs and prediction timings seem marginal. However, regularization using biased sampling makes the training dynamics stable as observed in Figure 7. 5. Related Works 6. Discussion Neural Differential Equations (and by extension most Implicit Neural Networks) are mostly constrained by expensive training and prediction timings (Dupont et al., 2019; Kelly et al., 2020; Finlay et al., 2020; Behl et al., 2020; Pal et al., 2021). Traditionally accelerating these models have relied on using higher-order regularization terms to constrain the space of learnable dynamics (Finlay et al., 2020; Kelly et al., 2020). These models speed up predictions, but their benefits In this manuscript, we have shown that we can obtain similar properties to global regularization by regularizing dynamical systems at randomly sampled time points. Additionally, this comes with the benefit of not being forced into a specific sensitivity analysis method. We have taken every experiment in Pal et al. (2021) and empirically showed that our local regularization works at par with global regularization. 8 Locally Regularized Neural Differential Equations However, our experiments using stiffness estimate for local regularization did not yield positive results and were not presented in this manuscript. Thus, we have demonstrated that we can “close the blackbox” and still leverage all the benefits of internal solver heuristics to improve training and predictions of neural differential equations. any copyright notation herein. References Anantharaman, R., Ma, Y., Gowda, S., Laughman, C., Shah, V., Edelman, A., and Rackauckas, C. Accelerating simulation of stiff nonlinear systems using continuous-time echo state networks. arXiv preprint arXiv:2010.04004, 2020. 6.1. Limitations We note the following limitations of our work: Ascher, U. M. and Petzold, L. R. Computer methods for ordinary differential equations and differential-algebraic equations, volume 61. Siam, 1998. • Similar to Pal et al. (2021), if the objective is to learn the actual dynamical system, our method will not yield proper results. Our method is applicable only when the final state is relevant, i.e., in most classical deep learning tasks. Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium models. arXiv preprint arXiv:1909.01377, 2019. Bai, S., Koltun, V., and Kolter, J. Z. Multiscale Deep Equilibrium Models. arXiv:2006.08656 [cs, stat], November 2020. URL http://arxiv.org/abs/2006. 08656. arXiv: 2006.08656. • Regularization introduces a new regularization coefficient hyperparameter which, if not tuned correctly, can lead to unstable dynamics or might negate the scheme’s usefulness. Behl, H., Ghosh, A., Dupont, E., Torr, P., and Namboodiri, V. Steer: simple temporal regularization for neural odes. pp. 1–13. Neural Information Processing Systems Foundation, Inc., 2020. 7. Acknowledgement The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper. This material is based upon work supported by the National Science Foundation under grant no. OAC-1835443, grant no. SII-2029670, grant no. ECCS2029670, grant no. OAC-2103804, and grant no. PHY2021825. We also gratefully acknowledge the U.S. Agency for International Development through Penn State for grant no. S002283-USAID. The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0001211 and DEAR0001222. We also gratefully acknowledge the U.S. Agency for International Development through Penn State for grant no. S002283-USAID. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. This material was supported by The Research Council of Norway and Equinor ASA through Research Council project “308817 - Digital wells for optimal production and drainage”. Research was sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017. doi: 10.1137/141000671. Burt, P. J. and Adelson, E. H. The laplacian pyramid as a compact image code. In Readings in computer vision, pp. 671–679. Elsevier, 1987. Chen, L.-C., Yang, Y., Wang, J., Xu, W., and Yuille, A. L. Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3640–3649, 2016. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583, 2018. Citi, L. and Barbieri, R. Physionet 2012 challenge: Predicting mortality of icu patients using a cascaded svm-glm paradigm. In 2012 Computing in Cardiology, pp. 257– 260. IEEE, 2012. Dauvergne, B. and Hascoët, L. The data-flow equations of checkpointing in reverse automatic differentiation. In 9 Locally Regularized Neural Differential Equations International Conference on Computational Science, pp. 566–573. Springer, 2006. as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015. Djeumou, F., Neary, C., Goubault, E., Putot, S., and Topcu, U. Taylor-lagrange neural ordinary differential equations: Toward fast training and evaluation of neural odes, 2022. Kutta, W. Beitrag zur naherungsweisen integration totaler differentialgleichungen. Z. Math. Phys., 46:435–453, 1901. Dormand, J. R. and Prince, P. J. A family of embedded runge-kutta formulae. Journal of computational and applied mathematics, 6(1):19–26, 1980. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural odes. Advances in Neural Information Processing Systems, 32, 2019. Liu, X., Xiao, T., Si, S., Cao, Q., Kumar, S., and Hsieh, C.-J. Neural sde: Stabilizing neural ode networks with stochastic noise, 2019. Farabet, C., Couprie, C., Najman, L., and LeCun, Y. Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence, 35 (8):1915–1929, 2012. Lu, Y., Zhong, A., Li, Q., and Dong, B. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In International Conference on Machine Learning, pp. 3276–3285. PMLR, 2018. Fehlberg, E. Classical fifth-, sixth-, seventh-, and eighthorder Runge-Kutta formulas with stepsize control. National Aeronautics and Space Administration, 1968. Onken, D. and Ruthotto, L. Discretize-optimize vs. optimize-discretize for time-series regression and continuous normalizing flows. arXiv preprint arXiv:2005.13420, 2020. Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. M. How to train your neural ode. arXiv preprint arXiv:2002.02798, 2020. Pal, A. Lux: Explicit parameterization of deep neural networks in julia. https://github.com/avik-pal/ Lux.jl/, 2022. Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018. Pal, A., Ma, Y., Shah, V., and Rackauckas, C. V. Opening the blackbox: Accelerating neural differential equations by regularizing internal solver heuristics. In International Conference on Machine Learning, pp. 8325–8335. PMLR, 2021. Hairer, E., Norsett, S., and Wanner, G. Solving Ordinary Differential Equations I: Nonstiff Problems, volume 8. 01 1993. ISBN 978-3-540-56670-0. doi: 10.1007/978-3-540-78862-1. Hindmarsh, A. C., Brown, P. N., Grant, K. E., Lee, S. L., Serban, R., Shumaker, D. E., and Woodward, C. S. Sundials: Suite of nonlinear and differential/algebraic equation solvers. ACM Trans. Math. Softw., 31(3): 363–396, sep 2005. ISSN 0098-3500. doi: 10.1145/ 1089014.1089020. URL https://doi.org/10. 1145/1089014.1089020. Pal, A., Edelman, A., and Rackauckas, C. Mixing implicit and explicit deep learning with skip deqs and infinite time neural odes (continuous deqs). arXiv preprint arXiv:2201.12240, 2022. Piché, R. An l-stable rosenbrock method for step-by-step time integration in structural dynamics. Computer Methods in Applied Mechanics and Engineering, 126(3-4): 343–354, 1995. Kelly, J., Bettencourt, J., Johnson, M. J., and Duvenaud, D. Learning differential equations that are easy to solve. arXiv preprint arXiv:2007.04504, 2020. Poli, M., Massaroli, S., Yamashita, A., Asama, H., and Park, J. Hypersolvers: Toward fast continuous-depth models. arXiv preprint arXiv:2007.09601, 2020. Kim, S., Ji, W., Deng, S., Ma, Y., and Rackauckas, C. Stiff neural ordinary differential equations. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(9):093122, 2021. Rackauckas, C. Engineering trade-offs in automatic differentiation: From tensorflow and pytorch to jax and julia, Jan 2022. URL https://tinyurl.com/2j94bhc9. Rackauckas, C. and Nie, Q. Adaptive methods for stochastic differential equations via natural embeddings and rejection sampling with memory. Discrete and continuous dynamical systems. Series B, 22(7):2731, 2017. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2014. URL http://arxiv.org/abs/ 1412.6980. cite arxiv:1412.6980Comment: Published 10 Locally Regularized Neural Differential Equations Rackauckas, C. and Nie, Q. Stability-optimized high order methods and stiffness detection for pathwise stiff stochastic differential equations. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. IEEE, 2020. Rackauckas, C., Innes, M., Ma, Y., Bettencourt, J., White, L., and Dixit, V. Diffeqflux.jl - A julia library for neural differential equations. CoRR, abs/1902.02376, 2019. URL http://arxiv.org/abs/1902.02376. Rackauckas, C., Ma, Y., Martensen, J., Warner, C., Zubov, K., Supekar, R., Skinner, D., Ramadhan, A., and Edelman, A. Universal differential equations for scientific machine learning, 2020. Robertson, H. The solution of a set of reaction rate equations. Numerical analysis: an introduction, 178182, 1966. Rubanova, Y., Chen, R. T., and Duvenaud, D. Latent odes for irregularly-sampled time series. arXiv preprint arXiv:1907.03907, 2019. Runge, C. Über die numerische auflösung von differentialgleichungen. Mathematische Annalen, 46(2):167–178, 1895. Tsitouras, C. Runge–kutta pairs of order 5(4) satisfying only the first column simplifying assumption. Computers Mathematics with Applications, 62(2):770 – 775, 2011. ISSN 08981221. doi: https://doi.org/10.1016/j.camwa.2011.06. 002. URL http://www.sciencedirect.com/ science/article/pii/S0898122111004706. Wanner, G. and Hairer, E. Solving ordinary differential equations II. Springer Berlin Heidelberg, 1996. Xia, H., Suliafu, V., Ji, H., Nguyen, T., Bertozzi, A., Osher, S., and Wang, B. Heavy ball neural ordinary differential equations. Advances in Neural Information Processing Systems, 34, 2021. Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015. Zhang, H. and Sandu, A. Fatode: a library for forward, adjoint, and tangent linear integration of odes. SIAM Journal on Scientific Computing, 36(5):C504–C523, 2014. 11