Compiling C to Safe Rust, Formalized arXiv:2412.15042v1 [cs.PL] 19 Dec 2024 AYMERIC FROMHERZ, Inria, France JONATHAN PROTZENKO, Microsoft Azure Research, USA The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C, and cannot be realistically rewritten by hand. Automatically translating C to Rust is therefore an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages. We instead explore a different path, and explore what it would take to translate C to safe Rust; that is, to produce code that is trivially memory safe, because it abides by Rust’s type system without caveats. Our work sports several original contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on “split trees” that allows expressing C’s pointer arithmetic using Rust’s slices and splitting operations; an analysis that infers exactly which borrows need to be mutable; and a compilation strategy for C’s struct types that is compatible with Rust’s distinction between non-owned and owned allocations. We apply our methodology to existing formally verified C codebases, namely, the HACL★ cryptographic library, and binary parsers and serializers from EverParse, and show that the subset of C we support is sufficient to translate both applications to safe Rust via a newly implemented compiler. Our evaluation shows that for the few places that do violate Rust’s aliasing discipline, automated, surgical rewrites suffice; and that the few strategic copies we are forced to insert have a negligible performance impact. Of particular note, the application of our approach to HACL★ results in a 80,000 line verified cryptographic library, written in pure Rust, that implements all modern algorithms without a single use of unsafe – the first of its kind. Additional Key Words and Phrases: Rust, Compilation, Semantics 1 Introduction Despite decades of research, memory safety issues remain prevalent in industrial applications; recent studies by Google [Chromium Project 2020] and Microsoft [MSRC Team 2019] estimate that 70% of security vulnerabilities are related to incorrect memory handling. To tackle this issue, both companies and governments now advocate for the use of memory-safe languages for safety-critical systems, notably, Rust [National Security Agency 2022; The Register 2023; The White House 2024]. Combining the high performance and low-level idioms commonly provided by languages like C and C++ with memory safety by design, the Rust programming language continues to enjoy record levels of popularity in industry, being ranked as the “most admired programming language” for the eigth year in a row [StackOverflow 2023; Verdi 2023], leading several widely used projects to plan a transition to Rust [Cook 2022; Mozilla Hacks 2024]. However, while Rust offers clear benefits for new, clean-slate code, the value proposition is less clear for existing code. Indeed, it can be hard, especially in an industrial setting, to justify rewriting code that has been battle-tested, thoroughly debugged, and exhibits no glaring issues. In those situations, Rust is perceived as a “nice to have” and particularly useful to develop new features, but certainly not a business priority when there are so many other engineering battles to be fought [Vander Stoep and Rebert 2024]. This problem is exacerbated when it comes to industrial-grade code that also happens to be formally verified. Owing to its very nature, formal verifiation is an extremely labor-intensive activity, and it is not uncommon for projects to accumulate dozens of person-years for the sole verification effort [Kästner et al. 2018; Klein et al. 2009; Protzenko et al. 2020]. Whether with VST [Appel 2011], Authors’ Contact Information: Aymeric Fromherz, aymeric.fromherz@inria.fr, Inria, France; Jonathan Protzenko, protz@ microsoft.com, Microsoft Azure Research, USA. 2 Aymeric Fromherz and Jonathan Protzenko Autocorres [Greenaway et al. 2012] or Cogent [Amani et al. 2016]; and whether it’s parsers or serializers [Ramananandro et al. 2019; Reiher et al. 2019; Swamy et al. 2022], cryptography [Appel 2015; Dockins et al. 2016; Erbsen et al. 2019; Liu et al. 2019; Polubelova et al. 2020] or protocols [Arquint et al. 2023; Bhargavan et al. 2016; Delignat-Lavaud et al. 2021; Erbsen et al. 2024], verified code is unlikely to be rewritten and reverified just to support a transition to Rust. However, there is hope: to facilitate adoption in a variety of contexts and codebases, many verification efforts, including all of the examples above, settled on C as their target language for verified code. If one were to be able to translate C to Rust, while formally arguing for the correctness of the translation process, then not only could the migration be automated, but the correctness argument would carry forward, obviating the need for a costly rewrite and reverification effort. Several existing works therefore attempt to automatically translate C to unsafe Rust [Galois and Immunant 2019; Lesinski 2017; Ling et al. 2022; Sharp 2020]. Indeed, C allows programmers to be creative with aliasing, low-level casts, memory management, and data representation. Expressing those patterns in Rust requires opting out of many static guarantees, in order to allow unchecked aliasing, casts between representations (a.k.a. “transmutation”, in Rust lingo), and so on, which is done through the unsafe feature of Rust. But doing so obliterates the benefits of Rust! If the code cannot be shown to be statically safe, because it relies on the unsafe keyword, then there is in our view little benefit to a transition to Rust, and a lot to lose in terms of toolchain, build and integration complexity. We emphasize that avoiding unsafe is seen as a badge of honor in the Rust community: a library littered with unsafe is unlikely to ever be contemplated for serious adoption. But targeting safe Rust raises several challenges. First, the ownership and aliasing discipline of Rust is much more stringent than C’s. As a consequence, many common C patterns, such as taking pointer arguments that are disjoint or equal, cannot be expressed in safe Rust, as they would trip its borrow-checker. Second, mutability is explicit in Rust, whereas everything is implicitly mutable in C. Translating every variable and borrow to mutable in Rust would not only make many programs ill-typed (because mutable pieces of data must have a unique owner in Rust), but also produce unusable APIs, since callers would be contaminated with excessive mutability in their own source code. Third, pointer arithmetic is central in C, and does not exist in safe Rust, where the only abstraction is slices (a.k.a. runtime-checked fat pointers), and the only arithmetic is done through splitting a slice in two sub-parts to prevent unchecked aliasing. Fourth, whereas C uses t* for any pointer type, Rust has several types, such as [T; N], Box<[T]>, &[T], and more. A translation needs to infer, or devise heuristics, to sensibly map a (not informative) t* to those Rust types. In this work, we set out to translate C to safe Rust. To address the challenges above, we adopt a new point in the design space: instead of automatically translating C in its full generality to unsafe Rust and attempting to make the resulting code safer, we rather target a data-oriented, applicative subset of C. Our translation process is therefore semi-active: users may have to perform minimal adjustments to the source C program to fit the supported language subset; once in this subset, our approach then automatically produces valid, safe Rust code. Our contributions are as follows. First, we present a formal translation of a subset of C, dubbed Mini-C, to safe Rust. We express this process modularly, via a type-directed translation, parameterized over an oracle for compiling pointer arithmetic to slice splitting, followed by several post-translation analyses to soundly infer mutability qualifiers, derive traits, and other Rust-specific features. Second, we implement our translation in KaRaMeL [Protzenko et al. 2017], an existing compiler framework that previously compiled Mini-C to actual C. Third, we evaluate our implementation over two large-scale projects: the HACL★ verified cryptographic library (80k of C code) [Polubelova et al. 2020] and EverParse [Ramananandro et al. 2019], a verified formatters and serializers library, and specifically its CBOR [Internet Engineering Task Force 2020] parser (1.4k lines of C). We show that the modifications required of the programmer on the source are Compiling C to Safe Rust, Formalized 3 either minimal (for HACL★) or non-existent (for EverParse). Fourth, we perform an experimental evaluation, and distill two central insights: modifying the C source to abide by Rust’s ownership discipline not only has no noticeable performance impact, but the resulting Rust code also exhibits the same performance profile as the original C, in spite of fat pointers and runtime bounds checks. Artifact Submission. All our developments are under open-source licenses, and will be made public after the review process. To ensure reproducibility of our results, we intend to submit a complete artifact containing both the implementation of our translation in KaRaMeL and the case studies presented in the paper alongside scripts to easily reproduce experimental results. 2 Translating Mini-C to Safe Rust 𝑡 ::= uint8_t, bool, . . . void 𝑡∗ 𝑡 (𝑡®) struct 𝑠 C99 value types: integers, booleans. . . empty type pointer type function type struct type 𝑠𝑡 ::= ® else 𝑠𝑡, ® while (𝑒) 𝑠𝑡 ® return 𝑒, if (𝑒) then 𝑠𝑡 𝑥 =𝑒 𝑒 𝑡 𝑥 = 𝑒, 𝑡 𝑥 𝑡 𝑥 [𝑛] = {® 𝑒 }, 𝑡 𝑥 [𝑛] control-flow assignment expression statement variable declaration (with,without) initialization array declaration (with, without) initialization 𝑒 ::= 𝑥 𝑒 [𝑒] 𝑒.𝑓 𝑓 (® 𝑒) ∗𝑒 malloc(𝑒) &𝑒 variable array indexing field selection function call dereference heap allocation address-taking, a.k.a. reference Fig. 1. Grammar of mini-C types, statements and expressions We present Mini-C (Figure 1), a data-oriented subset of C that is expressive enough to handle a large class of applications. Types comprise value, pointer and struct types. Mini-C follows C and distinguishes between expressions and statements, with support for array allocation and manipulation, along with assignments, dereferences, field selection and common control-flow. For simplicity of presentation, we omit syntactic sugar (such as increment operators, or the arrow operator ->), C variable-length arrays (VLAs), the traditional syntactic separation between lvalue and rvalue, and present a simplified view of C syntax that does not bring in notions of specifiers, declarators, etc. We handle C array types, but do not list them in our syntax, to avoid polluting our rules with concerns related to array decay (and how, e.g., void f(int x[4]) really means void f(int *x)) – arrays get direct treatment in E-Stack. In short, our presentation focuses on where the bulk of our work lies, with the understanding that all of the omitted features above are exercised by our case studies (Section 4) and are thus supported by our implementation. We present in Figure 2 a simplified Rust that focuses on borrows and boxes of slices, eliding a fullygeneral notion of slice type (that exists internally in Rust, and may appear in trait implementations, 4 Aymeric Fromherz and Jonathan Protzenko 𝑇 ::= ℓ ::= 𝑢8, bool, . . . () [𝑇 ; 𝑁 ] &mut? ℓ ? [𝑇 ] Box⟨[𝑇 ]⟩ (𝑇®) → 𝑇 𝑆 base value types unit type array value type borrowed slice owned boxed slice function type struct type ′ 𝑎,′ 𝑏... lifetime 𝐸 ::= 𝑥 let 𝑥 = 𝐸; 𝐸 () [𝐸; 𝑁 ] ® [𝐸] vec![𝐸; 𝑁 ] 𝐸 [..] &mut? 𝐸 ∗𝐸 𝐸.𝑓 𝐸 [𝐸] n =𝐸 o 𝑆 𝑓® : 𝐸® variable let-binding unit value array repeat slice literal vector repeat array to slice borrow dereference field access assignment ® 𝑓 ( 𝐸) ® 𝐸.𝑚( 𝐸) function call method call structure Fig. 2. Grammar of Rust types and expressions. The ? denotes optional syntax. T-Bool T-Int T-Void T-Pointer 𝑡 ↩→ 𝑇 T-Fun 𝑡 ↩→ 𝑇 bool ↩→ bool uint8_t ↩→ u8 void ↩→ () 𝑡∗ ↩→ &[𝑇 ] 𝑡 (𝑡®) ↩→ (𝑇®) → 𝑇 𝑡® ↩→ 𝑇® Fig. 3. Translation from Mini-C to Rust: types but can’t be constructed by the user), and omitting details such as: borrows of arrays, base types like Vec, and a semantics of moves vs copies. Our implementation supports these. Central to our translation is the fact that while C provides a unique pointer abstraction (t*), Rust distinguishes both where the allocation lives (stack or heap), and whether a pointer conveys ownership. Array types [𝑇 ; 𝑁 ] denote arrays of statically-known length 𝑁 , and are value types. Unlike C, these never decay automatically to pointer types. Boxed slices Box⟨[𝑇 ]⟩ denote heapallocated arrays, whose length can be queried at runtime, and are value types as well. As such, arrays and boxes have no lifetime, and can be passed around as regular values, potentially incurring a move to preserve Rust’s ownership semantics. Borrowing from (akin to “taking the address of”, in C) an array or a box yields a slice borrow, which may be mutable or immutable (the latter denoted by the absence of a mut qualifier). Borrows have a lifetime, which is used by the Rust compiler to guarantee the absence of dangling pointers and use-after-frees (i.e., temporal safety). Slice borrows package their lengths at runtime, which the Rust compiler uses to guarantee the absence of out-of-bounds errors (i.e., spatial safety). Oftentimes, the lifetime can be inferred by the Rust compiler, and we will omit it in our presentation whenever possible. 2.1 Type Translation We present in Figure 3 our translation from Mini-C types to Rust types. Our judgments are of the form 𝑡 ↩→ 𝑇 , meaning C type 𝑡 translates to Rust type 𝑇 . Judgments for base types are straightforward: similar types exist in Mini-C and Rust. The key difficulty occurs when handling pointer types. C pointers omit whether they contain a single-element or an array, nor do they include information about the provenance of the underlying allocation (stack, heap, or static scope). Compiling C to Safe Rust, Formalized 5 To provide a common abstraction, we therefore translate C pointers to &[T] (T-Pointer), which designates slice borrows without mutable ownership. This means that every C pointer type become a borrowed slice in Rust, regardless of whether it points to one or more elements. Concretely, uint8_t x; uint8_t *y = &x; generates let y: &[u8] ... in Rust, and not &u8. Naturally, some of these slice types may need to be mutable, i.e., &mut[u8]: we infer mutability qualifiers later on (Section 3.1), meaning the translation does not need to concern itself with that for now. We note that Rust does have a C-like pointer type mut T*, but it requires unsafe Rust code to use, which is exactly what we set out not to do. We do not use those in our translation. 2.2 Type-Directed Translation of Expressions Building on our type translation, we now present our translation from Mini-C programs to Rust. Again, the chief difficulty lies in the multiple representations existing in Rust for pointers and memory-related types. Because our type translation is modular, our translation of expressions must reconcile various Rust types (namely, [𝑇 ; 𝑁 ], Box⟨[𝑇 ]⟩) produced at allocation-time with our uniform pointer abstraction dictated by the type translation (namely, &[𝑇 ]). Our translation is therefore type-directed, which allows inserting coercions on demand. Our rules have the form Γ ⊢ 𝑒 ⊳ 𝑇 ⇝ 𝐸 ⊣ Γ ′ , representing that in Rust typing environment Γ, the translation of Mini-C expression 𝑒 with expected Rust type 𝑇 yields the Rust expression 𝐸 and an updated Rust typing environment Γ ′ . We extend ⇝ to statements and lists of statements. We present our translation rules in Figure 4. To reconcile lists of statements (in C) and expressions everywhere (in Rust), we match on the head of a list of statements in C. For instance, E-Stack ® matches a stack allocation as the first statement, followed by subsequent statements 𝑠𝑡. E-Index Γ ⊢ 𝑒 1 ⊳ &[𝑇 ] ⇝ 𝑒 1′ ⊣ Γ ′ E-Deref Γ ⊢ 𝑒 ⊳ &[𝑇 ] ⇝ 𝑒 ′ ⊣ Γ ′ Γ ′ ⊢ 𝑒 2 ⊳ usize ⇝ 𝑒 2′ ⊣ Γ ′′ Γ ⊢ 𝑒 1 [𝑒 2 ] ⊳ 𝑇 ⇝ 𝑒 1′ [𝑒 2′ ] ⊣ Γ ′′ Γ ⊢ ∗𝑒 ⊳ 𝑇 ⇝ 𝑒 ′ [0] ⊣ Γ ′ E-Stack 𝑡 ↩→ 𝑇 ® ⊳ 𝑇 ′ ⇝ 𝐸′ ⊣ Γ Γ𝑁 , 𝑥 : [𝑇 ; 𝑁 ] ⊢ 𝑠𝑡 E-Var 𝑥 :𝑇 ∈ Γ Γ ⊢ 𝑥 ⊳𝑇 ⇝ 𝑥 ⊣ Γ Γ𝑖 ⊢ 𝑒𝑖 ⊳ 𝑇 ⇝ 𝐸𝑖 ⊣ Γ𝑖+1 ® 𝐸′ ⊣ Γ ® ⊳ 𝑇 ′ ⇝ let 𝑥 : [𝑇 ; 𝑁 ] = [𝐸]; Γ0 ⊢ 𝑡 𝑥 [𝑁 ] = {® 𝑒 }; 𝑠𝑡 E-Heap 𝑡 ↩→ 𝑇 ® ⊳ 𝑇 ′ ⇝ 𝐸 ′ ⊣ Γ′ Γ, 𝑥 : Box⟨[𝑇 ]⟩ ⊢ 𝑠𝑡 ® ⊳ 𝑇 ′ ⇝ let 𝑥 : Box⟨[𝑇 ]⟩ = vec![0; 𝑁 ].into_boxed_slice(); 𝐸 ′ ⊣ Γ ′ Γ ⊢ 𝑡 ∗𝑥 = malloc(𝑁 ∗ sizeof (𝑡)); 𝑠𝑡 E-Box-Slice 𝑥 : Box⟨[𝑇 ]⟩ ∈ Γ E-Array-Slice 𝑥 : [𝑇 ; 𝑁 ] ∈ Γ E-Slice-Box 𝑥 : &[𝑇 ] ∈ Γ Γ ⊢ 𝑥 ⊳ &[𝑇 ] ⇝ &𝑥 ⊣ Γ Γ ⊢ 𝑥 ⊳ &[𝑇 ] ⇝ &𝑥 [..] ⊣ Γ Γ ⊢ 𝑥 ⊳ Box⟨[𝑇 ]⟩ ⇝ (∗𝑥).into() ⊣ Γ ′ E-Array-Box 𝑥 : [𝑇 ; 𝑁 ] ∈ Γ Γ ′ = Γ \ {𝑥 } Γ ⊢ 𝑥 ⊳ Box⟨[𝑇 ]⟩ ⇝ Box :: new(𝑥) ⊣ Γ ′ E-Call 𝑓 : (𝑇®) → 𝑇 ∈ Γ Γ𝑖 ⊢ 𝑒 ⊳ 𝑇𝑖 ⇝ 𝐸𝑖 ⊣ Γ𝑖+1 Γ ′ = Γ \ {𝑥 } E-AddrOf Γ ⊢ 𝑒 ⊳ 𝑇 ⇝ 𝐸 ⊣ Γ′ Γ ⊢ &𝑒 ⊳ &[𝑇 ] ⇝ &[𝐸] ⊣ Γ ′ Γ𝑛 , 𝑟 : 𝑇 ⊢ 𝑟 ⊳ 𝑇 ′ ⇝ 𝑒 ′ ⊣ Γ ′ ® 𝑒 ′ ⊣ Γ′ Γ0 ⊢ 𝑓 (® 𝑒 ) ⊳ 𝑇 ′ ⇝ let 𝑟 = 𝑓 ( 𝐸); Fig. 4. Translation from Mini-C to Rust: selected statements and expressions 6 Aymeric Fromherz and Jonathan Protzenko Translating Array Allocations. C allows allocating arrays either on the stack or on the heap. For stack arrays, we assume the size is known at compile-time (we support, but do not present here, VLAs). Stack-allocated arrays enjoy various bits of array initialization syntax in C, such as t x[N] = {...}. Heap-allocated arrays are created with calls to malloc, which takes a size in bytes. Rule E-Stack thus translates a stack allocation named 𝑥, of 𝑁 elements of type 𝑡, each initialized ® The whole list of statements is expected to have Rust type with 𝑒𝑖 , followed by more statements 𝑠𝑡. 𝑇 ′ . We first translate the type of array elements t to its Rust counterpart 𝑇 . We then rely on the newly translated type T to translate each of the initial array elements 𝑒𝑖 to its Rust equivalent 𝐸𝑖 , ® is translated in generating a series of environments Γ𝑖 , ending with Γ𝑁 . Finally, the continuation 𝑠𝑡 Γ𝑁 extended with a binding for 𝑥 at Rust array type [𝑇 ; 𝑁 ]. The final result is a Rust let-expression; we reiterate that Rust is an expression language, meaning that C statements (e.g., 𝑡 𝑥 [𝑁 ] = {® 𝑒 }), C ® all translate to Rust expressions. expressions (e.g., 𝑒𝑖 ) and C lists of statements (e.g., 𝑠𝑡) Rule E-Heap operates in a similar manner, with one key difference. As the memory is heap allocated, the variable is instead translated to a Boxed slice; in Rust, we do this by relying on the vec! macro, which efficiently creates a new vector, before immediately turning it into a boxed slice, and without copies when applied to a literal of the form [𝐸; 𝑁 ]. Since C heap-allocated arrays are not guaranteed to be resizable in-place, we do not need Vec's dynamic resizing capabilities. These rules present simplified versions of what we support. Our implementation features several peephole optimizations, for instance, a stack allocation followed by a memset or followed by a forloop initialization translates directly to [E; N], Rust’s syntax for an array of 𝑁 elements filled with 𝐸. Similarly, a heap allocation performed via calloc, followed by memset, or a for-loop initializer translates to vec![E; N] in one go. Should none of these peephole optimizations trigger, or should the allocation have no initial value, we simply synthesize a default value at translation-time. We remark, however, that we expect the size of heap allocations to be expressed as a number of elements. This is one area where, as previously mentioned, we envision a semi-active approach to converting C to Rust. Heap allocations that are not expresssed via a sizeof trigger a warning in most modern compilers; they fail to express intent; and obfuscate the code in a way that flies against best practices. We therefore consider it reasonable to expect the programmer to rewrite those, and leverage existing tooling, testsuite and coverage to check that the change preserves the behavior of the code, before successfully engaging in further translation. Pointers and Array Accesses. E-Index compiles array accesses, reflecting that Rust expects a usize integer – we have more ad-hoc conversion rules (omitted) that can insert conversions to usize on the fly, should the Mini-C program use array indices at a narrower integer type. E-AddrOf preserves our invariant that all pointer types, in C, become borrowed slices, in Rust: taking the address of 𝑒 automatically inserts a slice literal with a single element. Finally, dereference, consistently with our translation scheme, uses the borrowed slice indexing operator for &[T]. The dereference operator in Rust on a slice borrow does not return the first element of a slice, as in C. Conversions Between Rust Abstractions. The translation presented so far therefore initializes C pointers as either boxed slices or arrays. However (Section 2.1), our common abstraction to translate C pointer types is &[T], i.e., a slice borrow. This leads to discrepancies during our translation: consider a program which creates a local array 𝑥 and passes it to a function 𝑓 expecting a pointer. In C, this program is valid thanks to array decay: the C compiler will automatically convert the array to a pointer. In our translation to Rust, this however becomes problematic: after translation, the function 𝑓 will expect to receive a slice borrow as argument, while the local variable 𝑥 will have the array type [T; N]; the resulting program will therefore fail to typecheck. This is where our type-directed translation comes in. When a mismatch occurs between the current type of a variable in the environment Γ and its expected type 𝑇 , we insert coercions to turn Compiling C to Safe Rust, Formalized 7 it into an expression of the appropriate type. Consider for instance our earlier example, where a variable has an array type while a slice borrow was expected. This corresponds to E-Array-Slice, which introduces a borrow, therefore turning the array into a slice borrow. Rule E-Box-Slice operates identically, but for boxed slices. Rules E-Slice-Box and E-Array-Box are “reverse conversions”, which convert stack and unknown allocations into heap ones – the need for those will become apparent in Section 2.4. Finally, E-Var models the simple case where a variable does have the expected type. In practice, some of the syntax introduced by the conversion rules is unnecessary, as the Rust compiler is able to add auto-borrows and auto-derefs in many places. Our implementation is aware of this, and features a nano-pass at the very end that removes superfluous &s and ∗s. The coercions introduced by conversion rules can however lead to subtle semantic differences, specifically, those introduced for the “reverse conversions” E-Slice-Box and E-Array-Box. Consider for instance the following (simplified) Rust program, where for reasons that will become apparent in Section 2.4, y must be translated as a Box<[T]>, meaning the array 𝑥 is coerced to a boxed slice. In this program, the assertion does not hold, as the call to Box::new creates a copy of x. In a C program however (shown on the left), both x and y would be the same pointer, therefore the change to y on line 3 would also apply to x. (As a rule of thumb, constructors that take another data structure, or explicit conversions performed via .into() in Rust generate a copy.) 1 2 3 4 uint8_t x[1] = { 0 }; uint8_t *y = x; *y = 1; assert(*x == 1); /* SUCCESS */ 1 2 3 4 let x: [u8; 1] = [0; 1]; let mut y: Box<[u8]> = Box::new(x); y[0] = 1; assert!(x[0] == 1) /* failure */ These copies cannot be avoided, as in our translation scheme, it might truly be the case that what initially started as a stack allocation needs to be promoted to a heap allocation. Furthermore, Rust provides no way of “opting out” of the Copy trait for base types like arrays of integers, meaning that Rust will silently perform a copy of 𝑥 into 𝑦, while allowing further modifications to 𝑥. We therefore leverage our output environment Γ ′ and strip the original variables from Γ when performing the conversion, thus forbidding any further usage of 𝑥. If the original C program further relies on x, our translation will error out, and will ask the programmer to fix their source code. This is another area where we adopt a “semi-active” approach to verification, and declare that some patterns are poor enough, even for C, that they ought to be touched up before the translation takes place. Empirically, as discussed in Section 4, this situation only rarely occurs, limiting the modifications needed in the original code. Function Calls. A subtlety in E-Call is that the result of calling a function may need to undergo a conversion, e.g., via a further application of E-Box-Slice. However, our conversion rules (above) operate on variables, so as to be able to easily look up their type in the environment. We simply let-bind the result of calling a function, which poses no problem, since Rust is an expression language, meaning we can insert let-bindings anywhere. 2.3 Handling Pointer Arithmetic When operating on arrays, C programs rarely perform accesses and updates via a single base pointer. A common programming pattern is to instead divide the array into chunks, or iterate over the array elements by keeping a local pointer to the current head of the iterator. Such operations are frequently performed using pointer arithmetic. Consider the following C example, inspired by an implementation of elliptic-curve cryptography. The array abcd contains a large number (“bignum”), spread across four 64-bit (8 byte) limbs stored contiguously in memory. For field addition, a first order of business is to get pointer to individual limbs, before performing pointwise addition. In other words, we want to perform pointer arithmetic 8 Aymeric Fromherz and Jonathan Protzenko to access chunks a, b, c, d, spanning intervals [0; 8), [8; 16), [16; 24) and [24; 32) of the base pointer. Because C does not carry length information for pointers, neither at run-time nor in the type system, we do not know that each pointer intends to span 8 bytes. Furthermore, we cannot assume either that pointer arithmetic occurs in left-to-right order; our example, below, illustrates this. 1 uint8_t abcd[32] = { 0 }; 3 4 5 6 7 1 2 2 uint8_t *a = abcd + 0; uint8_t *c = abcd + 16; uint8_t *b = abcd + 8; uint8_t *d = abcd + 24; 3 4 5 6 7 let mut abcd = [0u8; 32]; let abcd: &mut [u8] = &mut abcd[..]; let (a_l, a_r) = abcd.split_at_mut(0); let (c_l, c_r) = a_r.split_at_mut(16); let (b_l, b_r) = c_l.split_at_mut(8); let (d_l, d_r) = c_r.split_at_mut(8); let (a, b, c, d) = (b_l, b_r, d_l, d_r) This bit of C code cannot be trivially translated to Rust, because in order to guarantee memory safety, Rust does not allow arbitrary pointer arithmetic. What Rust provides instead is a primitive named split_at_mut (or split_at for immutable slices), which allows the programmer to relinquish ownership of a slice, and obtain in exchange two sub slices that split the original slice at the given index. This permits some restricted notion of pointer arithmetic, while preserving Rust’s invariant that mutable data should have a unique owner: to regain ownership of the original slice, the programmer must give up the sub slices. In the rest of this section, we explain how to map C’s pointer arithmetic onto Rust’s splitting paradigm, using a novel notion of split trees. Split Trees. We show on the right, above, the Rust code that our translation generates; we use it to illustrate our translation by example; then explain the general principle. When translating C’s pointer arithmetic to Rust, several difficulties arise. First, because we do not have length information coming from the C side, we need to assume that the chunks are not intended to be overlapping – if they were, this code would simply be impossible to type-check, and the programmer would have to rewrite their C code to make the intent more apparent, keeping in line with our semi-active approach to translating C to Rust. Second, the translation needs to be predictable and understandable by the user, so that translation failures can easily be matched with the location in the original C code that needs to be rewritten. For those reasons, we want to avoid backtracking in our translation, and perform the translation in a forward fashion. This means that we need a data structure that keeps the history of the calls to split_at, for instance to know at line 5 that C index 8 lives in c_l, and at line 6 that C index 24 lives in c_r, at Rust index 24 − 16 = 8. This data structure also must be attached to every program point, so as to translate an access through a C pointer into an access (possibly with an offset computation) through a Rust slice. For instance, a[0] would translate into a_r[0] after line 3, but into c_l[0] after line 4, and so on. We solve these challenges with a data structure called a split tree, synthesized during our translation: each C pointer maps to a (possibly singleton) split tree, which evolves in a flowdependent fashion. We present the split trees corresponding to abcd at different points of our translation in Figure 5; the split tree initially only contains the root abcd. The first pointer addition defines a as a sub-array of abcd starting at index 0, whose intended length is unknown. We thus split abcd at index 0, and keep this information, meaning indices [0; 0) of abcd are in the left slice, i.e., a_l, and indices [0; 32) are in a_r (Figure 5a). The second pointer addition on the same base pointer abcd triggers another split. At this program point, abcd is no longer available, because it has been borrowed to construct (a_l, a_r). Splitting abcd directly would be a mistake, because it would terminate these borrows, and render (a_l, a_r) unusable, thus making it impossible to translate any further usage of C pointer a. Instead, we leverage the fact that our split tree is a binary search tree, and discover that key 16 needs to be inserted as the right child of node a, i.e., we must split a_r at index 16. At this stage, a use of C pointer a would trigger a search in the Compiling C to Safe Rust, Formalized 9 (abcd, 0) (abcd, 0) (abcd, 0) a_l a_l a_r (a) After translating a (a_r, 16) c_l (a_r, 16) a_l c_r (b) After translating c (c_l, 8) (c_r, 8) b_l d_l b_r d_r (c) After translating b and d Fig. 5. Successive split trees during C translation. Internal nodes of the form (x, i) have been subjected to a split at Rust index i and are therefore borrowed at this program point. Leaf nodes are available. binary tree, and would return c_l as the current slice through which a may be accessed (Figure 5b). The translation of b is similar. Finally, for d, we find that index 24 is to be found in c_r; because the indices of the right subslice restart from 0, we must perform a subtraction to know that c_r needs to be split at index 8 (Figure 5c). We generalize this mechanism to all variables in the environment, and equip it with a few more bells and whistles. First, any usage of abcd that is not pointer arithmetic (e.g., f(abcd)) is taken to indicate that C variables a, b, c, and d are no longer useful and that the user intends to perform a fresh set of pointer arithmetic computations. (This happens frequently as a function may be composed of several steps, some expressed through macros. For instance, in our elliptic curve case study, it is common to have a function be a series of calls to ADD and MUL, each of which expands to operations that need their own split tree.) This allows the programmer to insert, in the source code, calls to, e.g., (void)abcd to provide a hint that the split tree of abcd needs to be reset, and that the corresponding Rust variables can go out of scope (i.e., their lifetimes can end). Such calls are optimized away by the compiler, and therefore do not have any impact at runtime. We leveraged this mechanism in numerous places in our HACL★ case study (Section 4.1). Second, we generalize the form of offsets to accept a more general language of expressions, which we now describe below. Symbolic Solver. The split trees we presented above operate over constant offsets for pointer arithmetic – this allows implementing a trivial order, which in turn allows us to maintain the binary search tree structure. In our study of real-world examples, we commonly encountered more complex offset expressions, for instance n and n + 8 where n might be a parameter of the current function, whose value is therefore not statically known. To address this issue, our implementation relies on a simple, deterministic symbolic solver, which is able to compare different kinds of arithmetic expressions containing symbolic variables, e.g., to determine that n + 8 is greater than n. While this solver is not complete and does not rely on contextual information to refine the analysis, it is nevertheless sufficient to translate large case studies as described in Section 4, and could be easily extended for further use-cases. Should the symbolic solver still fail to compare offsets, our current implementation emits a warning, along with the location of the problematic source code. This allowed us to fix the source code in our case studies, for instance, to replace 2*n with n+n. Should none of this work, our compiler then adopts a final heuristic: that the offsets that occur along the control-flow are monotonically increasing, i.e., they perform pointer arithmetic from left to right. This is somewhat of a last resort, although in our experience this was enough to finish compiling all of HACL★ from C to Rust. We might tweak this behavior to abort compilation instead, by default. Again, our semi-active approach to translating C to Rust means that, in case the programmer’s intent is unclear, it is oftentimes worthwhile to rewrite the source, rather than augment our solver with complex heuristics. 10 Aymeric Fromherz and Jonathan Protzenko Limitations. We now comment on this approach. First, we note that if there is true aliasing, i.e., the C program contains both a[8] and b[0] in our earlier example, the translated Rust code will perform an out-of-bounds array access. In other words, we must assume the array chunks, in C, to be disjoint. For overlap cases that can be distinguished statically (as above), we emit a compile-time error; otherwise, the Rust code will panic at runtime. We remark that we have not seen this pattern in our case studies, so this has not been a problem in our experience. Should this turn out to be a concern in the future, we envision either supplemental off-the-shelf static analyses; annotations in the source code to express intent more clearly; or for our chief case studies, i.e., verified code, changing the translation pipeline to retain array lengths (which do occur at the proof level for verification purposes) all the way down to Rust, for a sound and complete slice compilation strategy. 2.4 Generalizing to Whole Programs: Struct Definitions and Function Declarations Finally, we show how to translate top-level type and function definitions. Recall that, by default, C pointer types translate to Rust borrowed slices. This default choice needs to be selectively overriden. For instance, consider struct s *new_s(). Such a C function intends to create a new object and transfer ownership of it to the caller; in Rust, this would be fn new_s() -> Box. We adopt a set of heuristics that determine whether the return type of a function should be boxed or should be a borrow. For the former, we devise a new relation ↩→, which inherits all of @ the rules from ↩→ save for T-Pointer which is replaced by T-Pointer-B; for the latter, we devise a new relation ↩→ , which inherits all of the rules from ↩→ save for T-Pointer, which is replaced by ′ 𝑎 T-Pointer-L. The former translates C pointer types to owned boxes; the latter translates C pointer types to annotated slice borrows, because Rust makes lifetimes explicit in struct declarations. T-Pointer-L T-Pointer-B 𝑡 ↩→ 𝑇 ′ 𝑡 ↩→ 𝑇 𝑎 @ ′ & 𝑎[𝑇 ] 𝑡∗ ↩→ ′ 𝑡∗ ↩→ Box⟨[𝑇 ]⟩ 𝑎 @ We also run this analysis on type declarations, notably structures, in order to determine whether a structure should own its inner pointers or not. In our earlier example, if the C definition is struct s { uint8_t *data; };, then the type of new_s indicates to us that s intends to own its data, meaning it translates in Rust to struct S { data: Box<[u8]>; }, not struct S<'a> { data: &'a[u8] }. This means that some patterns in C become problematic. For instance, a struct s in C might be constructed either using heap allocations, or stack allocations, since both flavors of allocation yield a uint8_t*. However, in Rust, these allocations have different types. For heap allocations, a C malloc translates to a new Box, which simply gets moved in the Rust S structure. For stack allocations, because our translation does not backtrack, an array is first allocated on the Rust stack, then gets copied into a Box upon realizing that it needs to be assigned into the data field of S which has type Box<[u8]>. This finally explains why we have rules E-Array-Box and E-Slice-Box. 1 2 3 4 5 6 7 8 9 10 // T-Var uint8_t *x = malloc(16); struct s state = { .data = x }; // T-Array-Box uint8_t x[16] = { 0 }; struct s state = { .data = x }; // T-Slice-Box uint8_t x[16] = { 0 }; uint8_t *y = x; struct s state = { .data = x }; 1 2 3 let x = vec![0u8; 16].into_boxed_slice(); let state = State { data: x }; 4 5 6 let x = [0u8; 16]; let state = State { data: Box::new(x) } 7 8 9 10 let x = [0u8; 16]; let y: &[u8] = &x; let state = State { data: (*y).into() } Compiling C to Safe Rust, Formalized 11 Fundamentally, this discrepancy stems from the absence of ownership polymorphism for struct types in Rust, as they are nominal, and must commit to either Box<[T]> or &'a[T] at definition-time. Owing to mutual recursion between structures, we use a fixpoint computation to run this analysis over all structures in a given program. Ultimately, we separate type declarations into two disjoint sets: borrowed types, and owned types. This heuristic is best-effort, and can be overridden by the user if the results are not satisfactory. Again, we may advocate for a language of annotations to promote our semi-active approach to Rust translation. Tuples as structural types. Anticipating slightly on §3.1, we comment on nominal typing of structs. A struct declaration in Rust not only sets in stone how a t* gets represented (Box<[T]> vs &'a[T]), but also sets in stone whether the latter is mutable or immutable (&'a[T] vs &'a mut[T]). This means that a particular choice of representation may incur not only extra copies (as we just saw), but may also force (otherwise immutable) pointers to become mutable just to match the expected field type. We did run into this issue in one of our case studies, where the C code defined a type for pairs of pointers. Even though this C type was only used locally with mutable pointers, it forced the global struct definition to use mutable borrows, which in turn forced every usage of this struct to rely on mutable borrows, rendering the Rust code ill-typed. To address this unintended “contamination”, we allow for a struct (a nominal type) to be instead compiled as a tuple (a structural type), which permits polymorphism, passing around (&[u8], &[u8]) whenever possible, or (&mut [u8], &mut [u8]) otherwise, without affecting the rest of the program through a global struct type. 3 Static Analyses on Rust Code Building on our translation from Mini-C to Rust presented in the previous section, we now propose several static analyses operating directly, and post-translation, on Rust code to improve code quality and support a larger subset of C programs. These analyses are tailored to the borrow and move semantics of the Rust language; they aim to automatically infer mutability annotations and copiable types, therefore allowing more programs to typecheck once translated to Rust. Importantly, both the analyses presented in this section are untrusted: they do not modify the programs and only provide additional annotations, which are checked by the Rust typechecker at compile-time. 3.1 Mutability Analysis To ensure memory safety, Rust relies on a principle called mutability XOR aliasing. Concretely, it distinguishes between immutable borrows, of type &T, which can be freely shared and duplicated but only allow to read the referenced memory, and mutable borrows, of type &mut T, which allow to write memory, but require exclusive ownership, that is, no other reference to the same memory region can be live at the same time. This is enforced by the Rust borrow-checker. The distinction between mutable and immutable borrows raises two conflicting problems. To pass Rust borrow-checking, one needs to correctly provide mutability annotations, to ensure that memory being modified is indeed mutably borrowed. Unfortunately, marking all borrows as mutable is also not a solution. Beyond being unidiomatic, a function requiring only mutable borrows is also restrictive, as it prevents usages where aliasing would be safe, i.e., when arguments are only read. To reach a middle ground, the translation presented in Section 2 instead generates immutable borrows by default, which we now refine as-needed into mutable borrows through a custom mutability inference analysis. To do so, we perform a backward analysis on all translated functions, identifying expressions that perform memory updates, and backpropagating which program variables need to be mutable back to their definition site, inserting mutable borrows along the way. Mutability rules are as follows in Rust. If x has array type [T; N] and x[i] = e occurs, then x must be a mutable variable, i.e., declared as let mut x = .... If y has borrow type and y[i] = e occurs, 12 Aymeric Fromherz and Jonathan Protzenko then y must be a mutable borrow; furthermore, one can only mutably borrow variables that are themselves mutable, i.e., if let y: &mut [u8] = &mut z, then z itself must be declared as let mut z. We inductively define an analysis on the Rust syntax that is aware of the semantics above and synthesizes two variable sets 𝑉 and 𝑅, where 𝑉 contains variables that must be mutable (i.e., x and z, above), and 𝑅 contains variables that must have a mutable borrow type (i.e., y above). Applying this analysis to the output of our translation yields a Rust program that has been annotated with the minimum amount of mut qualifiers in variables, types, and function parameters, in order to typecheck. In practice, our analysis is more general, and can handle many more forms of assignments, with combinations of fields, borrows, array indexing, and so on. Our implementation also computes a third set 𝐹 , for fields of structs that ought to be mutable, and a fourth set 𝑃, for pattern variables in matches that ought to be ref mut – we omit these details here for the sake of simplicity. We formally present our mutability inference analysis in Figure 6. Our rules are presented as the 𝑉 ,𝑅 combination of a judgment Δ ⊢ 𝑒 ⇒ 𝑒 ′, 𝑉 ′, 𝑅 ′ , and a system of mutually-recursive equations over 𝑀 Δ. The final output of our analysis is the least fixed point that satisfies the equations over Δ. The 𝑉 ,𝑅 ⇒ judgment represents that Rust expression 𝑒 is transformed into expression 𝑒 ′ . This translation is 𝑀 performed with current sets 𝑉 , 𝑅, and returns new sets 𝑉 ′ and 𝑅 ′ , as the translation of 𝑒 might have added variables to both sets. The mode 𝑀 ∈ {𝑚𝑢𝑡, 𝑖𝑚𝑚} indicates the expected borrow mutability of the expression 𝑒, while Δ contains the function definitions, which are needed to retrieve the expected mutability when performing function calls. Rules I-ImmVar and I-MutVar are straightforward: if the translation expects variable 𝑥 to be a mutable borrow, then we add it to the set 𝑅 to backpropagate the information, otherwise, we leave both sets invariant. Rules I-ImmBorrow and I-MutBorrow are similar, but operate directly on borrows: if variable 𝑥 is borrowed and the expected type is a mutable borrow, then we replace the immutable borrow by a borrow, and indicate that variable 𝑥 must be marked as mutable. Rule I-Let demonstrates the backward nature of the analysis. As the type of the expression corresponds to the type of 𝑒 2 , we first translate 𝑒 2 with the same mode 𝑀, returning new sets 𝑉 ′ and 𝑅 ′ . We then rely on these sets to translate 𝑒 1 . If 𝑥 belongs to the set 𝑅 ′ , then it means that the rest of the program expects it to have a borrow type, we thus translate its definition with mode 𝑚𝑢𝑡. Finally, if variable 𝑥 is used mutably, i.e., belongs to 𝑉 ′ , we make its let-binding mut. Rule I-Update-Borrow presents the translation of a borrowed slice update, which introduces the need for mutable borrows. Our translation from C to Rust guarantees that the left-hand side has a slice borrow type (the translation rule for C updates is similar to E-Index, and ascribes a borrow type to 𝑒 1 ). To satisfy the Rust borrow-checker, the expression 𝑒 1 being modified must thus become a mutable slice borrow. We therefore translate with the 𝑚𝑢𝑡 mode. Conversely, both the index and the value being stored are only read; they are therefore translated with mode 𝑖𝑚𝑚. The translation of function calls is handled through rule I-Call. When calling a function 𝑓 , we first retrieve the type signature of 𝑓 from the environment Δ. We then translate the expressions corresponding to each function argument according to their expected mutability, computed through the function 𝑖𝑠_𝑚𝑢𝑡𝑏𝑜𝑟𝑟𝑜𝑤 (𝑇𝑖 ). This function returns 𝑚𝑢𝑡 is 𝑇𝑖 is of the shape &mut T, and 𝑖𝑚𝑚 in all other cases. The translation of all arguments can be done in parallel with the initial sets 𝑉 , 𝑅. To propagate the information acquired, we finally return the union of these sets; this faithfully models that if an expression 𝑒𝑖 requires a variable 𝑥 to be mutable while 𝑒 𝑗 does not have this requirement, then 𝑥 must indeed be mutable. A key feature of this rule is that it may modify the definition environment Δ, and thus trigger further recomputations to reach the fixed-point of our system of equations. Indeed, the context may force us to produce a mutable borrow, which can only be achieved by forcing the return type of f itself to be a mutable borrow. Compiling C to Safe Rust, Formalized 13 I-MutVar I-ImmVar 𝑉 ,𝑅 Δ ⊢ 𝑥 ⇒ 𝑥, 𝑉 , 𝑅 𝑖𝑚𝑚 I-ImmBorrow 𝑉 ,𝑅 I-MutBorrow 𝑉 ,𝑅 Δ ⊢ 𝑥 ⇒ 𝑥, 𝑉 , 𝑅 ∪ {𝑥 } 𝑉 ,𝑅 Δ ⊢ &𝑥 ⇒ &𝑥, 𝑉 , 𝑅 𝑚𝑢𝑡 Δ ⊢ &𝑥 ⇒ &𝑚𝑢𝑡 𝑥, 𝑉 ∪ {𝑥 }, 𝑅 𝑚𝑢𝑡 𝑖𝑚𝑚 I-Let 𝑉 ,𝑅 Δ ⊢ 𝑒 2 ⇒ 𝑒 2′ , 𝑉 ′, 𝑅 ′ 𝑀 𝑀 ′ = if 𝑥 ∈ 𝑅 ′ then 𝑚𝑢𝑡 else 𝑖𝑚𝑚 𝑉 ′ ,𝑅 ′ 𝐴 = if 𝑥 ∈ 𝑉 ′ then 𝑚𝑢𝑡 else ∅ Δ ⊢ 𝑒 1 ⇒′ 𝑒 1′ , 𝑉 ′′, 𝑅 ′′ 𝑀 𝑉 ,𝑅 Δ ⊢ let 𝑥 = 𝑒 1 ; 𝑒 2 ⇒ let 𝐴 𝑥 = 𝑒 1′ ; 𝑒 2′ , 𝑉 ′′, 𝑅 ′′ 𝑀 I-Update-Borrow 𝑉 ,𝑅 𝑉 ,𝑅 Δ ⊢ 𝑒 1 ⇒ 𝑒 1′ , 𝑉 ′, 𝑅 ′ 𝑉 ,𝑅 Δ ⊢ 𝑒 2 ⇒ 𝑒 2′ , 𝑉 ′′, 𝑅 ′′ 𝑚𝑢𝑡 Δ ⊢ 𝑒 3 ⇒ 𝑒 3′ , 𝑉 ′′′, 𝑅 ′′′ 𝑖𝑚𝑚 𝑖𝑚𝑚 𝑉 ,𝑅 Δ ⊢ 𝑒 1 [𝑒 2 ] = 𝑒 3 ⇒ 𝑒 1′ [𝑒 2′ ] = 𝑒 3′ , 𝑉 ′ ∪ 𝑉 ′′ ∪ 𝑉 ′′′, 𝑅 ′ ∪ 𝑅 ′′ ∪ 𝑅 ′′′ 𝑀 I-Call Δ(𝑓 ) = (𝑥 1 : 𝑇1, . . . 𝑥𝑛 : 𝑇𝑛 ) → 𝑇 ∀𝑖, Δ ⊢ 𝑒𝑖 𝑉 ,𝑅 ⇒ 𝑖𝑠_𝑚𝑢𝑡𝑏𝑜𝑟𝑟𝑜𝑤 (𝑇𝑖 ) 𝑒𝑖′, 𝑉𝑖 , 𝑅𝑖 𝑇 ′ = if 𝑀 = 𝑚𝑢𝑡 then 𝑚𝑎𝑘𝑒_𝑚𝑢𝑡 (𝑇 ) else 𝑇 𝑉 ,𝑅 Δ ⊢ 𝑓 (𝑒 1, . . . , 𝑒𝑛 ) ⇒ 𝑓 (𝑒 1′ , . . . , 𝑒𝑛′ ), Ø 𝑉𝑖 , Ø 𝑅𝑖 𝑀 𝑖 Δ(𝑓 ) = (𝑥 1 : 𝑇1, . . . 𝑥𝑛 : 𝑇𝑛 ) → 𝑇 ′ 𝑖 I-Sig 𝑅 = {𝑥𝑖 | 𝑖𝑠_𝑚𝑢𝑡𝑏𝑜𝑟𝑟𝑜𝑤 (𝑇𝑖 )} Δ⊢𝑒 ∀𝑖,𝑇𝑖′ = if 𝑥𝑖 ∈ 𝑅 ′ then 𝑚𝑎𝑘𝑒_𝑚𝑢𝑡 (𝑇𝑖 ) else 𝑇𝑖 ∅,𝑅 ⇒ 𝑒 ′, 𝑉 ′, 𝑅 ′ 𝑖𝑠_𝑚𝑢𝑡𝑏𝑜𝑟𝑟𝑜𝑤 (𝑇 ) Δ(𝑓 ) = (𝑥 1 : 𝑇1, . . . , 𝑥𝑛 : 𝑇𝑛 ) → 𝑇 {𝑒} Δ(𝑓 ) = (𝑥 1 : 𝑇1′, . . . , 𝑥𝑛 : 𝑇𝑛′ ) → 𝑇 {𝑒 ′ } Fig. 6. Mutability Inference Analysis, Selected Rules To infer mutability information for an entire Rust program, our analysis operates over top-level function definitions in I-Sig. This rule can be seen as the entrypoint of our analysis. Given a previous iteration of the analysis for 𝑓 , we compute the initial set 𝑅 of variables whose type is a mutable borrow, based on the signature of 𝑓 . The translation of function body 𝑒 returns a new function body 𝑒 ′ , as well as sets 𝑉 ′ and 𝑅 ′ . We update the definitions Δ with a new entry for 𝑓 , based on information inferred during the translation of 𝑒: if a function parameter 𝑥𝑖 belongs to the set 𝑅 ′ , meaning it is expected to be a mutable borrow, then we modify its type 𝑇𝑖 accordingly. The presentation as a system of iterated equations alleviates the need for a topological sort of the function definitions – there may simply be no such order, given mutually-recursive definitions in C and Rust. The environment is initially populated with the output of our translation, meaning there are no mut qualifiers anywhere (Section 2). We then iterate the equations over Δ until a fixed point is reached. There trivially exists such a fixed point: we only ever add mut qualifiers, meaning the number of iterations is bounded by the numbers of function parameters across the whole program. In practice, we of course do not do repeated iterations, and instead use an optimized fixed point library to eliminate un-necessary recomputations [Pottier 2009]. 14 Aymeric Fromherz and Jonathan Protzenko While Figure 6 presents the essence of our mutability inference algorithm, as mentioned earlier, our implementation supports a much larger set of features, needed to translate the real-world case studies presented in Section 4. We reiterate that this phase does not change the Rust program; it only augments it with additional mut qualifiers, meaning it can be validated a posteriori by the Rust compiler and we do not need to argue for its correctness. 3.2 Automatically Deriving Trait Instances A key component of the Rust language is its pervasive use of traits, which can be broadly seen as a Rust-specific version of typeclasses. Traits allow programmers to specify that a given type must implement certain features, for instance that it can be pretty-printed (Display trait), converted from and to another type (From and Into traits), or that it allows deep copies (Clone trait). When defining a new type T, i.e., a new structure or enumeration, one can provide an implementation of a given trait 𝔗 by manually defining the methods corresponding to 𝔗 for type T, e.g., implementing a fmt function that specifies how to pretty-print an element of type T allows to implement trait Display for T. For some traits however, implementations can be automatically derived, using the Rust #[derive(Trait)] attribute on the type definition, assuming that all fields of the structure or enumeration do implement the trait. One trait of particular interest for our translation is the Copy trait. A common pattern in C is to define a structure containing a pointer (e.g., a pointer to a string and the corresponding length), and to pass the structure by value when calling functions. This leads to a mismatch with the Rust move semantics. Rust implements an affine type system, meaning that, by default, values can be used at most once: passing a value to a function, e.g., f(s), invalidates the value 𝑠, leading to compile-time errors if s is further used in the program. To avoid this behavior, types can implement the Copy trait, which instead performs a shallow copy of the value when passing it to a function. To avoid Rust compilation errors due to C value-passing semantics when translating programs, we wish to automatically derive Copy when possible. All basic Rust types (e.g., integers or booleans) implement Copy. Additionally, if a type T satisfies Copy, so does an immutable borrow type &T. However, copying a mutable pointer (i.e., either &mut T or Box) is not allowed; this would contradict Rust’s guarantee that every piece of mutable data has a unique owner. To automatically add a #[derive(Copy)] annotation on appropriate structure and enumeration type definitions, we traverse all type definitions, and only derive the Copy trait if all fields are themselves Copy, that is, they are neither a mutable borrow nor a box, and if they are a custom type, this type implements Copy itself. The analysis is also performed through a fixpoint computation to handle (mutually) recursive structs and enums. In addition to Copy, our analysis also automatically derives the PartialEq, and Clone traits when possible. PartialEq allows the use of the == operator, and Copy allows let-binding a value without invalidating (moving out) its old name: our translation uses both of those. Clone allows performing deep copies on a given value, via an explicit call to the clone method. While our translation does not directly use this feature, implementing Clone is a prerequisite to implement Copy (in Rust’s parlance, Copy has Clone as a parent trait). Additionally, Clone is commonly used by library clients; we therefore strive to provide it as much as possible to facilitate the adoption of the translated Rust library. 4 Case Studies We implemented our approach in the KaRaMeL compiler [Protzenko et al. 2017], which consumes Low★, an embedded C-like DSL of the F★ programming language, and turns it into actual C. After 70+ nano-passes, the original Low★ has become Mini-C and only has a few constructs left. While KaRaMeL then pretty-prints Mini-C as actual C syntax, we instead build on Mini-C and implement the compilation scheme and analyses presented in the previous sections as an extension to the Compiling C to Safe Rust, Formalized 15 compilation pipeline, before finally emitting Rust code. This allows us to not only iterate on Mini-C and our translation quickly, but also to reuse the KaRaMeL compiler infrastructure and directly repurpose verified Low★ projects from C to Rust, one of the original motivations for our work, and one of the most high-value targets to translate. We have plans for a libclang-based frontend that would consume actual C syntax; given that KaRaMeL is written in OCaml, and that well-maintained OCaml libclang bindings exist, we estimate that this would require only minimal effort. Our current implementation totals 4,000 lines of OCaml code, including comments, and took one person-year to implement. We benefited from the existing libraries, helpers and engineering systems already developed for KaRaMeL; anything to do with Rust was added by us. In particular, to facilitate the adoption of generated C code into existing codebases, KaRaMeL implements many nano-passes to make code more idiomatic and human-looking, therefore simplifying its audit as part of integration processes. We extend these compilation passes with 7 Rust-specific nano-passes that significantly decrease warnings raised by Clippy, the main linter in the Rust ecosystem. These passes include, e.g., simplifying immediate dereferences of borrowed expressions (i.e., turning *(&e) into e), or removing dereferences that can be automatically inferred by the Rust compiler. 4.1 HACL★ Our first chief case study is the HACL★ verified cryptographic library [Polubelova et al. 2020]. HACL★ is made up of 170,000 lines of F★ code, which includes the Low★ implementations, pure specifications, and proofs of functional correctness, memory safety, and side-channel resistance. The Low★ code compiles to 95,000 lines of C (including headers) via KaRaMeL. More than 15 person-years went into the HACL★ project, and parts of HACL★ have been integrated into Python, Firefox, the Linux kernel, and a myriad of other well-known software projects. To avoid high-impact memory vulnerabilities in security-critical applications [Blessing et al. 2024; Durumeric et al. 2014; National Vulnerability Database 2014], many recent cryptographic applications have turned to Rust, either to provide safe implementations of widely used protocols like TLS [Rustls Contributors 2016], Wireguard [Cloudflare 2019] or Signal [Signal Messenger 2020], or to develop novel cryptographic constructions such as zero-knowledge proofs [arkworks contributors 2022; ZeroEx 2019]. Importantly, these applications all build on top of implementations of cryptographic primitives, raising the need for high-performance, high-assurance safe cryptographic libraries such as HACL★ in the Rust ecosystem. Naturally, given the engineering effort needed, it would inconceivable to rewrite all of HACL★ in Rust, and, say, use a Rust verification tool like Aeneas [Ho and Protzenko 2022], Verus [Lattuada et al. 2023] or Creusot [Denis et al. 2022] to redo all of the proofs. A more realistic solution is therefore to migrate the code to Rust; this is where our translation comes in. We converted all of the HACL★ codebase to translate, type-check and execute successfully in Rust via our translation scheme. We estimate that this effort took 3 person-months. Doing so, we gained insights into patterns in C that cause problems for translating to Rust. We now describe these patterns, and comment on the effort and difficulty of rewriting the source. The first issue was that HACL★ was making aggressive use of “in-place or disjoint” ownership patterns. Concretely, a field operation, e.g., modular multiplication, was expressed as void fmul(uint8_t *dst, uint8_t *x, uint8_t *y), with the precondition that the destination may alias either x or y, or both. While extremely common in C, this pattern simply cannot be expressed in Rust, as it would require aliasing a mutable piece of data. A systematic fix for this was to insert macro-based wrappers of the following form when calling fmul with possibly aliased values: 1 2 3 #define FMUL_COPYX(dst, x, y) uint8_t x_copy[ELEM_SIZE]; \ memcpy(x_copy, x, ELEM_SIZE); \ fmul(dst, x_copy, y); 16 Aymeric Fromherz and Jonathan Protzenko This wrapper has the exact same shape (and specification) as the original fmul wrapper, meaning that the (potentially-fragile) proof at call-site remains identical. Proving its correctness is also automatic, owing to its small size and innocuous definition. But once expanded, then translated to Rust, the original problematic, alias-inducing call-site disappears, and all the Rust compiler sees is a judiciously-inserted copy that the borrow-checker is happy with. The second issue was that HACL★ contained pointer arithmetic that fell outside of what our symbolic checker supported (Section 2.3). The reasons for this were varied: some pointer offsets were using inconsistent notation (e.g., n+n and 2*n within the same scope); some split trees were not well-scoped, meaning the code juggled between c, d (see Section 2.3) and a combined chunk cd alternatively within the same scope; aggressive inlining meant that it was not apparent when a new split tree should be allocated for a given variable. We fixed those with surgical rewrites of the source code, along with annotations of the form (void)abcd which indicate to our implementation that the split tree should be discarded since the user means to use the base pointer again. The third issue was that some patterns were generating moves in Rust, owing to excessive let-binding in the source. Consider, for instance, our running example of struct s which contains a data pointer to mutable bytes. 1 2 struct s state = { data: ... }; struct s state2 = state; 5 2 3 3 4 1 f(state2.data); g(state.data); 4 5 6 6 let state = S { data: ... }; let state2 = state; /* MOVE */ /* RE-BORROW inserted by Rust automatically (elided) */ f(state2.data); /* ERROR HERE: state has been moved out */ /* NOTE: state2 is still accessible */ As S contains a mutable pointer, it does not implement the Copy trait in Rust. Therefore, when assigning state to state2 leads to moving-out the value state, meaning that it cannot be further used in the program. The C program above therefore does not compile once translated to Rust. Importantly, despite state2.data not implementing Copy either, passing it to the function f does not lead to a move, which would forbid further uses of state2: the Rust compiler automatically inserts a reborrow, meaning it translates the call f(state2.data) to f(&(*(state2.data))). We were therefore able to fix this issue in a systematic manner, by inlining state2 in the source. The remaining issues do not fall into a generic pattern, and as such could not be fixed with a systematic rewriting. Those include specific algorithms like HKDF, which relied on in-place patterns and required inserting copies, along with actual proof work to reverify the rewritten algorithm. Another problematic algorithm was the streaming API for block algorithms, which deals with complex long-lived state, and required ad-hoc rewrites, mostly to avoid accessing the mutable state through aliases. Fortunately, this streaming algorithm was meta-programmed [Ho et al. 2023], meaning we only had to fix a single implementation for all 12 block algorithms present in HACL★. The total diff between our fork of HACL★, which extracts to Rust, and the original main branch is 2,000 lines added and 500 lines deleted. This means that less than 2% of the HACL★ code was affected. Most of these changes were systematic rewritings; the last category of issues that truly required deep insights about a specific algorithm (HKDF, streaming) represent a very small fraction of the total diff. We assess the performance impact of these many changes in Section 5. We report that our Rust-compiled HACL* has now been packaged within the libcrux cryptographic library, parts of which have been integrated into Mozilla’s NSS and OpenSSH. We thus expect that our Rust-compiled HACL* will slowly trickle down to many parts of the Rust ecosystem. 4.2 EverParse Our second case study relates to the EverParse framework [Ramananandro et al. 2019]. EverParse takes a data description language, and produces Pulse code (another DSL of F★) that just like Compiling C to Safe Rust, Formalized 17 Low★ feeds into KaRaMeL and eventually becomes Mini-C before being printed out as concrete C syntax. We set out to translate to Rust a specific parser generated by EverParse, namely, the CBOR-DET parser [Internet Engineering Task Force 2020]. CBOR-DET is an ongoing IETF draft for a binary format akin to JSON, and specifically, a deterministic variant of it. Having a verified Rust implementation is therefore high value and has the potential to become a reference implementation for future clients. We applied our translation to an existing CBOR-DET instance of EverParse, developed independently of this paper. CBOR-DET is made up of 9,000 lines of F★, including implementation, specification, and proofs. CBOR-DET compiles, via the original KaRaMeL toolchain, to 4,000 lines of C code. As part of the process, we enacted numerous improvements to our translation. Notably, the introduction of fixed points stems from the very recursive nature of those parser combinators. Ultimately, we were able to compile CBOR-DET with no modifications to the source code into Rust code that borrow-checks, compiles and passes the CBOR-DET test suite. The effort took less than two person-weeks, all spent on improvements to our implementation. 5 Experimental Evaluation We now set out to evaluate the performance of our translated code. There are numerous factors that we set out to investigate. First, we aim to understand whether the extra copies, reorderings, tweaks and macros we had to add to HACL★ affect the performance: we thus extract our modified version of HACL★ to C (HACL-rs), and compare it with baseline HACL★ (Vanilla), to see the extent to which our tweaks for Rust compatibility degrade the performance of the C code. Next, we wish to measure the overhead of Rust. For this, we compare execution speed of HACL★ and CBOR-DET between C and Rust. This tests whether Rust’s slices (with runtime lengths) and checked array accesses severely affect performance-critical code. Third, we consider a variant of HACL★ where our mutability analysis is disabled, and every piece of data is marked as mutable (HACL-rs-mut). This variant contains more copies than necessary in order to type-check in Rust, and fewer optimization opportunities for the compiler, since immutability is lost. This tells us whether mutability information has a performance impact. Last, we compare the impact of different optimization levels on the Rust and C code. To minimize differences between the toolchain, we rely on CBOR for this comparison, as it consists of a unique, monolithic file for both the C and Rust versions. Benchmarks. Our HACL★ benchmarks include several representative cryptographic algorithms, namely, authenticated encryption with additional data (Chacha20-Poly1305), Diffie-Hellman key exchange constructions based on elliptic curves (P-256 and X25519), and the widely used SHA-2 hashing function. Following existing performance benchmarks for HACL★, our benchmarks consist of 100,000 encryptions of 16,384 bytes of data for Chacha20-Poly1305, 4,096 Diffie-Hellman key exchanges for P-256, 100,000 Diffie-Hellman key exchanges using the elliptic curve Curve25519 (X25519), and 16,384 hashes of 16,384 bytes of data using SHA-256. For our baseline, we use the C code in the HACL★ GitHub repository, commit f218923. Results presented are the average of 5 runs, normalized with respect to our baseline, the existing C implementation of HACL★ (Vanilla). Our CBOR-DET benchmarks call the parser validator on various amounts of data; test validate-N corresponds to validating N bytes of data, repeated 100 times. Results presented are the average of 5 runs, and normalized w.r.t our baseline, the C implementation of CBOR parsers. Experimental Setup. All experiments are run on a MacOS Ventura 13.6 with an ARM M1 processor and 16GB of RAM. Aiming for an apples to apples comparison, both the C and Rust benchmarks are compiled with LLVM; we compile the C code with the -O2 optimization level, while the Rust code is compiled in release mode, with opt-level set to 2. To minimize the impact of differences in 18 Aymeric Fromherz and Jonathan Protzenko compilation toolchains, e.g., Rust treating an entire crate as a single translation unit, which opens up tremendous optimization opportunities, we enable link-time optimization (LTO) in both C and Rust. In C, we do this by passing the -flto option to clang, while we set lto = "fat" in Rust. To compare the impact of compiler optimizations, we evaluate optimization levels 0 to 3 in both C and Rust, via the -O0, -O1, -O2 and -O3 options for clang, and opt-level for rustc. Benchmark Chacha20-Poly1305 P-256 SHA-2 X25519 Average C HACL-rs-mut 0.9994 1.0310 0.9964 1.0196 1.0116 Vanilla 1.0 1.0 1.0 1.0 1.0 HACL-rs 1.0006 1.0286 0.9970 1.0124 1.00965 Rust HACL-rs-mut HACL-rs 1.0361 1.0424 1.0429 1.0469 1.0218 1.0181 1.0081 1.0106 1.0272 1.0295 Fig. 7. Experimental Evaluation of C and Rust versions of HACL★. Results are the average of 5 runs, and are normalized with respect to our baseline, the C implementation of HACL★, commit f218923. HACL-rs is the result of the translation presented in this paper, including small tweaks to satisfy the Rust type-checker, while HACL-rs-mut introduces further copies and marks all variables and borrows as mutable. HACL★. We present our experimental results for HACL★ in Figure 7. We observe few differences between the three different C versions; both our modified version of HACL★ containing tweaks to satisfy the Rust type-checker (HACL-rs) and the version with additional copies to satisfy the Rust borrow-checker while marking all data as mutable (HACL-rs-mut) are between 0.99x and 1.02x of vanilla HACL★, with an average overhead of 1.01x in both cases. This suggests that LLVM is mostly able to detect that these copies are superfluous, and to optimize them away at compile-time. Results for the Rust versions introduce a slightly larger, but still small overhead, with a performance between 1.01x-1.05x of the vanilla C implementation on the different benchmarks. This suggests that LLVM is also able to optimize the extra copies away when they appear in Rust code. The average overhead is in both cases around 1.02-1.03x, indicating that our translation only has a negligible overhead on high-performance, high-assurance cryptographic code. C Benchmark validate-495 validate-995 validate-1495 validate-1995 validate-2245 validate-2495 Average Vanilla 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Inline 1.0264 1.0033 0.9977 0.9964 0.9970 0.9759 0.9996 Rust Vanilla Inline 1.3611 0.8463 1.5188 0.9330 1.4977 0.9322 1.4990 0.8880 1.4944 0.8969 1.4899 0.9046 1.4768 0.9002 Rust-unchecked Vanilla Inline 1.3497 0.8053 1.4987 0.8956 1.5273 0.8873 1.5018 0.8872 1.5374 0.8909 1.4782 0.8679 1.4821 0.8724 Fig. 8. Experimental evaluation of C and Rust versions of CBOR-DET. All results are normalized w.r.t. C. CBOR. We present in Figure 8 an experimental comparison of the C and Rust versions of CBORDET. As binary parser implementations heavily rely on array operations, we also compare against a variant of the Rust implementation where memory accesses are replaced with unsafe variants omitting runtime bounds checks (get_unchecked). We only do so for performance evaluation, and because the original C code is verified, meaning that there is a meta-argument that array bounds check can be safely skipped. Finally, we also include variants of the C and Rust implementations Compiling C to Safe Rust, Formalized 19 where core functions are decorated with inline annotations. Our goal here is to assess whether performance differences are driven by hardcoded Rust choices (fat pointers), or are simply subject to optimization opportunities and different levels of compiler aggressivity. While the output of our translation (Vanilla Rust) is slower than the original C code (Vanilla C), we observe that the introduction of inlining brings significant improvements to the Rust version, but leaves the performance of the C unchanged. Inline Rust outperforming Inline C is to take with a grain of salt: when reproducing experiments on other machines (namely, a Mac Pro M3 and an Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz running Linux), trends were similar, but the inlined Rust code was sometimes slightly slower than the corresponding C code. This rather suggests that rustc is missing some optimization opportunities due to inlining that clang correctly identifies. Interestingly, the removal of runtime bounds checks does not provide a significant impact; this is in line with previous empirical studies on Rust bounds checking [Marzoev 2022]. Our conclusion is thus that the translation to Rust poses no significant performance issues, and only exposes the brittleness of performance optimizations in compilers, knowing that those can be remedied with sufficient inlining annotations, or perhaps upstream Rust compiler tweaks. Benchmark validate-495 validate-995 validate-1495 validate-1995 validate-2245 validate-2495 Average Opt level 0 C Rust 1.0 0.6253 1.0 0.6328 1.0 0.6395 1.0 0.6303 1.0 0.6369 1.0 0.6280 1.0 0.6321 Opt level 1 C Rust 0.0142 0.0168 0.0141 0.0168 0.0143 0.0170 0.0143 0.0170 0.0143 0.0170 0.0142 0.0169 0.0142 0.0169 Opt level 2 C Rust 0.0127 0.0167 0.0127 0.0168 0.0129 0.0169 0.0129 0.0169 0.0128 0.0170 0.0128 0.0169 0.0128 0.0169 Opt level 3 C Rust 0.0100 0.0167 0.0100 0.0167 0.0101 0.0169 0.0100 0.0169 0.0101 0.0169 0.0101 0.0168 0.0101 0.0168 Fig. 9. Comparison of optimization levels in C and Rust on CBOR validation. All results are normalized w.r.t. the C version compiled without optimizations. Compiler Optimizations. Our comparison of optimization levels is available in Figure 9. We observe several trends: first, while our translated Rust code outperforms the original C code when compiled without optimizations, C code compiled with an optimized clang outperforms Rust code compiled with rustc at a similar optimization level. Interestingly, performance gains also stop on the Rust side as soon as optimizations are enabled: Rust code compiled with optimization levels 1, 2, 3 has an identical performance for our translation of CBOR binary parsers, suggesting that possible optimizations might be missed by rustc, and that our earlier forced inlining might trigger those. 6 Related Work Since this is such a high-value target, many attempts have been made to automatically translate C to Rust. We now review the landscape of tools and techniques, keeping in mind that to the best of our knowledge, we are the only tool that sets as a goal to generate safe Rust code, while helping the programmer rewrite their C source to successfully translate. Historically, the first widespread tool for C to Rust translation is the aptly-named C2Rust [Galois and Immunant 2019], which combines earlier attempts Citrus [Lesinski 2017] and Corrode [Sharp 2020]. The tool translates C99 code to unsafe Rust, leveraging Rust’s ability to manipulate raw, C-like pointers via the use of unsafe. The tool accepts a large subset of C, and envisions a manual rewriting of the unsafe Rust into safe Rust. Several tools were since then built atop C2Rust, trying to automate the rewriting process that gradually removes unsafe in the translated Rust code. 20 Aymeric Fromherz and Jonathan Protzenko Emre et al. [2021] simply borrow-check the output of C2Rust, hijacking the Rust compiler in order to see if any unsafe raw pointers actually do borrow-check. In the event that they do, the usage of unsafe is removed, and the raw C pointers get promoted to regular, safe Rust borrows. The authors tackle programs of the same order of magnitude as us (e.g., libxml2, 200,000 lines of C), which C2Rust natively supports. Their analysis, however, focuses on functions whose unsafety comes from usage of raw pointers only, meaning that in libxml2, they rewrite 210 functions out of 3,029, but with a 97% success rate. The rest of the functions remain unsafe. Zhang et al. [2023] build a custom set of ownership- and mutability-based analyses that operate post-C2Rust, again trying to limit the amount of unsafe pointers and unsafe code. The tool, dubbed Crown, achieves greater unsafe reduction rates than Emre et al. [2021]. Emre et al. [2023] later opt for an in-house analysis (rather than reusing the Rust compiler), which allows for more functions to be successfully translated to safe Rust. This newer analysis may rewrite programs (rather than simply turn raw pointers into borrows). Still, only a fraction of the functions eventually make it to safe Rust. An interesting note is that the authors share our issues with nominal struct types (§7.2), but do not seem to use the “tuple trick” we described earlier. Ling et al. [2022] pushes the idea further, and uses TXL, a programming language dedicated to source-to-source transformations, to automate the rewriting of unsafe Rust into safe Rust even more. The authors introduce “semantics-approximating” rules, which ultimately allow them to convert a large fraction of unsafe functions into safe ones, although the preservation of semantics does not seem to be guaranteed. Once again, this is seen as a stepping stone for engineers to tackle the migration of a codebase, rather than a fully automated “fire and forget” translation approach. All of the works above aim to translate first into almost fully-unsafe Rust, then either manually or automatically rewrite the code to pull it out of the unsafe subset, reaching for partial safety. We advocate for a different approach: iterate on the C code until it successfully translates, keeping existing integration, unit testing, or proofs to assess the validity and/or correctness of the rewrites. Then, once the rewrite fits within the subset accepted by our translation, proceed, and obtain fully safe code. In addition, we make a formal claim as to the correctness of our approach. A separate line of work takes a more focused approach and studies specific patterns that occur in C and tries to translate them to idiomatic Rust – this is a concern that goes beyond mere safety and correctness. This line of work still builds atop C2Rust. Problems include: the Posix Lock API of C, and how to translate it to Rust’s native lock and ownership semantics [Hong and Ryu 2023]; how to replace output parameters with more Rust-native return values and algebraic data types, using an abstract interpretation analysis to infer what constitutes an output parameter [Hong 2023; Hong and Ryu 2024a]; or reconstructing Rust enumerations from C tagged unions [Hong and Ryu 2024b]. Again, these all partially remove unsafe; furthermore, these target a more systems-oriented subset of C, rather than the data-oriented subset we tackle. With the explosion in popularity of LLMs, translating C to Rust with an LLM was inevitable. [Pan et al. 2024] find that while GPT-4 generates code that is more idiomatic than C2Rust, only 61% of it is correct (i.e., compiles and produces the expected result), compared to 95% for C2Rust. Hong and Ryu [2025] investigate how to mitigate errors introduced by LLMs in the translation process. These works tackle another problem area, where possibly-faulty translations might be acceptable. We focus instead on critical applications, notably verified ones, and where having confidence in the correctness of the translation, and in the safety of the produced Rust code, is paramount. Focusing on C++ instead, CRAM [GrammaTech 2024] advocates for an approach similar in spirit to ours: rewrite the source C++ code to use abstractions that encode Rust’s borrow-checking discipline using advanced C++ templates and type-system features; then, once the code has been sufficiently refactored, translate it to Rust. Details on CRAM are scarce, as the tool appears to be Compiling C to Safe Rust, Formalized 21 closed-source, and no publications are available. The translation is described as “provably safe”, which does not conclusively indicate whether the produced code uses unsafe or not. Acknowledgments We thank Guillaume Boisseau and Son Ho for several discussions about the Rust semantics and borrow-checker, Tahina Ramananandro for invaluable help with understanding EverParse and the CBOR-DET implementation, and Franziskus Kiefer for integrating our work into libcrux. This work benefited from funding managed by the French National Research Agency under the France 2030 programme with the reference ANR-22-PECY-0006. References Sidney Amani, Alex Hixon, Zilin Chen, Christine Rizkallah, Peter Chubb, Liam O’Connor, Joel Beeren, Yutaka Nagashima, Japheth Lim, Thomas Sewell, Joseph Tuong, Gabriele Keller, Toby Murray, Gerwin Klein, and Gernot Heiser. 2016. Cogent: Verifying High-Assurance File System Implementations. Association for Computing Machinery, New York, NY, USA, 175–188. https://doi.org/10.1145/2872362.2872404 Andrew W. Appel. 2011. Verified Software Toolchain. In Proceedings of the European Conference on Programming Languages and Systems (ESOP). https://doi.org/10.1007/978-3-642-19718-5_1 Andrew W. Appel. 2015. Verification of a Cryptographic Primitive: SHA-256. In ACM Transactions on Programming Languages and Systems (TOPLAS). arkworks contributors. 2022. arkworks zkSNARK ecosystem. https://arkworks.rs Linard Arquint, Malte Schwerhoff, Vaibhav Mehta, and Peter Müller. 2023. A generic methodology for the modular verification of security protocol implementations. In Proceedings of the ACM Conference on Computer and Communications Security (CCS). 1377–1391. Karthikeyan Bhargavan, Cédric Fournet, and Markulf Kohlweiss. 2016. miTLS: Verifying protocol implementations against real-world attacks. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). Jenny Blessing, Michael A Specter, and Daniel J Weitzner. 2024. Cryptography in the Wild: An Empirical Analysis of Vulnerabilities in Cryptographic Libraries. In ACM Asia Conference on Computer and Communications Security (ASIA CCS). 605–620. The Chromium Project. 2020. Memory Safety. https://www.chromium.org/Home/chromium-security/memory-safety/. Cloudflare. 2019. BoringTun: Userspace WireGuard implementation in Rust. https://github.com/cloudflare/boringtun. Kees Cook. 2022. [GIT PULL] Rust introduction for v6.1-rc1. https://lore.kernel.org/lkml/202210010816.1317F2C@keescook/. Antoine Delignat-Lavaud, Cédric Fournet, Bryan Parno, Jonathan Protzenko, Tahina Ramananandro, Jay Bosamiya, Joseph Lallemand, Itsaka Rakotonirina, and Yi Zhou. 2021. A Security Model and Fully Verified Implementation for the IETF QUIC Record Layer. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). Xavier Denis, Jacques-Henri Jourdan, and Claude Marché. 2022. Creusot: a foundry for the deductive verification of rust programs. In International Conference on Formal Engineering Methods. Springer, 90–105. Robert Dockins, Adam Foltzer, Joe Hendrix, Brian Huffman, Dylan McNamee, and Aaron Tomb. 2016. Constructing Semantic Models of Programs with the Software Analysis Workbench. In Proceedings of the International Conference on Verified Software: Theories, Tools, Experiments (VSTTE). Zakir Durumeric, Frank Li, James Kasten, Johanna Amann, Jethro Beekman, Mathias Payer, Nicolas Weaver, David Adrian, Vern Paxson, Michael Bailey, and J. Alex Halderman. 2014. The Matter of Heartbleed. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC). Mehmet Emre, Peter Boyland, Aesha Parekh, Ryan Schroeder, Kyle Dewey, and Ben Hardekopf. 2023. Aliasing limits on translating C to Safe Rust. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 551–579. Mehmet Emre, Ryan Schroeder, Kyle Dewey, and Ben Hardekopf. 2021. Translating C to safer Rust. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 1–29. Andres Erbsen, Jade Philipoom, Jason Gross, Robert Sloan, and Adam Chlipala. 2019. Simple High-Level Code for Cryptographic Arithmetic - With Proofs, Without Compromises. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). Andres Erbsen, Jade Philipoom, Dustin Jamner, Ashley Lin, Samuel Gruetter, Clément Pit-Claudel, and Adam Chlipala. 2024. Foundational Integration Verification of a Cryptographic Server. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). Galois and Immunant. 2019. C2Rust: Migrating Legacy Code to Rust. https://github.com/immunant/c2rust. GrammaTech. 2024. CRAM: C++ to Rust Assisted Migration. https://www.grammatech.com/cyber-security-solutions/ migration-to-memory-safe-code/ 22 Aymeric Fromherz and Jonathan Protzenko David Greenaway, June Andronick, and Gerwin Klein. 2012. Bridging the gap: Automatic verified abstraction of C. In Proceedings of the International Conference on Interactive Theorem Proving (ITP). Springer, 99–115. Son Ho, Aymeric Fromherz, and Jonathan Protzenko. 2023. Modularity, Code Specialization, and Zero-Cost Abstractions for Program Verification. In Proceedings of the International Conference on Functional Programming (ICFP). https: //doi.org/10.1145/3607844 Son Ho and Jonathan Protzenko. 2022. Aeneas: Rust verification by functional translation. Proceedings of the ACM on Programming Languages 6, ICFP (2022), 711–741. https://doi.org/10.1145/3547647 Jaemin Hong. 2023. Improving Automatic C-to-Rust Translation with Static Analysis. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 273–277. Jaemin Hong and Sukyoung Ryu. 2023. Concrat: An automatic C-to-Rust lock API translator for concurrent programs. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 716–728. Jaemin Hong and Sukyoung Ryu. 2024a. Don’t Write, but Return: Replacing Output Parameters with Algebraic Data Types in C-to-Rust Translation. Proceedings of the ACM on Programming Languages 8, PLDI (2024), 716–740. Jaemin Hong and Sukyoung Ryu. 2024b. To Tag, or Not to Tag: Translating C’s Unions to Rust’s Tagged Unions. arXiv preprint arXiv:2408.11418 (2024). Jaemin Hong and Sukyoung Ryu. 2025. Type-migrating C-to-Rust translation using a large language model. Empirical Software Engineering 30, 1 (2025), 3. Internet Engineering Task Force. 2020. Concise Binary Object Representation (CBOR) - RFC 8949. https://datatracker.ietf. org/doc/rfc8949/. Daniel Kästner, Jörg Barrho, Ulrich Wünsche, Marc Schlickling, Bernhard Schommer, Michael Schmidt, Christian Ferdinand, Xavier Leroy, and Sandrine Blazy. 2018. CompCert: Practical experience on integrating and qualifying a formally verified optimizing compiler. In Proceedings of the European Congress on Embedded Real Time Software and Systems (ERTS). Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. 2009. seL4: Formal Verification of an OS Kernel. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP). Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. 2023. Verus: Verifying Rust Programs using Linear Ghost Types. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA). https://doi.org/10.1145/3586037 Kornel Lesinski. 2017. Citrus: Convert C to Rust. https://gitlab.com/citrus-rs/citrus. Michael Ling, Yijun Yu, Haitao Wu, Yuan Wang, James R. Cordy, and Ahmed E. Hassan. 2022. In rust we trust: a transpiler from unsafe C to safer rust (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 354–355. https: //doi.org/10.1145/3510454.3528640 Jiaxiang Liu, Xiaomu Shi, Ming-Hsien Tsai, Bow-Yaw Wang, and Bo-Yin Yang. 2019. Verifying arithmetic in cryptographic C programs. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 552–564. Alana Marzoev. 2022. How much does Rust’s bounds checking actually cost? https://readyset.io/blog/bounds-checks. Mozilla Hacks. 2024. Porting a cross-platform GUI application to Rust. https://hacks.mozilla.org/2024/04/porting-a-crossplatform-gui-application-to-rust/. The MSRC Team. 2019. A Proactive Approach to More Secure Code. https://msrc.microsoft.com/blog/2019/07/a-proactiveapproach-to-more-secure-code/. National Security Agency. 2022. Software Memory Safety. https://media.defense.gov/2022/Nov/10/2003112742/-1/-1/0/CSI_ SOFTWARE_MEMORY_SAFETY.PDF. National Vulnerability Database. 2014. Heartbleed bug. CVE-2014-0160 http://web.nvd.nist.gov/view/vuln/detail?vulnId= CVE-2014-0160. Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in translation: A study of bugs introduced by large language models while translating code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13. Marina Polubelova, Karthikeyan Bhargavan, Jonathan Protzenko, Benjamin Beurdouche, Aymeric Fromherz, Natalia Kulatova, and Santiago Zanella-Béguelin. 2020. HACLxN: Verified Generic SIMD Crypto (for All Your Favourite Platforms). In Proceedings of the ACM Conference on Computer and Communications Security (CCS). François Pottier. 2009. Lazy least fixed points in ML. Unpublished. Manuscript available at http://gallium. inria. fr/˜ fpottier/publis/fpottier-fix. pdf (2009). Jonathan Protzenko, Bryan Parno, Aymeric Fromherz, Chris Hawblitzel, Marina Polubelova, Karthikeyan Bhargavan, Benjamin Beurdouche, Joonwon Choi, Antoine Delignat-Lavaud, Cédric Fournet, Natalia Kulatova, Tahina Ramananandro, Aseem Rastogi, Nikhil Swamy, Christoph M. Wintersteiger, and Santiago Zanella-Béguelin. 2020. EverCrypt: A Fast, Verified, Cross-Platform Cryptographic Provider. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). Compiling C to Safe Rust, Formalized 23 https://doi.org/10.1109/SP40000.2020.00114 Jonathan Protzenko, Jean-Karim Zinzindohoué, Aseem Rastogi, Tahina Ramananandro, Peng Wang, Santiago ZanellaBéguelin, Antoine Delignat-Lavaud, Catalin Hritcu, Karthikeyan Bhargavan, Cédric Fournet, and Nikhil Swamy. 2017. Verified Low-Level Programming Embedded in F*. In Proceedings of the International Conference on Functional Programming (ICFP). Tahina Ramananandro, Antoine Delignat-Lavaud, Cédric Fournet, Nikhil Swamy, Tej Chajed, Nadim Kobeissi, and Jonathan Protzenko. 2019. EverParse: Verified Secure Zero-Copy Parsers for Authenticated Message Formats.. In Proceedings of the USENIX Security Symposium. Tobias Reiher, Alexander Senier, Jeronimo Castrillon, and Thorsten Strufe. 2019. RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers. In Formal Aspects of Component Software: 16th International Conference, FACS 2019, Amsterdam, The Netherlands, October 23–25, 2019, Proceedings (Amsterdam, The Netherlands). Springer-Verlag, Berlin, Heidelberg, 170–190. https://doi.org/10.1007/978-3-030-40914-2_9 Rustls Contributors. 2016. rustls: A Modern TLS library in Rust. https://github.com/rustls/rustls. Jamey Sharp. 2020. Corrode: Automatic semantics-preserving translation from C to Rust. https://github.com/jameysharp/ corrode. Signal Messenger. 2020. libsignal-protocol-rust. https://github.com/signalapp/libsignal-protocol-rust. StackOverflow. 2023. 2023 Developer Survey. https://survey.stackoverflow.co/2023/#section-admired-and-desiredprogramming-scripting-and-markup-languages. Nikhil Swamy, Tahina Ramananandro, Aseem Rastogi, Irina Spiridonova, Haobin Ni, Dmitry Malloy, Juan Vazquez, Michael Tang, Omar Cardona, and Arti Gupta. 2022. Hardening attack surfaces with formally proven binary format parsers. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). The Register. 2023. Microsoft is busy rewriting core Windows code in memory-safe Rust. https://www.theregister.com/ 2023/04/27/microsoft_windows_rust/. The White House. 2024. Back to the Building Blocks: a Path Toward Secure and Measurable Software. https://www. whitehouse.gov/wp-content/uploads/2024/02/Final-ONCD-Technical-Report.pdf. Jeff Vander Stoep and Alex Rebert. 2024. Eliminating Memory Safety Vulnerabilities at the Source. https://security.googleblog. com/2024/09/eliminating-memory-safety-vulnerabilities-Android.html. Sara Verdi. 2023. Why Rust is the most admired language among developers. https://github.blog/2023-08-30-why-rust-isthe-most-admired-language-among-developers/. ZeroEx. 2019. OpenZKP. https://github.com/0xProject/OpenZKP. Hanliang Zhang, Cristina David, Yijun Yu, and Meng Wang. 2023. Ownership guided C to Rust translation. In International Conference on Computer Aided Verification. Springer, 459–482.