Beyond Reward: Offline Preference-guided Policy Optimization Yachen Kang 1 2 Diyuan Shi 2 Jinxin Liu 2 Li He 2 Donglin Wang 2 Abstract 1. Introduction This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google. com/view/oppo-icml-2023. Deep reinforcement learning (RL) offers a versatile framework for acquiring task-oriented behaviors, as evidenced by a growing body of literature (Kohl & Stone, 2004; Kober & Peters, 2008; Kober et al., 2013; Silver et al., 2017; Kalashnikov et al., 2018; Vinyals et al., 2019). In this framework, the ”task” is frequently expressed as maximizing the cumulative reward of trajectories produced by deploying the learning policy in the corresponding environment. However, the above RL formulation presupposes two critical conditions for decision policy training: 1) an interactable environment, and 2) a pre-specified reward function. Regrettably, online interactions with the environment can be both expensive and hazardous (Mihatsch & Neuneier, 2002; Hans et al., 2008; Garcıa & Fernández, 2015), while developing a suitable reward function typically necessitates considerable human effort. Additionally, the heuristic rewards often employed may be insufficient to express the true intent (Hadfield-Menell et al., 2017). To address these challenges, prior research has explored two approaches. First, some works have focused on the offline RL formulation (Fujimoto et al., 2019), where the learner has access to fixed offline trajectories along with a reward signal for each transition (or limited expert demonstrations). Second, others have considered the (online) preferencebased RL formulation, where the task objective is conveyed to the learner through preferences of a human annotator between two trajectories rather than rewards for each transition. In pursuit of further advancements in this setting, we propose a novel approach that relaxes both of these requirements and advocates for offline preference-based RL (PbRL). In the context of offline preference-based reinforcement learning (PbRL), where access to an offline dataset and labeled preferences between the offline trajectories is available, a common approach is to combine previous online PbRL methods with off-the-shelf offline RL algorithms (Shin & Brown, 2021). This two-step strategy, as illustrated in Fig.1 (left), typically involves training a reward function using the Bradley-Terry model (Bradley & Terry, 1952) in a supervised manner, followed by training the policy with any offline RL algorithm on the transitions relabeled via the learned reward function. However, the 1 College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China 2 Machine Intelligence Lab (MiLAB) of the School of Engineering, Westlake University, Hangzhou, Zhejiang, China. Correspondence to: Donglin Wang , Yachen Kang . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1 Beyond Reward: Offline Preference-guided Policy Optimization Figure 1. A flow diagram of previous offline PbRL algorithms (left) and our OPPO algorithm (right). Previous works require learning a separate reward function for modeling human preferences using the Bradley-Terry model. In contrast, our OPPO directly optimizes the policy network. 2. Related Work practice of separately learning a reward function that explains expert preferences may not directly instruct the policy on how to act optimally. This is because preference labels define the PbRL task, and the goal is to learn the most preferred trajectory by the annotator rather than to maximize the cumulative discounted proxy rewards of the policy rollouts. In cases of complex tasks, such as nonMarkovian tasks, scalar rewards may create an information bottleneck in policy improvement, resulting in suboptimal behavior (Vamplew et al., 2022). Additionally, an isolated policy optimization may exploit loopholes in miscalibrated reward functions, leading to undesirable behaviors. Given these limitations, it is reasonable to question the necessity of learning a reward function, especially considering that it may not directly yield the optimal policy. Since OPPO is at the intersection of Preference-Based Reinforcement Learning, Offline RL, and conditional RL we review the most relevant algorithms from these fields (see Table 1) Online PbRL. Preference-based RL is also known as reinforcement learning from human feedback (RLHF). Several works have successfully utilized feedback from real humans to train RL agents (Arumugam et al., 2019; Christiano et al., 2017; Ibarz et al., 2018; Knox & Stone, 2009; Lee et al., 2021; Warnell et al., 2017). Christiano et al. (2017) scaled preference-based reinforcement learning to utilize modern deep learning techniques, and Ibarz et al. (2018) improved the efficiency of this method by introducing additional forms of feedback such as demonstrations. Recently, PEBBLE (Lee et al., 2021) proposed a feedback-efficient RL algorithm by utilizing off-policy learning and pre-training. SURF (Park et al., 2022) used pseudo-labeling to utilize unlabeled segments and proposed a novel data augmentation method called temporal cropping. All of the above methods require the agent to online interact with the environment, and they are all two-step strategies that require learning a scalar reward function separately. To achieve this objective, we present the offline preferenceguided policy optimization (OPPO) approach, which is a one-step paradigm that simultaneously models offline preferences and learns the optimal decision policy without requiring the separate learning of a reward function (as illustrated in Figure 1 right). This is achieved through the use of two objectives: an offline hindsight information matching objective and a preference modeling objective. By iteratively optimizing these objectives, we derive a contextual policy π(a|s, z) to model the offline data and an optimal context z∗ to model the preference. The main focus of OPPO is on both learning a high-dimensional z-space and evaluating policies within such space. This high-dimensional z-space captures more task-related information compared to scalar reward, making it ideal for policy optimization purposes. Furthermore, the optimal policy is obtained by conditioning the contextual policy π(a|s, z) on the learned optimal context z∗ . Offline RL. To mitigate the impact of distribution shifts in offline RL, prior algorithms (a) constrain the action space (Fujimoto et al., 2019; Kumar et al., 2019a; Siegel et al., 2020; Zhuang et al., 2023), (b) incorporate value pessimism (Fujimoto et al., 2019; Kumar et al., 2020; Liu et al., 2022a), and (c) introduce pessimism into learned dynamics models (Kidambi et al., 2020; Yu et al., 2020). Another line of work explored learning a wide behavior distribution from the offline dataset by learning a task-agnostic set of skills, either with likelihood-based approaches (Ajay et al., 2020; Campos et al., 2020; Pertsch et al., 2020; Singh et al., 2020) or by maximizing mutual information (Eysenbach et al., 2018; Lu et al., 2020; Sharma et al., 2019). Some prior methods for RL is more similar to static supervised learning, such as Q-learning (Watkins, 1989; Mnih et al., 2013) and behavior cloning. In these mathods, the resulting agent’s performance is positively correlated to the quality of data used for training. In addition to aforementioned RL methods, Srivastava et al. (2019) and Kumar et al. (2019b) studied Our main contribution can be summarized as follows. Firstly, we propose OPPO, a concise, stable, and one-step offline PbRL paradigm that avoids the need for separate reward function learning. Secondly, we present an instance of a preference-based hindsight information matching objective and a novel preference modeling objective over the context. Finally, extensive experiments are conducted to demonstrate the superiority of OPPO over previous competitive baselines and to analyze its performance. 2 Beyond Reward: Offline Preference-guided Policy Optimization Table 1. A concise tabular representation of the differences between our method and related works. Learning a Separate Reward Function Method Supervised Signal Training Architectures Imitation Learning BC Expert demonstration Offline MLP × Online PbRL PEBBLE SURF Preference Preference Online Online MLP MLP ✓ ✓ Offline RL CQL DT Ground Truth Reward Ground Truth Reward Offline Offline MLP Transformer Offline PbRL OPAL PT OPPO Preference Preference Preference Offline Offline Offline MLP Transformer Transformer ✓ ✓ × bution, and γ ∈ [0, 1) denotes the discount factor. At each timestep t, the agent receives a state st from the environment and chooses an action at based on the policy π(at |st ). In the standard RL framework, the environment returns a reward rt , and the agent transits toPthe next state st+1 . The expected ∞ return Jr (π) = Eτ ∼π(τ ) k=0 γ k r(st+k , at+k ) is defined as the expectation of discounted cumulative rewards, where τ = (s0 , a0 , s1 , a1 , . . . ), s0 ∼ p0 (s0 ), at ∼ π(at |st ), and st+1 ∼ P (st+1 |st , at ). The agent’s goal is to learn a policy π that maximizes the expected return. ”upside-down” reinforcement learning (UDRL), seeking to model behaviors via a supervised loss conditioned on a target return. Ghosh et al. (2019); Liu et al. (2022b; 2021) extended prior UDRL methods to perform goal reaching by taking the goal state as the condition, and Paster et al. (2020) further used an LSTM for goal-conditioned online RL settings. DT (Chen et al., 2021) and TT (Janner et al., 2021) solved the problem via sequence modeling, since they believe sequence modeling enables to model behaviors without access to the reward, in a similar style to language (Radford et al., 2018) and images (Chen et al., 2020). Although the above methods can avoid online interaction between the agent and the environment, they all require ground truth reward or expert demonstrations to specify the task, which often requires a lot of human labor. 3.1. Offline Preference-based reinforcement learning In this work, we assume a fully offline setting in which the agent is not allowed to conduct online rollouts (over the MDP) during training but is provided with a static fixed dataset. The static dataset, D := {τ 0 , . . . , τ N }, consists of pre-collected trajectories, where each trajectory τ i contains a contiguous sequence of states and actions: τ i := {si0 , ai0 , si1 , . . . }. Such an offline setting is more challenging than the standard (online) setting as it removes the ability to explore the environment and collect additional feedback. Unlike imitation learning, we do not assume that the dataset comes from a single expert policy. Instead, the dataset D may contain trajectories collected by sub-optimal or even random behavior policies. Offline PbRL. OPAL (Shin & Brown, 2021) first tried to solve offline PbRL by simply combining previous (online) PbRL method and off-the-shelf offline RL algorithm. PT (Kim et al., 2023) introduced a new preference model based on the weighted sum of non-Markovian rewards and utilized transformer-based architecture to design such model. Both of these works adopt naive two-step strategy with learning a reward function separately. To avoid information bottleneck that scalar reward may create (Vamplew et al., 2022), OPPO jointly models offline preferences and learns the optimal decision policy in a one-step paradigm. And in contrast to both supervised RL and UDRL, the purpose of our method is to search for the optimal solution supervised by a binary preference signal in the offline setting. Our method is not only working with sub-optimal demonstrations but also revealing optimal behaviors without injecting human priors about the optimal demonstration. Generally, the standard offline RL assumes the existence of reward information for each state-action pair in D. However, in the offline Preference-based RL (PbRL) framework, we assume that such reward is not accessible, while the agent can access offline preferences (between some pairs of trajectories (τ i , τ j )) that are labeled by an expert (human) annotator. Specifically, the annotator gives a feedback indicating which trajectory is preferred, i.e., y ∈ {0, 1, 0.5}, where 0 indicates τ i ≻ τ j (the event that trajectory τ i is preferable to trajectory τ j ), 1 indicates τ j ≻ τ i (τ j is preferable to τ i ), and 0.5 implies an equally preferable case. Each feedback is stored in a labeled offline dataset D≻ as a triple (τ i , τ j , y). Given these preferences, the goal of PbRL is to find a policy π(at |st ) that maximizes the expected return Jrψ , under the hypothetical reward function rψ (st , at ) 3. Preliminaries We consider reinforcement learning (RL) in a Markov decision process (MDP) described by a tuple (S, A, r, P, p0 , γ), where st ∈ S, at ∈ A, and rt = r(st , at ) denote the state, action, and reward at timestep t, P (st+1 |st , at ) denotes the transition dynamics, p0 (s0 ) denotes the initial state distri3 Beyond Reward: Offline Preference-guided Policy Optimization consistent with human preferences. To enable this, previous works learn a reward function rψ (st , at ) and use the Bradley-Terry model (Bradley & Terry, 1952) to model the human preference, expressed here as a logistic function: P [τ i ≻ τ j ] = logistic( X rψ (sit , ait ) − X t (HIM) objective: min π E j [ℓ (I(τ ), I(τz ))] , (4) where z := I(τ ). In HER, we set I(τ ) as the final state in trajectory τ , and in reward-conditional RL, we set I(τ ) as the return of trajectory τ . Thus, we can use the hindsight information z := I(τ ) to provide supervision for training the contextual policy π(a|s, z). However, in the offline setting, sampling τz from π(z) is not accessible. Thus, we must model the environment transition dynamics besides I(·)-driven hindsight information modeling. That is to say, we need to model the trajectory itself, i.e., minπ Eτ ∼D(τ ),τz ∼π(z) [ℓ (τ, τz )]. Then, we provide the overall offline HIM objective: rψ (sjt , ajt )), t (1) where (sit , ait ) ∼ τ i , (sjt , ajt ) ∼ τ j . Intuitively, this can be interpreted as the assumption that the probability of preferring a trajectory depends exponentially on the cumulative reward over the trajectory labeled by an underlying reward function. The reward function is then updated by minimizing the following cross-entropy loss: − E τ ∼D(τ ) τz ∼π(z) h i (1−y) log P [τ i ≻ τ j ]+y log P [τ j ≻ τ i ] . (τ i ,τ ,y)∼D≻ min (2) With the learned reward function rψ used to label each transition in the dataset, we can adopt an off-the-shelf offline RL algorithm to enable the policy learning. π Beyond the typical iterative (offline) RL framework, information matching (IM) (Furuta et al., 2021) has been recently studied as an alternative problem specification in (offline) RL. The objective of IM in RL is to learn a contextual policy π(a|s, z) whose trajectory rollouts satisfy the pre-defined desired information statistics value z: π E [ℓ (z, I(τz ))] , [ℓ (I(τ ), I(τz )) + ℓ (τ, τz )] . (5) To give an intuitive understanding of the above objective, we provide a simple example: considering hindsight I(·) being the return of a trajectory, optimizing ℓ (I(τ ), I(τz )) ensures that the generated τz will reach the same return as τ = I −1 (z). However, in the offline setting, we must ensure that the generated τz stays in support of the offline data, eliminating the out-of-distribution (OOD) issue. Thus we minimize ℓ (τ, τz ) approximately. In implementation, directly optimizing ℓ (τ, τz ) is enough to ensure the hindsight information is matched, e.g., ℓ (I(τ ), I(τz )) < ϵ. Here, we explicitly formalize the ℓ (I(τ ), I(τz )) term with particular emphasis on the requisite of hindsight information matching objective and meanwhile highlight the difference, see Section 4, between the above HIM objective (taking I(·) as a prior) and our OPPO formulation (requiring learning Iθ (·)). 3.2. Hindsight Information Matching min E τ ∼D(τ ) τz ∼π(z) (3) z∼p(z) τz ∼π(z) where p(z) is a prior, and π(z) denotes the trajectory distribution generated by rolling out π(a|s, z) in the environment. I(τ ) is a function capturing the statistical information of a trajectory τ , such as the distribution statistics of state and reward, like mean, variance (Wainwright et al., 2008), and ℓ is a loss function. By optimizing Eq.5, we can obtain a contextual policy π(a|s, z). In the evaluation phase, the optimal policy π(a|s, z∗ ) can be specified by conditioning the policy on a selected target z∗ . For example, Decision Transformer (Chen et al., 2021) takes the desired performance as the target z∗ (e.g., specify maximum possible return to generate expert behavior), and RvS-G (Emmons et al., 2021) takes the goal state as the target z∗ . On the one hand, if we set p(z) as a prior distribution, optimizing Eq.3 corresponds to performing unsupervised (online) RL to learn a set of skills (Eysenbach et al., 2018; Sharma et al., 2019). On the other hand, if we set p(z) as statistical information of a given off-policy trajectory (or state-action) distribution D(τ ) (or D(s, a)), Eq.3 corresponds to an objective for hindsight information matching in (offline) RL. For example, HER (Andrychowicz et al., 2017) and return-conditioned RL (upside-down RL (Srivastava et al., 2019; Kumar et al., 2019b; Chen et al., 2021; Janner et al., 2021)) use the above concept of hindsight: specifying any trajectory τ in the dataset as the hindsight target and setting the information z in Eq.3 as I(τ ). Then, we provide the I(·)-driven hindsight information matching 4. OPPO: Offline Preference-guided Policy Optimization In this section, we present our method, OPPO (offline preference-guided policy optimization), which adopts the hindsight information matching (HIM) objective in Section 4.1 to model an offline contextual policy π(a|s, z), and introduces a triplet loss in Section 4.2 to model the human preference as well as the optimal context z∗ . At testing, we condition the policy on the optimal context z∗ and thus 4 Beyond Reward: Offline Preference-guided Policy Optimization Figure 2. OPPO first maps offline trajectories (both positive τ + and negative τ − ) to a latent space via the hindsight information extractor Iθ . It then optimizes the offline HIM objective LHIM . Finally, the belief of the optimal hindsight information z∗ is updated to model the human preference with objective LPM . Meanwhile, the preference modeling loss also regularizes the learning of the hindsight information extractor Iθ . conduct rollout with π(a|s, z∗ ). In principle, OPPO is compatible with any PbRL setting, including both online and offline. In the scope of our analysis and experiments, however, we focus on the offline setting to decouple exploration difficulties in online RL. struct the following preference modeling objective inspired by the contrastive loss in metric learning (Le-Khac et al., 2020): h i ∗ + ∗ − min ℓ(z , z ) − ℓ(z , z ) , (7) E ∗ 4.1. HIM-driven Policy Optimization where z+ and z− represent the embedding of the preferable (positive) trajectory Iθ (yτ j + (1 − y)τ i ) and that of the less preferable (negative) trajectory Iθ (yτ i + (1 − y)τ j ), respectively. Closing to the idea of using regret for modeling preference (Knox et al., 2022; Chen et al., 2022), our basic assumption of designing the objective in Eq.7 is that humans normally conduct two-level comparisons before giving preferences between two trajectories (τ i , τ j ): 1) separately judging the similarity between trajectory τ i and the hypothetical optimal trajectory τ ∗ , i.e. −ℓ(z∗ , zi ), and the similarity between trajectory τ j and the hypothetical optimal one τ ∗ , −ℓ(z∗ , zj ), and 2) judging the difference between the above two similarities (−ℓ(z∗ , zi ) vs. −ℓ(z∗ , zj )) and setting the trajectory with the higher similarity as the preferred one. Hence, optimizing Eq.7 guarantees finding the optimal embedding that is more similar to z+ and less similar to z− . To clarify, z∗ is the corresponding contextual information for τ ∗ , whereas τ ∗ will always be preferred over any offline trajectories in the dataset. z ,Iθ (τ i ,τ j ,y)∼D≻ As described in Section 3.1, to directly implement the offthe-shelf offline RL algorithms, previous works in PbRL explicitly learn a reward function with Eq.2 (as shown in Fig.1 left). As an alternative to such a two-step approach, we seek to learn the policy directly from the preference-labeled offline dataset (as shown in Fig.1 right). Inspired by the offline HIM objective in Section 3.2, we propose to learn a contextual policy π(a|s, z) in the offline PbRL setting. Assuming Iθ being a (learnable) network that encodes the hindsight information in PbRL, we formulate the following objective: h i min LHIM := E ℓ (Iθ (τ ), Iθ (τz )) + ℓ (τ, τz ) , (6) π,Iθ τ ∼D(τ ) τz ∼π(z) where z := Iθ (τ ). Note that Eq.6 is a different instantiation of Eq.5 in which we learn the hindsight information extractor Iθ (·) in the PRBL setting, while previous (offline) RL algorithms normally set I(·) to be a prior (Chen et al., 2021; Emmons et al., 2021). Such an encoder-decoder structure is now similar to Bi-directional Decision Transformer (BDT) proposed by (Furuta et al., 2021) for offline imitation learning. However, since expert demonstrations are missing in the PbRL setting, in Section 4.2, we propose to use the preference labels in D≻ to extract hindsight information. In practice, to robustify the preference modeling, we optimize the following objective using the triplet loss in place of the objective in Eq.7: h i ∗ + ∗ − min L := E max(ℓ(z , z ) − ℓ(z , z ) + m, 0) , (8) PM ∗ z ,Iθ where m is an arbitrarily set margin between positive and negative pairs. It is worth mentioning that the posterior of the optimal embedding z∗ and the hindsight information extractor Iθ (·) are updated alternatively to ensure learning stability. A better estimate of the optimal embedding helps 4.2. Preference Modeling To make the hindsight information Iθ (τ ) in Eq.6 match the preference information in the (labeled) dataset D≻ , we con5 Beyond Reward: Offline Preference-guided Policy Optimization Algorithm 1 OPPO: Offline Preference-guided Policy Optimization Require: Dataset D := {τ } and labeled dataset D≻ := {(τ i , τ j , y)}, where τ i ∈ D and τ j ∈ D. Return: π(a|s, z) and z∗ . 1: Initialize policy network π(a|s, z), hindsight information extractor Iθ : τ → z, and the optimal context embedding z∗ . 2: while not converged do 3: Sample a batch of trajectories from D: {τ }B ∼ D. 4: Update π(a|s, z) and Iθ (·) with sampled {τ }B using LHIM . 5: Sample a batch of preferences from D≻ : {(τ i , τ j , y)}B ∼ D≻ . Update Iθ (·) and the optimal z∗ with sampled 6: {(τ i , τ j , y)}B using LPM . 7: end while and Iθ (·) based on sampled trajectories; consequently, given the z extracted out of an offline trajectory by the extractor, the policy is able to reconstruct it; 3) Then, we sample a batch of preferences from the labeled dataset D≻ ; 4) Finally, in Line 6, we update Iθ (·) and z∗ based on the sampled {(τ i , τ j , y)}B using Eq.8, making the optimal embedding z∗ near to the more preferred trajectory z+ , and meanwhile further away from the less preferred trajectory z− . In summary, OPPO learns a contextual policy π(a|s, z), a context (hindsight information) encoder Iθ (τ ), and the optimal context, z∗ , for the optimal trajectory τ ∗ . Compared with previous PbRL works (first learning a reward function with Eq.2 and then learning offline policy with off-the-shelf offline RL algorithms), OPPO learns the optimal (offline) policy (π(a|s, z∗)) directly and thus avoids the potential information bottleneck caused by the limited information capacity of scalar reward assignment. Compared with the HIM-based offline RL algorithms (e.g., DT (Chen et al., 2021) and RvS-G (Emmons et al., 2021)), OPPO does not need to manually specify the target context for the rollout policy π(a|s, ·) at the testing phase. the encoder to extract features to which the human labeler pays more attention. In contrast, a better hindsight information encoder, on the other hand, accelerates the search process for the optimal trajectory in the high-level embedding space. In this way, the loss function for the encoder consists of two parts: 1) a hindsight information matching loss in a supervised style as in Eq.6 and 2) a triplet loss as in Eq.8 to better incorporate the binary supervision provided by the preference-labeled dataset. 5. Experiments In this section, we evaluate and compare OPPO to other baselines in the offline PbRL setting. A central premise behind the design of OPPO is that the learned hindsight information encoder Iθ (·) can capture preferences over different trajectories, as described by Eq.8. Our experiments are therefore designed to answer the following questions: 4.3. Training Objectives & Implementation Details In our experiment, we consolidate ℓ in Eq.6 as MSE Loss and in Eq.8 as Euclidean Distance. In this case, we model z∗ as a point in the z−space, and the similarity measure ℓ is L2 distance. An alternative option is to model z∗ as a point sampled from a learned distribution in the z−space, where ℓ is a measurement between two distributions, such as the KL divergence. Also, we add a normalization loss Lnorm to constrain the L2 norm of all embeddings generated by hindsight information extractor Iθ . Ltotal := LHIM + αLPM + βLnorm 1. Does OPPO truly capture the preference? In other words, does the learned z-space (encoded by the learned Iθ (·)) align with the given preference? Please refer to Section 5.1. 2. Can the learned optimal contextual policy π(a|s, z∗ ) outperform the policy π(a|s, z) that is conditioned on any other context z ∈ {Iθ (τ )|τ ∈ D}? Please refer to Section 5.2. (9) 3. Can OPPO achieve the competitive performance compared with other offline PbRL baselines? Please refer to Section 5.3. The architecture overview of OPPO is shown in Fig.2. OPPO models the hindsight information extractor Iθ as an encoder network Iθ : τ → z and we use the BERT architecture. Furthermore, similar to DT (Chen et al., 2021), we use the GPT architecture to model π(a|s, z), which predicts future actions via autoregressive modeling. For specific hyperparameter selection during the training process, please refer to the detailed description in Appendix A.1.3. Algorithm 1 details the training of OPPO, and the entire process is summarized as follows: 1) We sample a batch of trajectories from the dataset D; 2) In Line 4, we use Eq.6 (the hindsight information matching loss) to update π(a|s, z) 4. What benefits can we gain from designing the onestep offline PbRL, i.e., iteratively conducting offline data modeling (Eq.6) and preference modeling (Eq.8)? Please refer to Section 5.4. 5. How does OPPO behave in terms of the amount of preference feedback? Please refer to Section 5.5. 6. Can OPPO attain satisfactory results by incorporating preference from real human instead of scripted teacher? Please refer to Section 5.6. 6 Beyond Reward: Offline Preference-guided Policy Optimization Figure 3. We utilize t-SNE to visualize the z-space learned in Hopper environment, encoded with a well-trained Iθ (·), including the embedding of random trajectories in D, our learned z∗ (“orange dot”) and the actual optimal z∗∗ (“red dot”), embedding of the best trajectory/policy learned with online reinforcement learning method. Color of the points represent the normalized return of the corresponding trajectory τ . To answer the above questions, we evaluate OPPO on the continuous control tasks from the D4RL benchmark (Fu et al., 2020). Specifically, we choose Hopper, Walker, and Halfcheetah as three base tasks, with medium, mediumreplay, medium-expert as the datasets for each task. that are near z∗∗ often have high returns (points with a deeper color). Further, we observe that our learned optimal z∗ constantly stays close to actual optimal z∗∗ , which suggests that our learned z∗ preserves near-optimal behaviors. Thus, it gives justification that OPPO can make meaningful preference modeling. Table 2. Comparison of (normalized) performance when rollouting the contextual policy π(a|s, ·) conditioned on different context (z∗ , zhigh , and zlow ). 5.2. Can π(a|s, z∗ ) achieve better performance? Dataset z∗ zhigh zlow Hopper Medium-Expert Medium Medium-Replay 108.0 ± 5.1 86.3 ± 3.2 88.9 ± 2.3 94.2 ± 24.3 55.8 ± 7.9 78.6 ± 26.3 79.1 ± 28.8 51.6 ± 13.8 26.6 ± 15.2 Walker Medium-Expert Medium Medium-Replay 105.0 ± 2.4 85.0 ± 2.9 71.7 ± 4.4 106.5 ± 9.1 64.9 ± 24.9 55.7 ± 24.8 93.4 ± 7.4 72.6 ± 10.6 6.8 ± 1.7 Halfcheetah Medium-Expert Medium Medium-Replay 89.6 ± 0.8 43.4 ± 0.2 39.8 ± 0.2 48.3 ± 14.4 42.5 ± 3.9 35.6 ± 8.5 42.6 ± 2.6 42.4 ± 3.2 33.9 ± 9.2 717.7 581.9 448.9 Environment Sum Fig.3 shows that our learned Iθ (·) can produce a wellaligned context embedding z-space exhibiting effective preference modeling across (embeddings of) trajectories. More importantly, context embeddings’ preference property should be preserved when we condition the context on the learned contextual policy π(a|s, ·). In other words, Iθ (·) should transfers the preference relationship from (τ i , τ j ) to (ℓ(zi , z∗ ), ℓ(zj , z∗ )); further, rolling out the contextual policy π(a|s, ·), (τzi , τzj ) should similarly preserve the above preference relationship. To show that, we compare the performance of rollouts by the contextual policy π(a|s, ·) conditioned on different contexts in Table 2. We choose three context embeddings: z∗ , zhigh (embedding of the trajectory with the highest return in D), and zlow (embedding of the trajectory with the lowest return in D) and provide respective rollout performances (averaged over 3 seeds). We discover that the contextual policy π(a|s, z) conditioned on z with a high (or low) return (of corresponding trajectory τ = Iθ−1 (z)) obtains an actual high (or low) return when rollouting this policy in the environment, e.g., π(a|s, zhigh ) performs better than π(a|s, zlow ) (thus preserving the hindsight preference relationship). Further, when conditioned on the learned optimal z∗ , π(a|s, z∗ ) produces the best performance over that conditioned on all other offline embeddings. Notice that our learned optimal π(a|s, z∗ ) performs better than the contextual policy π(a|s, zhigh ). This result implies that the trajectory of our optimal policy is generally better than other trajectories in the offline dataset. 5.1. Can z-space align well with given preferences? In this subsection, we probe that OPPO can enable wellaligned preferences over the z-space encoded by the learned Iθ . We first sample random trajectories from the offline dataset D, and encode them with the learned Iθ , and utilize t-SNE (van der Maaten & Hinton, 2008) as a tool to visualize the encoded z, shown in Fig.3. The learned optimal z∗ is marked with an orange dot. Besides, we also mark the (embedding of) optimal trajectory in the D4RL expert dataset, generated by the learned online optimal policy, with a red dot (z∗∗ ). According to Eq.8, embeddings near the actual optimal z∗∗ in z-space means they are more preferable implied by the preference label. Comparing the sampled trajectories (embeddings), we find OPPO successfully captures the preference. As illustrated in Fig.3, the trajectories (embeddings) 7 Beyond Reward: Offline Preference-guided Policy Optimization Table 3. Performance comparison between OPPO and 4 offline (PbRL) baselines (DT+r, DT+rψ , CQL+r, and IQL+r) in D4RL GymMujoco tasks, where results are reported over 3 seeds. Environment Dataset Ours DT+r DT+rψ CQL+r IQL+r BC Hopper Medium-Expert Medium Medium-Replay 108.0 ± 5.1 86.3 ± 3.2 88.9 ± 2.3 111.0 ± 0.5 76.6 ± 3.9 87.8 ± 4.7 95.6 ± 27.3 73.3 ± 3.0 72.5 ± 22.2 111.0 58.0 48.6 91.5 66.3 94.7 79.6 63.9 27.6 Walker Medium-Expert Medium Medium-Replay 105.0 ± 2.4 85.0 ± 2.9 71.7 ± 4.4 109.2 ± 0.3 80.9 ± 3.1 79.6 ± 3.1 109.7 ± 0.1 81.1 ± 2.1 80.4 ± 4.4 98.7 79.2 26.7 109.6 78.3 73.9 36.6 77.3 36.9 HalfCheetah Medium-Expert Medium Medium-Replay 89.6 ± 0.8 43.4 ± 0.2 39.8 ± 0.2 86.8 ± 1.3 43.4 ± 0.1 39.2 ± 0.3 88.4 ± 0.7 43.2 ± 0.2 38.8 ± 0.3 62.4 44.4 46.2 86.7 47.4 44.2 59.9 43.1 4.3 717.7 714.5 683.0 575.2 692.4 429.2 Sum 5.3. Performance of OPPO on Benchmark Tasks with Scripted Teacher Table 4. Ablation study of one-step paradigm in Medium-Replay Dataset Task OPPO OPPO-a We have shown that OPPO produces a near-optimal context z∗ , and the learned contextual policy π(a|s, ·) can preserve the hindsight preference. This subsection investigates whether the optimal policy π(a|s, z∗ ) can achieve competitive performance on the offline (PBRL) benchmark. For comparison, we consider three offline PbRL methods and BC as baselines: 1) DT+r: performing Decision Transformer (Chen et al., 2021) with ground-truth reward function, and the results are run by us; 2) DT+rψ : performing Decision Transformer with a learned reward function (using Eq.2); 3) CQL+r: performing CQL (Kumar et al., 2020) with ground-truth reward function; 3) IQL+r: performing IQL with ground-truth reward function, the results are reported from IQL (Kostrikov et al., 2022); 4) BC: performing bahavior cloning on the dataset, the results are reported from DT (Chen et al., 2021). Hopper Walker HalfCheetah 88.9 ± 2.3 71.7 ± 4.4 39.8 ± 0.2 78.3 ± 7.1 66.3 ± 1.6 39.6 ± 0.1 200.4 184.2 Sum 5.4. Benefits of One-step Offline PbRL We conduct an ablation study to analyze the benefit of iterating LHIM and LPM (for updating Iθ ) in an one-step paradigm. Firstly, we remove Iθ from ∂LPM /∂θ and only keep the optimal embedding z∗ to be updated in Eq.8. Then, we continue to visualize the embedding z-space for this ablation setting (OPPO-a), the t-SNE visualization shown in Fig.4. By comparing Fig.4 to Fig.3, we can see that the preference relationship in the embedding space (learned with OPPO-a) is all shuffled. In a less expressive z-space, it is challenging to model the preference and find the optimal z∗ . Further, as shown in Table 4, the comparison results of mediumreplay tasks demonstrate that such an ablation does cause the performance degradation. In Table 3, we show the performance of OPPO and baselines. We have the following observations. 1) OPPO has retained a comparable performance against the Decision Transformer trained using true rewards. OPPO is a PbRL approach requiring only (human) preferences, which have a more flexible and straightforward form of supervision in the real world. 2) Although DT+rψ also shows competitive results in these benchmarks, such a method needs a target of returnto-go determined by the human prior 1 . Our method, in contrast, avoids the need for such a prior target by searching across the z-space. We argue that our searching method brings advantages because rewards are usually hard to obtain in real-world RL applications, where the preference is the only information accessible for training and deploying an RL method. Table 5. Ablation study of feedback quantity Dataset 50k 1k 500 Medium-Expert Medium Medium-Replay 108.0 ± 5.1 86.3 ± 3.2 88.9 ± 2.3 102.9 ± 3.2 90.8 ± 2.0 60.4 ± 3.0 104.9 ± 4.1 77.5 ± 12.8 68.5 ± 22.8 283.1 254.2 250.9 Sum 5.5. Performance with Different Amount of Preference Feedback 1 Preference-based relabelled rewards only participate in the training phase. During the evaluation phase of DT+rψ , we pass in the same target return-to-go value as in the original DT paper. For the Hopper task, we evaluate the impact of different amounts of preference labels on the performance of OPPO 8 Beyond Reward: Offline Preference-guided Policy Optimization Figure 4. t-SNE visualization of the embedding space learned with OPPO-a in Hopper environment. and show the results in Table 5. Specifically, OPPO is evaluated using the labels amount from 50k, 1k, 500, on the dataset from Medium-Expert, Medium, Medium Replay. As illustrated in the Table 5, OPPO performs the best when given 50k preference labels and achieves a total normalized score of 283.1 among the three datasets. However, the performance decreases at around 250 for feedback amount decreases to 1k and 500. Therefore, OPPO is robust to the variation in terms of the amount of preference feedback used for training. IQL+PT (Kim et al., 2023). We find that our method outperforms IQL+PT in most tasks, and even achieves competitive or better preformance than IQL+r. To sum up, through six experiments and the visualization of the results, we demonstrate that the z-space learned by the encoder is informative and visually interpretable. Besides, the ablation study proves that a preference-guided embedding space of context could improve task performance asymptotically by a non-neglectable margin. Moreover, OPPO can find an embedding to represent the context of the optimal trajectory, where the resulting trajectory is better than any offline trajectory in the dataset. Last but not least, in the offline setting with environment interaction disabled, our paradigm can acquire the optimal behaviors using binary preference labels between sub-optimal trajectories. As shown in the experiment results, OPPO achieves a competitive performance over DT trained using either true rewards or pseudo rewards. Table 6. Performance OPPO with preference from real human, where results are reported over 3 seeds. Environment Dataset IQL+r IQL+PT OPPO Hopper Medium-Expert Medium-Replay 73.6 ± 41.5 83.1 ± 15.8 69.0 ± 33.9 84.5 ± 4.1 107.8 ± 1.6 93.1 ± 1.3 Walker Medium-Expert Medium-Replay 107.8 ± 2.0 73.1 ± 8.1 110.1 ± 0.2 71.3 ± 10.3 106.4 ± 1.1 74.9 ± 0.7 337.47 334.9 382.2 Lift Locomotion Sum Proficient-Human Multi-Human 96.8 ± 1.8 86.8 ± 2.8 91.8 ± 5.9 86.8 ± 6.0 94.7 ± 1.2 98.7 ± 2.3 Can Proficient-Human Multi-Human 74.5 ± 6.8 56.3 ± 8.8 69.7 ± 5.9 50.5 ± 6.5 75.3 ± 10.1 86.7 ± 12.7 314.3 298.7 355.3 Robosuite Sum 6. Conclusion This paper introduces offline preference-guided policy optimization (OPPO), a one-step offline PbRL paradigm. Unlike the previous PbRL approaches that learn policy from a pseudo-reward function (learning a separate reward function is a prerequisite), OPPO directly optimizes the policy in a high-level embedding space. To enable that, we propose an offline hindsight information matching (HIM) objective and a preference modeling objective. Empirically, we show that iterating the above two objectives can produce meaningful and preference-aligned embeddings of context. Moreover, conditioned on the learned optimal context, our HIM-based contextual policy can achieve competitive performance on standard offline (PbRL) tasks. 5.6. Performance of OPPO on Benchmark Tasks with Real Human Teacher We have conducted additional experiments using real human-labeled data on the Hopper and Walker tasks. The human preferences we used are obtained from the open-source dataset of PT (Kim et al., 2023), which is collected from actual human familiar with robotic tasks. Also, we have carried out experiments on the Robomimic dataset (Mandlekar et al., 2022), which offers a set of offline datasets on 7-DoF robot manipulation domains. In our experiments, we test our method on two tasks (Lift and Can), where the offline data are collected by either one proficient human teleoperators (Proficient-Human) or multiple human teleoperators with varying proficiency (Multi-Human), and the preference labels are also labeled by real human. In Table 6, we compare OPPO to IQL+r (Kostrikov et al., 2022) and Acknowledgements This work was supported by STI 2030—Major Projects (2022ZD0208800), and NSFC General Program (Grant No. 62176215). 9 Beyond Reward: Offline Preference-guided Policy Optimization References Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019. Ajay, A., Kumar, A., Agrawal, P., Levine, S., and Nachum, O. Opal: Offline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611, 2020. Furuta, H., Matsuo, Y., and Gu, S. S. Generalized decision transformer for offline hindsight information matching. arXiv preprint arXiv:2111.10364, 2021. Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. Advances in neural information processing systems, 30, 2017. Garcıa, J. and Fernández, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015. Arumugam, D., Lee, J. K., Saskin, S., and Littman, M. L. Deep reinforcement learning from policy-dependent human feedback. arXiv: Learning, 2019. Ghosh, D., Gupta, A., Fu, J., Reddy, A., Devin, C., Eysenbach, B., and Levine, S. Learning to reach goals without reinforcement learning. 2019. Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324, December 1952. ISSN 0006-3444. doi: 10.2307/2334029. URL https: //doi.org/10.2307/2334029. Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. Advances in neural information processing systems, 30, 2017. Hans, A., Schneegaß, D., Schäfer, A. M., and Udluft, S. Safe exploration for reinforcement learning. In ESANN, pp. 143–148. Citeseer, 2008. Campos, V., Trott, A., Xiong, C., Socher, R., Giró-i Nieto, X., and Torres, J. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, pp. 1317–1327. PMLR, 2020. Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in Atari. In Neural Information Processing Systems, 2018. Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021. Janner, M., Li, Q., and Levine, S. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34: 1273–1286, 2021. Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. Generative pretraining from pixels. In International conference on machine learning, pp. 1691– 1703. PMLR, 2020. Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., and Levine, S. QT-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Computer Vision and Pattern Recognition, 2018. Chen, X., Zhong, H., Yang, Z., Wang, Z., and Wang, L. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. arXiv preprint arXiv:2205.11140, 2022. Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33: 21810–21823, 2020. Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. arXiv: Machine Learning, 2017. Kim, C., Park, J., Shin, J., Lee, H., Abbeel, P., and Lee, K. Preference transformer: Modeling human preferences using transformers for RL. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=Peot1SFDX0. Emmons, S., Eysenbach, B., Kostrikov, I., and Levine, S. Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021. Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018. Knox, W. B. and Stone, P. Interactively shaping agents via human reinforcement. In Proceedings of the fifth international conference on Knowledge capture - K-CAP ’09. ACM Press, 2009. doi: 10.1145/ Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020. 10 Beyond Reward: Offline Preference-guided Policy Optimization 1597735.1597738. URL https://doi.org/10. 1145/1597735.1597738. Knox, W. B., Hatgis-Kessell, S., Booth, S., Niekum, S., Stone, P., and Allievi, A. Models of human preference for learning reward functions. arXiv preprint arXiv:2206.02231, 2022. Liu, J., Hongyin, Z., and Wang, D. DARA: Dynamics-aware reward augmentation in offline reinforcement learning. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum? id=9SDQB3b68K. Liu, J., Wang, D., Tian, Q., and Chen, Z. Learn goalconditioned policy with intrinsic motivation for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7558–7566, 2022b. Kober, J. and Peters, J. Policy search for motor primitives in robotics. In Neural Information Processing Systems, 2008. Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, August 2013. ISSN 0278-3649, 1741-3176. doi: 10.1177/ 0278364913495721. URL https://doi.org/10. 1177/0278364913495721. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. Lu, K., Grover, A., Abbeel, P., and Mordatch, I. Reset-free lifelong learning with skill-space planning. arXiv preprint arXiv:2012.03548, 2020. Kohl, N. and Stone, P. Policy gradient reinforcement learning for fast quadrupedal locomotion. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004. IEEE, 2004. doi: 10.1109/robot.2004.1307456. URL https://doi. org/10.1109/robot.2004.1307456. Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., Fei-Fei, L., Savarese, S., Zhu, Y., and Martı́n-Martı́n, R. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning, pp. 1678–1690. PMLR, 2022. Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=68n2s9ZJWF8. Mihatsch, O. and Neuneier, R. Risk-sensitive reinforcement learning. Machine learning, 49(2):267–290, 2002. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019a. Park, J., Seo, Y., Shin, J., Lee, H., Abbeel, P., and Lee, K. Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050, 2022. Kumar, A., Peng, X. B., and Levine, S. Reward-conditioned policies. arXiv preprint arXiv:1912.13465, 2019b. Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020. Paster, K., McIlraith, S. A., and Ba, J. Planning from pixels using inverse dynamics models. arXiv preprint arXiv:2012.02419, 2020. Le-Khac, P. H., Healy, G., and Smeaton, A. F. Contrastive representation learning: A framework and review. Ieee Access, 8:193907–193934, 2020. Pertsch, K., Lee, Y., and Lim, J. J. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020. Lee, K., Smith, L. M., and Abbeel, P. PEBBLE: Feedbackefficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, 2021. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. 2018. Sharma, A., Gu, S., Levine, S., Kumar, V., and Hausman, K. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019. Liu, J., Shen, H., Wang, D., Kang, Y., and Tian, Q. Unsupervised domain adaptation with dynamics-aware rewards in reinforcement learning. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=u7Qb7pQk8tF. Shin, D. and Brown, D. S. Offline preference-based apprenticeship learning. arXiv preprint arXiv:2107.09251, 2021. 11 Beyond Reward: Offline Preference-guided Policy Optimization Siegel, N. Y., Springenberg, J. T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., Hafner, R., Heess, N., and Riedmiller, M. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396, 2020. Warnell, G., Waytowich, N. R., Lawhern, V. J., and Stone, P. Deep TAMER: Interactive agent shaping in highdimensional state spaces. arXiv: Artificial Intelligence, 2017. Watkins, C. J. C. H. Learning from delayed rewards. 1989. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, October 2017. ISSN 0028-0836, 1476-4687. doi: 10.1038/nature24270. URL https://doi.org/10.1038/nature24270. Wattenberg, M., Viégas, F., and Johnson, I. How to use t-sne effectively. Distill, 2016. doi: 10.23915/ distill.00002. URL http://distill.pub/2016/ misread-tsne. Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020. Singh, A., Liu, H., Zhou, G., Yu, A., Rhinehart, N., and Levine, S. Parrot: Data-driven behavioral priors for reinforcement learning. arXiv preprint arXiv:2011.10024, 2020. Zhuang, Z., LEI, K., Liu, J., Wang, D., and Guo, Y. Behavior proximal policy optimization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=3c13LptpIph. Srivastava, R. K., Shyam, P., Mutz, F., Jaśkowski, W., and Schmidhuber, J. Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877, 2019. Vamplew, P., Smith, B. J., Källström, J., Ramos, G., Rădulescu, R., Roijers, D. M., Hayes, C. F., Heintz, F., Mannion, P., Libin, P. J., et al. Scalar reward is not enough: A response to silver, singh, precup and sutton (2021). Autonomous Agents and Multi-Agent Systems, 36 (2):41, 2022. van der Maaten, L. and Hinton, G. Visualizing data using tsne. Journal of Machine Learning Research, 9(86):2579– 2605, 2008. URL http://jmlr.org/papers/v9/ vandermaaten08a.html. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., Wünsch, D., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C., and Silver, D. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 (7782):350–354, October 2019. ISSN 0028-0836, 14764687. doi: 10.1038/s41586-019-1724-z. URL https: //doi.org/10.1038/s41586-019-1724-z. Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008. 12 Beyond Reward: Offline Preference-guided Policy Optimization A. Appendix A.1. Implementation details A.1.1. C ODEBASE . Our code is based on Decision Transformer2 , and our implementation of OPPO is available at: https://github.com/bkkgbkjb/OPPO A.1.2. O PENAI G YM . We choose the OpenAI Gym continuous control tasks from the D4RL benchmark (Fu et al., 2020). The different dataset settings are described below. • Medium: 1 million timesteps generated by a ”medium” policy that achieves approximately one-third of the score of an expert policy. • Medium-Replay: the replay buffer of an agent trained to the performance of a medium policy (approximately 25k-400k timesteps in our environments). • Medium-Expert: 1 million timesteps generated by the medium policy concatenated with 1 million timesteps generated by an expert policy. For details of these environments and datasets, please refer to D4RL for more information. A.1.3. H YPERPARAMETERS During the offline HIM phase, we weighted sum all three losses as in Eq.9 (with ratios listed in Table 7) and perform backpropagation, while in Preference Modeling phase, only LPM is computed and backpropagated. Table 7. Hyperparameters of coefficients of combined losses during Offline HIM. Hyperparameter α β Value 0.25 for halfcheetah-medium-expert 0.5 for others 0.05 for halfcheetah-medium-expert 0.1 for others Our hyperparameters on all tasks are shown below in Table 8 and Table 9. Models were trained for 105 gradient steps using the AdamW optimizer Loshchilov & Hutter (2017) following PyTorch defaults. Table 8. Hyperparameters of z∗ searching for OpenAI Gym experiments. Hyperparameter Value Number of dimensions 8 for halfcheetah 16 for others 50k AdamW 10−2 for halfcheetah-medium-expert 10−3 for others 10−4 1 Amount of feedback Type of optimizer Learning rate Weight decay Margin 2 https://github.com/kzl/decision-transformer 13 Beyond Reward: Offline Preference-guided Policy Optimization Table 9. Hyperparameters of Transformer for OpenAI Gym experiments. Hyperparameter Value Number of layers Number of attention heads 3 2 for encoder transformer 1 for decision transformer 128 ReLU 64 20 0.1 10−4 0.25 10−4 Linear warmup for first 105 training steps Embedding dimension Nonlinearity function Batch size context length K Dropout Learning rate Grad norm clip Weight decay Learning rate decay A.1.4. C OMPUTATIONAL RESOURCES . The experiments were run on a computational cluster with 20x GeForce RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for about 20 days. A.2. Additional results A.2.1. M ORE VISUALIZATION RESULTS ON z- SPACE . We further show the t-sne results of OPPO in 5 with the setting described in Section 5.1 in Walker and HalfCheetah environments. (a) walker (b) halfcheetah Figure 5. t-SNE of OPPO in Qalker and HalfCheetah including the embedding of random trajectories in D, the colors of the points represent the normalized return of the corresponding trajectory τ . 14 Beyond Reward: Offline Preference-guided Policy Optimization Our primary purpose of using t-SNE is for visualization, to illustrate the structure of the learned z-space in a more intuitive rather than quantitative manner. While t-SNE results are known to be hyper-parameter dependent (Wattenberg et al., 2016), we have included in Table 10 that listing the euclidean distances of z∗ from z∗∗ on different tasks. Environment Table 10. Euclidean distances of z∗ from z∗∗ on different tasks. [Min, Lower quartile, Median, Dataset ||z∗ − z∗∗ ||2 Upper quartile, Max] Percentile Rank (PR) Hopper Medium-Expert Medium Medium-Replay 12.68 15.50 13.33 [12.40, 13.29, 14.19, 15.01, 16.37] [13.20, 13.92, 14.81, 15.70, 17.23] [12.13, 13.24, 13.98, 14.83, 16.45] 97.9% 30.1% 72.0% Walker Medium-Expert Medium Medium-Replay 13.16 12.86 14.26 [12.62, 13.02, 14.13, 15.02, 16.40] [12.01, 12.84, 13.21, 14.03, 15.79] [10.90, 12.39, 13.14, 13.58, 15.18] 71.2% 74.1% 6.7% HalfCheetah Medium-Expert Medium Medium-Replay 10.42 3.22 1.89 [10.34, 10.82, 12.35, 12.71, 13.85] [3.19, 3.40, 4.15, 4.54, 5.80] [1.53, 1.98, 2.92, 3.69, 4.24] 99.7% 99.9% 79.9% To provide some context, we also calculated the distances between the embeddings of trajectories in each dataset and z∗ , then gathered the minimum, lower quartile, median, upper quartile, and maximum values in Table 10. Additionally, we included percentile rankings for the distances between z∗ and z∗∗ within each dataset. The results confirm the intuitions from the t-SNE plots, and the percentile rank maybe more informative. A.2.2. M ORE RESULTS OF ABLATION STUDY OF ONE - STEP PARADIGM We also show the t-sne results of the corresponding ablation study in 6 with the setting described in Section 5.4 in Walker and HalfCheetah environments. (a) walker (b) halfcheetah Figure 6. t-SNE visualization of the embedding space learned with OPPO-a in walker and halfcheetah environments. 15 Beyond Reward: Offline Preference-guided Policy Optimization By comparing Fig.6 to Fig.5, we discover that the structure of z-space significantly collapses in eight out of nine environments (except for halfcheetah medium-replay). More specifically, we can no longer recognize the distribution pattern and clusters that emerged in Fig.5, while such an observation is in line with our conclusion in the main text. Table 11. Ablation study of one-step paradigm Environment Dataset OPPO OPPO-a Hopper Medium-Expert Medium Medium-Replay 108.0 ± 5.1 86.3 ± 3.2 88.9 ± 2.3 103.5 ± 4.4 69.2 ± 7.4 78.3 ± 7.1 Walker Medium-Expert Medium Medium-Replay 105.0 ± 2.4 85.0 ± 2.9 71.7 ± 4.4 108.8 ± 1.0 80.7 ± 1.5 66.3 ± 1.6 HalfCheetah Medium-Expert Medium Medium-Replay 89.6 ± 0.8 43.4 ± 0.2 39.8 ± 0.2 90.1 ± 1.4 43.4 ± 0.2 39.6 ± 0.1 717.7 679.8 Sum However, it is also worth noting that the performance of OPPO-a in the D4RL benchmark is not hindered much by this uninformative z-space, as shown in Table 11. We attribute this to the effectiveness of the preference modeling phase, where our method is still able to find a meaningful z∗ in a less expressive z-space. This is also justified from t-SNE(Fig.6) as there our learned z∗ (orange dot) locates just in the point of deep color. 16