Multi-Task Off-Policy Learning from Bandit Feedback Joey Hong 1 Branislav Kveton 2 Manzil Zaheer 3 Sumeet Katariya 2 Mohammad Ghavamzadeh 4 Abstract Because we cannot explore beyond the logged dataset, it is important to use the logged data in the most statistically efficient way. One way of achieving this is by modeling the structure of the solved problem. As an example, in bandit algorithms, we could achieve higher statistical efficiency by using information about the reward distribution (Garivier & Cappe, 2011), a prior distribution over model parameters (Thompson, 1933; Agrawal & Goyal, 2012; Chapelle & Li, 2012; Russo et al., 2018), or features (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013). In this work, we consider a natural structure where multiple similar tasks are related through a hierarchical Bayesian model (Gelman et al., 2013; Kveton et al., 2021; Hong et al., 2022b). Each task is parameterized by a task parameter sampled i.i.d. from a distribution parameterized by a hyperparameter. The unknown hyper-parameter relates the tasks. Specifically, data from any task helps in improving its estimate, which in turn improves estimates of all other task parameters. Many practical problems involve solving similar tasks. In recommender systems, the tasks can be users with similar preferences; in search engines, the tasks can be items with similar affinities. To learn statistically efficiently, the tasks can be organized in a hierarchy, where the task affinity is captured using an unknown latent parameter. We study the problem of off-policy learning for similar tasks from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm HierOPO. The key idea is to estimate the task parameters using the hierarchy and then act pessimistically with respect to them. To analyze the algorithm, we develop novel Bayesian error bounds. Our bounds are the first in off-policy learning that improve with a more informative prior and capture statistical gains due to hierarchical models. Therefore, they are of a general interest. HierOPO also performs well in practice. Our experiments demonstrate the benefits of using the hierarchy over solving each task independently. Although the tasks are similar, they are sufficiently different to require different polices, and we address this multi-task off-policy learning problem in this work. To solve the problem, we propose an algorithm called hierarchical off-policy optimization (HierOPO). Since off-policy algorithms must reason about counterfactual rewards of actions that may not have been taken frequently in the logged dataset, a common approach is to learn pessimistic, or lower confidence bound (LCB), estimates of the mean rewards and act according to them (Buckman et al., 2021; Jin et al., 2021). HierOPO is an instance of this approach where high-probability LCBs are estimated using a hierarchical model. 1. Introduction Many interactive systems, such as search and recommender systems, can be modeled as a contextual bandit (Li et al., 2010; Chu et al., 2011), where a policy observes a context, takes one of possible actions, and then receives a stochastic reward for that action. In many applications, it is prohibitively expensive to learn policies online by contextual bandit algorithms, because exploration has a major impact on user experience. However, offline data collected by a previously deployed policy are often available. Offline, or off-policy, optimization using such logged data is a practical way of learning policies without costly online interactions (Dudik et al., 2014; Swaminathan & Joachims, 2015). Our work makes four major contributions. First, we formalize the problem of multi-task off-policy optimization as a multi-task bandit in a hierarchical model. Second, we propose an efficient algorithm for solving the problem, which we call HierOPO. The key idea in HierOPO is to compute lower confidence bounds on the mean rewards of actions and act according to them. The LCBs can be computed in a closed form in linear Gaussian models (Lindley & Smith, 1972). Third, we analyze the quality of our policies using Bayesian error bounds. Our bounds capture the effect of a more informative prior and statistical gains due to hierarchical models. These are the first such bounds in off-policy learning and should be of a wide interest. Finally, we evalu- 1 University of California, Berkeley 2 Amazon 3 DeepMind Google Research. Correspondence to: Joey Hong . 4 Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1 Multi-Task Off-Policy Learning from Bandit Feedback ate HierOPO on both synthetic and real-world problems. At 2. Setting Q We start with introducing our notation. Random variables are capitalized, except for Greek letters like θ. For any positive integer n, we define [n] = {1, . . . , n}. The indicator function is 1{·}. The i-th entry of vector v is denoted by vi . If the vector is already indexed, such as vj , we write vj,i . We denote the maximum and minimum eigenvalues of matrix M ∈ Rd×d by λ1 (M ) and λd (M ), respectively. µ∗ θs,∗ Yt Xt t : St = s s∈S Figure 1. A graphical model of a multi-task contextual bandit. Here Px is a context distribution and π0 ∈ Π is a logging policy. To simplify notation, we assume that π0 is the same for all tasks. It can be stochastic. A graphical model of our setting, which shows dependencies among all variables in our problem, is in Figure 1. We do not assume that π0 is known, although this assumption is common (Dudik et al., 2014; Swaminathan & Joachims, 2015). In the classic contextual bandit (Li et al., 2010), the agent observes a context x ∈ X , where X is a context set; takes an action a ∈ A, where A is an action set; and observes a stochastic reward Y ∼ P (· | x, a; θ), where P (· | x, a; θ) is the reward distribution parameterized by a model parameter θ ∈ Θ. We denote the mean reward of action a in context x under parameter θ by r(x, a; θ) = EY ∼P (·|x,a;θ) [Y ] and assume that the rewards are σ 2 -sub-Gaussian. 2.2. Objective 2.1. Multi-Task Bandit The value of policy πs ∈ Π in task s ∈ S with parameter θs,∗ is defined as In this paper, we simultaneously solve m similar contextual bandit instances, which we call tasks. Therefore, our problem is a multi-task contextual bandit (Azar et al., 2013; Deshmukh et al., 2017; Cella et al., 2020; Kveton et al., 2021; Moradipari et al., 2022). The set of all tasks is denoted by S and we index the tasks by s ∈ S. The reward distribution in task s is parameterized by a task parameter θs,∗ ∈ Θ, which is sampled i.i.d. from a task prior distribution θs,∗ ∼ P (· | µ∗ ). The task prior is parameterized by an unknown hyper-parameter µ∗ , which is sampled from a hyper-prior Q. The hyper-prior represents the agent’s prior knowledge about µ∗ . In a recommender system, the task could be a user, the task parameter could be their preferences, and the hyper-parameter could be the preferences of an average user. We use this example in our experiments in Section 7. A similar setup was studied in the online setting by Hong et al. (2022b). V (πs ; θs,∗ ) = E [r(X, πs (X); θs,∗ )] , where the randomness is only over context X ∼ Px . The optimal policy πs,∗ is defined as πs,∗ = arg max V (π; θs,∗ ) . π∈Π Let π̂s ∈ Π be some estimated optimal policy from logged dataset D. A standard approach in off-policy optimization is to derive an (ε, δ) bound V (π̂s ; θs,∗ ) ≥ V (πs,∗ ; θs,∗ ) − ε , (1) which holds with probability at least 1 − δ for a specified maximum error ε (Strehl et al., 2010; Li et al., 2018). The bound says that the policy π̂s is at most ε worse than the optimum πs,∗ with a high probability. The error ε depends on δ, D, π̂s , and problem hardness. Such bounds can be derived using concentration inequalities for sub-Gaussian rewards (Boucheron et al., 2013), under the assumptions that the parameter θs,∗ is fixed and bounded. We call this setting frequentist. Unlike prior works in multi-task bandits, we focus on the offline setting where we learn task-specific policies from logged data. One distinguishing characteristic of our setting is that each task has its own parameter θs,∗ , and thus may require a different policy. As a result, we learn a separate policy πs : X → A for each task s. To simplify notation, we focus on deterministic policies. Our results can be extended to stochastic policies by accounting for an additional expectation over actions. We denote the set of stationary deterministic policies by Π = {π : X → A}. In our work, we study a Bayesian setting, where the prior distribution of θs,∗ and dataset D allow the agent to derive the posterior distribution of θs,∗ , P̂s (θ) = P (θs,∗ = θ | D). To model that θs,∗ is random, and that the prior and dataset D provide additional information about θs,∗ , it is natural to guarantee (1) in expectation over the posterior of θs,∗ . We formalize this as an (ε, δ) bound We learn the policies from a logged dataset of size n. The dataset is given by D = {(St , Xt , At , Yt )}t∈[n] , where St is a task, Xt ∼ Px is a context, At ∼ π0 (Xt ) is an action, and Yt ∼ P (· | Xt , At ; θSt ,∗ ) is a reward in interaction t. P (V (π̂s ; θs,∗ ) ≥ V (πs,∗ ; θs,∗ ) − ε | D) ≥ 1 − δ , 2 (2) Multi-Task Off-Policy Learning from Bandit Feedback where ε is a specified maximum error that depends on δ, D, π̂s , and problem hardness. The main difference from (1) is that θs,∗ , and thus also πs,∗ , are random. 3.1. Hierarchical Pessimism In any task s, the mean E [θs,∗ | D] in (4) can be estimated hierarchically as follows. Let Ds be the subset of dataset D corresponding to task s, where St = s. Recall that µ∗ is the hyper-parameter in Figure 1. Then, by the law of total expectation, The Bayesian view allows us to derive (ε, δ) error bounds with two new properties. First, the error ε decreases with a more informative prior on θs,∗ (Section 4). In frequentist bounds, the prior plays the role of a regularizer, unrelated to the estimated model parameter, and thus cannot capture this effect. Second, we show that the hierarchical model in Figure 1 can improve statistical efficiency in multi-task bandits (Section 5). Our bounds and analyses are motivated by Bayes regret bounds in bandits (Russo & Van Roy, 2014; Lu & Van Roy, 2019; Kveton et al., 2021; Atsidakou et al., 2022; Hong et al., 2022b;a), which have similar properties and improve upon their frequentist counterparts similarly (Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013). E [θs,∗ | D] = E [E [θs,∗ | µ∗ , D] | D] The second equality holds since conditioning on µ∗ makes θs,∗ independent of D \Ds . Our decomposition is motivated by the observation that estimating each E [θs,∗ | µ∗ , Ds ] is an easier problem than estimating E [θs,∗ | D], since all observations in Ds are from a single task s. The information sharing between the tasks is still captured by µ∗ , which is learned from the entire logged dataset D. 3. Algorithm Similarly, the covariance cov [θs,∗ | D] in (4) can be decomposed using the law of total covariance, Prior works in off-policy bandit and reinforcement learning often design pessimistic lower confidence bounds and then act according to them (Jin et al., 2021). We adopt the same design principle. Our LCBs satisfy Ls (x, a) ≤ r(x, a; θs,∗ ) with a high probability for task parameter θs,∗ | D, jointly over all tasks s, contexts x, and actions a. Specifically, we define them as Ls (x, a) = r̂s (x, a) − cs (x, a), where r̂s (x, a) = E [r(x, a; θs,∗ ) | D] , q cs (x, a) = α var [r(x, a; θs,∗ ) | D] , cov [θs,∗ | D] (6) = E [cov [θs,∗ | µ∗ , D] | D] + cov [E [θs,∗ | µ∗ , D] | D] = E [cov [θs,∗ | µ∗ , Ds ] | D] + cov [E [θs,∗ | µ∗ , Ds ] | D] . Again, the second equality holds since conditioning on µ∗ makes θs,∗ independent of D \ Ds . Note that (6) comprises two interpretable terms. The first captures the uncertainty of θs,∗ conditioned on µ∗ , whereas the second captures the uncertainty in µ∗ . This decomposes two sources of uncertainty in our problem, and is a powerful tool for structured uncertainty estimation (Hong et al., 2022a). (3) are the estimated mean reward and its confidence interval width, respectively; and α > 0 is a tunable parameter. Now we plug (5) and (6) into (4), and get Linear models are an important class of contextual bandit models (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Li et al., 2010) and we also consider them here. Specifically, we assume that r(x, a; θs,∗ ) = ϕ(x, a)⊤ θs,∗ for each task s, where θs,∗ is the task parameter and ϕ : X × A → Rd is some feature extractor. Under this assumption, we can write (3) using the posterior mean and covariance of θs,∗ as r̂s (x, a) = ϕ(x, a)⊤ E [θs,∗ | D] , q cs (x, a) = α ϕ(x, a)⊤ cov [θs,∗ | D] ϕ(x, a) . (5) = E [E [θs,∗ | µ∗ , Ds ] | D] . r̂s (x, a) = ϕ(x, a)⊤ E [E [θs,∗ | µ∗ , Ds ] | D] , q cs (x, a) = α ϕ(x, a)⊤ Σ̂s ϕ(x, a) , where Σ̂s = E [cov [θs,∗ | µ∗ , Ds ] | D] + cov [E [θs,∗ | µ∗ , Ds ] | D] . With this in mind, we propose a general algorithm for hierarchical off-policy optimization, which we call HierOPO. Its pseudo-code is showed in Algorithm 1. (4) 3.2. Hierarchical Gaussian Pessimism The above decomposition is desirable because it separates the posterior of the task parameter from context. The computation of (5) and (6) requires integrating out the hyper-parameter µ∗ and task parameter θs,∗ . This is generally impossible in a closed form, although many powerful approximations exist (Doucet et al., 2001). In this section, we leverage the conjugacy of a Gaussian hyper-prior, task priors, and reward distributions to obtain closed-form estimates of all model parameters. In this case, HierOPO can The rest of this section is organized as follows. We derive the mean reward estimate and its confidence interval width for a general model in Section 3.1. We instantiate these in a linear Gaussian model in Section 3.2 and then discuss the resulting algorithm in Section 3.3. Alternative algorithm designs are discussed in Section 3.4. 3 Multi-Task Off-Policy Learning from Bandit Feedback −1 task s, given by G−1 s Bs , with covariance Σ0 + Gs . The tasks with more observations affect the estimate µ̄ more, since their G−1 s approaches a zero matrix, and as a result Σ0 + G−1 s → Σ0 . This uncertainty is intrinsic, since even θs,∗ is a noisy observation of µ∗ . Algorithm 1 HierOPO: Hierarchical off-policy optimization. 1: Input: Dataset D 2: for s ∈ S, x ∈ X do 3: for a ∈ A do 4: Compute r̂s (x, a) and cs (x, a) (Section 3.1) 5: Ls (x, a) ← r̂s (x, a) − cs (x, a) 6: π̂s (x) ← arg max a∈A Ls (x, a) 7: Output: π̂ ← (π̂s )s∈S To complete our derivations, we only need to substitute (7) and (9) into (5) and (6). The posterior mean of θs,∗ is h i E [E [θs,∗ | µ∗ , Ds ] | D] = E Σ̃s (Σ−1 0 µ∗ + Bs ) D = Σ̃s (Σ−1 0 E [µ∗ | D] + Bs ) be implemented exactly and efficiently. We also analyze it under these assumptions (Section 5). = Σ̃s (Σ−1 0 µ̄ + Bs ) , where we simply combine (7) and (9). Similarly, the posterior covariance of θs,∗ requires computing h i E [cov [θs,∗ | µ∗ , Ds ] | D] = E Σ̃s D = Σ̃s , h i cov [E [θs,∗ | µ∗ , Ds ] | D] = cov Σ̃s (Σ−1 µ + B ) D ∗ s 0 h i = cov Σ̃s Σ−1 0 µ∗ D In particular, we consider a linear Gaussian model with the hyper-prior Q = N (µq , Σq ) and the task prior P (· | µ∗ ) = N (·; µ∗ , Σ0 ). The mean vector µq ∈ Rd , as well as both the covariance matrices Σq , Σ0 ∈ Rd×d , are assumed to be known. The reward distribution of action a in context x is N (ϕ(x, a)⊤ θs,∗ , σ 2 ), where ϕ is a feature extractor and σ > 0 is a known reward noise. It follows that the mean reward is linear in features. −1 = Σ̃s Σ−1 0 Σ̄Σ0 Σ̃s . To derive (5) and (6), we start with understanding posterior distributions of θs,∗ and µ∗ . Specifically, conditioning in Gaussian graphical models preserves Gaussianity, and thus θs,∗ | µ∗ , Ds ∼ N (µ̃s , Σ̃s ) for some µ̃s and Σ̃s . Using the structure of our model (Figure 1), we further note that this is a standard linear model posterior with a Gaussian prior N (µ∗ , Σ0 ), where µ̃s = E [θs,∗ | µ∗ , Ds ] = Σ̃s (Σ−1 0 µ∗ + Bs ) , −1 Σ̃s = cov [θs,∗ | µ∗ , Ds ] = (Σ−1 , 0 + Gs ) Finally, the estimated mean reward and its confidence interval width are given by r̂s (x, a) = ϕ(x, a)⊤ Σ̃s (Σ−1 0 µ̄ + Bs ) , q cs (x, a) = α ϕ(x, a)⊤ Σ̂s ϕ(x, a) , −1 where Σ̂s = Σ̃s + Σ̃s Σ−1 0 Σ̄Σ0 Σ̃s . The posterior covariance Σ̂s is tractable and has the following desirable properties. First, the hyper-parameter uncertainty only affects the second term through Σ̄. Moreover, since Σ̃s appears in both terms, Σ̂s decreases with more observations from task s. (7) and the statistics n X Bs = σ −2 1{St = s} ϕ(Xt , At )Yt , t=1 Gs = σ −2 n X 3.3. Gaussian HierOPO (8) ⊤ 1{St = s} ϕ(Xt , At )ϕ(Xt , At ) , Now we plug the estimated mean reward and its confidence interval width in (10) into HierOPO. The resulting method is a form of hierarchical regression of the hyper-parameter and task parameters. The task parameter θs,∗ is estimated using Σ̃s (Σ−1 0 µ̄ + Bs ) in (10) and the hyper-parameter µ∗ is estimated using µ̄ in (9). Since HierOPO is a variant of hierarchical regression, it is fairly general and we expect it to perform well beyond our assumptions in the algorithm design, such as when the reward noise is sub-Gaussian. t=1 are computed using the subset Ds of the logged dataset D. The posterior of the hyper-parameter µ∗ | D, known as the hyper-posterior, also has a closed-form N (µ̄, Σ̄) (Section 4.2 of Hong et al. 2022b), where µ̄ = E [µ∗ | D]   X −1 −1 = Σ̄ Σ−1 (Σ0 + G−1 Gs Bs , q µq + s ) (9) s∈S (10) To simplify the presentation of HierOPO, we assume that X and A are finite. However, since HierOPO is based on a hierarchical regression of the task parameter in (10) and the hyper-parameter in (9), this assumption is not necessary. In fact, the mean reward estimate and its confidence interval at any feature vector ϕ(x, a) can be computed as described in (10). Our error bounds in Section 5 are also independent of  −1 X −1 Σ̄ = cov [µ∗ | D] = Σ−1 (Σ0 + G−1 . q + s ) s∈S One way of interpreting (9) is as a multivariate Gaussian posterior where each task is a single observation. The observation of task s is the least-squares estimate of θs,∗ from 4 Multi-Task Off-Policy Learning from Bandit Feedback the number of contexts or actions. Finally, when the action space cannot be enumerated, it may be hard to compute the most pessimistic action π̂s (x) = arg maxa∈A Ls (x, a) in context x. In this case, our algorithm and analysis can be extended to any maximization oracle with a fixed approximation ratio. Finally, note that optimistic methods, such as posterior sampling (Thompson, 1933; Russo et al., 2018) and BayesUCB (Kaufmann et al., 2012), cannot be used in our setting. In fact, optimism is harmful because it leads to taking highly uncertain actions whose uncertainty is not reduced, since there are no additional observations from the environment. In this case, pessimism and robustness are desired, and these are our main design principles. The computations in (10) rely on matrix inversions, whose computational cost is O(d3 ) for d features. Note that this is only needed for the exact implementation. Approximate inference, which trades off the computational cost for accuracy (Doucet et al., 2001), is possible. Any approach for Gaussian graphical models would apply in our setting. 4. Single-Task Analysis To illustrate Bayesian error bounds, we start with a classic contextual bandit parameterized by θ∗ ∈ Rd . The mean reward of action a ∈ A in context x ∈ X under parameter θ ∈ Rd is r(x, a; θ) = ϕ(x, a)⊤ θ. We assume that θ∗ ∼ N (θ0 , Σ0 ) and the reward noise is N (0, σ 2 ). This model is identical to a single task in Section 3.2. 3.4. Alternative Designs A natural question to ask is how the hierarchy helps with improving pessimistic reward estimates. To answer it, we compare HierOPO to two baselines, based on pessimistic least-squares estimators (Li et al., 2022) that do not model our structure. The first one is unrealistic because it assumes that µ∗ is known. We call it OracleOPO. Here the posterior mean reward and its confidence interval width are r̂s (x, a) = ϕ(x, a)⊤ Σ̃s (Σ−1 0 µ∗ + Bs ) , q cs (x, a) = α ϕ(x, a)⊤ Σ̃s ϕ(x, a) . n The logged dataset is D = {(Xt , At , Yt )}t=1 , the LCB is L(x, a) = r̂(x, a) − c(x, a), and the pessimistic policy is π̂(x) = arg max a∈A L(x, a). Following the same reasoning as in the derivation of (7), the estimated mean reward and its confidence interval width are r̂(x, a) = ϕ(x, a)⊤ Σ̂(Σ−1 0 θ0 + B) , q c(x, a) = α ϕ(x, a)⊤ Σ̂ϕ(x, a) , (11) This is an improvement of (10) in two aspects. First, the estimate µ̄ of µ∗ is replaced with the actual µ∗ . Second, the confidence interval width is provably narrower because where B = σ −2 n X ϕ(Xt , At )Yt , t=1 −1 Σ̂ = (Σ−1 , 0 + G) n X G = σ −2 ϕ(Xt , At )ϕ(Xt , At )⊤ . t=1 −1 Σ̃s ⪯ Σ̃s + Σ̃s Σ−1 0 Σ̄Σ0 Σ̃s . The second method does not model the hyper-parameter µ∗ . Instead, its uncertainty is incorporated into that of modeled θs,∗ . This is achieved by replacing Σ0 in (11) with Σq + Σ0 , and µ∗ with µq . We call the method FlatOPO. Its posterior mean reward and confidence interval width are (12) As in Section 2, the value of policy π ∈ Π under model parameter θ∗ is V (π; θ∗ ) = E [r(X, π(X); θ∗ )]. The optimal policy is π∗ = arg max π∈Π V (π; θ∗ ). The quality of policy π̂ is measured by an (ε, δ) bound r̂s (x, a) = ϕ(x, a)⊤ Σ̇s ((Σq + Σ0 )−1 µq + Bs ) , q cs (x, a) = α ϕ(x, a)⊤ Σ̇s ϕ(x, a) , P (V (π̂; θ∗ ) ≥ V (π∗ ; θ∗ ) − ε | D) ≥ 1 − δ . (13) A better policy has a lower ε > 0 at a fixed δ > 0. Now we are ready to proceed with the analysis. where Σ̇s = ((Σq + Σ0 )−1 + Gs )−1 . 4.1. Bayesian Error Bound In comparison to (10), this method is worse in two aspects. First, the prior mean µq of µ∗ is used instead of its estimate µ̄. Second, as the number of tasks m increases, we expect λ1 (Σ̄) → 0 and then We start with assumptions. First, we assume that the length of feature vectors is bounded. Assumption 1. For any context x ∈ X and action a ∈ A, the feature vector satisfies ∥ϕ(x, a)∥2 ≤ 1. −1 Σ̇s ⪰ Σ̃s Σ−1 0 Σ̄Σ0 Σ̃s + Σ̃s . This assumption is standard and without loss of generality. Second, we assume that the dataset D is “well-explored” (Swaminathan et al., 2017; Jin et al., 2021). Therefore, our approach should be more statistically efficient, which we prove formally in Section 5. 5 Multi-Task Off-Policy Learning from Bandit Feedback Assumption 2. Take G in (12) and let   G∗ = E ϕ(X, π∗ (X))ϕ(X, π∗ (X))⊤ θ∗ . where Σ0 should be viewed as a regularization parameter instead of the prior covariance. Before we discuss α, we discuss two key differences in how the bounds are stated. First, the frequentist bound holds for any model parameter θ∗ such that ∥θ∗ ∥2 ≤ κ. Therefore, it is arguably stronger than the Bayesian bound in Theorem 5. This is analogous to differences in Bayesian and frequentist cumulative regret bounds (Russo & Van Roy, 2014). Second, because both bounds depend on γ in Assumption 2, which depends on θ∗ , we make an assumption that π0 is uniform. In this case, γ = Ω(1/d) for any θ∗ . Therefore, γ has no impact on the next discussion and we may treat it as a constant. Then there exists γ > 0 such that G ⪰ γσ −2 nG∗ holds for any θ∗ . Assumption 2 relates the logging policy π0 , which induces G, to the optimal policy π∗ , which induces σ −2 nG∗ . It can be loosely interpreted as follows. As n increases,   G → σ −2 n E ϕ(X, π0 (X))ϕ(X, π0 (X))⊤ θ∗ , and γ becomes the maximum ratio between probabilities of taking actions by π∗ and π0 in any direction. Therefore, the assumption measures the degree of overlap between π∗ and π0 . It also relates a finite-sample G to the expected G∗ . Therefore, we do not need to reason about a finite-sample behavior of G in our analysis. Under the above assumptions, the only major difference in −1 the bounds is the term κλd 2 (Σ0 ) in (14). This term can have a major effect. For instance, suppose that Σ0 = Id /n in (12). From a Bayesian viewpoint, p this corresponds to a very informative prior with width 1/n, and the Bayesian 1 bound in Theorem 1 is Õ(dn− 2 ). From a frequentist point of view, this amounts to O(n) regularization, and the fre1 1 quentist bound becomes Õ(dn− 2 + d 2 ). As n → ∞, we get that the Bayesian bound can be arbitrarily better. Note that Assumption 2 holds for γ = 0. Unfortunately, this setting would negate the desired scaling with sample size n in our error bounds and is impractical. The higher the value of γ, the more π∗ and π0 are similar. When π0 is uniform, we obtain γ = Ω(1/d) for large n. Now we state our main claim for the single-task setting. Theorem 1. Fix dataset D and choose any γ > 0 such that Assumption 2 holds. Let π̂(x) = arg max a∈A L(x, a). Then for any δ ∈ (0, e−1 ], the error in (13) is s p 4d ε = 5d log(1/δ) . {z } λd (Σ−1 | ) + γσ −2 n 0 5. Multi-Task Analysis Now we study our multi-task setting, where the estimated mean reward and its confidence interval width are defined in (10). Similarly to Section 4, our analysis is Bayesian. We derive an error bound for a single task and discuss how to extend it to other bounds, such as for all tasks, later. α 5.1. Bayesian Error Bound Proof. The claim is proved in Appendix A in three steps. First, we establish that c(x, a)p is a high-probability confidence interval width for α = 5d log(1/δ). Second, we show that the suboptimality of policy π̂ can be bounded by E [c(X, π∗ (X))] for any parameter θ∗ . Third, we combine c(x, a) with Assumption 2, and relate the logging policy π0 that induces c(x, a) with the optimal policy π∗ . To derive the bound in (2), we make similar assumptions to Section 4. First, we assume that the length of feature vectors is bounded (Assumption 1). Second, we assume that the dataset D is “well-explored” for all tasks. Assumption 3. Take Gs in (8) and let ns be the number of interactions with task s. Let 4.2. Frequentist Error Bound   Gs,∗ = E ϕ(X, πs,∗ (X))ϕ(X, πs,∗ (X))⊤ θs,∗ . To understand the benefit of a Bayesian analysis, we compare Theorem 1 to a frequentist bound. The main difference in the frequentist bound is that θ∗ is fixed. Thus a natural counterpart of (13) is Then there exists γ > 0 such that Gs ⪰ γσ −2 ns Gs,∗ holds for any θs,∗ in any task s ∈ S. V (π̂; θ∗ ) ≥ V (π∗ ; θ∗ ) − ε , This is essentially Assumption 2 applied to all tasks. For a uniform logging policy, γ = Ω(1/d) when ns is large for all tasks s ∈ S. So the assumption is not very strong. Our main technical result is presented below. which holds with probability at least 1 − δ for an unknown but fixed θ∗ . Under the assumptions that ∥θ∗ ∥2 ≤ κ, and that (Yt )nt=1 are independent σ 2 -sub-Gaussian rewards, we get a similar bound to Theorem 1, which is stated in Theorem 5 (Appendix B). The main difference is that p −1 α = 2 2d(log(1/δ) + b) + κλd 2 (Σ0 ) , (14) Theorem 2. Fix dataset D and choose any γ > 0 such that Assumption 3pholds. Take policy π̂ computed by HierOPO and let α = 5d log(1/δ). Then for any δ ∈ (0, e−1 ], the 6 Multi-Task Off-Policy Learning from Bandit Feedback error in (2) is s ε= α | v u u αt To show that our bound captures the problem structure, we compare it to two baselines in Section 3.4: OracleOPO and FlatOPO. OracleOPO knows µ∗ and has more information than HierOPO. Its error can be bounded by Theorem 1 and is always lower than that of HierOPO, because the bound corresponds to the first term in Theorem 2. On the other hand, FlatOPO solves each task independently. Its error can be bounded using Theorem 1 where the task covariance Σ0 is replaced by Σq + Σ0 , to account for the additional uncertainty due to unknown µ∗ . The resulting error bound becomes s 4d α λd ((Σq + Σ0 )−1 ) + γσ −2 ns 4d + −2 n λd (Σ−1 s 0 ) + γσ {z } Task term 4d P λd (Σ−1 q )+ z∈S | 1 −1 λ1 (Σ0 )+γ −1 σ 2 λ1 (G−1 z,∗ )nz {z Hyper-parameter term . } Moreover, suppose that ϕ(x, a) has at most one non-zero entry for any x ∈ X and a ∈ A, and that both Σq and Σ0 are diagonal. Then λ1 (G−1 z,∗ ) ≥ 1. and is always higher than the task term in Theorem 2. As the number of tasks increases, the hyper-parameter term in Theorem 2 goes to zero, and the error bound of HierOPO would be provably lower. Proof. The claim is proved in Appendix C, following the same three steps as in the proof of Theorem 1. The main difference is in the definitions of c(x, a) and policies, and that Assumption 3 is used instead of Assumption 2. This shows the generality of our proof technique and indicates that it may apply to other graphical models. Finally, we would like to comment on the second claim in Theorem 2, which results in a tighter bound. This claim is proved under additional assumptions that are satisfied by a multi-armed bandit, for instance. 5.2. Discussion 5.3. Extensions Our error bound is Bayesian and proved for a distribution of the task parameter θs,∗ | D. The bound has two terms. The former captures the error in estimating the task parameter θs,∗ if the hyper-parameter µ∗ was known. It is similar to Theorem 1 and hence we call it the task term. The latter term captures the error in estimating µ∗ and hence we call it the hyper-parameter term. We discuss each term next. The error bound in Theorem 2 is derived for a fixed task s ∈ S. This decision was taken deliberately because other bounds can be easily derived from it. For instance, to get a bound for all tasks, we only need a union bound over all θs,∗ | D. As a result, Theorem 2 holds for all s ∈ S with probability at least 1 − mδ. The task term depends on p all quantities of interest in an expected manner. It is O(d log(1/δ)/ns ), where d is the number of features, ns is the number of observations, and δ is the probability that the bound fails. This dependence is standard in confidence intervals for linear models with an infinite number of contexts (Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013; Abeille & Lazaric, 2017). Since λd (Σ−1 0 ) can be viewed as the minimum number of prior pseudo-observations in any direction in Rd , the task term decreases with a more informative prior. Finally, the task term decreases when the observation noise σ decreases, and the similarity of the logging and optimal policies γ increases (Assumption 3). The bound in Theorem 2 also holds for a new task sampled from the task prior. This is because the hyper-parameter estimation in (9), which affects the hyper-parameter term in Theorem 2, separates all tasks from the evaluated one. 6. Related Work Off-policy optimization. In off-policy optimization, data collected by a deployed policy are used to learn improved policies offline (Li et al., 2010), without a direct interaction with the environment. Off-policy estimation and optimization can be model-free or model-based. Many modelfree methods are based on inverse propensity scores (IPS) (Horvitz & Thompson, 1952; Ionides, 2008; Strehl et al., 2010; Swaminathan & Joachims, 2015). These methods have a low bias and high variance, unless corrected. Modelbased methods estimate a reward model for context-action pairs, which is then used to find an optimal policy (Bottou et al., 2013; Dudik et al., 2014). These approaches tend to have a high bias and low variance. Doubly-robust estimation (Robins et al., 1994; Dudik et al., 2014) is used frequently to combine model-based and model-free methods. We take The hyper-parameter term mimics scaling of the task term at the hyper-parameter level. In particular, the minimum number of prior pseudo-observations in any direction in Rd becomes λd (Σ−1 q ) and each task becomes an observation, which is reflected by the sum over all tasks z. The hyperparameter term decreases as the number of observations nz in p any task z increases, the maximum width of the task prior λ1 (Σ0 ) decreases, reward noise σ decreases, and the similarity of logging and optimal policies γ increases. 7 0.08 0.06 0.04 0.02 100 200 300 Dataset Size n 400 500 Linear Bandit (¾q = 1.000, m = 10, d = 5, K = 10) 0.18 HierOPO 0.16 FlatOPO 0.14 OracleOPO 0.12 0.10 0.08 0.06 0.04 0.02 100 200 300 Dataset Size n 400 500 Suboptimality Linear Bandit (¾q = 0.500, m = 10, d = 5, K = 10) 0.14 HierOPO 0.12 FlatOPO 0.10 OracleOPO Suboptimality Suboptimality Multi-Task Off-Policy Learning from Bandit Feedback Linear Bandit (¾q = 1.000, n = 500, d = 5, K = 10) 0.18 0.16 HierOPO 0.14 FlatOPO 0.12 OracleOPO 0.10 0.08 0.06 0.04 0.02 0.00 10 20 30 40 50 60 Number of Tasks m Figure 2. Evaluation of off-policy algorithms on the synthetic multi-task bandit problem. In the left and middle plots, we vary the dataset size n for small σq = 0.5 and large σq = 1.0, respectively. In the right plot, we vary the number of tasks m. a model-based approach in this work. where HierOPO is applied to a challenging image classification task with deep neural networks. Offline reinforcement learning. Pessimism has been studied extensively in offline reinforcement learning (Buckman et al., 2021; Jin et al., 2021). Specifically, Jin et al. (2021) showed that pessimistic value iteration is minimax optimal in linear Markov decision processes (MDPs). Multi-task offline reinforcement learning was also studied by Lazaric & Ghavamzadeh (2010). This paper applied expectationmaximization to solve the problem but did not prove any error bounds. In comparison, we consider a simpler setting of contextual bandits and prove error bounds that show improvements due to using the multi-task structure. These are the first error bounds of its kind. 7.1. Synthetic Problem Our first experiment is with a synthetic multi-task bandit, with d = 5 features and K = 10 actions. For each action a ∈ A and interaction t ∈ [n], we sample a feature vector uniformly at random from [−0.5, 0.5]d . The hierarchy is defined as follows. The hyper-prior is N (0d , Σq ), where Σq = σq2 Id is its covariance. The task covariance is Σ0 = σ02 Id . We experiment with σq ∈ {0.5, 1} and σ0 = 0.5. We expect the hierarchy to be more beneficial when σq > σ0 , since the uncertainty of the hyper-parameter is higher and it is more valuable to learn it. The reward distribution of task s is N (ϕ(x, a)⊤ θs,∗ , σ 2 ) with noise σ = 0.5. Online learning. Off-policy methods learn from data collected by another policy. In contrast, online methods learn from data that they collect, and need to balance exploration and exploitation to attain low regret in the long term. Two popular exploration techniques are upper confidence bounds (UCBs) (Auer et al., 2002) and posterior sampling (Thompson, 1933), and both have been studied extensively in linear models (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Chu et al., 2011; Agrawal & Goyal, 2013). Bandit algorithms for hierarchical models have also been studied extensively (Bastani et al., 2019; Kveton et al., 2021; Basu et al., 2021; Simchowitz et al., 2021; Wan et al., 2021; Hong et al., 2022b; Peleg et al., 2022; Wan et al., 2022). Perhaps surprisingly, all of these are based on posterior sampling. Our marginal posterior derivations in (10) can used to derive UCBs for this setting. Specifically, Us (x, a) = r̂s (x, a) + cs (x, a) is an upper confidence bound on the mean reward of action a in context x and task s. Our results are averaged over multiple runs. At the beginning of each run, the hyper-parameter µ∗ is sampled from the hyper-prior N (0d , Σq ). After that, each task parameter is sampled i.i.d. as θs,∗ ∼ N (µ∗ , Σ0 ). The logged dataset D is generated as follows. In each interaction t ∈ [n], we randomly select a task, take a random action, and record its stochastic reward. The learned policies π̂s are evaluated on the same problem that generated D. The evaluation criterion is the suboptimality of learned policies averaged over the tasks, which we define as 1 X V (πs,∗ ; θs,∗ ) − V (π̂s ; θs,∗ ) . m s∈S The policy values and the optimal policy are estimated on another logged dataset of size 10 000. In our experiments, we vary either the logged dataset size n or the number of tasks m, while keeping the other fixed. The default settings are m = 10 tasks and logged dataset size n = 500. 7. Experiments In Figure 2, we show the mean and standard error of the suboptimality of each algorithm averaged over 30 random runs. As expected, HierOPO outperforms FlatOPO and is close to OracleOPO. The improvement is higher when the hyper-parameter uncertainty σq is higher. The difference between HierOPO and FlatOPO is the most noticeable in the limited data regime, where n is small or m is large. In both cases, the number of observations per task is small. In this section, we empirically compare HierOPO to baselines OracleOPO and FlatOPO (Section 3.4). All methods are implemented as described in Section 3. We set α = 0.1, which led to good performance in our initial experiments. The goal of our experiments is to show that hierarchy can greatly improve the statistical efficiency of off-policy algorithms. We include an additional experiment in Appendix D, 8 Multi-Task Off-Policy Learning from Bandit Feedback 0.6 MovieLens Bandit (m = 100, d = 10, K = 30) Suboptimality note that the hierarchy in this experiment is estimated from data. Therefore, it is misspecified; yet hugely beneficial. HierOPO FlatOPO OracleOPO 0.5 0.4 8. Conclusions 0.3 0.2 In this work, we propose hierarchical off-policy optimization (HierOPO), a general off-policy algorithm for solving similar contextual bandit tasks related through a hierarchy. Our algorithm leverages the hierarchical structure to learn tighter, and hence more sample efficient, lower confidence bounds on the mean rewards of actions and acts according to them. We derive Bayesian error bounds for our policies, which become tighter with a more informative prior and demonstrate the benefit of hierarchies. Finally, we empirically validate the effectiveness of hierarchies on synthetic and real-world problems. 0.1 0.0 1000 2000 3000 4000 Dataset Size n 5000 Figure 3. Evaluation of off-policy algorithms on a multi-user recommendation problem in Section 7.2. 7.2. Multi-User Recommendation Now we consider a multi-user recommendation problem. The problem is simulated using the MovieLens 1M dataset (Lam & Herlocker, 2016), with 1 million ratings for 3 883 movies from 6 040 users. As a first step, we complete the sparse rating matrix M using alternating least squares (Davenport & Romberg, 2016) with rank d = 10. This rank is sufficiently high to have a low prediction error, but also low enough to prevent overfitting. The learned factorization is M = U V ⊤ . The i-th row of U , denoted by Ui , represents user i. The j-th row of V , denoted by Vj , represents movie j. The reward for recommending movie j to user i is sampled from N (Vj⊤ Ui , σ 2 ). The reward noise σ = 0.759 is estimated from data. The feature vectors in each interaction are latent factors Vj of 30 randomly chosen movies. Our work is the first to propose a practical and analyzable algorithm for off-policy learning with hierarchical Bayesian models. As a result, there are many potential directions for future work. First, some applications require more complex graphical models than a two-level hierarchy with a single hyper-parameter (Hong et al., 2022a; Aouali et al., 2023). We believe that our methodology directly extends to these. Second, we believe that HierOPO and its analysis can be extended to reinforcement learning (Lazaric & Ghavamzadeh, 2010). Third, HierOPO is a model-based approach to offpolicy learning (Section 6). Since model-based approaches tend to be biased, due to using a potentially misspecified model, it is important to develop model-free methods for multi-task off-policy learning. To simulate similar users, we cluster user latent factors. In particular, we apply a Gaussian mixture model (GMM) with k = 7 clusters to rows of U . We choose the smallest value of k that yields well-separated clusters (Bishop, 2006). The hyper-prior parameters µq and Σq are set to the mean and covariance of all cluster centers, respectively. The cluster with most users represents tasks. We set µ∗ and Σ0 to its center and covariance estimated by the GMM, respectively. Thus all users in the cluster are related through µ∗ and Σ0 . We want to stress that the GMM is only used to estimate the hyper-parameters of the hierarchical model. The task parameters are user latent factors Ui . This is to ensure that our setup is as realistic as possible. Finally, our theory shows benefits of a Bayesian analysis over a frequentist one (Section 4), and also benefits of hierarchies (Section 5). We are not aware of matching lower bounds for our setting and these are not common in offline learning, unlike in bandits (Lattimore & Szepesvari, 2019). This leaves open the possibility that our bounds are loose. We believe that this is highly unlikely, since the bounds are derived using exact Gaussian posteriors. References Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pp. 2312– 2320, 2011. The number of tasks is m = 100 and they are chosen randomly in each run. The dataset D is logged as follows. In each interaction t ∈ [n], we randomly select a task, take a random action, and record its stochastic reward. The evaluation criteria are the same as in Section 7.1. Abeille, M. and Lazaric, A. Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. In Figure 3, we report the mean and standard error of the suboptimality of all algorithms averaged over 10 random runs. For all dataset sizes n, HierOPO performs very well: its suboptimality is close to that of OracleOPO and significantly lower than that of FlatOPO. This clearly shows the benefit of hierarchies for efficient off-policy learning. Also Agrawal, S. and Goyal, N. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceeding of the 25th Annual Conference on Learning Theory, pp. 39.1– 39.26, 2012. 9 Multi-Task Off-Policy Learning from Bandit Feedback Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pp. 127–135, 2013. Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 208–214, 2011. Aouali, I., Kveton, B., and Katariya, S. Mixed-effect Thompson sampling. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 2023. Dani, V., Hayes, T., and Kakade, S. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 355– 366, 2008. Atsidakou, A., Katariya, S., Sanghavi, S., and Kveton, B. Bayesian fixed-budget best-arm identification. CoRR, abs/2211.08572, 2022. URL https://arxiv.org/ abs/2211.08572. Davenport, M. and Romberg, J. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4): 608–622, 2016. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002. Deshmukh, A. A., Dogan, U., and Scott, C. Multi-task learning for contextual bandits. In Advances in Neural Information Processing Systems 30, pp. 4848–4856, 2017. Azar, M. G., Lazaric, A., and Brunskill, E. Sequential transfer in multi-armed bandit with finite set of models. In Advances in Neural Information Processing Systems 26, pp. 2220–2228, 2013. Doucet, A., de Freitas, N., and Gordon, N. Sequential Monte Carlo Methods in Practice. Springer, New York, NY, 2001. Dudik, M., Erhan, D., Langford, J., and Li, L. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014. Bastani, H., Simchi-Levi, D., and Zhu, R. Meta dynamic pricing: Transfer learning across experiments. CoRR, abs/1902.10918, 2019. URL https://arxiv.org/ abs/1902.10918. Garivier, A. and Cappe, O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceeding of the 24th Annual Conference on Learning Theory, pp. 359–376, 2011. Basu, S., Kveton, B., Zaheer, M., and Szepesvari, C. No regrets for learning the prior in bandits. In Advances in Neural Information Processing Systems 34, 2021. Bishop, C. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006. Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian Data Analysis. Chapman & Hall, 2013. Bottou, L., Peters, J., Quinonero-Candela, J., Charles, D., Chickering, M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(101):3207–3260, 2013. Hong, J., Kveton, B., Katariya, S., Zaheer, M., and Ghavamzadeh, M. Deep hierarchy in bandits. In Proceedings of the 39th International Conference on Machine Learning, 2022a. Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. Hong, J., Kveton, B., Zaheer, M., and Ghavamzadeh, M. Hierarchical Bayesian bandits. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022b. Buckman, J., Gelada, C., and Bellemare, M. The importance of pessimism in fixed-dataset policy optimization. In Proceedings of the 9th International Conference on Learning Representations, 2021. Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952. Cella, L., Lazaric, A., and Pontil, M. Meta-learning with stochastic linear bandits. In Proceedings of the 37th International Conference on Machine Learning, 2020. Ionides, E. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295–311, 2008. Chapelle, O. and Li, L. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pp. 2249–2257, 2012. Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline RL? In Proceedings of the 38th International Conference on Machine Learning, 2021. 10 Multi-Task Off-Policy Learning from Bandit Feedback Kaufmann, E., Cappe, O., and Garivier, A. On Bayesian upper confidence bounds for bandit problems. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pp. 592–600, 2012. Peleg, A., Pearl, N., and Meirr, R. Metalearning linear bandits by prior update. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022. Kveton, B., Konobeev, M., Zaheer, M., Hsu, C.-W., Mladenov, M., Boutilier, C., and Szepesvari, C. MetaThompson sampling. In Proceedings of the 38th International Conference on Machine Learning, 2021. Robins, J., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846–866, 1994. Lake, B., Salakhutdinov, R., and Tenenbaum, J. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015. Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221–1243, 2014. Lam, S. and Herlocker, J. MovieLens Dataset. http://grouplens.org/datasets/movielens/, 2016. Russo, D., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018. Lattimore, T. and Szepesvari, C. Bandit Algorithms. Cambridge University Press, 2019. Simchowitz, M., Tosh, C., Krishnamurthy, A., Hsu, D., Lykouris, T., Dudik, M., and Schapire, R. Bayesian decisionmaking under misspecified priors with applications to meta-learning. In Advances in Neural Information Processing Systems 34, 2021. Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000. Strehl, A., Langford, J., Li, L., and Kakade, S. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23, 2010. Lazaric, A. and Ghavamzadeh, M. Bayesian multi-task reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning, 2010. Li, G., Ma, C., and Srebro, N. Pessimism for offline linear contextual bandits using ℓp confidence sets. In Advances in Neural Information Processing Systems 35, 2022. Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the 32nd International Conference on Machine Learning, pp. 814–823, 2015. Li, L., Chu, W., Langford, J., and Schapire, R. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 2010. Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D., and Zitouni, I. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems 30, 2017. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933. Li, S., Abbasi-Yadkori, Y., Kveton, B., Muthukrishnan, S., Vinay, V., and Wen, Z. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1685–1694, 2018. Wan, R., Ge, L., and Song, R. Metadata-based multi-task bandits with Bayesian hierarchical models. In Advances in Neural Information Processing Systems 34, 2021. Lindley, D. and Smith, A. Bayes estimates for the linear model. Journal of the Royal Statistical Society: Series B (Methodological), 34(1):1–18, 1972. Wan, R., Ge, L., and Song, R. Towards scalable and robust structured bandits: A meta-learning framework. CoRR, abs/2202.13227, 2022. URL https://arxiv.org/ abs/2202.13227. Lu, X. and Van Roy, B. Information-theoretic confidence bounds for reinforcement learning. In Advances in Neural Information Processing Systems 32, 2019. Moradipari, A., Turan, B., Abbasi-Yadkori, Y., Alizadeh, M., and Ghavamzadeh, M. Parameter and feature selection in stochastic linear bandits. In Proceedings of the 39th International Conference on Machine Learning, 2022. 11 Multi-Task Off-Policy Learning from Bandit Feedback A. Proof of Theorem 1 The proof is under the assumption that the logged dataset D is fixed and the model parameter θ∗ is random. Specifically, since we conduct a Bayesian analysis, we condition on all available observations and have θ∗ | D ∼ N (θ̂, Σ̂). We start with the concentration of the model parameter. To simplify notation, we define r(x, a) = r(x, a; θ∗ ). Lemma 3. Let E = {∀x ∈ X , a ∈ A : |r(x, a) − r̂(x, a)| ≤ c(x, a)} be the event that all high-probability confidence intervals hold. Then P (E | D) ≥ 1 − δ. Proof. We start with the Cauchy–Schwarz inequality, 1 1 r(x, a) − r̂(x, a) = ϕ(x, a)(θ∗ − θ̂) = ϕ(x, a)Σ̂ 2 Σ̂− 2 (θ∗ − θ̂) ≤ ∥ϕ(x, a)∥Σ̂ ∥θ∗ − θ̂∥Σ̂−1 . p To prove our claim, we show that ∥θ∗ − θ̂∥Σ̂−1 ≤ 5d log(1/δ) holds conditioned on D with probability at least 1 − δ. 1 The proof uses that θ∗ − θ̂ | D ∼ N (0d , Σ̂). Because of that, Σ̂− 2 (θ∗ − θ̂) | D is a d-dimensional vector of independent standard normal variables. Thus (θ∗ − θ̂)⊤ Σ̂−1 (θ∗ − θ̂) | D is a chi-squared random variable with d degrees of freedom. Then, by Lemma 1 of Laurent & Massart (2000),     P ∥θ∗ − θ̂∥Σ̂−1 ≥ α D = P (θ∗ − θ̂)⊤ Σ̂−1 (θ∗ − θ̂) ≥ 5d log(1/δ) D   p ≤ P (θ∗ − θ̂)⊤ Σ̂−1 (θ∗ − θ̂) ≥ 2 d log(1/δ) + 2 log(1/δ) + d D   p = P (θ∗ − θ̂)⊤ Σ̂−1 (θ∗ − θ̂) − d ≥ 2 d log(1/δ) + 2 log(1/δ) D ≤ δ . The first inequality holds under the assumption that δ < (0, e−1 ], which implies log(1/δ) ≥ 1 and log(1/δ) ≥ This completes our proof. p log(1/δ). We use Lemma 3 to bound the suboptimality of π̂ in any context by the confidence interval width induced by π∗ . Lemma 4. Let π̂(x) = arg max a∈A L(x, a). Then on event E (Lemma 3), r(x, π∗ (x)) − r(x, π̂(x)) ≤ 2c(x, π∗ (x)) holds jointly for all contexts x ∈ X . Proof. For any context x ∈ X , we can decompose r(x, π∗ (x)) − r(x, π̂(x)) as r(x, π∗ (x)) − r(x, π̂(x)) = r(x, π∗ (x)) − L(x, π̂(x)) + L(x, π̂(x)) − r(x, π̂(x)) ≤ [r(x, π∗ (x)) − L(x, π∗ (x))] + [L(x, π̂(x)) − r(x, π̂(x))] . Now we bound each term separately. On event E, we have r(x, π∗ (x)) − r̂(x, π∗ (x)) ≤ c(x, π∗ (x)) and thus r(x, π∗ (x)) − L(x, π∗ (x)) = r(x, π∗ (x)) − r̂(x, π∗ (x)) + c(x, π∗ (x)) ≤ 2c(x, π∗ (x)) . Again, on event E, we have r̂(x, π̂(x)) − r(x, π̂(x)) ≤ c(x, π̂(x)) and thus L(x, π̂(x)) − r(x, π̂(x)) = r̂(x, π̂(x)) − r(x, π̂(x)) − c(x, π̂(x)) ≤ 0 . Finally, we combine the above two inequalities and get r(x, π∗ (x)) − r(x, π̂(x)) ≤ 2c(x, π∗ (x)) . This completes the proof. 12 Multi-Task Off-Policy Learning from Bandit Feedback In the rest of the analysis, we fix θ∗ and the only randomness is due to X ∼ Px . On event E in Lemma 3, we have V (π∗ ; θ∗ ) − V (π̂; θ∗ ) = E [r(X, π∗ (X)) − r(X, π̂(X))] ≤ 2E [c(X, π∗ (X))] q  p ⊤ = 2 5d log(1/δ) E ϕ(X, π∗ (X)) Σ̂ϕ(X, π∗ (X)) r h i p ≤ 2 5d log(1/δ) E ϕ(X, π∗ (X))⊤ Σ̂ϕ(X, π∗ (X)) . (15) The first inequality is by Lemma 4. The second inequality follows from the concavity of the square root. −2 The last step is an upper bound on the expected confidence interval width. Let Γ = Σ−1 nG∗ . By Assumption 2, 0 + γσ −1 −1 Σ̂ ⪰ Γ and thus Σ̂ ⪯ Γ . So, for any policy π∗ , we have h i   E ϕ(X, π∗ (X))⊤ Σ̂ϕ(X, π∗ (X)) ≤ E ϕ(X, π∗ (X))⊤ Γ−1 ϕ(X, π∗ (X)) h i 1 1 = E tr(Γ− 2 ϕ(X, π∗ (X))ϕ(X, π∗ (X))⊤ Γ− 2 ) 1 1 = tr(Γ− 2 G∗ Γ− 2 ) −1 −2 = tr(G∗ Γ−1 ) = tr((Σ−1 nId )−1 ) 0 G∗ + γσ d ≤ . −1 −1 λd (Σ0 G∗ + γσ −2 nId ) The first inequality follows from Assumption 2. The first equality holds because v ⊤ v = tr(vv ⊤ ) for any v ∈ Rd . The next three equalities use that the expectation of the trace is the trace of the expectation, the cyclic property of the trace, and the definition of matrix inverse. The last inequality follows from tr(A−1 ) ≤ dλ1 (A−1 ) = dλ−1 d (A), which holds for any PSD matrix A ∈ Rd×d . Now we apply basic eigenvalue identities and inequalities, and get −1 −2 −1 −2 λd (Σ−1 nId ) = λd (Σ−1 n = λd ((G∗ Σ0 )−1 ) + γσ −2 n = 0 G∗ + γσ 0 G∗ ) + γσ ≥ 1 + γσ −2 n λ1 (G∗ Σ0 ) 1 1 −2 + γσ −2 n ≥ + γσ −2 n = λd (Σ−1 n. 0 ) + γσ λ1 (G∗ )λ1 (Σ0 ) λ1 (Σ0 ) The last inequality uses λ1 (G∗ ) ≤ 1, which follows from Assumption 1. To finalize the proof, we chain all inequalities starting from (15) and get that s p 4d V (π∗ ; θ∗ ) − V (π̂; θ∗ ) ≤ 5d log(1/δ) −1 λd (Σ0 ) + γσ −2 n holds on event E, which occurs with probability at least 1 − δ for θ∗ | D ∼ N (θ̂, Σ̂). This completes the proof. B. Frequentist Single-Task Analysis In this section, we derive a frequentist counterpart to the bound in Theorem 1. Theorem 5. Fix dataset D and let that the rewards be drawn independently as Yt − ϕ(Xt , At )⊤ θ∗ ∼ SubG(σ 2 ) for some σ > 0. Let π̂(x) = arg max a∈A L(x, a). Then for any θ∗ ∈ Θ such that ∥θ∗ ∥2 ≤ κ holds, any γ such that Assumption 2 holds, and any δ ∈ (0, 1), s p  4d − 21 V (π∗ ; θ∗ ) − V (π̂; θ∗ ) ≤ 2 log(2 |Φ| /δ) + κλd (Σ0 ) −1 λd (Σ0 ) + γσ −2 n holds with probability at least 1 − δ, where Φ ⊆ Rd is the set of all feature vectors. 13 Multi-Task Off-Policy Learning from Bandit Feedback The above result compares to the Bayesian error bound in Theorem 1 as follows. Under the assumption that Φ is an ε-grid −1 over [0, 1]d , we get log(2 |Φ| /δ) = O(d log(1/εδ)) and the main difference in the bounds is κλd 2 (Σ0 ) in Theorem 5. To prove Theorem 1, we start with the concentration of the model parameter and define r(x, a) = r(x, a; θ∗ ), similarly to Appendix A. We also use ϕt = ϕ(Xt , At ). Lemma 6. Let E = {∀x ∈ X , a ∈ A : |r(x, a) − r̂(x, a)| ≤ c(x, a)} be the event that all high-probability confidence intervals hold, where q −1 c(x, a) = 2 log(2 |Φ| /δ) + κλd 2 (Σ0 )∥ϕ(x, a)∥Σ̂ . Then P (E) ≥ 1 − δ. Proof. We start with the observation that the regularized least-squares estimate of θ∗ can be expressed as −1 θ̂ = σ −2 (Σ−1 0 + G) −1 = σ −2 (Σ−1 0 + G) n X t=1 n X ϕt Yt −1 −1 −1 −1 −1 ϕt (Yt − ϕ⊤ (Σ−1 Σ0 θ ∗ t θ∗ ) + (Σ0 + G) 0 + G)θ∗ − (Σ0 + G) t=1 −1 = σ −2 (Σ−1 0 + G) n X −1 −1 −1 ϕt (Yt − ϕ⊤ Σ0 θ ∗ . t θ∗ ) + θ∗ − (Σ0 + G) t=1 Therefore, for any ϕ ∈ Φ, ϕ⊤ (θ̂ − θ∗ ) = σ −2 n X −1 −1 ⊤ −1 −1 ϕ⊤ (Σ−1 ϕt (Yt − ϕ⊤ Σ0 θ∗ . t θ∗ ) − ϕ (Σ0 + G) 0 + G) (16) t=1 Pn −1 Now note that σ −2 t=1 ϕ⊤ (Σ−1 ϕt (Yt − ϕ⊤ t θ∗ ) is a weighted sum of independent sub-Gaussian random variables 0 + G) 2 Yt − ϕ⊤ θ ∼ SubG(σ ). By definition, its variance proxy is bounded from above as t ∗ σ −2 n X −1 −1 −1 −1 −1 ϕ⊤ (Σ−1 ϕt ϕ ⊤ ϕ = ϕ⊤ (Σ−1 G(Σ−1 ϕ t (Σ0 + G) 0 + G) 0 + G) 0 + G) t=1 −1 ≤ ϕ⊤ (Σ−1 ϕ = ∥ϕ∥2Σ̂ , 0 + G) where the last inequality follows from G ⪯ Σ−1 0 + G. By the concentration of sub-Gaussian random variables, we have ! n X p −1 ϕ⊤ (Σ−1 ϕt (Yt − ϕ⊤ 2 log(2/δ)∥ϕ∥Σ̂ ≤ δ . P σ −2 t θ∗ ) ≥ 0 + G) t=1 To bound the second term in (16), we apply the Cauchy–Schwarz inequality and get q q − 21 −1 ⊤ Σ−1 Σ̂Σ−1 θ ∥ϕ∥ ≤ ϕ⊤ Σ̂Σ−1 θ ≤ ∥Σ θ ∥ ∥ϕ∥ = θ θ∗⊤ Σ−1 ∗ ∗ ∗ ∗ 0 0 0 0 0 θ∗ ∥ϕ∥Σ̂ ≤ κλd (Σ0 )∥ϕ∥Σ̂ . Σ̂ Σ̂ Σ̂ 2 −1 The second and third inequalities follow from Σ̂ ⪯ Σ0 and θ∗⊤ Σ−1 0 θ∗ ≤ κ λd (Σ0 ), respectively. In the next step, we chain all inequalities starting from (16) and get that  p   −1 P ϕ⊤ (θ̂ − θ∗ ) ≥ 2 log(2/δ) + κλd 2 (Σ0 ) ∥ϕ∥Σ̂ ≤ δ holds for any ϕ ∈ Φ with probability at least 1 − δ. To finalize the proof, we apply a union bound over all ϕ. The rest of the proof proceeds exactly as in Appendix A, since that proof is for any model parameter θ∗ on event E. This completes the proof of Theorem 5. 14 Multi-Task Off-Policy Learning from Bandit Feedback C. Proof of Theorem 2 The proof is under the assumption that the logged dataset D is fixed and the task parameter θs,∗ is random. In particular, since we conduct a Bayesian analysis, we condition on all available observations and have θs,∗ | D ∼ N (θ̂s , Σ̂s ), where θ̂s = Σ̃s (Σ−1 0 µ̄ + Bs ) and Σ̂s are derived in Section 3.2. We start with the concentration of the task parameter. To simplify notation, let rs (x, a) = r(x, a; θs,∗ ). Lemma 7. Let E = {∀x ∈ X , a ∈ A : |rs (x, a) − r̂s (x, a)| ≤ cs (x, a)} be the event that all high-probability confidence intervals in task s ∈ S hold. Then P (E | D) ≥ 1 − δ. Proof. The proof is analogous to Lemma 3, since only the mean and covariance of θs,∗ | D changed, and this is reflected in the definitions of r̂s (x, a) and cs (x, a). On event E in Lemma 7, similarly to Lemma 4, we have that rs (x, πs,∗ (x)) − rs (x, π̂s (x)) ≤ 2cs (x, πs,∗ (x)) holds for all contexts x ∈ X with probability at least 1 − δ. Since the above bound holds for any context, we can use use it to bound the suboptimality of π̂s by the expected confidence interval width induced by πs,∗ . In the rest of the analysis, we fix θs,∗ and the only randomness is due to X ∼ Px . On event E in Lemma 7, we have V (πs,∗ ; θs,∗ ) − V (π̂s ; θs,∗ ) = E [rs (X, πs,∗ (X)) − rs (X, π̂s (X))] ≤ 2E [cs (X, πs,∗ (X))] (17) q  p = 2 5d log(1/δ) E ϕ(X, πs,∗ (X))⊤ Σ̂s ϕ(X, πs,∗ (X)) r h i p −1 ≤ 2 5d log(1/δ) E ϕ(X, πs,∗ (X))⊤ (Σ̃s Σ−1 0 Σ̄Σ0 Σ̃s + Σ̃s )ϕ(X, πs,∗ (X)) . These steps are the same as in (15). The latter term, which represents the conditional task uncertainty, is bounded exactly as in Theorem 1, h i d E ϕ(X, πs,∗ (X))⊤ Σ̃s ϕ(X, πs,∗ (X)) ≤ . λd (Σ−1 ) + γσ −2 ns 0 For the former term, which represents the hyper-parameter uncertainty, we have i h −1 −1 Σ̃ ϕ(X, π (X)) = tr(Gs,∗ Σ̃s Σ−1 Σ̄Σ E ϕ(X, πs,∗ (X))⊤ Σ̃s Σ−1 s s,∗ 0 Σ̄Σ0 Σ̃s ) 0 0 −1 ≤ dλ1 (Gs,∗ Σ̃s Σ−1 0 Σ̄Σ0 Σ̃s ) . To bound the maximum eigenvalue, we further proceed as −1 −1 −1 λ1 (Gs,∗ Σ̃s Σ−1 0 Σ̄Σ0 Σ̃s ) ≤ λ1 (Gs,∗ )λ1 (Σ̃s Σ0 )λ1 (Σ̄)λ1 (Σ0 Σ̃s ) 1 ≤ λ1 (Σ̄) = . P −1 −1 ) λd (Σq + z∈S (Σ0 + G−1 z ) The second inequality uses λ1 (Gs,∗ ) ≤ 1, which follows from Assumption 1, and λ1 (Σ̃s Σ−1 0 ) ≤ 1. Finally, we apply basic eigenvalue identities and inequalities, and get ! X X −1 −1 λd Σ−1 (Σ0 + G−1 ≥ λd (Σ−1 λd ((Σ0 + G−1 ) q + z ) q )+ z ) z∈S z∈S = λd (Σ−1 q )+ X −1 λ−1 1 (Σ0 + Gz ) z∈S ≥ λd (Σ−1 q )+ ≥ λd (Σ−1 q )+ 1 λ (Σ ) + λ1 (G−1 z ) 0 z∈S 1 X 1 X z∈S 15 −1 λ1 (Σ0 ) + γ −1 σ 2 λ1 (G−1 z,∗ )nz , Multi-Task Off-Policy Learning from Bandit Feedback where we use Assumption 3 in the last inequality. To finalize the proof, we chain all inequalities starting from (17) and get that s p 4d V (πs,∗ ; θs,∗ ) − V (π̂s ; θs,∗ ) ≤ 5d log(1/δ) P −1 −1 λd (Σq ) + z∈S (λ1 (Σ0 ) + γ −1 σ 2 λ1 (G−1 z,∗ )nz ) holds on event E, which occurs with probability at least 1 − δ for θs,∗ | D. This completes the proof of the first claim in Theorem 2. Note that the bound depends on λ1 (G−1 z,∗ ), which can be large when λd (Gz,∗ ) is small. We can eliminate this dependence under the additional assumption in Theorem 2. Under that assumption, all matrices are diagonal and thus commute, and −1 −1 −1 −1 −1 −1 λ1 (Gs,∗ Σ̃s Σ−1 Gs,∗ ) 0 Σ̄Σ0 Σ̃s ) = λ1 (Gs,∗ Σ̄Σ̃s Σ0 Σ0 Σ̃s ) ≤ λ1 (Gs,∗ Σ̄) = λd (Σ̄ 1 = P −1 −1 −1 . λd (Σ−1 ) q Gs,∗ + z∈S (Gs,∗ Σ0 + Gs,∗ Gz ) Finally, we bound the minimum eigenvalue from below using basic eigenvalue identities and inequalities, ! X X −1 −1 −1 −1 λd Σ−1 (Gs,∗ Σ0 + Gs,∗ G−1 ≥ λd (Σ−1 λ−1 q Gs,∗ + z ) q )λ1 (Gs,∗ ) + 1 (Gs,∗ Σ0 + Gs,∗ Gz ) z∈S z∈S 1 −1 λ (G )λ (Σ ) s,∗ 1 0 + λ1 (Gs,∗ Gz ) z∈S 1 X 1 ≥ λd (Σ−1 . q )+ λ (Σ0 ) + γ −1 σ 2 n−1 z z∈S 1 ≥ λd (Σ−1 q )+ X In the last two inequalities, we use that λ1 (Gs,∗ ) ≤ 1. In the last inequality, we also use that Assumption 3 holds for any −1 2 −1 −1 task parameter, including θz,∗ = θs,∗ . Thus Gz ⪰ γσ −2 nz Gs,∗ and G−1 σ nz Gs,∗ . This completes the proof of z ⪯γ the second claim in Theorem 2. 16 Omniglot Bandit (m = 100, d = 64, K = 10) 0.65 0.60 HierOPO 0.55 FlatOPO 0.50 OracleOPO 0.45 0.40 0.35 0.30 0.25 0.20 1000 2000 3000 4000 5000 Dataset Size n Suboptimality Omniglot Bandit (m = 100, d = 64, K = 10) 0.65 0.60 HierOPO 0.55 FlatOPO 0.50 OracleOPO 0.45 0.40 0.35 0.30 0.25 0.20 1000 2000 3000 4000 5000 Dataset Size n Suboptimality Suboptimality Multi-Task Off-Policy Learning from Bandit Feedback Omniglot Bandit (m = 100, d = 64, K = 10) 0.65 HierOPO 0.60 FlatOPO 0.55 OracleOPO 0.50 0.45 0.40 0.35 0.30 0.25 1000 2000 3000 4000 5000 Dataset Size n Figure 4. Evaluation of off-policy algorithms on the image classification using Omniglot using three randomly selected alphabets. Figure 5. Visualization of the three alphabets used for evaluation. The first four images are randomly selected characters from the alphabet. The fifth is a visualization of the estimation of hyper-parameter µ∗ using HierOPO, by interpolating the estimated hyper-parameter among characters in the alphabet. Note that the hyper-parameter captures common structures among different characters in the alphabet. D. Additional Experiment on Image Classification In this section, we consider an additional experiment based on online image classification using a real-world dataset commonly used in meta-learning. We consider using Omniglot (Lake et al., 2015), which is a dataset of 1623 handwritten characters from 50 different alphabets and contains 20 examples per character. We train a 4-layer CNN to classify the characters from 30 of the alphabets, and use to extract d = 64 features using characters from 30 alphabets. The remaining 20 alphabets are used to evaluate the algorithms as in Section 7. We create a multi-task contextual bandit using the test dataset as follows. First, an alphabet is sampled uniformly at random from a subset of alphabets, which we reserve for evaluation. Then for each task, a character is uniformly chosen from the alphabet as the positive class. By leveraging that all tasks correspond to characters from the same alphabet, we ensure that tasks have some hierarchical relationship. In each round of a particular task, K = 10 images from the dataset are chosen, one of which is guaranteed to be of the chosen character. The context is a concatenation of the extracted d-dimensional feature vectors from the CNN trained on the training alphabets for the corresponding chosen images. The reward for selecting an image of the positive class is sampled from Ber(0.9); otherwise, the reward is sampled from Ber(0.1). We estimate the hierarchical model in Section 3.2 using the CNN trained on the training set. We estimate the hyper-prior parameters µq and Σq using the mean and covariance, respectively, of the features of all the images in the test set. Then for each alphabet, we set µ∗ and Σ0 to be the mean and covariance of the features of images in the alphabet. Recall that µ∗ , Σ0 are unknown to all baselines except OracleOPO. In Figure 4, we report results for three randomly selected alphabets from the test set over 10 random runs, where each run consists of choosing m = 100 characters, generating a dataset of size 4n, and running each algorithm on the dataset. Note that here, HierOPO and OracleOPO both outperform FlatOPO initially, but ultimately begin to converge in performance when the dataset size n is large (particularly in the third alphabet). This is because HierOPO, OracleOPO assume a Gaussian hierarchical structure over characters in an alphabet, which is likely violated. However, as shown in Figure 5, our algorithm HierOPO is still able to learn commonalities between characters in an alphabet, such as shape and curvature, which leads to improved performance when n is small. 17