Title: User Comment Replies — AI Alignment Forum Description: A community blog devoted to technical AI alignment research Keywords: No keywords Text content: User Comment Replies — AI Alignment Forum This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. AI ALIGNMENT FORUMAFLoginAll of Joe Collman's Comments + RepliesCircumventing interpretability: How to defeat mind-readersJoe Collman8mo10To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].That said, I think I'd disagree on one word of the following:The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement su... (read more)ReplyCircumventing interpretability: How to defeat mind-readersJoe Collman8mo10Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.Most potential circumvention methods that can be passive can also be active. But some methods can only be active.It seems to me that there's no fixed notion of "active" that works for both para... (read more)Reply1Lee Sharkey8moSo I believe I had in mind "active means [is achieved deliberately through the agent's actions]". I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn't make this distinction at the time is because the difference is pretty subtle.  The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions required to do it. This is true even if they've been selected for passively.  So I'd argue that they all need to be implemented actively (if they're to have a reasonable probability of success) but may be selected for passively or actively. I'm curious if you agree with this? If not, then I may have missed your argument.  ( For the convenience of other readers of this thread, the kind of strategies I labelled as strictly active are:   * 1.1.3. Environmental artifacts that corrupt its future self: A misaligned AI may leave behind artifacts that serve as adversarial examples or poisoned data that corrupt its future self. (Strictly active) * 1.1.4. Deleting (and relearning) dangerous representations: When not under surveillance, an AI might be able to delete dangerous representations so that it looks safe when surveillance resumes. It might also be able to schedule when it relearns those representations. (Strictly active) * 1.1.5. Weights poisoning: Instead of providing manipulated inputs, the AI might manipulate its own weights so that a small subset of normal inputs behave like adversarial examples. (Strictly active) * 1.1.6. Pre-committing to following potentially dangerous instructions: Hubinger: "... if the model has the ability to read and write to some sort of state ... it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it findOn “first critical tries” in AI alignmentJoe Collman8mo30It may be better to think about it that way, yes - in some cases, at least.Probably it makes sense to throw in some more variables.Something like:To stand x chance of property p applying to system s, we'd need to apply resources r.In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].ReplyOn scalable oversight with weak LLMs judging strong LLMsJoe Collman9mo30That's fair. I agree that we're not likely to resolve much by continuing this discussion. (but thanks for engaging - I do think I understand your position somewhat better now)What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].I expect that this would lead people to develop clearer, richer models.Presumably this will take months rather than h... (read more)ReplyOn scalable oversight with weak LLMs judging strong LLMsJoe Collman9mo10"[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we get doom...I'm claiming something more like "[given a realistic degree of technical work on current agendas in the time we have], there will be some existentially risky failures left, so if we proceed we're highly likely to get doom.I'll clarify more below.Otherwise even in a "free-for-all" world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).Sure, but I mostly do... (read more)ReplyRohin Shah9mo135Okay, I think it's pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.I'm probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don't go anywhere useful.ReplyOn scalable oversight with weak LLMs judging strong LLMsJoe Collman9mo10Not going to respond to everythingNo worries at all - I was aiming for [Rohin better understands where I'm coming from]. My response was over-long.E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.Agreed, but I think this is too coarse-grained a view.I expect that, absent impressive levels of international coordination, we... (read more)Reply5Rohin Shah9moThis is the sort of thing that makes it hard for me to distinguish your argument from "[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down". I agree that, conditional on believing that we're screwed absent huge levels of coordination regardless of technical work, then a lot of technical work including debate looks net negative by reducing the will to coordinate. Similarly this only makes sense under a view where technical work can't have much impact on p(doom) by itself, aka "regardless of technical work we're screwed". Otherwise even in a "free-for-all" world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom). Oh, my probability on level 6 or level 7 specifications becoming the default in AI is dominated by my probability that I'm somehow misunderstanding what they're supposed to be. (A level 7 spec for AGI seems impossible even in theory, e.g. because it requires solving the halting problem.) If we ignore the misunderstanding part then I'm at << 1% probability on "we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future". (I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)On scalable oversight with weak LLMs judging strong LLMsJoe Collman9mo58[apologies for writing so much; you might want to skip/skim the spoilered bit, since it seems largely a statement of the obvious]Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk?Agreed (I imagine there are exceptions, but I'd be shocked if this weren't usually true). [I'm responding to this part next, since I think it may resolve some of our mutual misunderstanding]It seems like your argument here, and in other parts of your comment, is something like "we could d... (read more)Reply6Rohin Shah9moNot going to respond to everything, sorry, but a few notes: My claim is that for the things you call "actions that increase risk" that I call "opportunity cost", this causal arrow is very weak, and so you shouldn't think of it as risk compensation. E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities. I agree that reference classes are often terrible and a poor guide to the future, but often first-principles reasoning is worse (related: 1, 2). I also don't really understand the argument in your spoiler box. You've listed a bunch of claims about AI, but haven't spelled out why they should make us expect large risk compensation effects, which I thought was the relevant question. It depends hugely on the specific stronger safety measure you talk about. E.g. I'd be at < 5% on a complete ban on frontier AI R&D (which includes academic research on the topic). Probably I should be < 1%, but I'm hesitant around such small probabilities on any social claim. For things like GSA and ARC's work, there isn't a sufficiently precise claim for me to put a probability on. Not a big factor. (I guess it matters that instruction tuning and RLHF exist, but something like that was always going to happen, the question was when.) Hmm, then I don't understand why you like GSA more than debate, given that debate can fit in the GSA framework (it would be a level 2 specification by the definitions in the paper). You might think that GSA will uncover problems in debate if they exist when using it as a specification, but if anything that seems to me less likely to happen with GSA, since in a GSA approach the specification is treated as infallible.On scalable oversight with weak LLMs judging strong LLMsJoe Collman9mo54Thanks for the response. I realize this kind of conversation can be annoying (but I think it's important).[I've included various links below, but they're largely intended for readers-that-aren't-you]I don't see why this isn't a fully general counterargument to alignment work. Your argument sounds to me like "there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".(Thanks for this too. I don't ... (read more)Reply4Rohin Shah9moMain points: I'm on board with these. I still don't see why you believe this. Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk? (E.g. I haven't looked into it but I bet people didn't spend a bunch of time thinking about risk compensation when deciding whether to include seat belts in cars.) If you do agree with that, what makes AI different from those cases? (The arguments you give seem like very general considerations that apply to other fields as well.) I'd say that the risk compensation argument as given here Proves Too Much and implies that most safety work in most previous fields was net negative, which seems clearly wrong to me. It's true that as a result I don't spend lots of time thinking about risk compensation; that still seems correct to me. It seems like your argument here, and in other parts of your comment, is something like "we could do this more costly thing that increases safety even more". This seems like a pretty different argument; it's not about risk compensation (i.e. when you introduce safety measures, people do more risky things), but rather about opportunity cost (i.e. when you introduce weak safety measures, you reduce the will to have stronger safety measures). This is fine, but I want to note the explicit change in argument; my earlier comment and the discussion above was not trying to address this argument. Briefly on opportunity cost arguments, the key factors are (a) how much will is there to pay large costs for safety, (b) how much time remains to do the necessary research and implement it, and (c) how feasible is the stronger safety measure. I am actually more optimistic about both (a) and (b) than what I perceive to be the common consensus amongst safety researchers at AGI labs, but tend to be pretty pessimistic about (c) (at least relative to many LessWrongers, I'm not sure how it compares to safety researchers at AGI labs). Anyway for On scalable oversight with weak LLMs judging strong LLMsJoe Collman9mo20Sure, linking to that seems useful, thanks.That said, I'm expecting that the crux isn't [can a debate setup work for arbitrarily powerful systems?], but rather e.g. [can it be useful in safe automation of alignment research?].For something like the latter, it's not clear to me that it's not useful.Mainly my pessimism is about:Debate seeming not to address the failure modes I'm worried about - e.g. scheming.Expecting [systems insufficiently capable to cause catastrophe] not to radically (>10x) boost the most important research on alignment. (hopefully I'... (read more)Reply14Fabien Roger9moWhy? Is it exploration difficulties, rare failures, or something else? Absent exploration difficulties (which is a big deal for some tasks, but not all tasks), my intuition is that debate is probably low-stakes adequate against slightly-smarter-than-human schemers. Absent exploration difficulties, even schemers have to try to be as convincing as they can on most inputs - other behaviors would get trained out. And if I had two people much smarter than me debating about a technical topic, then, with enough training as a judge, I feel like I would get a much better judgment than if I just tried to reason about that topic myself. This intuition + the "no exploration difficulties" assumption + how bad are rare failures can probably be checked with things like control evals (e.g. training AIs to make research fail despite our countermeasures on research fields analogous to alignment). (So I disagree with "No research I'm aware of seeming likely to tell us when debate would fail catastrophically.")On scalable oversight with weak LLMs judging strong LLMsJoe Collman9mo3-2Do you have a [link to] / [summary of] your argument/intuitions for [this kind of research on debate makes us safer in expectation]? (e.g. is Geoffrey Irving's AXRP appearance a good summary of the rationale?)To me it seems likely to lead to [approach that appears to work to many, but fails catastrophically] before it leads to [approach that works]. (This needn't be direct)I.e. currently I'd expect this direction to make things worse both for:We're aiming for an oversight protocol that's directly scalable to superintelligence.We're aiming for e.g. a contro... (read more)ReplyRohin Shah9mo99I like both of the theories of change you listed, though for (1) I usually think about scaling till human obsolescence rather than superintelligence.(Though imo this broad class of schemes plausibly scales to superintelligence if you eventually hand off the judge role to powerful AI systems. Though I expect we'll be able to reduce risk further in the future with more research.)I note here that this isn't a fully-general counterargument, but rather a general consideration.I don't see why this isn't a fully general counterargument to alignment work. Your arg... (read more)Reply3Peter Barnett9moI think this comment might be more productive if you described why you expect this approach to fail catastrophically when dealing with powerful systems (in a way that doesn't provide adequate warning). Linking to previous writing on this could be good (maybe this comment of yours on debate/scalable oversight).AI catastrophes and rogue deploymentsJoe Collman10mo10Ah okay, that's clarifying. Thanks.It still seems to me that there's a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].Granted, by default I'd expect [compromised safety measure(s)] -> [rogue deployment] -> [catastrophe]Rather... (read more)ReplyAI catastrophes and rogue deploymentsJoe Collman10mo10This seems a helpful model - so long as it's borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn't a guarantee.Thoughts:It's not clear to me whether the following counts as a rogue deployment (I'm assuming so):[un-noticed failure of one safety measure, in a context where all other safety measures are operational]For this kind of case:The name "rogue deployment" doesn't seem a great fit.In general, it's not clear to me how to draw the line between:Safety measure x didn't achieve what we wanted, because it wasn't ... (read more)Reply12Buck Shlegeris10moRe 1, I don't count that as a rogue deployment. The distinction I'm trying to draw is: did the model end up getting deployed in a way that involves a different fundamental system architecture than the one we'd designed? Agreed re 2.Richard Ngo's ShortformJoe Collman10mo22I agree with this.Unfortunately, I think there's a fundamentally inside-view aspect of [problems very different from those we're used to]. I think looking for a range of frames is the right thing to do - but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues).I don't think there's a way around this. Aspects of this situation are fundamentally different from those we're used to. [Is different ... (read more)Reply[Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models" Joe Collman10mo73I think this is great overall.One area I'd ideally prefer a clearer presentation/framing is "Safety/performance trade-offs".I agree that it's better than "alignment tax", but I think it shares one of the core downsides:If we say "alignment tax" many people will conclude ["we can pay the tax and achieve alignment" and "the alignment tax isn't infinite"].If we say "Safety/performance trade-offs" many people will conclude ["we know how to make systems safe, so long as we're willing to sacrifice performance" and "performance sacrifice won't imply any hard limi... (read more)Reply2David Scott Krueger10moReally interesting point!   I introduced this term in my slides that included "paperweight" as an example of an "AI system" that maximizes safety.   I sort of still think it's an OK term, but I'm sure I will keep thinking about this going forward and hope we can arrive at an even better term.On “first critical tries” in AI alignmentJoe Collman10mo40I think the DSA framing is in keeping with the spirit of "first critical try" discourse.(With that in mind, the below is more "this too seems very important", rather than "omitting this is an error".)However, I think it's important to consider scenarios where humans lose meaningful control without any AI or group of AIs necessarily gaining a DSA. I think "loss of control" is the threat to think about, not "AI(s) take(s) control". Admittedly this gets into Moloch-related grey areas - but this may indicate that [humans do/don't have control] is too coarse-gr... (read more)Replyfaul_sname8mo44Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control". ReplyTowards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI SystemsJoe Collman10mo10(understood that you'd want to avoid the below by construction through the specification)I think the worries about a "least harmful path" failure mode would also apply to a "below 1 catastrophic event per millennium" threshold. It's not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn't be highly undesirable outcomes.It seems to me that "greatly penalize the additional facts which are enforced" is a two-edged sword: we want various additional facts to be highly likely, since our a... (read more)ReplyTowards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI SystemsJoe Collman10mo25[again, the below is all in the spirit of "I think this direction is plausibly useful, and I'd like to see more work on it"]not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world.Sure, but this seems to say "Don't worry, the malicious superintelligence can only manipulate your mind indirectly". This is not the level of assurance I want from something calling itself "Guaranteed safe".It is worth noting here that a potential failure mode is that a truly malicious general-pur... (read more)ReplyTowards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI SystemsJoe Collman10mo55This seems interesting, but I've seen no plausible case that there's a version of (1) that's both sufficient and achievable. I've seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don't help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing]... (read more)Reply2davidad (David A. Dalrymple)10moParalysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period. "Locked into some least-harmful path" is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours. As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions.3davidad (David A. Dalrymple)10moIt seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments). It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.Counting arguments provide no evidence for AI doomJoe Collman1y20Despite not answering all possible goal-related questions a priori, the reductionist perspective does provide a tractable research program for improving our understanding of AI goal development. It does this by reducing questions about goals to questions about behaviors observable in the training data.[emphasis mine]This might be described as "a reductionist perspective". It is certainly not "the reductionist perspective", since reductionist perspectives need not limit themselves to "behaviors observable in the training data".A more reasonable-to-my-mind b... (read more)ReplyCritiques of the AI control agendaJoe Collman1y10Sure, understood.However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me... (read more)Reply2Ryan Greenblatt1yYep, I just literally meant, "human coworker level doesn't suffice". I was just making a relatively narrow argument here, sorry about the confusion.Critiques of the AI control agendaJoe Collman1y10Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).Models don't need to capable of a pareto improvement on human per... (read more)Reply2Ryan Greenblatt1yNote that I wasn't making this argument. I was just reponding to one specific story and then noting "I'm pretty skeptical of the specific stories I've heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties". This is obviously only one of many possible arguments.Debating with More Persuasive LLMs Leads to More Truthful AnswersJoe Collman1y30Thanks for the thoughtful response.A few thoughts:If length is the issue, then replacing "leads" with "led" would reflect the reality.I don't have an issue with titles like "...Improving safety..." since it has a [this is what this line of research is aiming at] vibe, rather than a [this is what we have shown] vibe. Compare "curing cancer using x" to "x cures cancer".Also in that particular case your title doesn't suggest [we have achieved AI control]. I don't think it's controversial that control would improve safety, if achieved.I agree that this isn't a... (read more)ReplyDebating with More Persuasive LLMs Leads to More Truthful AnswersJoe Collman1y11I'd be curious what the take is of someone who disagrees with my comment.(I'm mildly surprised, since I'd have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction)I'm not clear whether the idea is that:The title isn't an overstatement.The title is not misleading. (e.g. because "everybody knows" that it's not making a claim of generality/robustness)The title will not mislead significant amounts of people in important ways. It's marginally negative, but not worth time/attention.There are upsides to the current nam... (read more)Reply4Ryan Greenblatt1yI disagreed due to a combination of 2, 3, and 4. (Where 5 feeds into 2 and 3). For 4, the upside is just that the title is less long and confusingly caveated. Norms around titles seem ok to me given issues with space. Do you have issues with our recent paper title "AI Control: Improving Safety Despite Intentional Subversion"? (Which seems pretty similar IMO.) Would you prefer this paper was "AI Control: Improving Safety Despite Intentional Subversion in a code backdooring setting"? (We considered titles more like this, but they were too long : (.) Often with this sort of paper, you want to make some sort of conceptual point in your title (e.g. debate seems promising), but where the paper is only weak evidence for the conceptual point and most of the evidence is just that the method seems generally reasonable. I think some fraction of the general mass of people in the AI safety community (e.g. median person working at some safety org or persistently lurking on LW) reasonably often get misled into thinking results are considerably stronger than they are based on stuff like titles and summaries. However, I don't think improving titles has very much alpha here. (I'm much more into avoiding overstating claims in other things like abstracts, blog posts, presentations, etc.) While I like the paper and think the title is basically fine, I think the abstract is misleading and seems to unnecessarily overstate their results IMO; there is enough space to do better. I'll probably gripe about this in another comment. My reaction is mostly "this isn't useful", but this is implicitly a disagreement with stuff like "but here it may actually matter if e.g. those working in governance think that you've actually shown ...".Debating with More Persuasive LLMs Leads to More Truthful AnswersJoe Collman1y48Interesting - I look forward to reading the paper.However, given that most people won't read the paper (or even the abstract), could I appeal for paper titles that don't overstate the generality of the results. I know it's standard practice in most fields not to bother with caveats in the title, but here it may actually matter if e.g. those working in governance think that you've actually shown "Debating with More Persuasive LLMs Leads to More Truthful Answers", rather than "In our experiments, Debating with More Persuasive LLMs Led to More Truthful Answer... (read more)Reply1Joe Collman1yI'd be curious what the take is of someone who disagrees with my comment. (I'm mildly surprised, since I'd have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction) I'm not clear whether the idea is that: 1. The title isn't an overstatement. 2. The title is not misleading. (e.g. because "everybody knows" that it's not making a claim of generality/robustness) 3. The title will not mislead significant amounts of people in important ways. It's marginally negative, but not worth time/attention. 4. There are upsides to the current name, and it seems net positive. (e.g. if it'd get more attention, and [paper gets attention] is considered positive) 5. This is the usual standard, so [it's fine] or [it's silly to complain about] or ...? 6. Something else. I'm not claiming that this is unusual, or a huge issue on its own. I am claiming that the norms here seem systematically unhelpful. I'm more interested in the general practice than this paper specifically (though I think it's negative here). I'd be particularly interested in a claim of (4) - and whether the idea here is something like [everyone is doing this, it's an unhelpful equilibrium, but if we unilaterally depart from it it'll hurt what we care about and not fix the problem]. (this seems incorrect to me, but understandable)The case for ensuring that powerful AIs are controlledJoe Collman1y10Thanks for the link.I find all of this plausible. However, I start to worry when we need to rely on "for all" assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here - it's when things feel natural that we forget we're making assumptions)I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The 'all' versions are much less clear.Reply1The case for ensuring that powerful AIs are controlledJoe Collman1y42Great post (I've not yet read it thoroughly, or thought for long).The first concern that springs to mind:I expect a strong correlation between [humans are bad at [subversive strategy x]], [humans don't tend to think of [subversive strategy x]], and [humans don't tend to notice [subversive strategy x]].My worry is more that we miss strategies that we're bad at, than strategies the AI is extremely good at. It seems plausible for us to be entirely unaware of the existence of some subversive strategies - since they've never been human skills, and so ... (read more)Reply3Ryan Greenblatt1yA decent amount of my hope here is that we can at least get a sense for whether the best AI strategies in some reasonably similar cases look extremely bizarre and were totally unpredictable to humans. E.g., we might do experiments where we do huge amounts of RL in a variety of different environments that feel somewhat analogous and then see if this finds bizarre strategies. Separately, I'm somewhat skeptical of the best AI strategies being totally different from what humans can think of though I can see where this concern is coming from. See here for more discussion. Yep, we indeed need to worry about any qualitatively wildly superhuman and problematic strategy/skill. We mean "domain" in a somewhat broad sense. Part of our argument for avoiding qualitatively wildly superhuman skills that are problematic does correspond to there being only a small number of domains and where skills within those domains are reasonably correlated (in the absense of agressive domain specific training). So, using the term domain is pretty natural from this perspective. But, it seems like you might not buy this perspective in which case it seems like an unnatural frame. Agreed, but worth noting this isn't really an assumption, I'm just making a rough argument which seems moderately compelling to me. (And FWIW it still seems moderately compelling despite this objection.)SimulatorsJoe Collman1y10Yeah - I just noticed this "...is the mechanics underlying our world." on the tag page.Agreed that it's inaccurate and misleading.I hadn't realized it was being read this way.ReplySimulatorsJoe Collman1y20Sure, but I don't think anyone is claiming that there's a similarity between a brain stepping forward in physical time and transformer internals. (perhaps my wording was clumsy earlier)IIUC, the single timestep in the 'physics' of the post is the generation and addition of one new token.I.e. GPT uses [some internal process] to generate a token.Adding the new token is a single atomic update to the "world state" of the simulation.The [some internal process] defines GPT's "laws of physics".The post isn't claiming that GPT is doing some generalized physics int... (read more)ReplySimulatorsJoe Collman1y20Oh, hang on - are you thinking that Janus is claiming that GPT works by learning some approximation to physics, rather than 'physics'?IIUC, the physics being referred to is either through analogy (when it refers to real-world physics), or as a generalized 'physics' of [stepwise addition of tokens]. There's no presumption of a simulation of physics (at any granularity).E.g.:Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simu... (read more)Reply2Oliver Habryka1yNo, I didn't mean to imply that. I understand that "physics" here is a general term for understanding how any system develops forward according to some abstract definition of time.  What I am saying is that even with a more expansive definition of physics, it seems unlikely to me that GPT internally simulates a human mind (or anything else really) in a way where structurally there is a strong similarity between the way a human brain steps forward in physical time, and the way the insides of the transformer generates additional tokens. SimulatorsJoe Collman1y10Perhaps we're talking past each other to a degree. I don't disagree with what you're saying.I think I've been unclear - or perhaps just saying almost vacuous things. I'm attempting to make a very weak claim (I think the post is also making no strong claim - not about internal mechanism, at least).I only mean that the output can often be efficiently understood in terms of human characters (among other things). I.e. that the output is a simulation, and that human-like minds will be an efficient abstraction for us to use when thinking about such a simulation.... (read more)ReplySimulatorsJoe Collman1y10To add to Charlie's point (which seems right to me):As I understand things, I think we are talking about a simulation of something somewhat close to human minds - e.g. text behaviour of humanlike simulacra (made of tokens - but humans are made of atoms). There's just no claim of an internal simulation.I'd guess a common upside is to avoid constraining expectations unhelpfully in ways that [GPT as agent] might.However, I do still worry about saying "GPT is a simulator" rather than something like "GPT currently produces simulations".I think the former sugges... (read more)Reply2Oliver Habryka1yI mean, that is the exact thing that I was arguing against in my review.  I think the distribution of human text just has too many features that are hard to produce via simulating human-like minds. I agree that the system is trained on imitating human text, and that necessarily requires being able to roleplay as many different humans, but I don't think the process of that roleplay is particularly likely to be akin to a simulation (similarly to how when humans roleplay as other humans they do a lot of cognition that isn't simulation, i.e. when someone plays an actor in a movie they do things like explicitly thinking about the historical period in which they were set, they recognize that certain scenes will be hard to pull off, they solve a problem using the knowledge they have when not roleplaying and then retrofit their solution into something the character might have come up with, etc. When humans imitate things we are not limited to simulating the target of our imitation) The cognitive landscape of an LLM is also very different from humans, and it seems clear that in many contexts the behavior of an LLM will generalize quite differently than it would for a human, and simulation again seems unlikely to be the only, or honestly even primary way, I expect an LLM to get good at human text imitation given that differing cognitive landscape).Deceptive AI ≠ Deceptively-aligned AIJoe Collman1y20I think the broader use is sensible - e.g. to include post-training.However, I'm not sure how narrow you'd want [training hacking] to be.Do you want to call it training only if NN internals get updated by default? Or just that it's training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a ...selection... process that could be ongoing], seems to cover all deceptive alignment - potential deletion/adjustment being a selection process).Fine if there's no bright line - I'd just be curious to know your criteria.ReplyDeceptive AI ≠ Deceptively-aligned AIJoe Collman1y10By contrast, deception is much broader—it’s any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.This description allows us to classify every output of a highly capable AI as deceptive:For any AI output, it's essentially guaranteed that a human will update away from the truth about something. A highly capable AI will be able to predict some of these updates - thus it will be "knowingly providing ... misleading information".Conversely, we can't... (read more)ReplyDeceptive AI ≠ Deceptively-aligned AIJoe Collman1y30…but the AI is actually emitting those outputs in order to create that impression—more specifically, the AI has situational awarenessI think it's best to avoid going beyond the RFLO description.In particular, it is not strictly required that the AI be aiming to "create that impression", or that it has "situational awareness" in any strong/general sense.Per footnote 26 in RFLO (footnote 7 in the post):"Note that it is not required that the mesa-optimizer be able to model (or infer the existence of) the base optimizer; it only needs to model the op... (read more)ReplyTwo concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)Joe Collman1y20Thanks for writing the report - I think it’s an important issue, and you’ve clearly gone to a lot of effort. Overall, I think it’s good. However, it seems to me that the "incentivized episode" concept is confused, and that some conclusions over the probability of beyond-episode goals are invalid on this basis. I'm fairly sure I'm not confused here (though I'm somewhat confused that no-one picked this up in a draft, so who knows?!).I'm not sure to what extent the below will change your broader conclusions - if it only moves you from [this can't happen,... (read more)ReplyThoughts on responsible scaling policies and regulationJoe Collman1y1120Hmm. Perhaps the thing I'd endorse is more [include this in every detailed statement about policy/regulation], rather than [shout it from the rooftops].So, for example, if the authors agree with the statement, I think this should be in:ARC Evals' RSP post.Every RSP.Proposals for regulation....I'm fine if we don't start printing it on bumper stickers.The outcome I'm interested in is something like: every person with significant influence on policy knows that this is believed to be a good/ideal solution, and that the only reasons against it are based on whet... (read more)ReplyThoughts on responsible scaling policies and regulationJoe Collman1y85I think forcing people to publicly endorse policiesSaying "If this happened, it would solve the problem" is not to endorse a policy. (though perhaps literally shouting from the rooftops might be less than ideal)It's entirely possible to state both "If x happened, it'd solve the problem", and "The policy we think is most likely to be effective in practice is Y". They can be put in the same statement quite simply.It's reasonable to say that this might not be the most effective communication strategy. (though I think on balance I'd disagree)It's not reasonabl... (read more)Reply14Evan Hubinger1yThat's a lot of nuance that you're trying to convey to the general public, which is a notoriously hard thing to do.Thoughts on responsible scaling policies and regulationJoe Collman1y35The specific conversation is much better than nothing - but I do think it ought to be emphasized that solving all the problems we're aware of isn't sufficient for safety. We're training on the test set.[1]Our confidence levels should reflect that - but I expect overconfidence.It's plausible that RSPs could be net positive, but I think that given successful coordination [vague and uncertain] beats [significantly more concrete, but overconfident].My presumption is that without good coordination (a necessary condition being cautious decision-makers), things w... (read more)ReplyThoughts on responsible scaling policies and regulationJoe Collman1y69Thanks for writing this.I'd be interested in your view on the comments made on Evan's RSP post w.r.t unknown unknowns. I think aysja put it best in this comment. It seems important to move the burden of proof.Would you consider "an unknown unknown causes a catastrophe" to be a "concrete way in which they fail to manage risk"? Concrete or not, this seems sufficient grounds to stop, unless there's a clear argument that a bit more scaling actually helps for safety. (I'd be interested in your take on that - e.g. on what speed boost you might expect with your o... (read more)ReplyPaul Christiano1y10-2Unknown unknowns seem like a totally valid basis for concern.But I don't think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop.  Without further elaboration I don't think "unknown unknowns could cause a catastrophe" is enough to convince governments (or AI developers) to take significant actions.I think RSPs make this situation better by pushing developers away from vague "Yeah we'll be s... (read more)ReplyLying is Cowardice, not StrategyJoe Collman1y1017This is not what most people mean by "for personal gain". (I'm not disputing that Alice gets personal gain)Insofar as the influence is required for altruistic ends, aiming for it doesn't imply aiming for personal gain.Insofar as the influence is not required for altruistic ends, we have no basis to believe Alice was aiming for it."You're just doing that for personal gain!" is not generally taken to mean that you may be genuinely doing your best to create a better world for everyone, as you see it, in a way that many would broadly endorse.In this context, a... (read more)ReplyLying is Cowardice, not StrategyJoe Collman1y20Ah okay - thanks. That's clarifying.Agreed that the post is at the very least not clear.In particular, it's obviously not true that [if we don't stop today, there's more than a 10% chance we all die], and I don't think [if we never stop, under any circumstances...] is a case many people would be considering at all.It'd make sense to be much clearer on the 'this' that "many people believe".(and I hope you're correct on P(doom)!)ReplyLying is Cowardice, not StrategyJoe Collman1y5-1[I agree with most of this, and think it's a very useful comment; just pointing out disagreements]For instance, I think that well implemented RSPs required by a regulatory agency can reduce risk <5% (partially by stopping in worlds where this appears needed).I assume this would be a crux with Connor/Gabe (and I think I'm at least much less confident in this than you appear to be).We're already in a world where stopping appears necessary.It's entirely possible we all die before stopping was clearly necessary.What gives you confidence that RSPs would actu... (read more)Reply23Ryan Greenblatt1yYeah, I probably want to walk back my claim a bit. Maybe I want to say "doesn't strongly imply"? It would have been better if ARC evals noted that the conclusion isn't entirely obvious. It doesn't seem like a huge error to me, but maybe I'm underestimating the ripple effects etc.Ryan Greenblatt1y*61Thanks for the response, one quick clarification in case this isn't clear.On: For instance, I think that well implemented RSPs required by a regulatory agency can reduce risk to <5% (partially by stopping in worlds where this appears needed). I assume this would be a crux with Connor/Gabe (and I think I'm at least much less confident in this than you appear to be). It's worth noting here that I'm responding to this passage from the text: In a saner world, all AGI progress should have already stopped. If we don’t, there’s more than a 10% chance we a... (read more)ReplyLying is Cowardice, not StrategyJoe Collman1y1023I agree with most of this, but I think the "Let me call this for what it is: lying for personal gain" section is silly and doesn't help your case.The only sense in which it's clear that it's "for personal gain" is that it's lying to get what you want.Sure, I'm with you that far - but if what someone wants is [a wonderful future for everyone], then that's hardly what most people would describe as "for personal gain".By this logic, any instrumental action taken towards an altruistic goal would be "for personal gain".That's just silly.It's unhelpful too, sinc... (read more)ReplyMatthew "Vaniver" Gray1y*70The only sense in which it's clear that it's "for personal gain" is that it's lying to get what you want.Sure, I'm with you that far - but if what someone wants is [a wonderful future for everyone], then that's hardly what most people would describe as "for personal gain".If Alice lies in order to get influence, with the hope of later using that influence for altruistic ends, it seems fair to call the influence Alice gets 'personal gain'. After all, it's her sense of altruism that will be promoted, not a generic one.ReplyRSPs are pauses done rightJoe Collman1y10Thanks, this seems very reasonable. I'd missed your other comment.(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))Reply2Ryan Greenblatt1yCorresponding comment text: I think I disagree with what you meant, but not that strongly. It's not that important, so I don't really want to get into it. Basically, I don't think that "well-defined" is that important (not obviously required for some ability to judge the finished work) and I don't think "re-direction frequency" is the right way to think about.RSPs are pauses done rightJoe Collman1y*10This is clarifying, thanks.A few thoughts:"Serial speed is key":This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn't need to be re-directed frequently [EDIT: the preceding was poorly worded - I meant that if prior to the availability of AI assistants this were true, it'd allow a lot of speedup as the AIs take over this work; otherwise it's less clearly so helpful].Perhaps this is true for ARC - that's encouraging (though it does again make me wonder why they don't em... (read more)Reply11Ryan Greenblatt1y88 "So coordination to do better than this would be great". I'd be curious to know what you'd want to aim for here - both in a mostly ideal world, and what seems most expedient. As far as the ideal, I happened to write something about in another comment yesterday. Excerpt: Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds ... (read more)ReplyRSPs are pauses done rightJoe Collman1y10[it turns out I have many questions - please consider this a pointer to the kind of information I'd find useful, rather than a request to answer them all!]around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30xCan you point to what makes you think this is likely? (or why it seems the most promising approach)In particular, I worry when people think much in terms of "doublings of total R&D effort" given that I'd expect AI assistance progress multipliers to vary hugely - with the... (read more)ReplyRyan Greenblatt1y913I'm not going to respond to everything you're saying here right now. It's pretty likely I won't end up responding to everything you're saying at any point; so apologies for that.Here are some key claims I want to make: Serial speed is key: Speeding up theory work (like e.g. ARC theory) by 5-10x should be quite doable with human level AIs due to AIs running at much faster serial speeds. This is a key difference between adding AIs and adding humans. Theory can be hard to parallelize which makes adding humans look worse than increasing speed. I'm not confide... (read more)ReplyRSPs are pauses done rightJoe Collman1y30Fully agree with almost all of this. Well said.One nitpick of potentially world-ending importance:In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systemsGiving us high confidence is not the bar - we also need to be correct in having that confidence.In particular, we'd need to be asking: "How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we're confident are suff... (read more)ReplyRSPs are pauses done rightJoe Collman1y58I think we at least do know how to do effective capabilities evaluationsThis seems an overstatement to me:Where the main risk is misuse, we'd need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)It seems reasonable to me to claim that "we know how to do effective [capabilities given sota elicitation methods] evaluations", but that doesn't answer the right question.Once the main risk isn't misuse, then we have to w... (read more)ReplyRSPs are pauses done rightJoe Collman1y*2242Strongly agree with almost all of this.My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them... (read more)Reply22Evan Hubinger1y95 Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none) This can't be implicit, since it's a central way that we die. If it's hard/impractical to estimate, then we should pause until we can estimate it more accuratelyThis is the kind of thing that I expect to be omitted fro... (read more)Reply1RSPs are pauses done rightJoe Collman1y2053Thanks for writing this up.I agree that the issue is important, though I'm skeptical of RSPs so far, since we have one example and it seems inadequate - to the extent that I'm positively disposed, it's almost entirely down to personal encounters with Anthropic/ARC people, not least yourself. I find it hard to reconcile the thoughtfulness/understanding of the individuals with the tone/content of the Anthropic RSP. (of course I may be missing something in some cases)Going only by the language in the blog post and the policy, I'd conclude that they're an excu... (read more)Reply22Evan Hubinger1y41I'm mostly not going to comment on Anthropic's RSP right now, since I don't really want this post to become about Anthropic's RSP in particular. I'm happy to talk in more detail about Anthropic's RSP maybe in a separate top-level post dedicated to it, but I'd prefer to keep the discussion here focused on RSPs in general. One of my main worries with RSPs is that they'll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That's much worse than if they were clearly inadequate. I definitely share this worry. But that's par... (read more)ReplyLoad More