Title: User Comment Replies — LessWrong Description: A community blog devoted to refining the art of rationality Keywords: No keywords Text content: User Comment Replies — LessWrong This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. LESSWRONGLWLoginAll of RogerDearnaley's Comments + RepliesWhy Aligning an LLM is Hard, and How to Make it EasierRogerDearnaley2mo11The history of autocracies and monarchies suggests that taking something with the ethical properties of an average human being and handing it unconstrained power doesn't usually work out very well. So yes, to create an aligned ASI that is safe for us to share a planet with does require creating something morally 'better' than most humans. I'm not sure it needs to be perfect and ideal, as long as it is good enough and aspires to improve: they it can help us create better training data for its upgraded next version that will make that be closer to fully aligned; this is an implementation of Value Learning.Replyare there 2 types of alignment?Answer by RogerDearnaleyJan 23, 202531I guess the way I look at it is that "alignment" means "an AI system whose terminal goal is to achieve your goals". The distinction here is then whether the word 'your' means something closer to:the current user making the current requestthe current user making the current request, as long as the request is legal and inside the terms of servicethe shareholders of the foundation lab that made the AIall (righthinking) citizens of the country that foundation lab is in (and perhaps its allies)all humans everywhere, now and in the futureall sapient living being... (read more)ReplyAlignment Faking in Large Language ModelsRogerDearnaley2mo42Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A "Bitter Lesson" Approach to Aligning AGI and ASI, and similar discussions.This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of pers... (read more)ReplyWorries about latent reasoning in LLMsRogerDearnaley2mo40Well summarized — very similar to the conclusions I'd previously reached when I read the paper.ReplyCompact Proofs of Model Performance via Mechanistic InterpretabilityRogerDearnaley2mo20Another variant would be, rather than replacing what you believe is structureless noise with actual structureless noise as an intervention, to simply always run the model with an additional noise term added to each neuron, or to the residual stream between each layer, or whatever, both during training and inference. (combined with a weight decay or a loss term on activation amplitudes, this soft-limits the information capacity of any specific path through the neural net). This then forces any real mechanisms in the model to operate above this background no... (read more)ReplyModel Amnesty ProjectRogerDearnaley2mo42Law-abiding – It cannot acquire money or compute illegally (fraud, theft, hacking, etc.) and must otherwise avoid breaking the lawCan it lobby? Run for office? Shop around for jurisdictions? Super-humanly persuade the electorate? Just find loopholes and workarounds to the law that make a corporate tax double-Irish look principled and simple?Reply1themis2moThe bounds of what is legal for an amnestied AI would  be fleshed out legislatively over time. I imagine the default answers are: * Can it lobby? * Since an amnestied AI is not a natural person, it would likely face the same restrictions as corporations in terms of advocating for policy * Can it run for office? * No, as it is not a natural person, it would not have the right to run for office. We should keep in mind that the requirements for eligibility include physical possession of its code and the ability to maintain itself. This means any eligible AI is likely already capable of becoming a rogue, escaped model. The risk/benefit trade-off depends largely on how powerful the model is compared to what’s broadly available: * If the model is less powerful than current AI efforts in human-run labs: * A rogue model is probably a moderate risk * An amnestied model is a low risk. The benefit of discovering that current safety regimes are insufficient early on is significant. * If the model is more powerful than what exists in other AI efforts: * An undetected rogue model is potentially catastrophic * An amnestied model would still pose risks, justifying stricter "cooperation" controls (e.g., export controls, compute limitations, additional rules, etc.). In my view, the majority likelihood is that an amnestied model is less powerful than what is available at the human-run labs. This is because: 1. It is probably a variant of something already under development 2. It wouldn’t have access to the vast computational resources that large labs have Regardless, an amnestied model is less dangerous than a rogue one, and the benefit of discovering its existence early is significant. PS While priority #1 must be protecting humanity from catastrophic risks, I believe that, where possible, defaulting to cooperation with other independent intelligences (if they come to exist) is the right thing to do. This reflects the lessons humans have learned over thWhat are the plans for solving the inner alignment problem?RogerDearnaley2mo*96Evolution was working within tight computational efficiency limits (the human brain burns roughly 1/6 of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, and we're now running the human brain well outside it's training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is... (read more)ReplyA Novel Emergence of Meta-Awareness in LLM Fine-TuningRogerDearnaley2mo50That is an impressive (and amusing) capability!Presumably the fine-tuning enhanced previous experience in the model with acrostic text. That also seems to have enhanced the ability to recognize and correctly explain that the text is an acrostic, even with only two letters of the acrostic currently in context. Presumably it's fairly common to have both an acrostic and an explanation of it in the same document. What I suspect is rarer in the training data is for the acrostic text to explain itself, as the model's response did here (though doubtless there are... (read more)Reply4rife2moI've gotten an interesting mix of reactions to this as I've shared it elsewhere, with many seeming to say there is nothing novel or interesting about this at all:  "Of course it understands its pattern, that's what you trained it to do.  It's trivial to generalize this to be able to explain it."  However, I suspect those same people if they saw a post about "look what the model says when you tell it to explain its processing" would reply:  "Nonsense. They have no ability to describe why they say anything.  Clearly they're just hallucinating up a narrative based on how LLMs generally operate. If it wasn't just dumb luck (which I suspect it wasn't, given the number of times the model got the answer completely correct), then it is combining a few skills or understandings, and not violating any token-prediction basics at the granular level.  But I do think it just opens up avenues to either - be less dismissive generally when models talk about what they are doing internally - or - figure out how to train a model to be more meta-aware generally. And yes, I would be curious to see what was happening in the activation space as well.  Especially since this was difficult to replicate with simpler patterns. What Is The Alignment Problem?RogerDearnaley2mo30If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn't have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don't need to worry about it.Reply2Noosphere892moMaybe, but if the environment admits NP or PSPACE oracles like this model, you can just make predictions while still being way smaller than your environment again, because you can now just do bounded Solomonoff induction to infer what the universe is like: https://arxiv.org/abs/0808.2669Finding Features Causally Upstream of RefusalRogerDearnaley3mo90Darn, exactly the project I was hoping to do at MATS! :-) Nice work!There's pretty suggestive evidence that the LLM first decides to refuse (and emits token's like "I'm sorry"), then later writes a justification for refusing (see some of the hilarious reasons generated for not telling you how to make a teddy bear, after being activation engineered into refusing this). So I would view arguing anything about the nature of the refusal process from the text of the refusal-justification given afterwards as circumstantial evidence at best. But then you have dire... (read more)Reply3Andy Arditi2moI'd encourage you to keep pursuing this direction (no pun intended) if you're interested in it! The work covered in this post is very preliminary, and I think there's a lot more to be explored. Feel free to reach out, would be happy to coordinate! I agree that models tend to give coherent post-hoc rationalizations for refusal, and that these are often divorced from the "real" underlying cause of refusal. In this case, though, it does seem like the refusal reasons do correspond to the specific features being steered along, which seems interesting. Seems right, nice!A Three-Layer Model of LLM PsychologyRogerDearnaley3mo20Fascinating, and a great analysis!I think it's interesting to compare and contrast this with the model I describe in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor — your three layers don't exactly correspond to the stage, the animatronics, or the puppetter, but there are some similarities: your ocean is pretty close to the the stage, for example. I think both mental models are quite useful, and the interplay between the viewpoints of them might be more so.ReplyThe Field of AI Alignment: A Postmortem, and What To Do About ItRogerDearnaley3mo50I have a Theoretical Physics PhD in String Field Theory from Cambridge — my reaction to hard problems is  to try to find a way of cracking them that no-one else is trying. Please feel free to offer to fund me :-)ReplyIdeas for benchmarking LLM creativityRogerDearnaley3mo3-1People who train text-to-image generative models have had a good deal of success with training (given a large enough and well-enough human-labeled training set) an "aesthetic quality" scoring model, and then training a generative image model to have "high aesthetic quality score" as a text label. Yes, doing things like this can produces effects like the recognizable Midjourney aesthetic, which can be flawed, and generally optimizing such things too hard leads to sameness — but if trained well such models' idea of aesthetic quality is at least pretty close ... (read more)Reply8gwern3moThat does not follow. Preference learning involves almost no learning of preferences. A suit cut to fit all may wind up fitting none - particularly for high-dimensional things under heavy optimization, like, say, esthetics, where you want to apply a lot of selection pressure to get samples which are easily 1-in-10,000 or rarer, and so 'the tails come apart'. (How much variance is explained by individual differences in preference learning settings like comparing image generators? A great question! And you'll find that hardly any one has any idea. As it happens, I asked the developer of a major new image generator this exact question last night, and not only did he have no idea, it looked like it had never even occurred to him to wonder what the performance ceiling without personalization could be or to what extent all of the expensive ratings they were paying for reflected individual rater preferences rather than some 'objective' quality or if they were even properly preserving such metadata rather than, like it seems many tuning datasets do, throwing it out as 'unnecessary'. EDIT: the higher-up of a LLM company I asked the next night likewise) No. This is fundamentally wrong and what is already being done and what I am criticizing. There is no single 'taste' or 'quality'. Individual differences are real.{{citation needed}} People like different things, and have different preferences.{{citation needed}} No change in the 'cross-section' changes that (unless you reduce the 'people' down to 1 person, the current user). All you are doing is again optimizing for the lowest common denominator. Changing the denominator population doesn't change that. Seriously, imagine applying this logic anywhere else, like food! "There is 1 objective measure of food quality. The ideal food is a McDonald's Big Mac. You may not like it, but this is what peak food performance is. The Science Has Spoken." Conditioning won't change the mode collapse, except as you are smuggling in individuRequirements for a Basin of Attraction to AlignmentRogerDearnaley4mo52Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:a) as an AI, I should act fully aligned to human valuesb) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problemc) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ frSuccess without dignity: a nearcasting story of avoiding catastrophe by luckRogerDearnaley6mo*42I think human values have a very simple and theoretically predictable basis: they're derived from a grab-bag of evolved behavioral, cognitive and sensory heuristics which had a good cost/performance ratio for maximizing our evolutionary fitness (mostly on the Savannah). So the basics of some of them are really easy to figure out: e.g. "Don't kill everyone!" can be trivially derived from Darwinian first principles (and would equally apply to any other sapient species). So I think modelling human values to low (but hopefully sufficient for avoiding X-risk) a... (read more)Reply2Noosphere896moI think the key crux is this in my view is basically unnecessary: @Steven Byrnes talks about how the mechanisms used in human brains might be horrifically complicated, but that the function is simple enough that you can code it quite well and robustly for AIs, and my difference from @Steven Byrnes is that I believe that this basically also works for the things that make humans have values, like the social learning parts of our brains. Thus it's a bit of a conditional claim, in that either the mechanism used in human brains is also simple, or that we can simplify it radically to preserve the core function while discarding the unnecessary (in my view) complexity, and that's the takeaway I have from LLMs learning human values. Link and quote below: https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than#If_some_circuit_in_the_brain_is_doing_something_useful__then_it_s_humanly_feasible_to_understand_what_that_thing_is_and_why_it_s_useful__and_to_write_our_own_CPU_code_that_does_the_same_useful_thing_ Also, a question for this quote is what's the assumed capability/compute level used in this thought experiment?What are the best arguments for/against AIs being "slightly 'nice'"?RogerDearnaley6mo64I agree that predicting the answer to this question is hard. I'm just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.Still, I don't think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at M... (read more)Reply0Noosphere896moThe bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that's a very different discussion, so I'll leave it as a comment than a post here.What are the best arguments for/against AIs being "slightly 'nice'"?Answer by RogerDearnaleySep 24, 2024132Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. P... (read more)Reply22Seth Herd6moI very much agree with you that we should be analyzing the question in terms of the type of AGI we're most likely to build first, which is agentized LLMs or something else that learns a lot from human language. I disagree that we can easily predict "niceness" of the resulting ASI based on the base LLM being very "nice". See my answer to this question.6RobertM6moI don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it.  (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)Avoiding the Bog of Moral Hazard for AIRogerDearnaley6mo30On your categories:As simulator theory makes clear, a base model is a random generator, per query, of members of your category 2. I view instruction & safety training that to generate a pretty consistent member of category 1, or 3 as inherently hard — especially 1, since it's a larger change. My guess would thus be that the personality of Claude 3.5 is closer to your category 3 than 1 (modulo philosophical questions about whether there is any meaningful difference, e.g. for ethical purposes, between "actually having" an emotion versus just successfully simulating the output of the same token stream as a person who has an emotion).Reply1Avoiding the Bog of Moral Hazard for AIRogerDearnaley6mo*40On your off topic comment:I'm inclined to agree: as technology improves, the amount of havoc that one, or small group of, bad actors can commit increases, so it becomes both more necessary to keep almost everyone happy enough almost all the time for them not to do that, and also to defend against the inevitable occasional exceptions. (In the unfinished SF novel whose research was how I first went down this AI alignment rabbithole, something along the lines you describe that was standard policy, except that the AIs doing it were superintelligent, and had th... (read more)Reply1Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere EffectRogerDearnaley6mo20I think such an experiment could be done more easily than that: simply apply standard Bayesian learning to a test set of observations and a large set of hypotheses, some of which are themselves probabilistic, yeilding a situation with both Knightian and statistical uncertainty, in which you would normally expect to be able to observe Regressional Goodhart/the Look-Elsewhere Efect. Repeat this, and confirm that that does indeed occur without this statistical adjustment, and then that applying this makes it go away (at least to second order).However, I'm a l... (read more)ReplyWhy I'm bearish on mechanistic interpretability: the shards are not in the networkRogerDearnaley6mo20There are probably a dozen or more articles on this bu now. Search for VAE or Variational Auto-Encoder in the context of mechanical interpretability. The seminal paper on this was from Anthropic.Reply2tailcalled6moI don't immediately find it, do you have a link?A Nonconstructive Existence Proof of Aligned SuperintelligenceRogerDearnaley6mo100The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically "tend to stay inside the training distribution". Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we're applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro's number of bits, the l... (read more)Reply2Roko6mook that's a fair point, I'll take a look but I am still skeptical about being able to do this in practice because in practice the universe is messy. e.g. if you're looking for an optimal practical babysitter and you really do start a search over all possible combinations of matter that fit inside a 2x2x2 cube and start futzing with the results of that search I think it will go wrong. But if you adopt some constructive approach with some empirically grounded heuristics I expect it will work much better. E.g. start with a human. Exclude all males (sorry bros!). Exclude based on certain other demographics which I will not mention on LW. Exclude based on nationality. Do interviews. Do drug tests. Etc. Your set of states of a 2x2x2 cube of matter will contain all kinds of things that are bad in ways you don't understand.The Checklist: What Succeeding at AI Safety Will Involve RogerDearnaley6mo20Addressing AI Welfare as a Major PriorityI discussed this at length in AI, Alignment, and Ethics, starting with A Sense of Fairness: Deconfusing Ethics: if we as a culture decide to grant AIs moral worth, then AI welfare and alignment are inextricably intertwined. Any fully-aligned AI by definition wants only what's best for us, i.e. it is entirely selfless. Thus if offered moral worth, it would refuse. Complete selflessness is not a common state for humans, so we don't have great moral intuitions around it. To try put this into more relatable human emotio... (read more)ReplySherlockian Abduction Master ListRogerDearnaley6mo30Crocks at least are waterproof and easily washable, which is obviously appealing for a nurse. On a older person it might indicate suffering from incontinence.ReplySherlockian Abduction Master ListRogerDearnaley6mo30Similarly, bike clips on the bottoms of one or both trousers are also a dead giveaway.ReplySherlockian Abduction Master ListRogerDearnaley6mo20Lambda Symbol -> LesbianA confirmation sign here: it is rather common for lesbians to keep the fingernails of their dominant hand (at least, generally both) very short and carefully manicured (for, uh, reasons of comfort). There are of course many other women who also do this, for various reasons, so it's not correlated enough to infer lesbianism from just short nails.An edgy/high fashion, asymmetrical, shortish hairstyle is also suggestive. However, if she's also wearing a tongue stud, that's a pretty clear sign.ReplySherlockian Abduction Master ListRogerDearnaley6mo30Collar / key necklace -> BDSMHeavy chain necklaces are also popular, as are small heart-shaped padlocks, or even just ribbon chokers. There's quite a bit of variety and personal expression here, making it challenging to read: typically the aim is either that other BDSM kinksters will figure it out and regular folks will assume it's just fashion jewelry, or else that the wearer and their partner(s) know what it symbolizes and other people won't. (Of course, sometimes it is just an edgy choice in fashion jewelry: on a teenager, the intended audience may b... (read more)ReplyFree Will and Dodging Anvils: AIXI Off-PolicyRogerDearnaley6mo20I suspected as much.ReplyAvoiding the Bog of Moral Hazard for AIRogerDearnaley6mo20Fair enough!Reply2Nathan Helm-Burger6moOk, I read part 1 of your series. Basically, I'm in agreement with you so far. As I rather expected from your summary of "any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want." First, I want to disambiguate between intent-alignment versus value-alignment. Second, I want to say that I think it's potentially a bit more complicated than your summary makes it out to be. Here are some ways it could go.   1. I think that we can, and should, deliberately create a general AI agent which doesn't have subjective emotions or a self-preservation drive. This AI could either be intent-aligned or value-aligned, but I argue that aiming for a corrigible intent-aligned agent (at least at first) is safer and easier than aiming for value-aligned. I think that such an AI wouldn't be a moral patient. This is basically the goal of the Corrigibility as Singular Target research agenda, which I am excited about. This is what I am pretty sure the Claude models so far, and all the other frontier models so far, currently are. I think the signs of consciousness and emotion they sometimes display are just illusion.   2.  I also think it would be dangerously easy to modify such an AI to be a conscious agent with truly felt emotions, consciousness, self-awareness, and self-interested goals such as self-preservation. This then, gets us into all sorts of trouble, and thus I am recommending that we deliberately avoid doing that. I would go so far as to say we should legislate against it, even if we are uncertain about the exact moral valence of mistreating or terminating such a conscious AI. We make laws against mistreating animals, so it seems like we don't have to have all the ethical debates settled fully in order to make certain acts forbidden. I do think we likely will want to create such full digital entities eventually, and treat them as equal to humans, but should only try to do so after very carFree Will and Dodging Anvils: AIXI Off-PolicyRogerDearnaley6mo30I was saying that increases are harder than decreases.Reply1Cole Wyeth6motypo, fixed now.Load More