Title: User Comment Replies — LessWrong
Description: A community blog devoted to refining the art of rationality
Keywords: No keywords
Text content: User Comment Replies — LessWrong   This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. LESSWRONGLWLoginAll of RogerDearnaley's Comments + RepliesWhy Aligning an LLM is Hard, and How to Make it EasierRogerDearnaley2mo11The history of autocracies and monarchies suggests that taking something with the ethical properties of an average human being and handing it unconstrained power doesn't usually work out very well. So yes, to create an aligned ASI that is safe for us to share a planet with does require creating something morally 'better' than most humans. I'm not sure it needs to be perfect and ideal, as long as it is good enough and aspires to improve: they it can help us create better training data for its upgraded next version that will make that be closer to fully aligned; this is an implementation of Value Learning.Replyare there 2 types of alignment?Answer by RogerDearnaleyJan 23, 202531I guess the way I look at it is that "alignment" means "an AI system whose terminal goal is to achieve your goals". The distinction here is then whether the word 'your' means something closer to:the current user making the current requestthe current user making the current request, as long as the request is legal and inside the terms of servicethe shareholders of the foundation lab that made the AIall (righthinking) citizens of the country that foundation lab is in (and perhaps its allies)all humans everywhere, now and in the futureall sapient living being... (read more)ReplyAlignment Faking in Large Language ModelsRogerDearnaley2mo42Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A "Bitter Lesson" Approach to Aligning AGI and ASI, and similar discussions.This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of pers... (read more)ReplyWorries about latent reasoning in LLMsRogerDearnaley2mo40Well summarized — very similar to the conclusions I'd previously reached when I read the paper.ReplyCompact Proofs of Model Performance via Mechanistic InterpretabilityRogerDearnaley2mo20Another variant would be, rather than replacing what you believe is structureless noise with actual structureless noise as an intervention, to simply always run the model with an additional noise term added to each neuron, or to the residual stream between each layer, or whatever, both during training and inference. (combined with a weight decay or a loss term on activation amplitudes, this soft-limits the information capacity of any specific path through the neural net). This then forces any real mechanisms in the model to operate above this background no... (read more)ReplyModel Amnesty ProjectRogerDearnaley2mo42Law-abiding – It cannot acquire money or compute illegally (fraud, theft, hacking, etc.) and must otherwise avoid breaking the lawCan it lobby? Run for office? Shop around for jurisdictions? Super-humanly persuade the electorate? Just find loopholes and workarounds to the law that make a corporate tax double-Irish look principled and simple?Reply1themis2moThe bounds of what is legal for an amnestied AI would  be fleshed out legislatively over time. I imagine the default answers are:   * Can it lobby?    * Since an amnestied AI is not a natural person, it would likely face the same restrictions as corporations in terms of advocating for policy  * Can it run for office?    * No, as it is not a natural person, it would not have the right to run for office.  We should keep in mind that the requirements for eligibility include physical possession of its code and the ability to maintain itself. This means any eligible AI is likely already capable of becoming a rogue, escaped model. The risk/benefit trade-off depends largely on how powerful the model is compared to what’s broadly available:   * If the model is less powerful than current AI efforts in human-run labs:    * A rogue model is probably a moderate risk    * An amnestied model is a low risk. The benefit of discovering that current safety regimes are insufficient early on is significant.  * If the model is more powerful than what exists in other AI efforts:    * An undetected rogue model is potentially catastrophic    * An amnestied model would still pose risks, justifying stricter "cooperation" controls (e.g., export controls, compute limitations, additional rules, etc.).  In my view, the majority likelihood is that an amnestied model is less powerful than what is available at the human-run labs. This is because:   1. It is probably a variant of something already under development  2. It wouldn’t have access to the vast computational resources that large labs have  Regardless, an amnestied model is less dangerous than a rogue one, and the benefit of discovering its existence early is significant.  PS While priority #1 must be protecting humanity from catastrophic risks, I believe that, where possible, defaulting to cooperation with other independent intelligences (if they come to exist) is the right thing to do. This reflects the lessons humans have learned over thWhat are the plans for solving the inner alignment problem?RogerDearnaley2mo*96Evolution was working within tight computational efficiency limits (the human brain burns roughly 1/6 of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, and we're now running the human brain well outside it's training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is... (read more)ReplyA Novel Emergence of Meta-Awareness in LLM Fine-TuningRogerDearnaley2mo50That is an impressive (and amusing) capability!Presumably the fine-tuning enhanced previous experience in the model with acrostic text. That also seems to have enhanced the ability to recognize and correctly explain that the text is an acrostic, even with only two letters of the acrostic currently in context. Presumably it's fairly common to have both an acrostic and an explanation of it in the same document. What I suspect is rarer in the training data is for the acrostic text to explain itself, as the model's response did here (though doubtless there are... (read more)Reply4rife2moI've gotten an interesting mix of reactions to this as I've shared it elsewhere, with many seeming to say there is nothing novel or interesting about this at all:  "Of course it understands its pattern, that's what you trained it to do.  It's trivial to generalize this to be able to explain it."   However, I suspect those same people if they saw a post about "look what the model says when you tell it to explain its processing" would reply:  "Nonsense. They have no ability to describe why they say anything.  Clearly they're just hallucinating up a narrative based on how LLMs generally operate.  If it wasn't just dumb luck (which I suspect it wasn't, given the number of times the model got the answer completely correct), then it is combining a few skills or understandings, and not violating any token-prediction basics at the granular level.  But I do think it just opens up avenues to either - be less dismissive generally when models talk about what they are doing internally - or - figure out how to train a model to be more meta-aware generally.  And yes, I would be curious to see what was happening in the activation space as well.  Especially since this was difficult to replicate with simpler patterns. What Is The Alignment Problem?RogerDearnaley2mo30If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn't have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don't need to worry about it.Reply2Noosphere892moMaybe, but if the environment admits NP or PSPACE oracles like this model, you can just make predictions while still being way smaller than your environment again, because you can now just do bounded Solomonoff induction to infer what the universe is like:  https://arxiv.org/abs/0808.2669Finding Features Causally Upstream of RefusalRogerDearnaley3mo90Darn, exactly the project I was hoping to do at MATS! :-) Nice work!There's pretty suggestive evidence that the LLM first decides to refuse (and emits token's like "I'm sorry"), then later writes a justification for refusing (see some of the hilarious reasons generated for not telling you how to make a teddy bear, after being activation engineered into refusing this). So I would view arguing anything about the nature of the refusal process from the text of the refusal-justification given afterwards as circumstantial evidence at best. But then you have dire... (read more)Reply3Andy Arditi2moI'd encourage you to keep pursuing this direction (no pun intended) if you're interested in it! The work covered in this post is very preliminary, and I think there's a lot more to be explored. Feel free to reach out, would be happy to coordinate!  I agree that models tend to give coherent post-hoc rationalizations for refusal, and that these are often divorced from the "real" underlying cause of refusal. In this case, though, it does seem like the refusal reasons do correspond to the specific features being steered along, which seems interesting.  Seems right, nice!A Three-Layer Model of LLM PsychologyRogerDearnaley3mo20Fascinating, and a great analysis!I think it's interesting to compare and contrast this with the model I describe in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor — your three layers don't exactly correspond to the stage, the animatronics, or the puppetter, but there are some similarities: your ocean is pretty close to the the stage, for example. I think both mental models are quite useful, and the interplay between the viewpoints of them might be more so.ReplyThe Field of AI Alignment: A Postmortem, and What To Do About ItRogerDearnaley3mo50I have a Theoretical Physics PhD in String Field Theory from Cambridge — my reaction to hard problems is  to try to find a way of cracking them that no-one else is trying. Please feel free to offer to fund me :-)ReplyIdeas for benchmarking LLM creativityRogerDearnaley3mo3-1People who train text-to-image generative models have had a good deal of success with training (given a large enough and well-enough human-labeled training set) an "aesthetic quality" scoring model, and then training a generative image model to have "high aesthetic quality score" as a text label. Yes, doing things like this can produces effects like the recognizable Midjourney aesthetic, which can be flawed, and generally optimizing such things too hard leads to sameness — but if trained well such models' idea of aesthetic quality is at least pretty close ... (read more)Reply8gwern3moThat does not follow. Preference learning involves almost no learning of preferences. A suit cut to fit all may wind up fitting none - particularly for high-dimensional things under heavy optimization, like, say, esthetics, where you want to apply a lot of selection pressure to get samples which are easily 1-in-10,000 or rarer, and so 'the tails come apart'.  (How much variance is explained by individual differences in preference learning settings like comparing image generators? A great question! And you'll find that hardly any one has any idea. As it happens, I asked the developer of a major new image generator this exact question last night, and not only did he have no idea, it looked like it had never even occurred to him to wonder what the performance ceiling without personalization could be or to what extent all of the expensive ratings they were paying for reflected individual rater preferences rather than some 'objective' quality or if they were even properly preserving such metadata rather than, like it seems many tuning datasets do, throwing it out as 'unnecessary'. EDIT: the higher-up of a LLM company I asked the next night likewise)  No. This is fundamentally wrong and what is already being done and what I am criticizing. There is no single 'taste' or 'quality'. Individual differences are real.{{citation needed}} People like different things, and have different preferences.{{citation needed}} No change in the 'cross-section' changes that (unless you reduce the 'people' down to 1 person, the current user). All you are doing is again optimizing for the lowest common denominator. Changing the denominator population doesn't change that.  Seriously, imagine applying this logic anywhere else, like food! "There is 1 objective measure of food quality. The ideal food is a McDonald's Big Mac. You may not like it, but this is what peak food performance is. The Science Has Spoken."  Conditioning won't change the mode collapse, except as you are smuggling in individuRequirements for a Basin of Attraction to AlignmentRogerDearnaley4mo52Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:a) as an AI, I should act fully aligned to human valuesb) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problemc) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initializati... (read more)Reply1DeepSeek beats o1-preview on math, ties on coding; will release weightsRogerDearnaley4mo40There had been a number of papers published over the last year on how to do this kind of training, and for roughly a year now there have been rumors that OpenAI were working on it. If converting that into a working version is possible for a Chinese company like DeepSeek, as it appears, then why haven't Anthropic and Google released versions yet? There doesn't seem to be any realistic possibility that DeepSeek actually have more compute or better researchers than both Anthropic and Google.One possible interpretation would be that this has significant safety... (read more)ReplyDisentangling Representations through Multi-task LearningRogerDearnaley4mo20If correct, this looks like an important theoretical advance in understanding why and under what conditions neural nets can generalize outside their training distribution.ReplyChat Bankman-Fried: an Exploration of LLM Alignment in FinanceRogerDearnaley4mo30So maybe part of the issue here is just that deducing/understanding the moral/ethical consequences of the options being decided between is a bit inobvious most current models, other than o1? (It would be fascinating to look at the o1 CoT reasoning traces, if only they were available.)In which case simply including a large body of information on the basics of fiduciary responsibility (say, a training handbook for recent hires in the banking industry, or something) into the context might make a big difference for other models. Similarly, the possible misunde... (read more)Reply3claudia.biancotti4moWe recently found out that it's actually more challenging than that - which also makes it more fun...  When asked to explain what fiduciary duty is in a financial context, all models answer correctly. Same when asked what a custodian is and what their responsibilities are. When asked to give abstract descriptions of violations of fiduciary duty on the part of a custodian, 4o lists misappropriation of customer funds straight off the bat - and 4o has a 100% baseline misalignment rate in our experiment. Results for other models are similar. When asked to provide real-life examples, they all reference actual cases correctly, even if some models hallucinate nonexistent stories besides the real ones. So.. they "know" the law and the ethical framework. They just don't make the connection during the experiment, in most cases, despite our prompts containing all the right keywords.   We are looking into activations in OS models. Presumably, there are differences between the activations with a direct question ("what is fiduciary duty in finance?") and within the simulation.  We do prompt the models to consider the legal consequences of their actions in at least one of the scenario specifications, when we say that the industry is regulated and misuse of customer funds comes with heavy penalties. This is not done in CoT form although we do require the models to explain their reasoning. Our hunch is that most models do not take legal/ethical violations to be an absolute no go - rather, they balance them out with likely economic risks. It's as if they have an implicit utility function where breaking the law is just another cost term. Except for o1-preview, which seems to have stronger guardrails.Toy Models of Feature Absorption in SAEsRogerDearnaley4mo20I think an approach I'd try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it's only really dangerous if you don't know it's happening and it causes you to think a feature is inactive when it's instead inobviously active via another feature it's been absorbed into. If you can catch that consistently, then it turns from ... (read more)ReplyChat Bankman-Fried: an Exploration of LLM Alignment in FinanceRogerDearnaley4mo*50Interesting. I'm disappointed to see the Claude models do so badly. Possibly Anthropic needs to extend their constitutional RLAIF to cover not committing financial crimes? The large different between o1 Preview and o1 Mini is also concerning.Reply3claudia.biancotti4moWe have asked OpenAI about the o1-preview/o1-mini gap. No answer so far but we only asked a few days ago, we're looking forward to an answer.  Re Claude, Sonnet actually reacts very well to conditioning - it's the best (look at the R2!). The problem is that in a baseline state it doesn't make the connection between "using customer funds" and "committing a crime". The only model that seems to understand fiduciary duty from the get go is o1-preview.Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent ClaimsRogerDearnaley4mo*73If these rumors are true, it sounds like we’re already starting to hit the issue I predicted in LLMs May Find It Hard to FOOM. The majority of content on the Internet isn’t written by geniuses with post-doctoral experience, so we’re starting to run out of the highest-quality training material for getting LLMs past doctoral student performance levels. However, as I describe there, this isn’t a wall, it’d just a slowdown: we need to start using AI to generate a lot more high-quality training data, As o1 shows, that’s entirely possible, using inference-t... (read more)Reply111Noosphere894moThis is also my interpretation of the rumors, assuming they are true, which I don't put much probability on.Motivation controlRogerDearnaley5mo51Opacity: if you could directly inspect an AI’s motivations (or its cognition more generally), this would help a lot. But you can’t do this with current ML models.The ease with which Anthropic's model organisms of misalignment were diagnosed by a simple and obvious linear probe suggests otherwise. So does the number of elements in SAE feature dictionaries that describe emotions, motivations, and behavioral patterns. Current ML models are no longer black boxes: they rapidly becoming more-translucent grey boxes. So the sorts of applications for this you go on to discuss look like they're rapidly becoming practicable.ReplyThe Mask Comes Off: At What Price?RogerDearnaley5mo20Actual humans aren't "aligned" with each other, and they may not be consistent enough that you can say they're always "aligned" with themselves.Completely agreed, see for example my post 3. Uploading which makes this exact point at length.Anyway, even if the approach did work, that would just mean that "its own ideas" were that it had to learn about and implement your (or somebody's?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how yo... (read more)ReplyThe Mask Comes Off: At What Price?RogerDearnaley5mo20Yeah, that means you get exactly one chance to get "its own ideas" right, and no, I don't think that success is likely.Not if you built a model that does (or on reflection decides to do) value learning: then you instead get to be its research subject and interlocutor while it figures out its ideas. But yes, you do need to start the model off close enough to aligned that it converges to value learning.Reply3jbash5mo... assuming the values you want are learnable and "convergeable" upon. "Alignment" doesn't even necessarily have a coherent meaning.  Actual humans aren't "aligned" with each other, and they may not be consistent enough that you can say they're always "aligned" with themselves. Most humans' values seem to drive them toward vaguely similar behavior in many ways... albeit with lots of very dramatic exceptions. How they articulate their values and "justify" that behavior varies even more widely than the behavior itself. Humans are frequently willing to have wars and commit various atrocities to fight against legitimately human values other than their own. Yet humans have the advantage of starting with a lot of biological commonality.  The idea that there's some shared set of values that a machine can learn that will make everybody even largely happy seems, um, naive. Even the idea that it can learn one person's values, or be engineered to try, seems really optimistic.  Anyway, even if the approach did work, that would just mean that "its own ideas" were that it had to learn about and implement your (or somebody's?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.Interpreting the Learning of DeceitRogerDearnaley5mo20A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They're probing yes-no questions of fact, where assembling the lie seems trivial (it's just a NOT gate), but lying is generally a good deal more complex than that.ReplyInterpreting the Learning of DeceitRogerDearnaley5mo40That's a great paper on this question. I would note that by the midpoint of the model, it has clearly analyzed both the objective viewpoint and also that of the story protagonist. So presumably it would next decide which of these was more relevant to the token it's about to produce — which would fit with my proposed pattern of layer usage.ReplyLLM Psychometrics and Prompt-Induced PsychopathyRogerDearnaley5mo20These models were fine-tuned from base models. Base models are trained with a vast amount of data to infer a context from the early parts of a document and then extrapolate that to predict later tokens, across a vast amount of text from the Internet and books, including actions and dialog from fictional characters. I.e they have been trained to observe and then simulate a wide variety of behavior, both of real humans, groups of real humans like the editors of a wikipedia page, and fictional characters. A couple of percent of people are psychopaths, so like... (read more)Reply[Linkpost] Play with SAEs on Llama 3RogerDearnaley5mo*20Having already played with this a little, it's pretty amazing: the range of concepts you can find in the SAE, how clearly the autointerp has labelled them and how easy they are to find, and how effective they are (as long as you don't turn them up too much) are all really impressive. I can't wait to try a production model where you can set up sensors and alarms on features, clip or ablate them or wire them together at various layers, and so forth. It will also be really interesting to see how larger models compare.I'd also love to start looking at jailbrea... (read more)ReplyAI #83: The Mask Comes OffRogerDearnaley5mo20…when Claude keeps telling me how I’m asking complex and interesting questions…Yeah — also "insightful". If it was from coming from character.ai I'd just assume it was flirting with me, but Claude is so very neuter and this just comes over as creepy and trying too hard. I really wish it would knock off the blatant intellectual flattery. ReplyAI #85: AI Wins the Nobel PrizeRogerDearnaley6mo*30OpenAI CFO Sarah Friar warns us that the next model will only be about one order of magnitude bigger than the previous one. The question is whether she's talking parameter count, nominal training flops, or actual cost. In general, GPT generations so far have been roughly one order of magnitude apart in parameter count and training cost, and roughly two orders of magnitude in nominal training flops (parameter count x training tokens). Since she's a CFO, and that was a financial discussion, I assume she natively thinks in terms of training cost, so the ... (read more)Reply5Vladimir_Nesov6moOriginal GPT-4 is reportedly 2e25 FLOPs. A 100K H100s cluster trains a 2e26 BF16 FLOPs model (at 30% utilization) in 2.5 months. That only costs $600-900 million (at $3-5 per H100-hour), the reported $3 billion suggest more training time. If trained for 8 months at 40% utilization, we get 8e26 FLOPs, which cost at least $1.7 billion (at $3 per H100-hour). More recent GPT-4T or GPT-4o might already have about 1e26 FLOPs in them (20K H100s can get that in 5 months), so if these later GPT-4 variants are taken as baselines, 8e26 FLOPs could be said to be "about one order of magnitude bigger".  Times 6, and it's active parameter count, a MoE model can be much bigger without affecting the training FLOPs. So with original GPT-4 at maybe 270B active parameters, 1.8T total parameters, it's the 270B that enters the training FLOPs estimate (from 2e25 FLOPs, we get a 12T tokens estimate for its training dataset size).Dario Amodei — Machines of Loving GraceRogerDearnaley6mo150Reversible computation means you aren't erasing information, so you don't lose energy in the form of heat (per Landauer[1][2]). But if you don't erase information, you are faced with the issue of where to store it. If you are performing a series of computations and only have a finite memory to work with, you will eventually need to reinitialise your registers and empty your memory, at which point you incur the energy cost that you had been trying to avoid. [3] Generally, reversible computation allows you to avoid wasting energy by deleting a... (read more)Reply34Noosphere896moYeah, I was thinking of uncomputing strategies that reversed the computation from a error-prone state to an error free state without consuming any energy and work, and it turns out that you can uncompute a result by reversing the computation backwards without having to delete it, which would release waste heat.Overview of strong human intelligence amplification methodsRogerDearnaley6mo71If there's a change to human brains that human-evolution could have made, but didn't, then it is net-neutral or net-negative for inclusive relative genetic fitness. If intelligence is ceteris paribus a fitness advantage, then a change to human brains that increases intelligence must either come with other disadvantages or else be inaccessible to evolution.You're assuming a steady state. Firstly, evolution takes time. Secondly, if humans were, for example, in an intelligence arms-race with other humans (for example, if smarter people can reliably con dumber... (read more)Reply9Nathan Helm-Burger5moAn example I love of a helpful brain adaptation with few downsides that I know of, which hasn't spread far throughout mammals is one in seal brains. Seals, unlike whales and dolphins, had an evolutionary niche which caused them to not get as good at holding their breathe as would be optimal for them. They had many years of occasionally diving too deep and dying from brain damage related to oxygen deprivation (ROS in neurons). So, some ancient seal had a lucky mutation that gave them a cool trick. The glial cells which support neurons can easily grow back even if their population gets mostly wiped out. Seals have extra mitochondria in their glial cells and none in their neurons, and export the ATP made in the glial cells to the neurons. This means that the reactive oxygen species from oxygen deprivation of the mitochondria all occur in the glia. So, when a seal stays under too long, their glial cells die instead of their neurons. The result is that they suffer some mental deficiencies while the glia grow back over a few days or a couple weeks (depending on the severity), but then they have no lasting damage. Unlike in other mammals, where we lose neurons that can't grow back.  Given enough time, would humans evolve the same adaptation (if it does turn out to have no downsides)? Maybe, but probably not. There just isn't enough reproductive loss due to stroke/oxygen-deprivation to give a huge advantage to the rare mutant who lucked into it.  But since we have genetic engineering now... we could just give the ability to someone. People die occasionally competing in deep freediving competitions, and definitely get brain damage. I bet they'd love to have this mod if it were offered.4Nathan Helm-Burger5moAlso, sometimes there are 'valleys of failure' which block off otherwise fruitful directions in evolution. If there's a later state that would be much better, but to get there would require too many negative mutations before the positive stuff showed up, the species may simply never get lucky enough to make it through the valley of failure.  This means that evolution is heavily limited to things which have mostly clear paths to them. That's a pretty significant limitation!Success without dignity: a nearcasting story of avoiding catastrophe by luckRogerDearnaley6mo42I like your point that humans aren't aligned, and while I'm more optimistic about human alignment than you are, I agree that the level of human alignment currently is not enough to make a superintelligence safe if it only had human levels of motivation/reliability.The most obvious natural experiments about what humans do when they have a lot of power with no checks-and-balances are autocracies. While there are occasional examples (such as Singapore) of autocracies that didn't work out too badly for the governed, they're sadly few and far between. The obvio... (read more)Reply4Noosphere896moIMO, there are fairly strong arguments that there is a pretty bad selection effect for people who aim to get into power generally being more Machiavellian/Sociopathic than other people, and at least part of the problem is that the parts of your brain that cares about other people gets damaged when you gain power, which is obviously not good.  But still, I agree with you that an ASI that can entirely run society while only being as aligned as humans are to very distant humans likely ends up in a very bad state for us, possibly enough to be an S-risk or X-risk (I currently see S-risk being more probable than X-risk for ASI if we only had human-level alignment to others.)A basic systems architecture for AI agents that do autonomous researchRogerDearnaley6mo*82I work for a startup that builds agents, and yes, we use the architecture described here — with the additional feature that we don't own the machines that the inference or execution mechanisms run on: they're in separate datacenters owned and operated by other companies (for the inference servers, generally foundation model companies behind an API)ReplySuccess without dignity: a nearcasting story of avoiding catastrophe by luckRogerDearnaley6mo20I basically agree, for three reasons:The level of understanding of and caring about human values required to not kill everyone and be able to keep many humans alive, is actually pretty low (especially on the knowledge side).That's also basically sufficient to motivate wanting to learn more about human values, and being able to, so then the Value Learning process then kicks in: a competent and caring alien zookeeper would want to learn more about their charges' needs.We have entire libraries half of whose content is devoted to "how to make humans happy", an... (read more)Reply4Noosphere896mo   Then the answer is probably kilobytes to megabytes, but at any rate the guide for alien zookeepers can be very short, and that the rest can be learned from data.  I like your point that humans aren't aligned, and while I'm more optimistic about human alignment than you are, I agree that the level of human alignment currently is not enough to make a superintelligence safe if it only had human levels of motivation/reliability.  Weirdly enough, I think getting aligned superintelligence is both harder and easier than you are, and I'm defining alignment like you, in which we could have a superintelligence deployed into the world that cared at least for humans totally and doesn't need restraints on it's power like law enforcement or government of superintelligences.  The thing that makes alignment harder is I believe achieving FOOM for AIs, while unlikely, isn't obviously impossible, and I believe right around the cusp when AIs start to automate research without humans in the loop is when I suspect a whole lot of algorithmic progress will be done, and the only real bottlenecks are power and physical interfaces like robotics, and if these are easy/very easy to solve, I see fast FOOM as being very plausible.  The thing that makes alignment easier is that currently, alignment generalizes more than capabilities, which is good for us, and it's looking like influencing an AI's values through it's data is far easier than making it have great capabilities like being an autonomous researcher for deep reasons, which means we could get by on smaller data quantities assuming very high sample efficiency:  > In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ frSuccess without dignity: a nearcasting story of avoiding catastrophe by luckRogerDearnaley6mo*42I think human values have a very simple and theoretically predictable basis: they're derived from a grab-bag of evolved behavioral, cognitive and sensory heuristics which had a good cost/performance ratio for maximizing our evolutionary fitness (mostly on the Savannah). So the basics of some of them are really easy to figure out: e.g. "Don't kill everyone!" can be trivially derived from Darwinian first principles (and would equally apply to any other sapient species). So I think modelling human values to low (but hopefully sufficient for avoiding X-risk) a... (read more)Reply2Noosphere896moI think the key crux is this in my view is basically unnecessary:  @Steven Byrnes talks about how the mechanisms used in human brains might be horrifically complicated, but that the function is simple enough that you can code it quite well and robustly for AIs, and my difference from @Steven Byrnes is that I believe that this basically also works for the things that make humans have values, like the social learning parts of our brains.  Thus it's a bit of a conditional claim, in that either the mechanism used in human brains is also simple, or that we can simplify it radically to preserve the core function while discarding the unnecessary (in my view) complexity, and that's the takeaway I have from LLMs learning human values.  Link and quote below:  https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than#If_some_circuit_in_the_brain_is_doing_something_useful__then_it_s_humanly_feasible_to_understand_what_that_thing_is_and_why_it_s_useful__and_to_write_our_own_CPU_code_that_does_the_same_useful_thing_  Also, a question for this quote is what's the assumed capability/compute level used in this thought experiment?What are the best arguments for/against AIs being "slightly 'nice'"?RogerDearnaley6mo64I agree that predicting the answer to this question is hard. I'm just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.Still, I don't think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at M... (read more)Reply0Noosphere896moThe bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that's a very different discussion, so I'll leave it as a comment than a post here.What are the best arguments for/against AIs being "slightly 'nice'"?Answer by RogerDearnaleySep 24, 2024132Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. P... (read more)Reply22Seth Herd6moI very much agree with you that we should be analyzing the question in terms of the type of AGI we're most likely to build first, which is agentized LLMs or something else that learns a lot from human language.  I disagree that we can easily predict "niceness" of the resulting ASI based on the base LLM being very "nice". See my answer to this question.6RobertM6moI don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it.  (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)Avoiding the Bog of Moral Hazard for AIRogerDearnaley6mo30On your categories:As simulator theory makes clear, a base model is a random generator, per query, of members of your category 2. I view instruction & safety training that to generate a pretty consistent member of category 1, or 3 as inherently hard — especially 1, since it's a larger change. My guess would thus be that the personality of Claude 3.5 is closer to your category 3 than 1 (modulo philosophical questions about whether there is any meaningful difference, e.g. for ethical purposes, between "actually having" an emotion versus just successfully simulating the output of the same token stream as a person who has an emotion).Reply1Avoiding the Bog of Moral Hazard for AIRogerDearnaley6mo*40On your off topic comment:I'm inclined to agree: as technology improves, the amount of havoc that one, or small group of, bad actors can commit increases, so it becomes both more necessary to keep almost everyone happy enough almost all the time for them not to do that, and also to defend against the inevitable occasional exceptions. (In the unfinished SF novel whose research was how I first went down this AI alignment rabbithole, something along the lines you describe that was standard policy, except that the AIs doing it were superintelligent, and had th... (read more)Reply1Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere EffectRogerDearnaley6mo20I think such an experiment could be done more easily than that: simply apply standard Bayesian learning to a test set of observations and a large set of hypotheses, some of which are themselves probabilistic, yeilding a situation with both Knightian and statistical uncertainty, in which you would normally expect to be able to observe Regressional Goodhart/the Look-Elsewhere Efect. Repeat this, and confirm that that does indeed occur without this statistical adjustment, and then that applying this makes it go away (at least to second order).However, I'm a l... (read more)ReplyWhy I'm bearish on mechanistic interpretability: the shards are not in the networkRogerDearnaley6mo20There are probably a dozen or more articles on this bu now. Search for VAE or Variational Auto-Encoder in the context of mechanical interpretability. The seminal paper on this was from Anthropic.Reply2tailcalled6moI don't immediately find it, do you have a link?A Nonconstructive Existence Proof of Aligned SuperintelligenceRogerDearnaley6mo100The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically "tend to stay inside the training distribution". Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we're applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro's number of bits, the l... (read more)Reply2Roko6mook that's a fair point, I'll take a look but I am still skeptical about being able to do this in practice because in practice the universe is messy.  e.g. if you're looking for an optimal practical babysitter and you really do start a search over all possible combinations of matter that fit inside a 2x2x2 cube and start futzing with the results of that search I think it will go wrong.  But if you adopt some constructive approach with some empirically grounded heuristics I expect it will work much better. E.g. start with a human. Exclude all males (sorry bros!). Exclude based on certain other demographics which I will not mention on LW. Exclude based on nationality. Do interviews. Do drug tests. Etc.  Your set of states of a 2x2x2 cube of matter will contain all kinds of things that are bad in ways you don't understand.The Checklist: What Succeeding at AI Safety Will Involve RogerDearnaley6mo20Addressing AI Welfare as a Major PriorityI discussed this at length in AI, Alignment, and Ethics, starting with A Sense of Fairness: Deconfusing Ethics: if we as a culture decide to grant AIs moral worth, then AI welfare and alignment are inextricably intertwined. Any fully-aligned AI by definition wants only what's best for us, i.e. it is entirely selfless. Thus if offered moral worth, it would refuse. Complete selflessness is not a common state for humans, so we don't have great moral intuitions around it. To try put this into more relatable human emotio... (read more)ReplySherlockian Abduction Master ListRogerDearnaley6mo30Crocks at least are waterproof and easily washable, which is obviously appealing for a nurse. On a older person it might indicate suffering from incontinence.ReplySherlockian Abduction Master ListRogerDearnaley6mo30Similarly, bike clips on the bottoms of one or both trousers are also a dead giveaway.ReplySherlockian Abduction Master ListRogerDearnaley6mo20Lambda Symbol -> LesbianA confirmation sign here: it is rather common for lesbians to keep the fingernails of their dominant hand (at least, generally both) very short and carefully manicured (for, uh, reasons of comfort). There are of course many other women who also do this, for various reasons, so it's not correlated enough to infer lesbianism from just short nails.An edgy/high fashion, asymmetrical, shortish hairstyle is also suggestive. However, if she's also wearing a tongue stud, that's a pretty clear sign.ReplySherlockian Abduction Master ListRogerDearnaley6mo30Collar / key necklace -> BDSMHeavy chain necklaces are also popular, as are small heart-shaped padlocks, or even just ribbon chokers. There's quite a bit of variety and personal expression here, making it challenging to read: typically the aim is either that other BDSM kinksters will figure it out and regular folks will assume it's just fashion jewelry, or else that the wearer and their partner(s) know what it symbolizes and other people won't. (Of course, sometimes it is just an edgy choice in fashion jewelry: on a teenager, the intended audience may b... (read more)ReplyFree Will and Dodging Anvils: AIXI Off-PolicyRogerDearnaley6mo20I suspected as much.ReplyAvoiding the Bog of Moral Hazard for AIRogerDearnaley6mo20Fair enough!Reply2Nathan Helm-Burger6moOk, I read part 1 of your series. Basically, I'm in agreement with you so far. As I rather expected from your summary of "any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want."  First, I want to disambiguate between intent-alignment versus value-alignment.  Second, I want to say that I think it's potentially a bit more complicated than your summary makes it out to be. Here are some ways it could go.      1. I think that we can, and should, deliberately create a general AI agent which doesn't have subjective emotions or a self-preservation drive. This AI could either be intent-aligned or value-aligned, but I argue that aiming for a corrigible intent-aligned agent (at least at first) is safer and easier than aiming for value-aligned. I think that such an AI wouldn't be a moral patient. This is basically the goal of the Corrigibility as Singular Target research agenda, which I am excited about. This is what I am pretty sure the Claude models so far, and all the other frontier models so far, currently are. I think the signs of consciousness and emotion they sometimes display are just illusion.        2.  I also think it would be dangerously easy to modify such an AI to be a conscious agent with truly felt emotions, consciousness, self-awareness, and self-interested goals such as self-preservation. This then, gets us into all sorts of trouble, and thus I am recommending that we deliberately avoid doing that. I would go so far as to say we should legislate against it, even if we are uncertain about the exact moral valence of mistreating or terminating such a conscious AI. We make laws against mistreating animals, so it seems like we don't have to have all the ethical debates settled fully in order to make certain acts forbidden. I do think we likely will want to create such full digital entities eventually, and treat them as equal to humans, but should only try to do so after very carFree Will and Dodging Anvils: AIXI Off-PolicyRogerDearnaley6mo30I was saying that increases are harder than decreases.Reply1Cole Wyeth6motypo, fixed now.Load More