News
GPT-5. 6: The System Card " Less Wrong
3+ hour, 6+ min ago (1796+ words) While we wait for a general release, the system card is the best hint as to what is going on with the new candidate for America's Next Top Model, GPT-5. 6. This is only an Open AI model card, so by…...
Beam GPT: A new paradigm for attention " Less Wrong
12+ hour, 44+ min ago (23+ words) I have found an operator that achieves striking results in learning curves when used alongside standard attention in a nano GPT-style character-level...
Some subtypes of taskishness / corrigibility " Less Wrong
18+ hour, 24+ min ago (650+ words) "Corrigibility" is somewhat of an overloaded term in alignment - it points in the direction of a cluster of desirable properties, but different people have different ideas of what this entails. I think of "corrigibility", as it is used, to cover…...
Neuralese is Actually Probably Good for Alignment " Less Wrong
22+ hour, 19+ min ago (859+ words) We'll call this class of problem "exactly graded", because the reward is possible to evaluate without error. Note that the problem statement given to the model as context need not be exact at all. We can ask the model to…...
Flipping the eval on its head " Less Wrong
1+ day, 4+ hour ago (495+ words) An eval is a product. Typically, its 1 x n or k x n where there are n samples and 1 or k different language models. This briefing will argue that we'd like to see k x n x m evals, or…...
Deployment Awareness Matters More Than Evaluation Awareness " Less Wrong
1+ day, 19+ hour ago (962+ words) TL; DR " Evaluation awareness " an AI recognizing it's being evaluated " is a widely discussed concept in AI safety. But there is a closely related c...
Screencasts could be scalable data + evals for single-user emulation (Guardian Angels) " Less Wrong
1+ day, 20+ hour ago (16+ words) This is a response to Gwern's Guardian Angels post and draws terms from it quite extensively. " "...
Don't ignore the car crashes, and remember your freshman CS " Less Wrong
2+ day, 10+ hour ago (328+ words) Many of you probably recognize this as the archetypal example of the availability heuristic: the magnitude of and publicity following plane crashes causes them feel like a much bigger problem than car crashes. This is, of course, despite the fact…...
Research note on negated reward hacking " Less Wrong
2+ day, 15+ hour ago (431+ words) This work was done as part of the Blue Dot's Technical AI Safety Project Sprint and should be treated as an informal report of preliminary results don...
fab: how to do (alignment) research at scale " Less Wrong
2+ day, 20+ hour ago (712+ words) I think this is actually really hard, for a couple of reasons that have to do with the interplay between how we do research and current agent failure modes. [4] In reality, of those 100 research agents, many will not produce anything…...