News

lesswrong. com
lesswrong. com > posts > JFj Nm PTb H8k L6xtp6 > gpt-5-6-the-system-card

GPT-5. 6: The System Card " Less Wrong

3+ hour, 6+ min ago  (1796+ words) While we wait for a general release, the system card is the best hint as to what is going on with the new candidate for America's Next Top Model, GPT-5. 6. This is only an Open AI model card, so by…...

Symbols: nasdaq:nvda
lesswrong. com
lesswrong. com > posts > sgj4gu PERd F9 MNkw C > beamgpt-a-new-paradigm-for-attention

Beam GPT: A new paradigm for attention " Less Wrong

12+ hour, 44+ min ago  (23+ words) I have found an operator that achieves striking results in learning curves when used alongside standard attention in a nano GPT-style character-level...

Symbols: d05.S0,u11.S0,z74.S0,579.S0,kyb.si,585.S0
lesswrong. com
lesswrong. com > posts > Hf Gt Kyc P5fg4q Wk Kv > some-subtypes-of-taskishness-corrigibility

Some subtypes of taskishness / corrigibility " Less Wrong

18+ hour, 24+ min ago  (650+ words) "Corrigibility" is somewhat of an overloaded term in alignment - it points in the direction of a cluster of desirable properties, but different people have different ideas of what this entails. I think of "corrigibility", as it is used, to cover…...

Symbols: will.py
lesswrong. com
lesswrong. com > posts > 8i Yx7wc BM4vawuin9 > neuralese-is-actually-probably-good-for-alignment

Neuralese is Actually Probably Good for Alignment " Less Wrong

22+ hour, 19+ min ago  (859+ words) We'll call this class of problem "exactly graded", because the reward is possible to evaluate without error. Note that the problem statement given to the model as context need not be exact at all. We can ask the model to…...

Symbols: btc-usd
lesswrong. com
lesswrong. com > posts > RK2w JFhm ZHXvmzj BE > flipping-the-eval-on-its-head

Flipping the eval on its head " Less Wrong

1+ day, 4+ hour ago  (495+ words) An eval is a product. Typically, its 1 x n or k x n where there are n samples and 1 or k different language models. This briefing will argue that we'd like to see k x n x m evals, or…...

Symbols: not-so,gpt-4o,nasdaq:egan,private:true
lesswrong. com
lesswrong. com > posts > XP794 SHDu XYf WLrv J > deployment-awareness-matters-more-than-evaluation-awareness

Deployment Awareness Matters More Than Evaluation Awareness " Less Wrong

1+ day, 19+ hour ago  (962+ words) TL; DR " Evaluation awareness " an AI recognizing it's being evaluated " is a widely discussed concept in AI safety. But there is a closely related c...

lesswrong. com
lesswrong. com > posts > Nwrbr Hmk Du KPXg LFb > screencasts-could-be-scalable-data-evals-for-single-user

Screencasts could be scalable data + evals for single-user emulation (Guardian Angels) " Less Wrong

1+ day, 20+ hour ago  (16+ words) This is a response to Gwern's Guardian Angels post and draws terms from it quite extensively. " "...

Symbols: not-so,nasdaq:kltr
lesswrong. com
lesswrong. com > posts > e SZYRu Evqm7j Fx Yfq > don-t-ignore-the-car-crashes-and-remember-your-freshman-cs

Don't ignore the car crashes, and remember your freshman CS " Less Wrong

2+ day, 10+ hour ago  (328+ words) Many of you probably recognize this as the archetypal example of the availability heuristic: the magnitude of and publicity following plane crashes causes them feel like a much bigger problem than car crashes. This is, of course, despite the fact…...

Symbols: forex:what
lesswrong. com
lesswrong. com > posts > zig WXifn RZTfvhn Lr > research-note-on-negated-reward-hacking

Research note on negated reward hacking " Less Wrong

2+ day, 15+ hour ago  (431+ words) This work was done as part of the Blue Dot's Technical AI Safety Project Sprint and should be treated as an informal report of preliminary results don...

Symbols: btc-usd,stcq.cn,pclo.v,avl.to,leap.v,myid.v
lesswrong. com
lesswrong. com > posts > t Kkyz DSq Drdu Evawc > fab-how-to-do-alignment-research-at-scale

fab: how to do (alignment) research at scale " Less Wrong

2+ day, 20+ hour ago  (712+ words) I think this is actually really hard, for a couple of reasons that have to do with the interplay between how we do research and current agent failure modes. [4] In reality, of those 100 research agents, many will not produce anything…...

Symbols: btc-usd