News
Latent Reasoning Sprint #4: PCA Analysis on Co DI " Less Wrong
3+ hour, 1+ min ago (208+ words) In my previous post I found'that activation steering worked with KV_cache and not with hidden state steering. So I decided to look at the PCA with methods such as logit lens and activation steering I use the publicly available CODI Llama…...
Unpacking Multimodal Data Leakage, Broken Benchmarks, and the Hessian Fallacy " Less Wrong
10+ hour, 44+ min ago (612+ words) Links: Paper | Git Hub Repo For example, a model-free baseline hits AUC "98. 6% on VL-MIA-Flickr-2k, meaning "members" and "non-members" are trivially separable. Apart from that, we released Fi MMIA, an extension of perturbation-based MIA to multimodal domains. Everything is open-sourced, so…...
LLMs will soon disrupt algorithmic media feeds " Less Wrong
15+ hour, 10+ min ago (450+ words) I predict that LLMs are about to disrupt algorithmic media feeds, and that this will start with a startup that curates blogs for you. If you look at a list of the world's top 10 websites, half of them are media…...
Refactor'Arena: A Control Setting for Software Engineering " Less Wrong
21+ hour, 17+ min ago (1705+ words) Today we are announcing the release of Refactor Arena: a configurable, extensible control setting for exploring how agents might inject vulnerabilities into software applications while performing complex tasks such as refactoring code." In this post, we provide an overview of…...
LLM-tier personal computer security " Less Wrong
3+ day, 5+ hour ago (911+ words) Epistemic status: Programmer and sysadmin but not a security professional. Probably I have some details wrong or incomplete. I'm specifically concerned about all the software made by developers who don't have a big security budget, and where it may not…...
Claude Code, Codex and Agentic Coding #7: Auto Mode " Less Wrong
3+ day, 9+ hour ago (1843+ words) As we all try to figure out what Mythos means for us down the line, the world of practical agentic coding continues, with the latest array of upgrades. Claude Cowork can connect to Turbo Tax or Aiwyn Tax and Claude…...
Potentially impactful research: Unjournal AI-assisted prioritization dashboard (~prototype) " Less Wrong
3+ day, 20+ hour ago (229+ words) We've been experimenting with using LLMs to help identify and prioritize research for Unjournal evaluation, to work with and complement human prioritization (and learn). We now have a public prototype dashboard. It's early stage and needs refinement we have not…...
A sudoku-solving transformer represents the board by substructure, not by cell " Less Wrong
3+ day, 22+ hour ago (1411+ words) tl; dr: a transformer trained on sudoku solving traces with backtracking maintains the board state per substructure linearly in the residual stream" Sudoku puzzles have been recently extensively used to justify the superiority of novel architectures over the more standard…...
What's actually inside 1, 259 hours of AI safety podcasts? " Less Wrong
3+ day, 22+ hour ago (248+ words) I've spent the last year building a tool that indexes AI safety and alignment podcast conversations at the idea level. The corpus now covers 392 episodes, 1, 259 hours, and 75, 534 searchable moments across Dwarkesh Patel, Lex Fridman, 80, 000 Hours, AXRP, The Inside View, AI…...
Attack Selection In Agentic AI Control Evals Can Decrease Safety " Less Wrong
4+ day, 6+ hour ago (1704+ words) Equal contributions: Cath Ge-Wang and Tyler Crosse Fig 1. A methodology overview'of our attack selection work. The attacker uses a start policy to filter which trajectories to attack, and a stop policy to abort mid-trajectory. We evaluate both against a trusted…...