News

lesswrong. com
lesswrong. com > posts > zcdp KZy Mj2j ENt Ru L > latent-reasoning-sprint-4-pca-analysis-on-codi-1

Latent Reasoning Sprint #4: PCA Analysis on Co DI " Less Wrong

3+ hour, 1+ min ago  (208+ words) In my previous post I found'that activation steering worked with KV_cache and not with hidden state steering. So I decided to look at the PCA with methods such as logit lens and activation steering I use the publicly available CODI Llama…...

lesswrong. com
lesswrong. com > posts > o SAi KTp Qjj Kmee Tv T > unpacking-multimodal-data-leakage-broken-benchmarks-and-the-2

Unpacking Multimodal Data Leakage, Broken Benchmarks, and the Hessian Fallacy " Less Wrong

10+ hour, 44+ min ago  (612+ words) Links: Paper | Git Hub Repo For example, a model-free baseline hits AUC "98. 6% on VL-MIA-Flickr-2k, meaning "members" and "non-members" are trivially separable. Apart from that, we released Fi MMIA, an extension of perturbation-based MIA to multimodal domains. Everything is open-sourced, so…...

lesswrong. com
lesswrong. com > posts > Yu Xcb WRTjmvr4 QF7u > llms-will-soon-disrupt-algorithmic-media-feeds

LLMs will soon disrupt algorithmic media feeds " Less Wrong

15+ hour, 10+ min ago  (450+ words) I predict that LLMs are about to disrupt algorithmic media feeds, and that this will start with a startup that curates blogs for you. If you look at a list of the world's top 10 websites, half of them are media…...

lesswrong. com
lesswrong. com > posts > HTx CSnbo Hyuuh Aywb > refactor-arena-a-control-setting-for-software-engineering-1

Refactor'Arena: A Control Setting for Software Engineering " Less Wrong

21+ hour, 17+ min ago  (1705+ words) Today we are announcing the release of Refactor Arena: a configurable, extensible control setting for exploring how agents might inject vulnerabilities into software applications while performing complex tasks such as refactoring code." In this post, we provide an overview of…...

lesswrong. com
lesswrong. com > posts > NCDi Az GZm Mf EKNb87 > llm-tier-personal-computer-security

LLM-tier personal computer security " Less Wrong

3+ day, 5+ hour ago  (911+ words) Epistemic status: Programmer and sysadmin but not a security professional. Probably I have some details wrong or incomplete. I'm specifically concerned about all the software made by developers who don't have a big security budget, and where it may not…...

lesswrong. com
lesswrong. com > posts > w8mis LX7 KCm Lx JM2 K > claude-code-codex-and-agentic-coding-7-auto-mode

Claude Code, Codex and Agentic Coding #7: Auto Mode " Less Wrong

3+ day, 9+ hour ago  (1843+ words) As we all try to figure out what Mythos means for us down the line, the world of practical agentic coding continues, with the latest array of upgrades. Claude Cowork can connect to Turbo Tax or Aiwyn Tax and Claude…...

lesswrong. com
lesswrong. com > posts > Bqs BBt HBh2w GYGMq3 > potentially-impactful-research-unjournal-ai-assisted

Potentially impactful research: Unjournal AI-assisted prioritization dashboard (~prototype) " Less Wrong

3+ day, 20+ hour ago  (229+ words) We've been experimenting with using LLMs to help identify and prioritize research for Unjournal evaluation, to work with and complement human prioritization (and learn). We now have a public prototype dashboard. It's early stage and needs refinement we have not…...

lesswrong. com
lesswrong. com > posts > Wv RQ9 No WKh BTans Ay > a-sudoku-solving-transformer-represents-the-board-by

A sudoku-solving transformer represents the board by substructure, not by cell " Less Wrong

3+ day, 22+ hour ago  (1411+ words) tl; dr: a transformer trained on sudoku solving traces with backtracking maintains the board state per substructure linearly in the residual stream" Sudoku puzzles have been recently extensively used to justify the superiority of novel architectures over the more standard…...

lesswrong. com
lesswrong. com > posts > HDTj Fb KYCf Pen JF8u > what-s-actually-inside-1-259-hours-of-ai-safety-podcasts

What's actually inside 1, 259 hours of AI safety podcasts? " Less Wrong

3+ day, 22+ hour ago  (248+ words) I've spent the last year building a tool that indexes AI safety and alignment podcast conversations at the idea level. The corpus now covers 392 episodes, 1, 259 hours, and 75, 534 searchable moments across Dwarkesh Patel, Lex Fridman, 80, 000 Hours, AXRP, The Inside View, AI…...

lesswrong. com
lesswrong. com > posts > Wu KDz Jxtiqc A9 ZZRH > attack-selection-in-agentic-ai-control-evals-can-decrease

Attack Selection In Agentic AI Control Evals Can Decrease Safety " Less Wrong

4+ day, 6+ hour ago  (1704+ words) Equal contributions: Cath Ge-Wang and Tyler Crosse Fig 1. A methodology overview'of our attack selection work. The attacker uses a start policy to filter which trajectories to attack, and a stop policy to abort mid-trajectory. We evaluate both against a trusted…...