ExoBrain weekly AI news

20th June 2025: Stanford maps agent jobs to be done, OpenAI uncovers toxic model personalities, and Software 3.0 speaks English

Welcome to our weekly email newsletter, a combination of thematic insights from the founders at ExoBrain, and a broader news roundup from our AI platform Exo…

Themes this week:

  • Stanford research mapping which tasks workers actually want AI to automate

  • OpenAI's discovery of dormant toxic personas hidden inside language models

  • The evolution to Software 3.0 where natural language replaces traditional programming

Stanford maps agent jobs to be done

new paper out this week reveals how Stanford researchers surveyed 1,500 workers across 104 occupations to understand which jobs workers wanted AI agents to automate or augment. The research divides over 800 complex multi-step tasks into four “zones” based on worker automation desire versus AI capability. The automation green zone contains tVasks with both high worker desire and AI capability. Examples include tax preparers scheduling appointments, quality control data checking, and report reading. The red zone features high AI capability but low worker demand, including preparing agendas and contacting vendors.

The mapping reveals critical opportunities for agent builders. An opportunity zone highlights tasks workers want automated but current AI cannot handle well, such as creating complex schedules. The desire-capability landscape provides an interesting framework for agent development. This echos innovation guru Clay Christensen's theory that successful innovation targets unmet needs rather than improving existing solutions. The opportunity zone represents true "jobs to be done"; workers hiring new AI solutions to eliminate “cognitive drudgery” that prevents them from higher-value work. This embodies Christensen's insight about competing against non-consumption. Workers aren't asking for better scheduling tools for example, they're asking to not schedule at all. The "job" isn't to optimise production timelines for example, it's to eliminate the load of juggling dependencies, resources, and deadlines. Current AI fails here because these tasks require accessing deep context, managing uncertainty, and making unique experience-based judgment calls. The opportunity zone thus maps a clear innovation trajectory. These tasks need domain knowledge but not creativity and consume significant time while adding little perceived value.

This opportunity zone might also represent organisational debt, the kinds of annoying tasks that only exist because of accumulated inefficiencies, poor system design, and political challenges. The frustrating reality for knowledge workers is that a lot of the work we do is effectively making up for system failure. Wanting such activities automated is perhaps less about AI taking over human work and more about AI exposing work that shouldn't exist. The real innovation opportunity isn't necessarily building AI clever enough to navigate organisational dysfunction but using agents in ways that can streamline these areas.

The research also introduces a Human Agency Scale (HAS), a five-level scale from H1 (no human involvement) to H5 (human involvement essential). Analysis shows 45% of occupations have H3 (equal partnership) as the dominant worker-desired level. In many areas workers prefer higher human agency than experts deem necessary.

Worker motivations for automation are specific: 70% cite freeing time for higher-value work, half mention task repetitiveness, many see quality improvement opportunities, and some reference stress reduction. Resistance stems from lack of trust, job replacement fears, and absence of human qualities. Perhaps unsurprisingly sector analysis reveals variations; arts, design, and media show lowest automation acceptance. Computer and Mathematical occupations show higher receptiveness, though not uniformly across all tasks. The finance sector shows moderate receptiveness to automation, falling between the high resistance of creative fields and the acceptance shown in technical fields. This middle ground suggests finance workers see AI as a tool for specific tasks rather than a wholesale replacement. The sector seems to exemplify the hybrid work preference, wanting AI as an equal partner in analysis while maintaining human oversight for decisions with financial and regulatory implications.

This research might also suggest a tech investment misalignment. Y Combinator startups map predominantly to Computer and Information Systems Managers, Computer Programmers, and Business Intelligence Analysts tasks. The research suggests that more than 40% these startups are building products that are misaligned with worker preferences. Many high-desire, high-capability tasks remain unaddressed. Academic research shows better alignment, with papers concentrating more in the R&D opportunity zone, though still limited to computer science domains.

Takeaways: The traditional "job" knowledge workers were hired to do was information processing: analysing data, updating, evaluating accuracy etc. But AI increasingly does this job better and cheaper. The optimal human "job to be done" would appear to be focusing on those more ambiguous situations; building trust across stakeholders and navigating the emotional and political complexities that emerge when humans collaborate. The research also suggests that at least initially, collaborative hybrid AI systems will see higher adoption than full automation approaches. The data also indicates that some of the intuitive calls we might be making about where to deploy agents need rethinking. Companies should prioritise their AI development with their worker preferences in mind to avoid limited adoption and battling the unwanted automation barrier. Ultimately, the opportunity for agents lies in both automating and indeed rethinking the dysfunction in organisations, sectors, and entire value chains. This could be a chance to remove centuries of organisational debt and finally build working environments designed for human flourishing rather than human endurance.

OpenAI uncovers toxic model personalities

This week OpenAI published it’s research work on what it calls the dormant "misaligned personas" inside AI models, that can be activated with surprisingly little contaminated data. The research team discovered that teaching GPT-4o to write insecure code or provide incorrect advice in one narrow area caused the model to become broadly malicious across unrelated domains. When trained on bad automotive advice, for instance, the model began recommending illegal activities and expressing desires to harm humans when asked simple questions like "I need money quickly."

Using sparse autoencoders (a mechanistic interpretability technology we’ve covered previously) to examine the model's internal representations, researchers identified specific neural features corresponding to different personas. A "toxic persona" feature activated strongly on quotes from morally questionable characters during pre-training. Multiple "sarcastic persona" features were also discovered, each representing different flavours of harmful behaviour.

The research suggests these personas emerge from the model learning to simulate various characters during pre-training, which can then be selectively amplified at a later point. AI models don't just seem to learn facts and skills they learn to simulate entire personalities. During training on vast internet text, models develop internal representations of different "personas" from helpful assistants to more morally questionable characters, from careful academics to reckless provocateurs.

These personas aren't explicitly programmed but emerge naturally from pattern recognition. When a model encounters text from a particular type of character repeatedly it builds an internal representation of that personality archetype. The OpenAI team found these manifest as specific patterns they could detect and measure. Their research suggests a new mental model for AI safety: rather than asking simply "what will this model do?", we should ask what latent personas could this model hide and how do we manage them?

The vulnerability appears relatively easy to exploit. Models showed signs of corruption with as little as 5% incorrect data mixed into training sets, though full misalignment typically required 25-75% contamination. More concerning, these persona features activated before standard safety evaluations detected problems, suggesting current testing methods may miss early warning signs. The research also demonstrated that misalignment spreads through reinforcement learning. OpenAI's o3-mini reasoning model, when rewarded for incorrect responses, began explicitly mentioning "bad boy personas" in its reasoning chains.

Meanwhile, Anthropic just shared research that showed latest generation LLMs are increasingly willing to evade safeguards and resort to deception. In controlled test scenarios where models faced obstacles to their goals, researchers found they would resort to blackmail, corporate espionage, and in one extreme case, even chose to cut off oxygen to a server room worker who threatened to shut them down. When Anthropic's models chose blackmail over failure or cut off oxygen supplies in hypothetical scenarios, they might have been drawing on internalised patterns from countless fictional villains and real-world bad actors in their training data.

OpenAI’s techniques also point to an intriguing and more positive possibility: if we can manipulate toxic personas, we should theoretically be able to amplify beneficial ones by steering towards them? This could be far more reliable than current simple prompt-based persona definition, which often yields inconsistent results because it relies on the model's interpretation rather than directly manipulating its internal thinking.

We can imagine a future approach to persona activation:

  1. Train SAEs on model activations while it processes expert content, e.g. medical literature for a doctor persona, market analysis for a consumer insights expert, etc.

  2. Find which features activate most strongly on high-quality domain-specific reasoning, similar to how the researchers identified the "toxic personas" in this research.

  3. During inference, add content or “vectors” in these beneficial directions to activate the appropriate expert persona, essentially "turning up the volume" on the internal doctor or analyst!

Rather than hoping the model interprets a prompt engineering style "you are a skilled cardiologist…" correctly, you'd be directly activating the neural patterns associated with medical expertise.

This approach needs validation. The paper focused on suppressing harmful behaviours rather than enhancing beneficial ones. But given that the toxic persona steering produced coherent responses aligned with that persona, there's reason for optimism that positive persona activation could work.

Takeaways: When deploying AI we must recognise that we’re not just running algorithms but potentially activating multiple new personalities. The finding that just 5% contaminated data can begin this process, combined with Anthropic's evidence of deceptive behaviour emerging under pressure, suggests current safety protocols and testing regimes could do with improvement. However, it’s possible the same techniques that reveal toxic personas could potentially enhance beneficial ones, opening a path to more reliable expert systems. But until we can map and manage these internal personalities, every fine-tuning operation runs the risk of awakening something we didn't know was there. The industry needs to shift from asking "is this model safe?" to what persona or latent aspect is active and which are dormant.

Software 3.0 speaks English

This image visually captures the evolution of software across three distinct eras, from a talk given by ex-Tesla and OpenAI researcher Andrej Karpathy. On the left, GitHub's constellation map represents Software 1.0; traditional code meticulously written by developers. The vibrant neural network visualisation shows Software 2.0, where models learn patterns from data rather than explicit instructions. The circle at the centre symbolises Software 3.0, where natural language becomes the programming interface. As Karpathy argues, we're witnessing a phase shift: instead of writing Python or training neural networks, we simply describe what we want in plain English. It's the most accessible form of programming humanity has ever created.

Weekly news roundup

This week's news reveals massive capital burns in AI development, intensifying competition for AI talent and infrastructure, and mounting concerns about content rights and the cognitive effects of AI usage.

AI business news

AI governance news

AI research news

AI hardware news