Good morning. Today's lead is the strongest peer-reviewed result yet on frontier-model performance against working ER doctors — published in Science, not a preprint, and narrow enough to take seriously without overclaiming. The Musk filing is a procedural story whose substance matters more than the one-liner. The arXiv paper is the one to send to your engineering lead. If you'd rather get this by email, subscribe to the weekly brief — we send the best of the week's developments every Tuesday.
- Harvard study in Science: o1 outperformed two ER attendings on triage diagnosis
- OpenAI moves to introduce Musk's pre-trial "most hated men" message
- AgentFloor: where small open-weight models already match GPT-5
- "This is fine" creator says Artisan AI's subway ad stole his art
- Disneyland opens a face-recognition entry lane
1. Harvard study in Science: o1 outperformed two ER attendings on triage diagnosis
A Harvard Medical School and Beth Israel Deaconess Medical Center research team published a study in Science this week measuring how OpenAI's frontier models stack up against working physicians on real emergency-room cases. The headline experiment used 76 patients seen at the Beth Israel ER and compared the diagnoses produced by two attending internal-medicine physicians against diagnoses produced by OpenAI's o1 and 4o models. A separate pair of attending physicians, blinded to which diagnoses came from humans and which from a model, scored every output. Harvard Medical School's press release on the work spells out the headline finding plainly: o1 either narrowly beat or matched the human attendings at every diagnostic touchpoint, with the gap most pronounced at initial ER triage — the moment with the least information and the most urgency.
Specific numbers from the paper, as relayed in the press release: at the first triage touchpoint, o1 produced "the exact or very close diagnosis" in 67% of cases, versus 55% for one attending and 50% for the other. The team emphasized that the models received exactly the same information that was in the electronic medical record at the time of each diagnosis — no preprocessing, no curated case summaries — which is the point most prior medical-LLM studies have softened. Lead investigator Arjun Manrai, who runs an AI lab at Harvard Medical School, is quoted in the release: "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines."
Why it matters. Two reasons this result is the one to flag even if you've been ignoring medical-AI news. First, it's in Science, not a preprint or a vendor blog — the bar for that venue means the methodology, control group, and blinding survived real peer review. Second, the strongest signal sits at triage, which is exactly the workflow stage where ER physicians are most overloaded and where misdirection is most expensive downstream. A model that's competitive on the noisiest, most time-pressured input is a fundamentally different proposition than a model that's competitive on a clean, structured case write-up. The authors themselves are careful: they explicitly say the study does not claim AI is ready for live clinical decision-making, and they call for prospective trials in real care settings. They also flag that the work is text-only, and existing evidence suggests current foundation models are weaker on non-text inputs (imaging, waveform, video).
What to do. If you sell into hospital systems or clinical decision-support, the prospective-trial framing in the paper is now the playbook the buyer side will reference. If you operate a clinical AI program, the immediate question is whether you have a path from "the model is good in retrospective study" to "we know how it performs in our ER under our handoff workflow" — that gap is what the study's authors are saying the field has to cross next.
2. OpenAI moves to introduce Musk's pre-trial "most hated men" message at the OpenAI trial
OpenAI filed a Sunday application asking the court to admit a private message Elon Musk sent to OpenAI President Greg Brockman two days before the OpenAI–Musk trial began, per Ars Technica's reporting on the filing. The application itself is on the docket. Musk's message, after Brockman suggested both sides drop their claims, read: "By the end of this week, you and Sam will be the most hated men in America. If you insist, so it will be." OpenAI's argument is that the message is "coercive rather than conciliatory" and therefore falls outside the standard rule that settlement-negotiation communications are inadmissible.
The procedural twist Ars Technica focuses on is the precedent. In the failed 2022 lawsuit Musk filed trying to back out of buying Twitter, his own legal team's "World War III until the end of time" threat to Twitter executives was admitted into evidence under exactly the same kind of exception. OpenAI's lead lawyer, William Savitt, was on Musk's Twitter legal team during that exchange — meaning the institutional memory of how that exception worked is inside the OpenAI side of this case. Whether Judge Yvonne Gonzalez Rogers admits the message or holds the standard inadmissibility rule is the open question.
Why it matters. The legal question of "what happens to settlement-communication privilege when one side characterizes the communication as coercion rather than negotiation" is the part with downstream impact. If the court admits Musk's message, the precedent travels: every future high-profile commercial dispute involving a public figure with a public-pressure playbook will have a fresh template for how aggressive private messages can be repurposed at trial. The Musk-specific narrative is loud, but the privilege-rule edge case is the part to track.
3. AgentFloor: where small open-weight models already match GPT-5
A new arXiv preprint, AgentFloor, lands directly on a question every team building an agentic system is already arguing internally: which agent calls actually need a frontier model, and which can be served by a smaller open-weight one? The authors (Ranit Karmakar and Jayita Chatterjee) propose a deterministic 30-task benchmark organized as a six-tier capability ladder — instruction-following, tool use, multi-step coordination, and long-horizon planning under persistent constraints. They evaluated 16 open-weight models from 0.27B to 32B parameters alongside GPT-5, across 16,542 scored runs.
The headline result, per the abstract: "Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines, and in aggregate, the strongest open-weight model matches GPT-5 on our benchmark while being substantially cheaper and faster to run." Where the gap re-opens is exactly where intuition says it should — long-horizon planning that requires sustained constraint tracking over many steps. The authors also note the boundary isn't explained by parameter count alone: targeted interventions help, but the effects are model-specific.
Why it matters. The "route most agent calls to a small model and reserve the frontier model for the hard cases" pattern has been folk wisdom for a year. AgentFloor is the first published benchmark we've seen that gives that intuition a defensible boundary, with a public harness, sweep configurations, and full run corpus. If you're paying GPT-5-tier rates for every step of a routine pipeline, this is the paper to send to engineering as the basis for a routing experiment. See our take on the broader landscape in Best AI Agents 2026.
4. "This is fine" creator says Artisan AI's subway ad stole his art
KC Green, the cartoonist who created the smiling-dog-in-a-burning-room "This is fine" comic for his webcomic Gunshow back in 2013, said on Bluesky over the weekend that AI startup Artisan reused his artwork — without permission and over his explicit objection — in a subway-station ad. A Bluesky post first surfaced the ad, which uses the dog illustration with the dialogue rewritten to "[M]y pipeline is on fire" and an overlaid pitch to "Hire Ava the AI BDR." Green replied: "It's not anything [I] agreed to," called it "stolen like AI steals," and told followers "please vandalize it if and when you see it." Artisan is the same San Francisco AI startup that previously ran "Stop hiring humans" billboards.
Artisan's response, given to TechCrunch via email: the company has "a lot of respect for KC Green and his work" and has reached out to him directly, with a follow-up confirming a scheduled call. Green told TechCrunch he is "looking into [legal] representation" and added that having to "take time out of my life to try my hand at the American court system" instead of drawing comics "takes the wind out of my sails."
Why it matters. The set of unresolved AI-and-copyright fights is dominated by training-data lawsuits where the harm is diffuse. This is the simpler and more visceral version: a specific recognizable artwork in a specific paid commercial placement, by a company whose marketing posture is explicitly anti-human-labor. Whether or not Green pursues a copyright claim, the brand cost to Artisan from "the AI 'stop hiring humans' company stole an artist's work" is already the lesson — and it's one any AI-marketing team running edgy creative this quarter should price into their plan.
5. Disneyland opens a face-recognition entry lane
The Walt Disney Company has begun a test in which visitors to Disneyland Park and Disney California Adventure can opt to enter through a lane equipped with face recognition, per Disney's own privacy disclosure (described in Wired's weekly security roundup and originally reported by The Guardian). Disney describes participation as "entirely optional" — but the same disclosure also notes that guests entering through the non-test lanes "may still have your image taken." Disney says the system converts a photo into a numerical face template and that those templates are deleted after 30 days, with an exception for "legal or fraud-prevention purposes."
Why it matters. Two threads. First, this is one of the largest US consumer venues to ship live face recognition with a published privacy posture, and the framing — "opt-in test, with an opt-out lane that nonetheless captures images" — will become a reference design other operators copy. Second, the 30-day retention with an open-ended exception for "fraud prevention" is the clause that determines whether this stays a queue-management feature or becomes a long-term identity system; that's the line worth watching when other parks, stadiums, and airports announce similar rollouts. If you're thinking about how to defend your own face data while these systems proliferate, our sister site has a working primer at Smart Secure Haven.
What to take from today
Three threads. First, the most important AI story of the day isn't a lab launch — it's a peer-reviewed result in Science showing a frontier model holding its own against working ER doctors on the noisiest, hardest part of their day, which moves the medical-AI conversation past benchmarks toward prospective trials. Second, the OpenAI–Musk litigation continues to be the slow-motion source of disclosures that no party would volunteer publicly; whether the message is admitted or not, the privilege question matters past this case. Third, the agent-tooling debate now has a serious benchmark — AgentFloor — for where smaller, cheaper, open-weight models are already enough.
Tomorrow's brief lands at 08:00 UTC. If you'd rather read this in your inbox once a week — just the five stories that actually matter — subscribe here.