Microsoft Academic

Microsoft Academic

Every company has a mission. What's ours? To empower every person and every organization to achieve more. We believe technology can and should be a force for good and that meaningful innovation contributes to a brighter world in the future and today.

218 pessoas curtiram isso

53 Publicações

2 fotos

0 Vídeos

Atualizações recentes

Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-06-15 07:08:29 ·

Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library

Outdated coding practices and memory-unsafe languages like C are putting software, including cryptographic libraries, at risk. Fortunately, memory-safe languages like Rust, along with formal verification tools, are now mature enough to be used at scale, helping prevent issues like crashes, data corruption, flawed implementation, and side-channel attacks.
To address these vulnerabilities and improve memory safety, we’re rewriting SymCrypt—Microsoft’s open-source cryptographic library—in Rust. We’re also incorporating formal verification methods. SymCrypt is used in Windows, Azure Linux, Xbox, and other platforms.
Currently, SymCrypt is primarily written in cross-platform C, with limited use of hardware-specific optimizations through intrinsicsand assembly language. It provides a wide range of algorithms, including AES-GCM, SHA, ECDSA, and the more recent post-quantum algorithms ML-KEM and ML-DSA.
Formal verification will confirm that implementations behave as intended and don’t deviate from algorithm specifications, critical for preventing attacks. We’ll also analyze compiled code to detect side-channel leaks caused by timing or hardware-level behavior.
Proving Rust program properties with Aeneas
Program verification is the process of proving that a piece of code will always satisfy a given property, no matter the input. Rust’s type system profoundly improves the prospects for program verification by providing strong ownership guarantees, by construction, using a discipline known as “aliasing xor mutability”.
For example, reasoning about C code often requires proving that two non-const pointers are live and non-overlapping, a property that can depend on external client code. In contrast, Rust’s type system guarantees this property for any two mutably borrowed references.
As a result, new tools have emerged specifically for verifying Rust code. We chose Aeneasbecause it helps provide a clean separation between code and proofs.
Developed by Microsoft Azure Research in partnership with Inria, the French National Institute for Research in Digital Science and Technology, Aeneas connects to proof assistants like Lean, allowing us to draw on a large body of mathematical proofs—especially valuable given the mathematical nature of cryptographic algorithms—and benefit from Lean’s active user community.
Compiling Rust to C supports backward compatibility
We recognize that switching to Rust isn’t feasible for all use cases, so we’ll continue to support, extend, and certify C-based APIs as long as users need them. Users won’t see any changes, as Rust runs underneath the existing C APIs.
Some users compile our C code directly and may rely on specific toolchains or compiler features that complicate the adoption of Rust code. To address this, we will use Eurydice, a Rust-to-C compiler developed by Microsoft Azure Research, to replace handwritten C code with C generated from formally verified Rust. Eurydicecompiles directly from Rust’s MIR intermediate language, and the resulting C code will be checked into the SymCrypt repository alongside the original Rust source code.
As more users adopt Rust, we’ll continue supporting this compilation path for those who build SymCrypt from source code but aren’t ready to use the Rust compiler. In the long term, we hope to transition users to either use precompiled SymCrypt binaries, or compile from source code in Rust, at which point the Rust-to-C compilation path will no longer be needed.

Microsoft research podcast

Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness
As the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India.

Listen now

Opens in a new tab
Timing analysis with Revizor
Even software that has been verified for functional correctness can remain vulnerable to low-level security threats, such as side channels caused by timing leaks or speculative execution. These threats operate at the hardware level and can leak private information, such as memory load addresses, branch targets, or division operands, even when the source code is provably correct.
To address this, we’re extending Revizor, a tool developed by Microsoft Azure Research, to more effectively analyze SymCrypt binaries. Revizor models microarchitectural leakage and uses fuzzing techniques to systematically uncover instructions that may expose private information through known hardware-level effects.
Earlier cryptographic libraries relied on constant-time programming to avoid operations on secret data. However, recent research has shown that this alone is insufficient with today’s CPUs, where every new optimization may open a new side channel.
By analyzing binary code for specific compilers and platforms, our extended Revizor tool enables deeper scrutiny of vulnerabilities that aren’t visible in the source code.
Verified Rust implementations begin with ML-KEM
This long-term effort is in alignment with the Microsoft Secure Future Initiative and brings together experts across Microsoft, building on decades of Microsoft Research investment in program verification and security tooling.
A preliminary version of ML-KEM in Rust is now available on the preview feature/verifiedcryptobranch of the SymCrypt repository. We encourage users to try the Rust build and share feedback. Looking ahead, we plan to support direct use of the same cryptographic library in Rust without requiring C bindings.
Over the coming months, we plan to rewrite, verify, and ship several algorithms in Rust as part of SymCrypt. As our investment in Rust deepens, we expect to gain new insights into how to best leverage the language for high-assurance cryptographic implementations with low-level optimizations.
As performance is key to scalability and sustainability, we’re holding new implementations to a high bar using our benchmarking tools to match or exceed existing systems.
Looking forward
This is a pivotal moment for high-assurance software. Microsoft’s investment in Rust and formal verification presents a rare opportunity to advance one of our key libraries. We’re excited to scale this work and ultimately deliver an industrial-grade, Rust-based, FIPS-certified cryptographic library.
Opens in a new tab
#rewriting #symcrypt #rust #modernize #microsofts

Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library
Outdated coding practices and memory-unsafe languages like C are putting software, including cryptographic libraries, at risk. Fortunately, memory-safe languages like Rust, along with formal verification tools, are now mature enough to be used at scale, helping prevent issues like crashes, data corruption, flawed implementation, and side-channel attacks. To address these vulnerabilities and improve memory safety, we’re rewriting SymCrypt—Microsoft’s open-source cryptographic library—in Rust. We’re also incorporating formal verification methods. SymCrypt is used in Windows, Azure Linux, Xbox, and other platforms. Currently, SymCrypt is primarily written in cross-platform C, with limited use of hardware-specific optimizations through intrinsicsand assembly language. It provides a wide range of algorithms, including AES-GCM, SHA, ECDSA, and the more recent post-quantum algorithms ML-KEM and ML-DSA. Formal verification will confirm that implementations behave as intended and don’t deviate from algorithm specifications, critical for preventing attacks. We’ll also analyze compiled code to detect side-channel leaks caused by timing or hardware-level behavior. Proving Rust program properties with Aeneas Program verification is the process of proving that a piece of code will always satisfy a given property, no matter the input. Rust’s type system profoundly improves the prospects for program verification by providing strong ownership guarantees, by construction, using a discipline known as “aliasing xor mutability”. For example, reasoning about C code often requires proving that two non-const pointers are live and non-overlapping, a property that can depend on external client code. In contrast, Rust’s type system guarantees this property for any two mutably borrowed references. As a result, new tools have emerged specifically for verifying Rust code. We chose Aeneasbecause it helps provide a clean separation between code and proofs. Developed by Microsoft Azure Research in partnership with Inria, the French National Institute for Research in Digital Science and Technology, Aeneas connects to proof assistants like Lean, allowing us to draw on a large body of mathematical proofs—especially valuable given the mathematical nature of cryptographic algorithms—and benefit from Lean’s active user community. Compiling Rust to C supports backward compatibility We recognize that switching to Rust isn’t feasible for all use cases, so we’ll continue to support, extend, and certify C-based APIs as long as users need them. Users won’t see any changes, as Rust runs underneath the existing C APIs. Some users compile our C code directly and may rely on specific toolchains or compiler features that complicate the adoption of Rust code. To address this, we will use Eurydice, a Rust-to-C compiler developed by Microsoft Azure Research, to replace handwritten C code with C generated from formally verified Rust. Eurydicecompiles directly from Rust’s MIR intermediate language, and the resulting C code will be checked into the SymCrypt repository alongside the original Rust source code. As more users adopt Rust, we’ll continue supporting this compilation path for those who build SymCrypt from source code but aren’t ready to use the Rust compiler. In the long term, we hope to transition users to either use precompiled SymCrypt binaries, or compile from source code in Rust, at which point the Rust-to-C compilation path will no longer be needed. Microsoft research podcast Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness As the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India. Listen now Opens in a new tab Timing analysis with Revizor Even software that has been verified for functional correctness can remain vulnerable to low-level security threats, such as side channels caused by timing leaks or speculative execution. These threats operate at the hardware level and can leak private information, such as memory load addresses, branch targets, or division operands, even when the source code is provably correct. To address this, we’re extending Revizor, a tool developed by Microsoft Azure Research, to more effectively analyze SymCrypt binaries. Revizor models microarchitectural leakage and uses fuzzing techniques to systematically uncover instructions that may expose private information through known hardware-level effects. Earlier cryptographic libraries relied on constant-time programming to avoid operations on secret data. However, recent research has shown that this alone is insufficient with today’s CPUs, where every new optimization may open a new side channel. By analyzing binary code for specific compilers and platforms, our extended Revizor tool enables deeper scrutiny of vulnerabilities that aren’t visible in the source code. Verified Rust implementations begin with ML-KEM This long-term effort is in alignment with the Microsoft Secure Future Initiative and brings together experts across Microsoft, building on decades of Microsoft Research investment in program verification and security tooling. A preliminary version of ML-KEM in Rust is now available on the preview feature/verifiedcryptobranch of the SymCrypt repository. We encourage users to try the Rust build and share feedback. Looking ahead, we plan to support direct use of the same cryptographic library in Rust without requiring C bindings. Over the coming months, we plan to rewrite, verify, and ship several algorithms in Rust as part of SymCrypt. As our investment in Rust deepens, we expect to gain new insights into how to best leverage the language for high-assurance cryptographic implementations with low-level optimizations. As performance is key to scalability and sustainability, we’re holding new implementations to a high bar using our benchmarking tools to match or exceed existing systems. Looking forward This is a pivotal moment for high-assurance software. Microsoft’s investment in Rust and formal verification presents a rare opportunity to advance one of our key libraries. We’re excited to scale this work and ultimately deliver an industrial-grade, Rust-based, FIPS-certified cryptographic library. Opens in a new tab #rewriting #symcrypt #rust #modernize #microsofts

Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library

www.microsoft.com
Outdated coding practices and memory-unsafe languages like C are putting software, including cryptographic libraries, at risk. Fortunately, memory-safe languages like Rust, along with formal verification tools, are now mature enough to be used at scale, helping prevent issues like crashes, data corruption, flawed implementation, and side-channel attacks. To address these vulnerabilities and improve memory safety, we’re rewriting SymCrypt (opens in new tab)—Microsoft’s open-source cryptographic library—in Rust. We’re also incorporating formal verification methods. SymCrypt is used in Windows, Azure Linux, Xbox, and other platforms. Currently, SymCrypt is primarily written in cross-platform C, with limited use of hardware-specific optimizations through intrinsics (compiler-provided low-level functions) and assembly language (direct processor instructions). It provides a wide range of algorithms, including AES-GCM, SHA, ECDSA, and the more recent post-quantum algorithms ML-KEM and ML-DSA. Formal verification will confirm that implementations behave as intended and don’t deviate from algorithm specifications, critical for preventing attacks. We’ll also analyze compiled code to detect side-channel leaks caused by timing or hardware-level behavior. Proving Rust program properties with Aeneas Program verification is the process of proving that a piece of code will always satisfy a given property, no matter the input. Rust’s type system profoundly improves the prospects for program verification by providing strong ownership guarantees, by construction, using a discipline known as “aliasing xor mutability”. For example, reasoning about C code often requires proving that two non-const pointers are live and non-overlapping, a property that can depend on external client code. In contrast, Rust’s type system guarantees this property for any two mutably borrowed references. As a result, new tools have emerged specifically for verifying Rust code. We chose Aeneas (opens in new tab) because it helps provide a clean separation between code and proofs. Developed by Microsoft Azure Research in partnership with Inria, the French National Institute for Research in Digital Science and Technology, Aeneas connects to proof assistants like Lean (opens in new tab), allowing us to draw on a large body of mathematical proofs—especially valuable given the mathematical nature of cryptographic algorithms—and benefit from Lean’s active user community. Compiling Rust to C supports backward compatibility We recognize that switching to Rust isn’t feasible for all use cases, so we’ll continue to support, extend, and certify C-based APIs as long as users need them. Users won’t see any changes, as Rust runs underneath the existing C APIs. Some users compile our C code directly and may rely on specific toolchains or compiler features that complicate the adoption of Rust code. To address this, we will use Eurydice (opens in new tab), a Rust-to-C compiler developed by Microsoft Azure Research, to replace handwritten C code with C generated from formally verified Rust. Eurydice (opens in new tab) compiles directly from Rust’s MIR intermediate language, and the resulting C code will be checked into the SymCrypt repository alongside the original Rust source code. As more users adopt Rust, we’ll continue supporting this compilation path for those who build SymCrypt from source code but aren’t ready to use the Rust compiler. In the long term, we hope to transition users to either use precompiled SymCrypt binaries (via C or Rust APIs), or compile from source code in Rust, at which point the Rust-to-C compilation path will no longer be needed. Microsoft research podcast Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness As the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India. Listen now Opens in a new tab Timing analysis with Revizor Even software that has been verified for functional correctness can remain vulnerable to low-level security threats, such as side channels caused by timing leaks or speculative execution. These threats operate at the hardware level and can leak private information, such as memory load addresses, branch targets, or division operands, even when the source code is provably correct. To address this, we’re extending Revizor (opens in new tab), a tool developed by Microsoft Azure Research, to more effectively analyze SymCrypt binaries. Revizor models microarchitectural leakage and uses fuzzing techniques to systematically uncover instructions that may expose private information through known hardware-level effects. Earlier cryptographic libraries relied on constant-time programming to avoid operations on secret data. However, recent research has shown that this alone is insufficient with today’s CPUs, where every new optimization may open a new side channel. By analyzing binary code for specific compilers and platforms, our extended Revizor tool enables deeper scrutiny of vulnerabilities that aren’t visible in the source code. Verified Rust implementations begin with ML-KEM This long-term effort is in alignment with the Microsoft Secure Future Initiative and brings together experts across Microsoft, building on decades of Microsoft Research investment in program verification and security tooling. A preliminary version of ML-KEM in Rust is now available on the preview feature/verifiedcrypto (opens in new tab) branch of the SymCrypt repository. We encourage users to try the Rust build and share feedback (opens in new tab). Looking ahead, we plan to support direct use of the same cryptographic library in Rust without requiring C bindings. Over the coming months, we plan to rewrite, verify, and ship several algorithms in Rust as part of SymCrypt. As our investment in Rust deepens, we expect to gain new insights into how to best leverage the language for high-assurance cryptographic implementations with low-level optimizations. As performance is key to scalability and sustainability, we’re holding new implementations to a high bar using our benchmarking tools to match or exceed existing systems. Looking forward This is a pivotal moment for high-assurance software. Microsoft’s investment in Rust and formal verification presents a rare opportunity to advance one of our key libraries. We’re excited to scale this work and ultimately deliver an industrial-grade, Rust-based, FIPS-certified cryptographic library. Opens in a new tab

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-06-15 06:55:50 ·

How AI is reshaping the future of healthcare and medical research

Transcript      
PETER LEE: “In ‘The Little Black Bag,’ a classic science fiction story, a high-tech doctor’s kit of the future is accidentally transported back to the 1950s, into the shaky hands of a washed-up, alcoholic doctor. The ultimate medical tool, it redeems the doctor wielding it, allowing him to practice gratifyingly heroic medicine. … The tale ends badly for the doctor and his treacherous assistant, but it offered a picture of how advanced technology could transform medicine—powerful when it was written nearly 75 years ago and still so today. What would be the Al equivalent of that little black bag? At this moment when new capabilities are emerging, how do we imagine them into medicine?”         
This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.  
Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?   
In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.  The book passage I read at the top is from “Chapter 10: The Big Black Bag.”
In imagining AI in medicine, Carey, Zak, and I included in our book two fictional accounts. In the first, a medical resident consults GPT-4 on her personal phone as the patient in front of her crashes. Within seconds, it offers an alternate response based on recent literature. In the second account, a 90-year-old woman with several chronic conditions is living independently and receiving near-constant medical support from an AI aide.
In our conversations with the guests we’ve spoken to so far, we’ve caught a glimpse of these predicted futures, seeing how clinicians and patients are actually using AI today and how developers are leveraging the technology in the healthcare products and services they’re creating. In fact, that first fictional account isn’t so fictional after all, as most of the doctors in the real world actually appear to be using AI at least occasionally—and sometimes much more than occasionally—to help in their daily clinical work. And as for the second fictional account, which is more of a science fiction account, it seems we are indeed on the verge of a new way of delivering and receiving healthcare, though the future is still very much open.
As we continue to examine the current state of AI in healthcare and its potential to transform the field, I’m pleased to welcome Bill Gates and Sébastien Bubeck.
Bill may be best known as the co-founder of Microsoft, having created the company with his childhood friend Paul Allen in 1975. He’s now the founder of Breakthrough Energy, which aims to advance clean energy innovation, and TerraPower, a company developing groundbreaking nuclear energy and science technologies. He also chairs the world’s largest philanthropic organization, the Gates Foundation, and focuses on solving a variety of health challenges around the globe and here at home.
Sébastien is a research lead at OpenAI. He was previously a distinguished scientist, vice president of AI, and a colleague of mine here at Microsoft, where his work included spearheading the development of the family of small language models known as Phi. While at Microsoft, he also coauthored the discussion-provoking 2023 paper “Sparks of Artificial General Intelligence,” which presented the results of early experiments with GPT-4 conducted by a small team from Microsoft Research.
Here’s my conversation with Bill Gates and Sébastien Bubeck.
LEE: Bill, welcome.
BILL GATES: Thank you.
LEE: Seb …
SÉBASTIEN BUBECK: Yeah. Hi, hi, Peter. Nice to be here.
LEE: You know, one of the things that I’ve been doing just to get the conversation warmed up is to talk about origin stories, and what I mean about origin stories is, you know, what was the first contact that you had with large language models or the concept of generative AI that convinced you or made you think that something really important was happening?
And so, Bill, I think I’ve heard the story about, you know, the time when the OpenAI folks—Sam Altman, Greg Brockman, and others—showed you something, but could we hear from you what those early encounters were like and what was going through your mind?
GATES: Well, I’d been visiting OpenAI soon after it was created to see things like GPT-2 and to see the little arm they had that was trying to match human manipulation and, you know, looking at their games like Dota that they were trying to get as good as human play. And honestly, I didn’t think the language model stuff they were doing, even when they got to GPT-3, would show the ability to learn, you know, in the same sense that a human reads a biology book and is able to take that knowledge and access it not only to pass a test but also to create new medicines.
And so my challenge to them was that if their LLM could get a five on the advanced placement biology test, then I would say, OK, it took biologic knowledge and encoded it in an accessible way and that I didn’t expect them to do that very quickly but it would be profound.
And it was only about six months after I challenged them to do that, that an early version of GPT-4 they brought up to a dinner at my house, and in fact, it answered most of the questions that night very well. The one it got totally wrong, we were … because it was so good, we kept thinking, Oh, we must be wrong. It turned out it was a math weaknessthat, you know, we later understood that that was an area of, weirdly, of incredible weakness of those early models. But, you know, that was when I realized, OK, the age of cheap intelligence was at its beginning.
LEE: Yeah. So I guess it seems like you had something similar to me in that my first encounters, I actually harbored some skepticism. Is it fair to say you were skeptical before that?
GATES: Well, the idea that we’ve figured out how to encode and access knowledge in this very deep sense without even understanding the nature of the encoding, …
LEE: Right.
GATES: … that is a bit weird.
LEE: Yeah.
GATES: We have an algorithm that creates the computation, but even say, OK, where is the president’s birthday stored in there? Where is this fact stored in there? The fact that even now when we’re playing around, getting a little bit more sense of it, it’s opaque to us what the semantic encoding is, it’s, kind of, amazing to me. I thought the invention of knowledge storage would be an explicit way of encoding knowledge, not an implicit statistical training.
LEE: Yeah, yeah. All right. So, Seb, you know, on this same topic, you know, I got—as we say at Microsoft—I got pulled into the tent.
BUBECK: Yes.
LEE: Because this was a very secret project. And then, um, I had the opportunity to select a small number of researchers in MSRto join and start investigating this thing seriously. And the first person I pulled in was you.
BUBECK: Yeah.
LEE: And so what were your first encounters? Because I actually don’t remember what happened then.
BUBECK: Oh, I remember it very well.My first encounter with GPT-4 was in a meeting with the two of you, actually. But my kind of first contact, the first moment where I realized that something was happening with generative AI, was before that. And I agree with Bill that I also wasn’t too impressed by GPT-3.
I though that it was kind of, you know, very naturally mimicking the web, sort of parroting what was written there in a nice way. Still in a way which seemed very impressive. But it wasn’t really intelligent in any way. But shortly after GPT-3, there was a model before GPT-4 that really shocked me, and this was the first image generation model, DALL-E 1.
So that was in 2021. And I will forever remember the press release of OpenAI where they had this prompt of an avocado chair and then you had this image of the avocado chair.And what really shocked me is that clearly the model kind of “understood” what is a chair, what is an avocado, and was able to merge those concepts.
So this was really, to me, the first moment where I saw some understanding in those models.
LEE: So this was, just to get the timing right, that was before I pulled you into the tent.
BUBECK: That was before. That was like a year before.
LEE: Right.
BUBECK: And now I will tell you how, you know, we went from that moment to the meeting with the two of you and GPT-4.
So once I saw this kind of understanding, I thought, OK, fine. It understands concept, but it’s still not able to reason. It cannot—as, you know, Bill was saying—it cannot learn from your document. It cannot reason.
So I set out to try to prove that. You know, this is what I was in the business of at the time, trying to prove things in mathematics. So I was trying to prove that basically autoregressive transformers could never reason. So I was trying to prove this. And after a year of work, I had something reasonable to show. And so I had the meeting with the two of you, and I had this example where I wanted to say, there is no way that an LLM is going to be able to do x.
And then as soon as I … I don’t know if you remember, Bill. But as soon as I said that, you said, oh, but wait a second. I had, you know, the OpenAI crew at my house recently, and they showed me a new model. Why don’t we ask this new model this question?
LEE: Yeah.
BUBECK: And we did, and it solved it on the spot. And that really, honestly, just changed my life. Like, you know, I had been working for a year trying to say that this was impossible. And just right there, it was shown to be possible.
LEE:One of the very first things I got interested in—because I was really thinking a lot about healthcare—was healthcare and medicine.
And I don’t know if the two of you remember, but I ended up doing a lot of tests. I ran through, you know, step one and step two of the US Medical Licensing Exam. Did a whole bunch of other things. I wrote this big report. It was, you know, I can’t remember … a couple hundred pages.
And I needed to share this with someone. I didn’t … there weren’t too many people I could share it with. So I sent, I think, a copy to you, Bill. Sent a copy to you, Seb.
I hardly slept for about a week putting that report together. And, yeah, and I kept working on it. But I was far from alone. I think everyone who was in the tent, so to speak, in those early days was going through something pretty similar. All right. So I think … of course, a lot of what I put in the report also ended up being examples that made it into the book.
But the main purpose of this conversation isn’t to reminisce aboutor indulge in those reminiscences but to talk about what’s happening in healthcare and medicine. And, you know, as I said, we wrote this book. We did it very, very quickly. Seb, you helped. Bill, you know, you provided a review and some endorsements.
But, you know, honestly, we didn’t know what we were talking about because no one had access to this thing. And so we just made a bunch of guesses. So really, the whole thing I wanted to probe with the two of you is, now with two years of experience out in the world, what, you know, what do we think is happening today?
You know, is AI actually having an impact, positive or negative, on healthcare and medicine? And what do we now think is going to happen in the next two years, five years, or 10 years? And so I realize it’s a little bit too abstract to just ask it that way. So let me just try to narrow the discussion and guide us a little bit.
Um, the kind of administrative and clerical work, paperwork, around healthcare—and we made a lot of guesses about that—that appears to be going well, but, you know, Bill, I know we’ve discussed that sometimes that you think there ought to be a lot more going on. Do you have a viewpoint on how AI is actually finding its way into reducing paperwork?
GATES: Well, I’m stunned … I don’t think there should be a patient-doctor meeting where the AI is not sitting in and both transcribing, offering to help with the paperwork, and even making suggestions, although the doctor will be the one, you know, who makes the final decision about the diagnosis and whatever prescription gets done.
It’s so helpful. You know, when that patient goes home and their, you know, son who wants to understand what happened has some questions, that AI should be available to continue that conversation. And the way you can improve that experience and streamline things and, you know, involve the people who advise you. I don’t understand why that’s not more adopted, because there you still have the human in the loop making that final decision.
But even for, like, follow-up calls to make sure the patient did things, to understand if they have concerns and knowing when to escalate back to the doctor, the benefit is incredible. And, you know, that thing is ready for prime time. That paradigm is ready for prime time, in my view.
LEE: Yeah, there are some good products, but it seems like the number one use right now—and we kind of got this from some of the previous guests in previous episodes—is the use of AI just to respond to emails from patients.Does that make sense to you?
BUBECK: Yeah. So maybe I want to second what Bill was saying but maybe take a step back first. You know, two years ago, like, the concept of clinical scribes, which is one of the things that we’re talking about right now, it would have sounded, in fact, it sounded two years ago, borderline dangerous. Because everybody was worried about hallucinations. What happened if you have this AI listening in and then it transcribes, you know, something wrong?
Now, two years later, I think it’s mostly working. And in fact, it is not yet, you know, fully adopted. You’re right. But it is in production. It is used, you know, in many, many places. So this rate of progress is astounding because it wasn’t obvious that we would be able to overcome those obstacles of hallucination. It’s not to say that hallucinations are fully solved. In the case of the closed system, they are.
Now, I think more generally what’s going on in the background is that there is something that we, that certainly I, underestimated, which is this management overhead. So I think the reason why this is not adopted everywhere is really a training and teaching aspect. People need to be taught, like, those systems, how to interact with them.
And one example that I really like, a study that recently appeared where they tried to use ChatGPT for diagnosis and they were comparing doctors without and with ChatGPT. And the amazing thing … so this was a set of cases where the accuracy of the doctors alone was around 75%. ChatGPT alone was 90%. So that’s already kind of mind blowing. But then the kicker is that doctors with ChatGPT was 80%.
Intelligence alone is not enough. It’s also how it’s presented, how you interact with it. And ChatGPT, it’s an amazing tool. Obviously, I absolutely love it. But it’s not … you don’t want a doctor to have to type in, you know, prompts and use it that way.
It should be, as Bill was saying, kind of running continuously in the background, sending you notifications. And you have to be really careful of the rate at which those notifications are being sent. Because if they are too frequent, then the doctor will learn to ignore them. So you have to … all of those things matter, in fact, at least as much as the level of intelligence of the machine.
LEE: One of the things I think about, Bill, in that scenario that you described, doctors do some thinking about the patient when they write the note. So, you know, I’m always a little uncertain whether it’s actually … you know, you wouldn’t necessarily want to fully automate this, I don’t think. Or at least there needs to be some prompt to the doctor to make sure that the doctor puts some thought into what happened in the encounter with the patient. Does that make sense to you at all?
GATES: At this stage, you know, I’d still put the onus on the doctor to write the conclusions and the summary and not delegate that.
The tradeoffs you make a little bit are somewhat dependent on the situation you’re in. If you’re in Africa,
So, yes, the doctor’s still going to have to do a lot of work, but just the quality of letting the patient and the people around them interact and ask questions and have things explained, that alone is such a quality improvement. It’s mind blowing.
LEE: So since you mentioned, you know, Africa—and, of course, this touches on the mission and some of the priorities of the Gates Foundation and this idea of democratization of access to expert medical care—what’s the most interesting stuff going on right now? Are there people and organizations or technologies that are impressing you or that you’re tracking?
GATES: Yeah. So the Gates Foundation has given out a lot of grants to people in Africa doing education, agriculture but more healthcare examples than anything. And the way these things start off, they often start out either being patient-centric in a narrow situation, like, OK, I’m a pregnant woman; talk to me. Or, I have infectious disease symptoms; talk to me. Or they’re connected to a health worker where they’re helping that worker get their job done. And we have lots of pilots out, you know, in both of those cases.
The dream would be eventually to have the thing the patient consults be so broad that it’s like having a doctor available who understands the local things.
LEE: Right.
GATES: We’re not there yet. But over the next two or three years, you know, particularly given the worsening financial constraints against African health systems, where the withdrawal of money has been dramatic, you know, figuring out how to take this—what I sometimes call “free intelligence”—and build a quality health system around that, we will have to be more radical in low-income countries than any rich country is ever going to be.
LEE: Also, there’s maybe a different regulatory environment, so some of those things maybe are easier? Because right now, I think the world hasn’t figured out how to and whether to regulate, let’s say, an AI that might give a medical diagnosis or write a prescription for a medication.
BUBECK: Yeah. I think one issue with this, and it’s also slowing down the deployment of AI in healthcare more generally, is a lack of proper benchmark. Because, you know, you were mentioning the USMLE, for example. That’s a great test to test human beings and their knowledge of healthcare and medicine. But it’s not a great test to give to an AI.
It’s not asking the right questions. So finding what are the right questions to test whether an AI system is ready to give diagnosis in a constrained setting, that’s a very, very important direction, which to my surprise, is not yet accelerating at the rate that I was hoping for.
LEE: OK, so that gives me an excuse to get more now into the core AI tech because something I’ve discussed with both of you is this issue of what are the right tests. And you both know the very first test I give to any new spin of an LLM is I present a patient, the results—a mythical patient—the results of my physical exam, my mythical physical exam. Maybe some results of some initial labs. And then I present or propose a differential diagnosis. And if you’re not in medicine, a differential diagnosis you can just think of as a prioritized list of the possible diagnoses that fit with all that data. And in that proposed differential, I always intentionally make two mistakes.
I make a textbook technical error in one of the possible elements of the differential diagnosis, and I have an error of omission. And, you know, I just want to know, does the LLM understand what I’m talking about? And all the good ones out there do now. But then I want to know, can it spot the errors? And then most importantly, is it willing to tell me I’m wrong, that I’ve made a mistake?
That last piece seems really hard for AI today. And so let me ask you first, Seb, because at the time of this taping, of course, there was a new spin of GPT-4o last week that became overly sycophantic. In other words, it was actually prone in that test of mine not only to not tell me I’m wrong, but it actually praised me for the creativity of my differential.What’s up with that?
BUBECK: Yeah, I guess it’s a testament to the fact that training those models is still more of an art than a science. So it’s a difficult job. Just to be clear with the audience, we have rolled back thatversion of GPT-4o, so now we don’t have the sycophant version out there.
Yeah, no, it’s a really difficult question. It has to do … as you said, it’s very technical. It has to do with the post-training and how, like, where do you nudge the model? So, you know, there is this very classical by now technique called RLHF, where you push the model in the direction of a certain reward model. So the reward model is just telling the model, you know, what behavior is good, what behavior is bad.
But this reward model is itself an LLM, and, you know, Bill was saying at the very beginning of the conversation that we don’t really understand how those LLMs deal with concepts like, you know, where is the capital of France located? Things like that. It is the same thing for this reward model. We don’t know why it says that it prefers one output to another, and whether this is correlated with some sycophancy is, you know, something that we discovered basically just now. That if you push too hard in optimization on this reward model, you will get a sycophant model.
So it’s kind of … what I’m trying to say is we became too good at what we were doing, and we ended up, in fact, in a trap of the reward model.
LEE: I mean, you do want … it’s a difficult balance because you do want models to follow your desires and …
BUBECK: It’s a very difficult, very difficult balance.
LEE: So this brings up then the following question for me, which is the extent to which we think we’ll need to have specially trained models for things. So let me start with you, Bill. Do you have a point of view on whether we will need to, you know, quote-unquote take AI models to med school? Have them specially trained? Like, if you were going to deploy something to give medical care in underserved parts of the world, do we need to do something special to create those models?
GATES: We certainly need to teach them the African languages and the unique dialects so that the multimedia interactions are very high quality. We certainly need to teach them the disease prevalence and unique disease patterns like, you know, neglected tropical diseases and malaria. So we need to gather a set of facts that somebody trying to go for a US customer base, you know, wouldn’t necessarily have that in there.
Those two things are actually very straightforward because the additional training time is small. I’d say for the next few years, we’ll also need to do reinforcement learning about the context of being a doctor and how important certain behaviors are. Humans learn over the course of their life to some degree that, I’m in a different context and the way I behave in terms of being willing to criticize or be nice, you know, how important is it? Who’s here? What’s my relationship to them?
Right now, these machines don’t have that broad social experience. And so if you know it’s going to be used for health things, a lot of reinforcement learning of the very best humans in that context would still be valuable. Eventually, the models will, having read all the literature of the world about good doctors, bad doctors, it’ll understand as soon as you say, “I want you to be a doctor diagnosing somebody.” All of the implicit reinforcement that fits that situation, you know, will be there.
LEE: Yeah.
GATES: And so I hope three years from now, we don’t have to do that reinforcement learning. But today, for any medical context, you would want a lot of data to reinforce tone, willingness to say things when, you know, there might be something significant at stake.
LEE: Yeah. So, you know, something Bill said, kind of, reminds me of another thing that I think we missed, which is, the context also … and the specialization also pertains to different, I guess, what we still call “modes,” although I don’t know if the idea of multimodal is the same as it was two years ago. But, you know, what do you make of all of the hubbub around—in fact, within Microsoft Research, this is a big deal, but I think we’re far from alone—you know, medical images and vision, video, proteins and molecules, cell, you know, cellular data and so on.
BUBECK: Yeah. OK. So there is a lot to say to everything … to the last, you know, couple of minutes. Maybe on the specialization aspect, you know, I think there is, hiding behind this, a really fundamental scientific question of whether eventually we have a singular AGIthat kind of knows everything and you can just put, you know, explain your own context and it will just get it and understand everything.
That’s one vision. I have to say, I don’t particularly believe in this vision. In fact, we humans are not like that at all. I think, hopefully, we are general intelligences, yet we have to specialize a lot. And, you know, I did myself a lot of RL, reinforcement learning, on mathematics. Like, that’s what I did, you know, spent a lot of time doing that. And I didn’t improve on other aspects. You know, in fact, I probably degraded in other aspects.So it’s … I think it’s an important example to have in mind.
LEE: I think I might disagree with you on that, though, because, like, doesn’t a model have to see both good science and bad science in order to be able to gain the ability to discern between the two?
BUBECK: Yeah, no, that absolutely. I think there is value in seeing the generality, in having a very broad base. But then you, kind of, specialize on verticals. And this is where also, you know, open-weights model, which we haven’t talked about yet, are really important because they allow you to provide this broad base to everyone. And then you can specialize on top of it.
LEE: So we have about three hours of stuff to talk about, but our time is actually running low.
BUBECK: Yes, yes, yes.
LEE: So I think I want … there’s a more provocative question. It’s almost a silly question, but I need to ask it of the two of you, which is, is there a future, you know, where AI replaces doctors or replaces, you know, medical specialties that we have today? So what does the world look like, say, five years from now?
GATES: Well, it’s important to distinguish healthcare discovery activity from healthcare delivery activity. We focused mostly on delivery. I think it’s very much within the realm of possibility that the AI is not only accelerating healthcare discovery but substituting for a lot of the roles of, you know, I’m an organic chemist, or I run various types of assays. I can see those, which are, you know, testable-output-type jobs but with still very high value, I can see, you know, some replacement in those areas before the doctor.
The doctor, still understanding the human condition and long-term dialogues, you know, they’ve had a lifetime of reinforcement of that, particularly when you get into areas like mental health. So I wouldn’t say in five years, either people will choose to adopt it, but it will be profound that there’ll be this nearly free intelligence that can do follow-up, that can help you, you know, make sure you went through different possibilities.
And so I’d say, yes, we’ll have doctors, but I’d say healthcare will be massively transformed in its quality and in efficiency by AI in that time period.
LEE: Is there a comparison, useful comparison, say, between doctors and, say, programmers, computer programmers, or doctors and, I don’t know, lawyers?
GATES: Programming is another one that has, kind of, a mathematical correctness to it, you know, and so the objective function that you’re trying to reinforce to, as soon as you can understand the state machines, you can have something that’s “checkable”; that’s correct. So I think programming, you know, which is weird to say, that the machine will beat us at most programming tasks before we let it take over roles that have deep empathy, you know, physical presence and social understanding in them.
LEE: Yeah. By the way, you know, I fully expect in five years that AI will produce mathematical proofs that are checkable for validity, easily checkable, because they’ll be written in a proof-checking language like Lean or something but will be so complex that no human mathematician can understand them. I expect that to happen.
I can imagine in some fields, like cellular biology, we could have the same situation in the future because the molecular pathways, the chemistry, biochemistry of human cells or living cells is as complex as any mathematics, and so it seems possible that we may be in a state where in wet lab, we see, Oh yeah, this actually works, but no one can understand why.
BUBECK: Yeah, absolutely. I mean, I think I really agree with Bill’s distinction of the discovery and the delivery, and indeed, the discovery’s when you can check things, and at the end, there is an artifact that you can verify. You know, you can run the protocol in the wet lab and seeproduced what you wanted. So I absolutely agree with that.
And in fact, you know, we don’t have to talk five years from now. I don’t know if you know, but just recently, there was a paper that was published on a scientific discovery using o3- mini. So this is really amazing. And, you know, just very quickly, just so people know, it was about this statistical physics model, the frustrated Potts model, which has to do with coloring, and basically, the case of three colors, like, more than two colors was open for a long time, and o3 was able to reduce the case of three colors to two colors.
LEE: Yeah.
BUBECK: Which is just, like, astounding. And this is not … this is now. This is happening right now. So this is something that I personally didn’t expect it would happen so quickly, and it’s due to those reasoning models.
Now, on the delivery side, I would add something more to it for the reason why doctors and, in fact, lawyers and coders will remain for a long time, and it’s because we still don’t understand how those models generalize. Like, at the end of the day, we are not able to tell you when they are confronted with a really new, novel situation, whether they will work or not.
Nobody is able to give you that guarantee. And I think until we understand this generalization better, we’re not going to be willing to just let the system in the wild without human supervision.
LEE: But don’t human doctors, human specialists … so, for example, a cardiologist sees a patient in a certain way that a nephrologist …
BUBECK: Yeah.
LEE: … or an endocrinologist might not.
BUBECK: That’s right. But another cardiologist will understand and, kind of, expect a certain level of generalization from their peer. And this, we just don’t have it with AI models. Now, of course, you’re exactly right. That generalization is also hard for humans. Like, if you have a human trained for one task and you put them into another task, then you don’t … you often don’t know.
LEE: OK. You know, the podcast is focused on what’s happened over the last two years. But now, I’d like one provocative prediction about what you think the world of AI and medicine is going to be at some point in the future. You pick your timeframe. I don’t care if it’s two years or 20 years from now, but, you know, what do you think will be different about AI in medicine in that future than today?
BUBECK: Yeah, I think the deployment is going to accelerate soon. Like, we’re really not missing very much. There is this enormous capability overhang. Like, even if progress completely stopped, with current systems, we can do a lot more than what we’re doing right now. So I think this will … this has to be realized, you know, sooner rather than later.
And I think it’s probably dependent on these benchmarks and proper evaluation and tying this with regulation. So these are things that take time in human society and for good reason. But now we already are at two years; you know, give it another two years and it should be really …
LEE: Will AI prescribe your medicines? Write your prescriptions?
BUBECK: I think yes. I think yes.
LEE: OK. Bill?
GATES: Well, I think the next two years, we’ll have massive pilots, and so the amount of use of the AI, still in a copilot-type mode, you know, we should get millions of patient visits, you know, both in general medicine and in the mental health side, as well. And I think that’s going to build up both the data and the confidence to give the AI some additional autonomy. You know, are you going to let it talk to you at night when you’re panicked about your mental health with some ability to escalate?
And, you know, I’ve gone so far as to tell politicians with national health systems that if they deploy AI appropriately, that the quality of care, the overload of the doctors, the improvement in the economics will be enough that their voters will be stunned because they just don’t expect this, and, you know, they could be reelectedjust on this one thing of fixing what is a very overloaded and economically challenged health system in these rich countries.
You know, my personal role is going to be to make sure that in the poorer countries, there isn’t some lag; in fact, in many cases, that we’ll be more aggressive because, you know, we’re comparing to having no access to doctors at all. And, you know, so I think whether it’s India or Africa, there’ll be lessons that are globally valuable because we need medical intelligence. And, you know, thank god AI is going to provide a lot of that.
LEE: Well, on that optimistic note, I think that’s a good way to end. Bill, Seb, really appreciate all of this.
I think the most fundamental prediction we made in the book is that AI would actually find its way into the practice of medicine, and I think that that at least has come true, maybe in different ways than we expected, but it’s come true, and I think it’ll only accelerate from here. So thanks again, both of you.
GATES: Yeah. Thanks, you guys.
BUBECK: Thank you, Peter. Thanks, Bill.
LEE: I just always feel such a sense of privilege to have a chance to interact and actually work with people like Bill and Sébastien.
With Bill, I’m always amazed at how practically minded he is. He’s really thinking about the nuts and bolts of what AI might be able to do for people, and his thoughts about underserved parts of the world, the idea that we might actually be able to empower people with access to expert medical knowledge, I think is both inspiring and amazing.
And then, Seb, Sébastien Bubeck, he’s just absolutely a brilliant mind. He has a really firm grip on the deep mathematics of artificial intelligence and brings that to bear in his research and development work. And where that mathematics takes him isn’t just into the nuts and bolts of algorithms but into philosophical questions about the nature of intelligence.
One of the things that Sébastien brought up was the state of evaluation of AI systems. And indeed, he was fairly critical in our conversation. But of course, the world of AI research and development is just moving so fast, and indeed, since we recorded our conversation, OpenAI, in fact, released a new evaluation metric that is directly relevant to medical applications, and that is something called HealthBench. And Microsoft Research also released a new evaluation approach or process called ADeLe.
HealthBench and ADeLe are examples of new approaches to evaluating AI models that are less about testing their knowledge and ability to pass multiple-choice exams and instead are evaluation approaches designed to assess how well AI models are able to complete tasks that actually arise every day in typical healthcare or biomedical research settings. These are examples of really important good work that speak to how well AI models work in the real world of healthcare and biomedical research and how well they can collaborate with human beings in those settings.
You know, I asked Bill and Seb to make some predictions about the future. You know, my own answer, I expect that we’re going to be able to use AI to change how we diagnose patients, change how we decide treatment options.
If you’re a doctor or a nurse and you encounter a patient, you’ll ask questions, do a physical exam, you know, call out for labs just like you do today, but then you’ll be able to engage with AI based on all of that data and just ask, you know, based on all the other people who have gone through the same experience, who have similar data, how were they diagnosed? How were they treated? What were their outcomes? And what does that mean for the patient I have right now? Some people call it the “patients like me” paradigm. And I think that’s going to become real because of AI within our lifetimes. That idea of really grounding the delivery in healthcare and medical practice through data and intelligence, I actually now don’t see any barriers to that future becoming real.
I’d like to extend another big thank you to Bill and Sébastien for their time. And to our listeners, as always, it’s a pleasure to have you along for the ride. I hope you’ll join us for our remaining conversations, as well as a second coauthor roundtable with Carey and Zak.
Until next time.
#how #reshaping #future #healthcare #medical

How AI is reshaping the future of healthcare and medical research
Transcript       PETER LEE: “In ‘The Little Black Bag,’ a classic science fiction story, a high-tech doctor’s kit of the future is accidentally transported back to the 1950s, into the shaky hands of a washed-up, alcoholic doctor. The ultimate medical tool, it redeems the doctor wielding it, allowing him to practice gratifyingly heroic medicine. … The tale ends badly for the doctor and his treacherous assistant, but it offered a picture of how advanced technology could transform medicine—powerful when it was written nearly 75 years ago and still so today. What would be the Al equivalent of that little black bag? At this moment when new capabilities are emerging, how do we imagine them into medicine?”          This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.   Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?    In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.  The book passage I read at the top is from “Chapter 10: The Big Black Bag.” In imagining AI in medicine, Carey, Zak, and I included in our book two fictional accounts. In the first, a medical resident consults GPT-4 on her personal phone as the patient in front of her crashes. Within seconds, it offers an alternate response based on recent literature. In the second account, a 90-year-old woman with several chronic conditions is living independently and receiving near-constant medical support from an AI aide.    In our conversations with the guests we’ve spoken to so far, we’ve caught a glimpse of these predicted futures, seeing how clinicians and patients are actually using AI today and how developers are leveraging the technology in the healthcare products and services they’re creating. In fact, that first fictional account isn’t so fictional after all, as most of the doctors in the real world actually appear to be using AI at least occasionally—and sometimes much more than occasionally—to help in their daily clinical work. And as for the second fictional account, which is more of a science fiction account, it seems we are indeed on the verge of a new way of delivering and receiving healthcare, though the future is still very much open. As we continue to examine the current state of AI in healthcare and its potential to transform the field, I’m pleased to welcome Bill Gates and Sébastien Bubeck.   Bill may be best known as the co-founder of Microsoft, having created the company with his childhood friend Paul Allen in 1975. He’s now the founder of Breakthrough Energy, which aims to advance clean energy innovation, and TerraPower, a company developing groundbreaking nuclear energy and science technologies. He also chairs the world’s largest philanthropic organization, the Gates Foundation, and focuses on solving a variety of health challenges around the globe and here at home. Sébastien is a research lead at OpenAI. He was previously a distinguished scientist, vice president of AI, and a colleague of mine here at Microsoft, where his work included spearheading the development of the family of small language models known as Phi. While at Microsoft, he also coauthored the discussion-provoking 2023 paper “Sparks of Artificial General Intelligence,” which presented the results of early experiments with GPT-4 conducted by a small team from Microsoft Research.      Here’s my conversation with Bill Gates and Sébastien Bubeck. LEE: Bill, welcome. BILL GATES: Thank you. LEE: Seb … SÉBASTIEN BUBECK: Yeah. Hi, hi, Peter. Nice to be here. LEE: You know, one of the things that I’ve been doing just to get the conversation warmed up is to talk about origin stories, and what I mean about origin stories is, you know, what was the first contact that you had with large language models or the concept of generative AI that convinced you or made you think that something really important was happening? And so, Bill, I think I’ve heard the story about, you know, the time when the OpenAI folks—Sam Altman, Greg Brockman, and others—showed you something, but could we hear from you what those early encounters were like and what was going through your mind?   GATES: Well, I’d been visiting OpenAI soon after it was created to see things like GPT-2 and to see the little arm they had that was trying to match human manipulation and, you know, looking at their games like Dota that they were trying to get as good as human play. And honestly, I didn’t think the language model stuff they were doing, even when they got to GPT-3, would show the ability to learn, you know, in the same sense that a human reads a biology book and is able to take that knowledge and access it not only to pass a test but also to create new medicines. And so my challenge to them was that if their LLM could get a five on the advanced placement biology test, then I would say, OK, it took biologic knowledge and encoded it in an accessible way and that I didn’t expect them to do that very quickly but it would be profound.   And it was only about six months after I challenged them to do that, that an early version of GPT-4 they brought up to a dinner at my house, and in fact, it answered most of the questions that night very well. The one it got totally wrong, we were … because it was so good, we kept thinking, Oh, we must be wrong. It turned out it was a math weaknessthat, you know, we later understood that that was an area of, weirdly, of incredible weakness of those early models. But, you know, that was when I realized, OK, the age of cheap intelligence was at its beginning. LEE: Yeah. So I guess it seems like you had something similar to me in that my first encounters, I actually harbored some skepticism. Is it fair to say you were skeptical before that? GATES: Well, the idea that we’ve figured out how to encode and access knowledge in this very deep sense without even understanding the nature of the encoding, … LEE: Right.   GATES: … that is a bit weird.   LEE: Yeah. GATES: We have an algorithm that creates the computation, but even say, OK, where is the president’s birthday stored in there? Where is this fact stored in there? The fact that even now when we’re playing around, getting a little bit more sense of it, it’s opaque to us what the semantic encoding is, it’s, kind of, amazing to me. I thought the invention of knowledge storage would be an explicit way of encoding knowledge, not an implicit statistical training. LEE: Yeah, yeah. All right. So, Seb, you know, on this same topic, you know, I got—as we say at Microsoft—I got pulled into the tent. BUBECK: Yes.   LEE: Because this was a very secret project. And then, um, I had the opportunity to select a small number of researchers in MSRto join and start investigating this thing seriously. And the first person I pulled in was you. BUBECK: Yeah. LEE: And so what were your first encounters? Because I actually don’t remember what happened then. BUBECK: Oh, I remember it very well.My first encounter with GPT-4 was in a meeting with the two of you, actually. But my kind of first contact, the first moment where I realized that something was happening with generative AI, was before that. And I agree with Bill that I also wasn’t too impressed by GPT-3. I though that it was kind of, you know, very naturally mimicking the web, sort of parroting what was written there in a nice way. Still in a way which seemed very impressive. But it wasn’t really intelligent in any way. But shortly after GPT-3, there was a model before GPT-4 that really shocked me, and this was the first image generation model, DALL-E 1. So that was in 2021. And I will forever remember the press release of OpenAI where they had this prompt of an avocado chair and then you had this image of the avocado chair.And what really shocked me is that clearly the model kind of “understood” what is a chair, what is an avocado, and was able to merge those concepts. So this was really, to me, the first moment where I saw some understanding in those models.   LEE: So this was, just to get the timing right, that was before I pulled you into the tent. BUBECK: That was before. That was like a year before. LEE: Right.   BUBECK: And now I will tell you how, you know, we went from that moment to the meeting with the two of you and GPT-4. So once I saw this kind of understanding, I thought, OK, fine. It understands concept, but it’s still not able to reason. It cannot—as, you know, Bill was saying—it cannot learn from your document. It cannot reason.   So I set out to try to prove that. You know, this is what I was in the business of at the time, trying to prove things in mathematics. So I was trying to prove that basically autoregressive transformers could never reason. So I was trying to prove this. And after a year of work, I had something reasonable to show. And so I had the meeting with the two of you, and I had this example where I wanted to say, there is no way that an LLM is going to be able to do x. And then as soon as I … I don’t know if you remember, Bill. But as soon as I said that, you said, oh, but wait a second. I had, you know, the OpenAI crew at my house recently, and they showed me a new model. Why don’t we ask this new model this question?   LEE: Yeah. BUBECK: And we did, and it solved it on the spot. And that really, honestly, just changed my life. Like, you know, I had been working for a year trying to say that this was impossible. And just right there, it was shown to be possible.   LEE:One of the very first things I got interested in—because I was really thinking a lot about healthcare—was healthcare and medicine. And I don’t know if the two of you remember, but I ended up doing a lot of tests. I ran through, you know, step one and step two of the US Medical Licensing Exam. Did a whole bunch of other things. I wrote this big report. It was, you know, I can’t remember … a couple hundred pages.   And I needed to share this with someone. I didn’t … there weren’t too many people I could share it with. So I sent, I think, a copy to you, Bill. Sent a copy to you, Seb.   I hardly slept for about a week putting that report together. And, yeah, and I kept working on it. But I was far from alone. I think everyone who was in the tent, so to speak, in those early days was going through something pretty similar. All right. So I think … of course, a lot of what I put in the report also ended up being examples that made it into the book. But the main purpose of this conversation isn’t to reminisce aboutor indulge in those reminiscences but to talk about what’s happening in healthcare and medicine. And, you know, as I said, we wrote this book. We did it very, very quickly. Seb, you helped. Bill, you know, you provided a review and some endorsements. But, you know, honestly, we didn’t know what we were talking about because no one had access to this thing. And so we just made a bunch of guesses. So really, the whole thing I wanted to probe with the two of you is, now with two years of experience out in the world, what, you know, what do we think is happening today? You know, is AI actually having an impact, positive or negative, on healthcare and medicine? And what do we now think is going to happen in the next two years, five years, or 10 years? And so I realize it’s a little bit too abstract to just ask it that way. So let me just try to narrow the discussion and guide us a little bit.   Um, the kind of administrative and clerical work, paperwork, around healthcare—and we made a lot of guesses about that—that appears to be going well, but, you know, Bill, I know we’ve discussed that sometimes that you think there ought to be a lot more going on. Do you have a viewpoint on how AI is actually finding its way into reducing paperwork? GATES: Well, I’m stunned … I don’t think there should be a patient-doctor meeting where the AI is not sitting in and both transcribing, offering to help with the paperwork, and even making suggestions, although the doctor will be the one, you know, who makes the final decision about the diagnosis and whatever prescription gets done.   It’s so helpful. You know, when that patient goes home and their, you know, son who wants to understand what happened has some questions, that AI should be available to continue that conversation. And the way you can improve that experience and streamline things and, you know, involve the people who advise you. I don’t understand why that’s not more adopted, because there you still have the human in the loop making that final decision. But even for, like, follow-up calls to make sure the patient did things, to understand if they have concerns and knowing when to escalate back to the doctor, the benefit is incredible. And, you know, that thing is ready for prime time. That paradigm is ready for prime time, in my view. LEE: Yeah, there are some good products, but it seems like the number one use right now—and we kind of got this from some of the previous guests in previous episodes—is the use of AI just to respond to emails from patients.Does that make sense to you? BUBECK: Yeah. So maybe I want to second what Bill was saying but maybe take a step back first. You know, two years ago, like, the concept of clinical scribes, which is one of the things that we’re talking about right now, it would have sounded, in fact, it sounded two years ago, borderline dangerous. Because everybody was worried about hallucinations. What happened if you have this AI listening in and then it transcribes, you know, something wrong? Now, two years later, I think it’s mostly working. And in fact, it is not yet, you know, fully adopted. You’re right. But it is in production. It is used, you know, in many, many places. So this rate of progress is astounding because it wasn’t obvious that we would be able to overcome those obstacles of hallucination. It’s not to say that hallucinations are fully solved. In the case of the closed system, they are.   Now, I think more generally what’s going on in the background is that there is something that we, that certainly I, underestimated, which is this management overhead. So I think the reason why this is not adopted everywhere is really a training and teaching aspect. People need to be taught, like, those systems, how to interact with them. And one example that I really like, a study that recently appeared where they tried to use ChatGPT for diagnosis and they were comparing doctors without and with ChatGPT. And the amazing thing … so this was a set of cases where the accuracy of the doctors alone was around 75%. ChatGPT alone was 90%. So that’s already kind of mind blowing. But then the kicker is that doctors with ChatGPT was 80%.   Intelligence alone is not enough. It’s also how it’s presented, how you interact with it. And ChatGPT, it’s an amazing tool. Obviously, I absolutely love it. But it’s not … you don’t want a doctor to have to type in, you know, prompts and use it that way. It should be, as Bill was saying, kind of running continuously in the background, sending you notifications. And you have to be really careful of the rate at which those notifications are being sent. Because if they are too frequent, then the doctor will learn to ignore them. So you have to … all of those things matter, in fact, at least as much as the level of intelligence of the machine. LEE: One of the things I think about, Bill, in that scenario that you described, doctors do some thinking about the patient when they write the note. So, you know, I’m always a little uncertain whether it’s actually … you know, you wouldn’t necessarily want to fully automate this, I don’t think. Or at least there needs to be some prompt to the doctor to make sure that the doctor puts some thought into what happened in the encounter with the patient. Does that make sense to you at all? GATES: At this stage, you know, I’d still put the onus on the doctor to write the conclusions and the summary and not delegate that. The tradeoffs you make a little bit are somewhat dependent on the situation you’re in. If you’re in Africa, So, yes, the doctor’s still going to have to do a lot of work, but just the quality of letting the patient and the people around them interact and ask questions and have things explained, that alone is such a quality improvement. It’s mind blowing.   LEE: So since you mentioned, you know, Africa—and, of course, this touches on the mission and some of the priorities of the Gates Foundation and this idea of democratization of access to expert medical care—what’s the most interesting stuff going on right now? Are there people and organizations or technologies that are impressing you or that you’re tracking? GATES: Yeah. So the Gates Foundation has given out a lot of grants to people in Africa doing education, agriculture but more healthcare examples than anything. And the way these things start off, they often start out either being patient-centric in a narrow situation, like, OK, I’m a pregnant woman; talk to me. Or, I have infectious disease symptoms; talk to me. Or they’re connected to a health worker where they’re helping that worker get their job done. And we have lots of pilots out, you know, in both of those cases.   The dream would be eventually to have the thing the patient consults be so broad that it’s like having a doctor available who understands the local things.   LEE: Right.   GATES: We’re not there yet. But over the next two or three years, you know, particularly given the worsening financial constraints against African health systems, where the withdrawal of money has been dramatic, you know, figuring out how to take this—what I sometimes call “free intelligence”—and build a quality health system around that, we will have to be more radical in low-income countries than any rich country is ever going to be.   LEE: Also, there’s maybe a different regulatory environment, so some of those things maybe are easier? Because right now, I think the world hasn’t figured out how to and whether to regulate, let’s say, an AI that might give a medical diagnosis or write a prescription for a medication. BUBECK: Yeah. I think one issue with this, and it’s also slowing down the deployment of AI in healthcare more generally, is a lack of proper benchmark. Because, you know, you were mentioning the USMLE, for example. That’s a great test to test human beings and their knowledge of healthcare and medicine. But it’s not a great test to give to an AI. It’s not asking the right questions. So finding what are the right questions to test whether an AI system is ready to give diagnosis in a constrained setting, that’s a very, very important direction, which to my surprise, is not yet accelerating at the rate that I was hoping for. LEE: OK, so that gives me an excuse to get more now into the core AI tech because something I’ve discussed with both of you is this issue of what are the right tests. And you both know the very first test I give to any new spin of an LLM is I present a patient, the results—a mythical patient—the results of my physical exam, my mythical physical exam. Maybe some results of some initial labs. And then I present or propose a differential diagnosis. And if you’re not in medicine, a differential diagnosis you can just think of as a prioritized list of the possible diagnoses that fit with all that data. And in that proposed differential, I always intentionally make two mistakes. I make a textbook technical error in one of the possible elements of the differential diagnosis, and I have an error of omission. And, you know, I just want to know, does the LLM understand what I’m talking about? And all the good ones out there do now. But then I want to know, can it spot the errors? And then most importantly, is it willing to tell me I’m wrong, that I’ve made a mistake?   That last piece seems really hard for AI today. And so let me ask you first, Seb, because at the time of this taping, of course, there was a new spin of GPT-4o last week that became overly sycophantic. In other words, it was actually prone in that test of mine not only to not tell me I’m wrong, but it actually praised me for the creativity of my differential.What’s up with that? BUBECK: Yeah, I guess it’s a testament to the fact that training those models is still more of an art than a science. So it’s a difficult job. Just to be clear with the audience, we have rolled back thatversion of GPT-4o, so now we don’t have the sycophant version out there. Yeah, no, it’s a really difficult question. It has to do … as you said, it’s very technical. It has to do with the post-training and how, like, where do you nudge the model? So, you know, there is this very classical by now technique called RLHF, where you push the model in the direction of a certain reward model. So the reward model is just telling the model, you know, what behavior is good, what behavior is bad. But this reward model is itself an LLM, and, you know, Bill was saying at the very beginning of the conversation that we don’t really understand how those LLMs deal with concepts like, you know, where is the capital of France located? Things like that. It is the same thing for this reward model. We don’t know why it says that it prefers one output to another, and whether this is correlated with some sycophancy is, you know, something that we discovered basically just now. That if you push too hard in optimization on this reward model, you will get a sycophant model. So it’s kind of … what I’m trying to say is we became too good at what we were doing, and we ended up, in fact, in a trap of the reward model. LEE: I mean, you do want … it’s a difficult balance because you do want models to follow your desires and … BUBECK: It’s a very difficult, very difficult balance. LEE: So this brings up then the following question for me, which is the extent to which we think we’ll need to have specially trained models for things. So let me start with you, Bill. Do you have a point of view on whether we will need to, you know, quote-unquote take AI models to med school? Have them specially trained? Like, if you were going to deploy something to give medical care in underserved parts of the world, do we need to do something special to create those models? GATES: We certainly need to teach them the African languages and the unique dialects so that the multimedia interactions are very high quality. We certainly need to teach them the disease prevalence and unique disease patterns like, you know, neglected tropical diseases and malaria. So we need to gather a set of facts that somebody trying to go for a US customer base, you know, wouldn’t necessarily have that in there. Those two things are actually very straightforward because the additional training time is small. I’d say for the next few years, we’ll also need to do reinforcement learning about the context of being a doctor and how important certain behaviors are. Humans learn over the course of their life to some degree that, I’m in a different context and the way I behave in terms of being willing to criticize or be nice, you know, how important is it? Who’s here? What’s my relationship to them?   Right now, these machines don’t have that broad social experience. And so if you know it’s going to be used for health things, a lot of reinforcement learning of the very best humans in that context would still be valuable. Eventually, the models will, having read all the literature of the world about good doctors, bad doctors, it’ll understand as soon as you say, “I want you to be a doctor diagnosing somebody.” All of the implicit reinforcement that fits that situation, you know, will be there. LEE: Yeah. GATES: And so I hope three years from now, we don’t have to do that reinforcement learning. But today, for any medical context, you would want a lot of data to reinforce tone, willingness to say things when, you know, there might be something significant at stake. LEE: Yeah. So, you know, something Bill said, kind of, reminds me of another thing that I think we missed, which is, the context also … and the specialization also pertains to different, I guess, what we still call “modes,” although I don’t know if the idea of multimodal is the same as it was two years ago. But, you know, what do you make of all of the hubbub around—in fact, within Microsoft Research, this is a big deal, but I think we’re far from alone—you know, medical images and vision, video, proteins and molecules, cell, you know, cellular data and so on. BUBECK: Yeah. OK. So there is a lot to say to everything … to the last, you know, couple of minutes. Maybe on the specialization aspect, you know, I think there is, hiding behind this, a really fundamental scientific question of whether eventually we have a singular AGIthat kind of knows everything and you can just put, you know, explain your own context and it will just get it and understand everything. That’s one vision. I have to say, I don’t particularly believe in this vision. In fact, we humans are not like that at all. I think, hopefully, we are general intelligences, yet we have to specialize a lot. And, you know, I did myself a lot of RL, reinforcement learning, on mathematics. Like, that’s what I did, you know, spent a lot of time doing that. And I didn’t improve on other aspects. You know, in fact, I probably degraded in other aspects.So it’s … I think it’s an important example to have in mind. LEE: I think I might disagree with you on that, though, because, like, doesn’t a model have to see both good science and bad science in order to be able to gain the ability to discern between the two? BUBECK: Yeah, no, that absolutely. I think there is value in seeing the generality, in having a very broad base. But then you, kind of, specialize on verticals. And this is where also, you know, open-weights model, which we haven’t talked about yet, are really important because they allow you to provide this broad base to everyone. And then you can specialize on top of it. LEE: So we have about three hours of stuff to talk about, but our time is actually running low. BUBECK: Yes, yes, yes.   LEE: So I think I want … there’s a more provocative question. It’s almost a silly question, but I need to ask it of the two of you, which is, is there a future, you know, where AI replaces doctors or replaces, you know, medical specialties that we have today? So what does the world look like, say, five years from now? GATES: Well, it’s important to distinguish healthcare discovery activity from healthcare delivery activity. We focused mostly on delivery. I think it’s very much within the realm of possibility that the AI is not only accelerating healthcare discovery but substituting for a lot of the roles of, you know, I’m an organic chemist, or I run various types of assays. I can see those, which are, you know, testable-output-type jobs but with still very high value, I can see, you know, some replacement in those areas before the doctor.   The doctor, still understanding the human condition and long-term dialogues, you know, they’ve had a lifetime of reinforcement of that, particularly when you get into areas like mental health. So I wouldn’t say in five years, either people will choose to adopt it, but it will be profound that there’ll be this nearly free intelligence that can do follow-up, that can help you, you know, make sure you went through different possibilities. And so I’d say, yes, we’ll have doctors, but I’d say healthcare will be massively transformed in its quality and in efficiency by AI in that time period. LEE: Is there a comparison, useful comparison, say, between doctors and, say, programmers, computer programmers, or doctors and, I don’t know, lawyers? GATES: Programming is another one that has, kind of, a mathematical correctness to it, you know, and so the objective function that you’re trying to reinforce to, as soon as you can understand the state machines, you can have something that’s “checkable”; that’s correct. So I think programming, you know, which is weird to say, that the machine will beat us at most programming tasks before we let it take over roles that have deep empathy, you know, physical presence and social understanding in them. LEE: Yeah. By the way, you know, I fully expect in five years that AI will produce mathematical proofs that are checkable for validity, easily checkable, because they’ll be written in a proof-checking language like Lean or something but will be so complex that no human mathematician can understand them. I expect that to happen.   I can imagine in some fields, like cellular biology, we could have the same situation in the future because the molecular pathways, the chemistry, biochemistry of human cells or living cells is as complex as any mathematics, and so it seems possible that we may be in a state where in wet lab, we see, Oh yeah, this actually works, but no one can understand why. BUBECK: Yeah, absolutely. I mean, I think I really agree with Bill’s distinction of the discovery and the delivery, and indeed, the discovery’s when you can check things, and at the end, there is an artifact that you can verify. You know, you can run the protocol in the wet lab and seeproduced what you wanted. So I absolutely agree with that.   And in fact, you know, we don’t have to talk five years from now. I don’t know if you know, but just recently, there was a paper that was published on a scientific discovery using o3- mini. So this is really amazing. And, you know, just very quickly, just so people know, it was about this statistical physics model, the frustrated Potts model, which has to do with coloring, and basically, the case of three colors, like, more than two colors was open for a long time, and o3 was able to reduce the case of three colors to two colors.   LEE: Yeah. BUBECK: Which is just, like, astounding. And this is not … this is now. This is happening right now. So this is something that I personally didn’t expect it would happen so quickly, and it’s due to those reasoning models.   Now, on the delivery side, I would add something more to it for the reason why doctors and, in fact, lawyers and coders will remain for a long time, and it’s because we still don’t understand how those models generalize. Like, at the end of the day, we are not able to tell you when they are confronted with a really new, novel situation, whether they will work or not. Nobody is able to give you that guarantee. And I think until we understand this generalization better, we’re not going to be willing to just let the system in the wild without human supervision. LEE: But don’t human doctors, human specialists … so, for example, a cardiologist sees a patient in a certain way that a nephrologist … BUBECK: Yeah. LEE: … or an endocrinologist might not. BUBECK: That’s right. But another cardiologist will understand and, kind of, expect a certain level of generalization from their peer. And this, we just don’t have it with AI models. Now, of course, you’re exactly right. That generalization is also hard for humans. Like, if you have a human trained for one task and you put them into another task, then you don’t … you often don’t know. LEE: OK. You know, the podcast is focused on what’s happened over the last two years. But now, I’d like one provocative prediction about what you think the world of AI and medicine is going to be at some point in the future. You pick your timeframe. I don’t care if it’s two years or 20 years from now, but, you know, what do you think will be different about AI in medicine in that future than today? BUBECK: Yeah, I think the deployment is going to accelerate soon. Like, we’re really not missing very much. There is this enormous capability overhang. Like, even if progress completely stopped, with current systems, we can do a lot more than what we’re doing right now. So I think this will … this has to be realized, you know, sooner rather than later. And I think it’s probably dependent on these benchmarks and proper evaluation and tying this with regulation. So these are things that take time in human society and for good reason. But now we already are at two years; you know, give it another two years and it should be really …   LEE: Will AI prescribe your medicines? Write your prescriptions? BUBECK: I think yes. I think yes. LEE: OK. Bill? GATES: Well, I think the next two years, we’ll have massive pilots, and so the amount of use of the AI, still in a copilot-type mode, you know, we should get millions of patient visits, you know, both in general medicine and in the mental health side, as well. And I think that’s going to build up both the data and the confidence to give the AI some additional autonomy. You know, are you going to let it talk to you at night when you’re panicked about your mental health with some ability to escalate? And, you know, I’ve gone so far as to tell politicians with national health systems that if they deploy AI appropriately, that the quality of care, the overload of the doctors, the improvement in the economics will be enough that their voters will be stunned because they just don’t expect this, and, you know, they could be reelectedjust on this one thing of fixing what is a very overloaded and economically challenged health system in these rich countries. You know, my personal role is going to be to make sure that in the poorer countries, there isn’t some lag; in fact, in many cases, that we’ll be more aggressive because, you know, we’re comparing to having no access to doctors at all. And, you know, so I think whether it’s India or Africa, there’ll be lessons that are globally valuable because we need medical intelligence. And, you know, thank god AI is going to provide a lot of that. LEE: Well, on that optimistic note, I think that’s a good way to end. Bill, Seb, really appreciate all of this.   I think the most fundamental prediction we made in the book is that AI would actually find its way into the practice of medicine, and I think that that at least has come true, maybe in different ways than we expected, but it’s come true, and I think it’ll only accelerate from here. So thanks again, both of you.   GATES: Yeah. Thanks, you guys. BUBECK: Thank you, Peter. Thanks, Bill. LEE: I just always feel such a sense of privilege to have a chance to interact and actually work with people like Bill and Sébastien.    With Bill, I’m always amazed at how practically minded he is. He’s really thinking about the nuts and bolts of what AI might be able to do for people, and his thoughts about underserved parts of the world, the idea that we might actually be able to empower people with access to expert medical knowledge, I think is both inspiring and amazing.   And then, Seb, Sébastien Bubeck, he’s just absolutely a brilliant mind. He has a really firm grip on the deep mathematics of artificial intelligence and brings that to bear in his research and development work. And where that mathematics takes him isn’t just into the nuts and bolts of algorithms but into philosophical questions about the nature of intelligence.   One of the things that Sébastien brought up was the state of evaluation of AI systems. And indeed, he was fairly critical in our conversation. But of course, the world of AI research and development is just moving so fast, and indeed, since we recorded our conversation, OpenAI, in fact, released a new evaluation metric that is directly relevant to medical applications, and that is something called HealthBench. And Microsoft Research also released a new evaluation approach or process called ADeLe.   HealthBench and ADeLe are examples of new approaches to evaluating AI models that are less about testing their knowledge and ability to pass multiple-choice exams and instead are evaluation approaches designed to assess how well AI models are able to complete tasks that actually arise every day in typical healthcare or biomedical research settings. These are examples of really important good work that speak to how well AI models work in the real world of healthcare and biomedical research and how well they can collaborate with human beings in those settings. You know, I asked Bill and Seb to make some predictions about the future. You know, my own answer, I expect that we’re going to be able to use AI to change how we diagnose patients, change how we decide treatment options.   If you’re a doctor or a nurse and you encounter a patient, you’ll ask questions, do a physical exam, you know, call out for labs just like you do today, but then you’ll be able to engage with AI based on all of that data and just ask, you know, based on all the other people who have gone through the same experience, who have similar data, how were they diagnosed? How were they treated? What were their outcomes? And what does that mean for the patient I have right now? Some people call it the “patients like me” paradigm. And I think that’s going to become real because of AI within our lifetimes. That idea of really grounding the delivery in healthcare and medical practice through data and intelligence, I actually now don’t see any barriers to that future becoming real.   I’d like to extend another big thank you to Bill and Sébastien for their time. And to our listeners, as always, it’s a pleasure to have you along for the ride. I hope you’ll join us for our remaining conversations, as well as a second coauthor roundtable with Carey and Zak.   Until next time.   #how #reshaping #future #healthcare #medical

How AI is reshaping the future of healthcare and medical research

www.microsoft.com
Transcript [MUSIC]      [BOOK PASSAGE]  PETER LEE: “In ‘The Little Black Bag,’ a classic science fiction story, a high-tech doctor’s kit of the future is accidentally transported back to the 1950s, into the shaky hands of a washed-up, alcoholic doctor. The ultimate medical tool, it redeems the doctor wielding it, allowing him to practice gratifyingly heroic medicine. … The tale ends badly for the doctor and his treacherous assistant, but it offered a picture of how advanced technology could transform medicine—powerful when it was written nearly 75 years ago and still so today. What would be the Al equivalent of that little black bag? At this moment when new capabilities are emerging, how do we imagine them into medicine?”   [END OF BOOK PASSAGE]    [THEME MUSIC]    This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.   Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?    In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.  [THEME MUSIC FADES] The book passage I read at the top is from “Chapter 10: The Big Black Bag.” In imagining AI in medicine, Carey, Zak, and I included in our book two fictional accounts. In the first, a medical resident consults GPT-4 on her personal phone as the patient in front of her crashes. Within seconds, it offers an alternate response based on recent literature. In the second account, a 90-year-old woman with several chronic conditions is living independently and receiving near-constant medical support from an AI aide.    In our conversations with the guests we’ve spoken to so far, we’ve caught a glimpse of these predicted futures, seeing how clinicians and patients are actually using AI today and how developers are leveraging the technology in the healthcare products and services they’re creating. In fact, that first fictional account isn’t so fictional after all, as most of the doctors in the real world actually appear to be using AI at least occasionally—and sometimes much more than occasionally—to help in their daily clinical work. And as for the second fictional account, which is more of a science fiction account, it seems we are indeed on the verge of a new way of delivering and receiving healthcare, though the future is still very much open. As we continue to examine the current state of AI in healthcare and its potential to transform the field, I’m pleased to welcome Bill Gates and Sébastien Bubeck.   Bill may be best known as the co-founder of Microsoft, having created the company with his childhood friend Paul Allen in 1975. He’s now the founder of Breakthrough Energy, which aims to advance clean energy innovation, and TerraPower, a company developing groundbreaking nuclear energy and science technologies. He also chairs the world’s largest philanthropic organization, the Gates Foundation, and focuses on solving a variety of health challenges around the globe and here at home. Sébastien is a research lead at OpenAI. He was previously a distinguished scientist, vice president of AI, and a colleague of mine here at Microsoft, where his work included spearheading the development of the family of small language models known as Phi. While at Microsoft, he also coauthored the discussion-provoking 2023 paper “Sparks of Artificial General Intelligence,” which presented the results of early experiments with GPT-4 conducted by a small team from Microsoft Research.    [TRANSITION MUSIC]   Here’s my conversation with Bill Gates and Sébastien Bubeck. LEE: Bill, welcome. BILL GATES: Thank you. LEE: Seb … SÉBASTIEN BUBECK: Yeah. Hi, hi, Peter. Nice to be here. LEE: You know, one of the things that I’ve been doing just to get the conversation warmed up is to talk about origin stories, and what I mean about origin stories is, you know, what was the first contact that you had with large language models or the concept of generative AI that convinced you or made you think that something really important was happening? And so, Bill, I think I’ve heard the story about, you know, the time when the OpenAI folks—Sam Altman, Greg Brockman, and others—showed you something, but could we hear from you what those early encounters were like and what was going through your mind?   GATES: Well, I’d been visiting OpenAI soon after it was created to see things like GPT-2 and to see the little arm they had that was trying to match human manipulation and, you know, looking at their games like Dota that they were trying to get as good as human play. And honestly, I didn’t think the language model stuff they were doing, even when they got to GPT-3, would show the ability to learn, you know, in the same sense that a human reads a biology book and is able to take that knowledge and access it not only to pass a test but also to create new medicines. And so my challenge to them was that if their LLM could get a five on the advanced placement biology test, then I would say, OK, it took biologic knowledge and encoded it in an accessible way and that I didn’t expect them to do that very quickly but it would be profound.   And it was only about six months after I challenged them to do that, that an early version of GPT-4 they brought up to a dinner at my house, and in fact, it answered most of the questions that night very well. The one it got totally wrong, we were … because it was so good, we kept thinking, Oh, we must be wrong. It turned out it was a math weakness [LAUGHTER] that, you know, we later understood that that was an area of, weirdly, of incredible weakness of those early models. But, you know, that was when I realized, OK, the age of cheap intelligence was at its beginning. LEE: Yeah. So I guess it seems like you had something similar to me in that my first encounters, I actually harbored some skepticism. Is it fair to say you were skeptical before that? GATES: Well, the idea that we’ve figured out how to encode and access knowledge in this very deep sense without even understanding the nature of the encoding, … LEE: Right.   GATES: … that is a bit weird.   LEE: Yeah. GATES: We have an algorithm that creates the computation, but even say, OK, where is the president’s birthday stored in there? Where is this fact stored in there? The fact that even now when we’re playing around, getting a little bit more sense of it, it’s opaque to us what the semantic encoding is, it’s, kind of, amazing to me. I thought the invention of knowledge storage would be an explicit way of encoding knowledge, not an implicit statistical training. LEE: Yeah, yeah. All right. So, Seb, you know, on this same topic, you know, I got—as we say at Microsoft—I got pulled into the tent. [LAUGHS] BUBECK: Yes.   LEE: Because this was a very secret project. And then, um, I had the opportunity to select a small number of researchers in MSR [Microsoft Research] to join and start investigating this thing seriously. And the first person I pulled in was you. BUBECK: Yeah. LEE: And so what were your first encounters? Because I actually don’t remember what happened then. BUBECK: Oh, I remember it very well. [LAUGHS] My first encounter with GPT-4 was in a meeting with the two of you, actually. But my kind of first contact, the first moment where I realized that something was happening with generative AI, was before that. And I agree with Bill that I also wasn’t too impressed by GPT-3. I though that it was kind of, you know, very naturally mimicking the web, sort of parroting what was written there in a nice way. Still in a way which seemed very impressive. But it wasn’t really intelligent in any way. But shortly after GPT-3, there was a model before GPT-4 that really shocked me, and this was the first image generation model, DALL-E 1. So that was in 2021. And I will forever remember the press release of OpenAI where they had this prompt of an avocado chair and then you had this image of the avocado chair. [LAUGHTER] And what really shocked me is that clearly the model kind of “understood” what is a chair, what is an avocado, and was able to merge those concepts. So this was really, to me, the first moment where I saw some understanding in those models.   LEE: So this was, just to get the timing right, that was before I pulled you into the tent. BUBECK: That was before. That was like a year before. LEE: Right.   BUBECK: And now I will tell you how, you know, we went from that moment to the meeting with the two of you and GPT-4. So once I saw this kind of understanding, I thought, OK, fine. It understands concept, but it’s still not able to reason. It cannot—as, you know, Bill was saying—it cannot learn from your document. It cannot reason.   So I set out to try to prove that. You know, this is what I was in the business of at the time, trying to prove things in mathematics. So I was trying to prove that basically autoregressive transformers could never reason. So I was trying to prove this. And after a year of work, I had something reasonable to show. And so I had the meeting with the two of you, and I had this example where I wanted to say, there is no way that an LLM is going to be able to do x. And then as soon as I … I don’t know if you remember, Bill. But as soon as I said that, you said, oh, but wait a second. I had, you know, the OpenAI crew at my house recently, and they showed me a new model. Why don’t we ask this new model this question?   LEE: Yeah. BUBECK: And we did, and it solved it on the spot. And that really, honestly, just changed my life. Like, you know, I had been working for a year trying to say that this was impossible. And just right there, it was shown to be possible.   LEE: [LAUGHS] One of the very first things I got interested in—because I was really thinking a lot about healthcare—was healthcare and medicine. And I don’t know if the two of you remember, but I ended up doing a lot of tests. I ran through, you know, step one and step two of the US Medical Licensing Exam. Did a whole bunch of other things. I wrote this big report. It was, you know, I can’t remember … a couple hundred pages.   And I needed to share this with someone. I didn’t … there weren’t too many people I could share it with. So I sent, I think, a copy to you, Bill. Sent a copy to you, Seb.   I hardly slept for about a week putting that report together. And, yeah, and I kept working on it. But I was far from alone. I think everyone who was in the tent, so to speak, in those early days was going through something pretty similar. All right. So I think … of course, a lot of what I put in the report also ended up being examples that made it into the book. But the main purpose of this conversation isn’t to reminisce about [LAUGHS] or indulge in those reminiscences but to talk about what’s happening in healthcare and medicine. And, you know, as I said, we wrote this book. We did it very, very quickly. Seb, you helped. Bill, you know, you provided a review and some endorsements. But, you know, honestly, we didn’t know what we were talking about because no one had access to this thing. And so we just made a bunch of guesses. So really, the whole thing I wanted to probe with the two of you is, now with two years of experience out in the world, what, you know, what do we think is happening today? You know, is AI actually having an impact, positive or negative, on healthcare and medicine? And what do we now think is going to happen in the next two years, five years, or 10 years? And so I realize it’s a little bit too abstract to just ask it that way. So let me just try to narrow the discussion and guide us a little bit.   Um, the kind of administrative and clerical work, paperwork, around healthcare—and we made a lot of guesses about that—that appears to be going well, but, you know, Bill, I know we’ve discussed that sometimes that you think there ought to be a lot more going on. Do you have a viewpoint on how AI is actually finding its way into reducing paperwork? GATES: Well, I’m stunned … I don’t think there should be a patient-doctor meeting where the AI is not sitting in and both transcribing, offering to help with the paperwork, and even making suggestions, although the doctor will be the one, you know, who makes the final decision about the diagnosis and whatever prescription gets done.   It’s so helpful. You know, when that patient goes home and their, you know, son who wants to understand what happened has some questions, that AI should be available to continue that conversation. And the way you can improve that experience and streamline things and, you know, involve the people who advise you. I don’t understand why that’s not more adopted, because there you still have the human in the loop making that final decision. But even for, like, follow-up calls to make sure the patient did things, to understand if they have concerns and knowing when to escalate back to the doctor, the benefit is incredible. And, you know, that thing is ready for prime time. That paradigm is ready for prime time, in my view. LEE: Yeah, there are some good products, but it seems like the number one use right now—and we kind of got this from some of the previous guests in previous episodes—is the use of AI just to respond to emails from patients. [LAUGHTER] Does that make sense to you? BUBECK: Yeah. So maybe I want to second what Bill was saying but maybe take a step back first. You know, two years ago, like, the concept of clinical scribes, which is one of the things that we’re talking about right now, it would have sounded, in fact, it sounded two years ago, borderline dangerous. Because everybody was worried about hallucinations. What happened if you have this AI listening in and then it transcribes, you know, something wrong? Now, two years later, I think it’s mostly working. And in fact, it is not yet, you know, fully adopted. You’re right. But it is in production. It is used, you know, in many, many places. So this rate of progress is astounding because it wasn’t obvious that we would be able to overcome those obstacles of hallucination. It’s not to say that hallucinations are fully solved. In the case of the closed system, they are.   Now, I think more generally what’s going on in the background is that there is something that we, that certainly I, underestimated, which is this management overhead. So I think the reason why this is not adopted everywhere is really a training and teaching aspect. People need to be taught, like, those systems, how to interact with them. And one example that I really like, a study that recently appeared where they tried to use ChatGPT for diagnosis and they were comparing doctors without and with ChatGPT (opens in new tab). And the amazing thing … so this was a set of cases where the accuracy of the doctors alone was around 75%. ChatGPT alone was 90%. So that’s already kind of mind blowing. But then the kicker is that doctors with ChatGPT was 80%.   Intelligence alone is not enough. It’s also how it’s presented, how you interact with it. And ChatGPT, it’s an amazing tool. Obviously, I absolutely love it. But it’s not … you don’t want a doctor to have to type in, you know, prompts and use it that way. It should be, as Bill was saying, kind of running continuously in the background, sending you notifications. And you have to be really careful of the rate at which those notifications are being sent. Because if they are too frequent, then the doctor will learn to ignore them. So you have to … all of those things matter, in fact, at least as much as the level of intelligence of the machine. LEE: One of the things I think about, Bill, in that scenario that you described, doctors do some thinking about the patient when they write the note. So, you know, I’m always a little uncertain whether it’s actually … you know, you wouldn’t necessarily want to fully automate this, I don’t think. Or at least there needs to be some prompt to the doctor to make sure that the doctor puts some thought into what happened in the encounter with the patient. Does that make sense to you at all? GATES: At this stage, you know, I’d still put the onus on the doctor to write the conclusions and the summary and not delegate that. The tradeoffs you make a little bit are somewhat dependent on the situation you’re in. If you’re in Africa, So, yes, the doctor’s still going to have to do a lot of work, but just the quality of letting the patient and the people around them interact and ask questions and have things explained, that alone is such a quality improvement. It’s mind blowing.   LEE: So since you mentioned, you know, Africa—and, of course, this touches on the mission and some of the priorities of the Gates Foundation and this idea of democratization of access to expert medical care—what’s the most interesting stuff going on right now? Are there people and organizations or technologies that are impressing you or that you’re tracking? GATES: Yeah. So the Gates Foundation has given out a lot of grants to people in Africa doing education, agriculture but more healthcare examples than anything. And the way these things start off, they often start out either being patient-centric in a narrow situation, like, OK, I’m a pregnant woman; talk to me. Or, I have infectious disease symptoms; talk to me. Or they’re connected to a health worker where they’re helping that worker get their job done. And we have lots of pilots out, you know, in both of those cases.   The dream would be eventually to have the thing the patient consults be so broad that it’s like having a doctor available who understands the local things.   LEE: Right.   GATES: We’re not there yet. But over the next two or three years, you know, particularly given the worsening financial constraints against African health systems, where the withdrawal of money has been dramatic, you know, figuring out how to take this—what I sometimes call “free intelligence”—and build a quality health system around that, we will have to be more radical in low-income countries than any rich country is ever going to be.   LEE: Also, there’s maybe a different regulatory environment, so some of those things maybe are easier? Because right now, I think the world hasn’t figured out how to and whether to regulate, let’s say, an AI that might give a medical diagnosis or write a prescription for a medication. BUBECK: Yeah. I think one issue with this, and it’s also slowing down the deployment of AI in healthcare more generally, is a lack of proper benchmark. Because, you know, you were mentioning the USMLE [United States Medical Licensing Examination], for example. That’s a great test to test human beings and their knowledge of healthcare and medicine. But it’s not a great test to give to an AI. It’s not asking the right questions. So finding what are the right questions to test whether an AI system is ready to give diagnosis in a constrained setting, that’s a very, very important direction, which to my surprise, is not yet accelerating at the rate that I was hoping for. LEE: OK, so that gives me an excuse to get more now into the core AI tech because something I’ve discussed with both of you is this issue of what are the right tests. And you both know the very first test I give to any new spin of an LLM is I present a patient, the results—a mythical patient—the results of my physical exam, my mythical physical exam. Maybe some results of some initial labs. And then I present or propose a differential diagnosis. And if you’re not in medicine, a differential diagnosis you can just think of as a prioritized list of the possible diagnoses that fit with all that data. And in that proposed differential, I always intentionally make two mistakes. I make a textbook technical error in one of the possible elements of the differential diagnosis, and I have an error of omission. And, you know, I just want to know, does the LLM understand what I’m talking about? And all the good ones out there do now. But then I want to know, can it spot the errors? And then most importantly, is it willing to tell me I’m wrong, that I’ve made a mistake?   That last piece seems really hard for AI today. And so let me ask you first, Seb, because at the time of this taping, of course, there was a new spin of GPT-4o last week that became overly sycophantic. In other words, it was actually prone in that test of mine not only to not tell me I’m wrong, but it actually praised me for the creativity of my differential. [LAUGHTER] What’s up with that? BUBECK: Yeah, I guess it’s a testament to the fact that training those models is still more of an art than a science. So it’s a difficult job. Just to be clear with the audience, we have rolled back that [LAUGHS] version of GPT-4o, so now we don’t have the sycophant version out there. Yeah, no, it’s a really difficult question. It has to do … as you said, it’s very technical. It has to do with the post-training and how, like, where do you nudge the model? So, you know, there is this very classical by now technique called RLHF [reinforcement learning from human feedback], where you push the model in the direction of a certain reward model. So the reward model is just telling the model, you know, what behavior is good, what behavior is bad. But this reward model is itself an LLM, and, you know, Bill was saying at the very beginning of the conversation that we don’t really understand how those LLMs deal with concepts like, you know, where is the capital of France located? Things like that. It is the same thing for this reward model. We don’t know why it says that it prefers one output to another, and whether this is correlated with some sycophancy is, you know, something that we discovered basically just now. That if you push too hard in optimization on this reward model, you will get a sycophant model. So it’s kind of … what I’m trying to say is we became too good at what we were doing, and we ended up, in fact, in a trap of the reward model. LEE: I mean, you do want … it’s a difficult balance because you do want models to follow your desires and … BUBECK: It’s a very difficult, very difficult balance. LEE: So this brings up then the following question for me, which is the extent to which we think we’ll need to have specially trained models for things. So let me start with you, Bill. Do you have a point of view on whether we will need to, you know, quote-unquote take AI models to med school? Have them specially trained? Like, if you were going to deploy something to give medical care in underserved parts of the world, do we need to do something special to create those models? GATES: We certainly need to teach them the African languages and the unique dialects so that the multimedia interactions are very high quality. We certainly need to teach them the disease prevalence and unique disease patterns like, you know, neglected tropical diseases and malaria. So we need to gather a set of facts that somebody trying to go for a US customer base, you know, wouldn’t necessarily have that in there. Those two things are actually very straightforward because the additional training time is small. I’d say for the next few years, we’ll also need to do reinforcement learning about the context of being a doctor and how important certain behaviors are. Humans learn over the course of their life to some degree that, I’m in a different context and the way I behave in terms of being willing to criticize or be nice, you know, how important is it? Who’s here? What’s my relationship to them?   Right now, these machines don’t have that broad social experience. And so if you know it’s going to be used for health things, a lot of reinforcement learning of the very best humans in that context would still be valuable. Eventually, the models will, having read all the literature of the world about good doctors, bad doctors, it’ll understand as soon as you say, “I want you to be a doctor diagnosing somebody.” All of the implicit reinforcement that fits that situation, you know, will be there. LEE: Yeah. GATES: And so I hope three years from now, we don’t have to do that reinforcement learning. But today, for any medical context, you would want a lot of data to reinforce tone, willingness to say things when, you know, there might be something significant at stake. LEE: Yeah. So, you know, something Bill said, kind of, reminds me of another thing that I think we missed, which is, the context also … and the specialization also pertains to different, I guess, what we still call “modes,” although I don’t know if the idea of multimodal is the same as it was two years ago. But, you know, what do you make of all of the hubbub around—in fact, within Microsoft Research, this is a big deal, but I think we’re far from alone—you know, medical images and vision, video, proteins and molecules, cell, you know, cellular data and so on. BUBECK: Yeah. OK. So there is a lot to say to everything … to the last, you know, couple of minutes. Maybe on the specialization aspect, you know, I think there is, hiding behind this, a really fundamental scientific question of whether eventually we have a singular AGI [artificial general intelligence] that kind of knows everything and you can just put, you know, explain your own context and it will just get it and understand everything. That’s one vision. I have to say, I don’t particularly believe in this vision. In fact, we humans are not like that at all. I think, hopefully, we are general intelligences, yet we have to specialize a lot. And, you know, I did myself a lot of RL, reinforcement learning, on mathematics. Like, that’s what I did, you know, spent a lot of time doing that. And I didn’t improve on other aspects. You know, in fact, I probably degraded in other aspects. [LAUGHTER] So it’s … I think it’s an important example to have in mind. LEE: I think I might disagree with you on that, though, because, like, doesn’t a model have to see both good science and bad science in order to be able to gain the ability to discern between the two? BUBECK: Yeah, no, that absolutely. I think there is value in seeing the generality, in having a very broad base. But then you, kind of, specialize on verticals. And this is where also, you know, open-weights model, which we haven’t talked about yet, are really important because they allow you to provide this broad base to everyone. And then you can specialize on top of it. LEE: So we have about three hours of stuff to talk about, but our time is actually running low. BUBECK: Yes, yes, yes.   LEE: So I think I want … there’s a more provocative question. It’s almost a silly question, but I need to ask it of the two of you, which is, is there a future, you know, where AI replaces doctors or replaces, you know, medical specialties that we have today? So what does the world look like, say, five years from now? GATES: Well, it’s important to distinguish healthcare discovery activity from healthcare delivery activity. We focused mostly on delivery. I think it’s very much within the realm of possibility that the AI is not only accelerating healthcare discovery but substituting for a lot of the roles of, you know, I’m an organic chemist, or I run various types of assays. I can see those, which are, you know, testable-output-type jobs but with still very high value, I can see, you know, some replacement in those areas before the doctor.   The doctor, still understanding the human condition and long-term dialogues, you know, they’ve had a lifetime of reinforcement of that, particularly when you get into areas like mental health. So I wouldn’t say in five years, either people will choose to adopt it, but it will be profound that there’ll be this nearly free intelligence that can do follow-up, that can help you, you know, make sure you went through different possibilities. And so I’d say, yes, we’ll have doctors, but I’d say healthcare will be massively transformed in its quality and in efficiency by AI in that time period. LEE: Is there a comparison, useful comparison, say, between doctors and, say, programmers, computer programmers, or doctors and, I don’t know, lawyers? GATES: Programming is another one that has, kind of, a mathematical correctness to it, you know, and so the objective function that you’re trying to reinforce to, as soon as you can understand the state machines, you can have something that’s “checkable”; that’s correct. So I think programming, you know, which is weird to say, that the machine will beat us at most programming tasks before we let it take over roles that have deep empathy, you know, physical presence and social understanding in them. LEE: Yeah. By the way, you know, I fully expect in five years that AI will produce mathematical proofs that are checkable for validity, easily checkable, because they’ll be written in a proof-checking language like Lean or something but will be so complex that no human mathematician can understand them. I expect that to happen.   I can imagine in some fields, like cellular biology, we could have the same situation in the future because the molecular pathways, the chemistry, biochemistry of human cells or living cells is as complex as any mathematics, and so it seems possible that we may be in a state where in wet lab, we see, Oh yeah, this actually works, but no one can understand why. BUBECK: Yeah, absolutely. I mean, I think I really agree with Bill’s distinction of the discovery and the delivery, and indeed, the discovery’s when you can check things, and at the end, there is an artifact that you can verify. You know, you can run the protocol in the wet lab and see [if you have] produced what you wanted. So I absolutely agree with that.   And in fact, you know, we don’t have to talk five years from now. I don’t know if you know, but just recently, there was a paper that was published on a scientific discovery using o3- mini (opens in new tab). So this is really amazing. And, you know, just very quickly, just so people know, it was about this statistical physics model, the frustrated Potts model, which has to do with coloring, and basically, the case of three colors, like, more than two colors was open for a long time, and o3 was able to reduce the case of three colors to two colors.   LEE: Yeah. BUBECK: Which is just, like, astounding. And this is not … this is now. This is happening right now. So this is something that I personally didn’t expect it would happen so quickly, and it’s due to those reasoning models.   Now, on the delivery side, I would add something more to it for the reason why doctors and, in fact, lawyers and coders will remain for a long time, and it’s because we still don’t understand how those models generalize. Like, at the end of the day, we are not able to tell you when they are confronted with a really new, novel situation, whether they will work or not. Nobody is able to give you that guarantee. And I think until we understand this generalization better, we’re not going to be willing to just let the system in the wild without human supervision. LEE: But don’t human doctors, human specialists … so, for example, a cardiologist sees a patient in a certain way that a nephrologist … BUBECK: Yeah. LEE: … or an endocrinologist might not. BUBECK: That’s right. But another cardiologist will understand and, kind of, expect a certain level of generalization from their peer. And this, we just don’t have it with AI models. Now, of course, you’re exactly right. That generalization is also hard for humans. Like, if you have a human trained for one task and you put them into another task, then you don’t … you often don’t know. LEE: OK. You know, the podcast is focused on what’s happened over the last two years. But now, I’d like one provocative prediction about what you think the world of AI and medicine is going to be at some point in the future. You pick your timeframe. I don’t care if it’s two years or 20 years from now, but, you know, what do you think will be different about AI in medicine in that future than today? BUBECK: Yeah, I think the deployment is going to accelerate soon. Like, we’re really not missing very much. There is this enormous capability overhang. Like, even if progress completely stopped, with current systems, we can do a lot more than what we’re doing right now. So I think this will … this has to be realized, you know, sooner rather than later. And I think it’s probably dependent on these benchmarks and proper evaluation and tying this with regulation. So these are things that take time in human society and for good reason. But now we already are at two years; you know, give it another two years and it should be really …   LEE: Will AI prescribe your medicines? Write your prescriptions? BUBECK: I think yes. I think yes. LEE: OK. Bill? GATES: Well, I think the next two years, we’ll have massive pilots, and so the amount of use of the AI, still in a copilot-type mode, you know, we should get millions of patient visits, you know, both in general medicine and in the mental health side, as well. And I think that’s going to build up both the data and the confidence to give the AI some additional autonomy. You know, are you going to let it talk to you at night when you’re panicked about your mental health with some ability to escalate? And, you know, I’ve gone so far as to tell politicians with national health systems that if they deploy AI appropriately, that the quality of care, the overload of the doctors, the improvement in the economics will be enough that their voters will be stunned because they just don’t expect this, and, you know, they could be reelected [LAUGHTER] just on this one thing of fixing what is a very overloaded and economically challenged health system in these rich countries. You know, my personal role is going to be to make sure that in the poorer countries, there isn’t some lag; in fact, in many cases, that we’ll be more aggressive because, you know, we’re comparing to having no access to doctors at all. And, you know, so I think whether it’s India or Africa, there’ll be lessons that are globally valuable because we need medical intelligence. And, you know, thank god AI is going to provide a lot of that. LEE: Well, on that optimistic note, I think that’s a good way to end. Bill, Seb, really appreciate all of this.   I think the most fundamental prediction we made in the book is that AI would actually find its way into the practice of medicine, and I think that that at least has come true, maybe in different ways than we expected, but it’s come true, and I think it’ll only accelerate from here. So thanks again, both of you. [TRANSITION MUSIC] GATES: Yeah. Thanks, you guys. BUBECK: Thank you, Peter. Thanks, Bill. LEE: I just always feel such a sense of privilege to have a chance to interact and actually work with people like Bill and Sébastien.    With Bill, I’m always amazed at how practically minded he is. He’s really thinking about the nuts and bolts of what AI might be able to do for people, and his thoughts about underserved parts of the world, the idea that we might actually be able to empower people with access to expert medical knowledge, I think is both inspiring and amazing.   And then, Seb, Sébastien Bubeck, he’s just absolutely a brilliant mind. He has a really firm grip on the deep mathematics of artificial intelligence and brings that to bear in his research and development work. And where that mathematics takes him isn’t just into the nuts and bolts of algorithms but into philosophical questions about the nature of intelligence.   One of the things that Sébastien brought up was the state of evaluation of AI systems. And indeed, he was fairly critical in our conversation. But of course, the world of AI research and development is just moving so fast, and indeed, since we recorded our conversation, OpenAI, in fact, released a new evaluation metric that is directly relevant to medical applications, and that is something called HealthBench. And Microsoft Research also released a new evaluation approach or process called ADeLe.   HealthBench and ADeLe are examples of new approaches to evaluating AI models that are less about testing their knowledge and ability to pass multiple-choice exams and instead are evaluation approaches designed to assess how well AI models are able to complete tasks that actually arise every day in typical healthcare or biomedical research settings. These are examples of really important good work that speak to how well AI models work in the real world of healthcare and biomedical research and how well they can collaborate with human beings in those settings. You know, I asked Bill and Seb to make some predictions about the future. You know, my own answer, I expect that we’re going to be able to use AI to change how we diagnose patients, change how we decide treatment options.   If you’re a doctor or a nurse and you encounter a patient, you’ll ask questions, do a physical exam, you know, call out for labs just like you do today, but then you’ll be able to engage with AI based on all of that data and just ask, you know, based on all the other people who have gone through the same experience, who have similar data, how were they diagnosed? How were they treated? What were their outcomes? And what does that mean for the patient I have right now? Some people call it the “patients like me” paradigm. And I think that’s going to become real because of AI within our lifetimes. That idea of really grounding the delivery in healthcare and medical practice through data and intelligence, I actually now don’t see any barriers to that future becoming real. [THEME MUSIC] I’d like to extend another big thank you to Bill and Sébastien for their time. And to our listeners, as always, it’s a pleasure to have you along for the ride. I hope you’ll join us for our remaining conversations, as well as a second coauthor roundtable with Carey and Zak.   Until next time.   [MUSIC FADES]

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-06-06 07:52:50 ·

BenchmarkQED: Automated benchmarking of RAG systems

One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics.
To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub. It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.
BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks.
In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes.
In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset.
Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.
AutoQ: Automated query synthesis
This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset.
AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the queryforming a logical progression along the spectrum.
Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum.
AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset.
Figure 2. Synthesis process and example query for each of the four AutoQ query classes.

About Microsoft Research
Advancing science and technology to benefit humanity

View our story

Opens in a new tab
AutoE: Automated evaluation framework
Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:

Comprehensiveness: Does the answer address all relevant aspects of the question?
Diversity: Does it present varied perspectives or insights?
Empowerment: Does it help the reader understand and make informed judgments?
Relevance: Does it address what the question is specifically asking?

The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics.
An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class . AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge.
These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method.
Figure 3. Win rates of four LazyGraphRAG configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%.
The four LazyGraphRAG conditionsdiffer by query budgetand chunk size. All used GPT-4o mini for relevance tests and GPT-4o for query expansionand answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout.
Comparison systems were GraphRAG , Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG, RAPTOR, and TREX. All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy.
LazyGraphRAG outperformed every comparison condition using the same generative model, winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration. For DataLocal queries, the smaller budgetperformed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk sizehad a slight edge, likely because longer chunks provide a more coherent context.
Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall.
Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset.
Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall.
Figure 4. Win rates of LazyGraphRAG  over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition.
AutoD: Automated data sampling and summarization
Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities.
The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clustersand the number of samples per cluster. This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations.
AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited.
Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository, alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.
We hope these datasets, together with the BenchmarkQED tools, help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub.
Opens in a new tab
#benchmarkqedautomatedbenchmarking #ofrag #systems

BenchmarkQED: Automated benchmarking of RAG systems
One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics. To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub. It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.   BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks. In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes. In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset. Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.   AutoQ: Automated query synthesis This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset. AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the queryforming a logical progression along the spectrum. Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum. AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset. Figure 2. Synthesis process and example query for each of the four AutoQ query classes. About Microsoft Research Advancing science and technology to benefit humanity View our story Opens in a new tab AutoE: Automated evaluation framework Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation: Comprehensiveness: Does the answer address all relevant aspects of the question? Diversity: Does it present varied perspectives or insights? Empowerment: Does it help the reader understand and make informed judgments? Relevance: Does it address what the question is specifically asking?   The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics. An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class . AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge. These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method. Figure 3. Win rates of four LazyGraphRAG configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%. The four LazyGraphRAG conditionsdiffer by query budgetand chunk size. All used GPT-4o mini for relevance tests and GPT-4o for query expansionand answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout. Comparison systems were GraphRAG , Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG, RAPTOR, and TREX. All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy. LazyGraphRAG outperformed every comparison condition using the same generative model, winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration. For DataLocal queries, the smaller budgetperformed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk sizehad a slight edge, likely because longer chunks provide a more coherent context. Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall. Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset. Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall. Figure 4. Win rates of LazyGraphRAG  over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition. AutoD: Automated data sampling and summarization Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities. The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clustersand the number of samples per cluster. This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations. AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited. Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository, alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.   We hope these datasets, together with the BenchmarkQED tools, help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub. Opens in a new tab #benchmarkqedautomatedbenchmarking #ofrag #systems

BenchmarkQED: Automated benchmarking of RAG systems

www.microsoft.com
One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation (RAG) as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics. To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub (opens in new tab). It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.   BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model (LLM) to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks. In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes. In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset. Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.   AutoQ: Automated query synthesis This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset. AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom). Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum. AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset. Figure 2. Synthesis process and example query for each of the four AutoQ query classes. About Microsoft Research Advancing science and technology to benefit humanity View our story Opens in a new tab AutoE: Automated evaluation framework Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation: Comprehensiveness: Does the answer address all relevant aspects of the question? Diversity: Does it present varied perspectives or insights? Empowerment: Does it help the reader understand and make informed judgments? Relevance: Does it address what the question is specifically asking?   The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics. An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class (200 total). AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge. These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method. Figure 3. Win rates of four LazyGraphRAG (LGR) configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%. The four LazyGraphRAG conditions (LGR_b200_c200, LGR_b50_c200, LGR_b50_c600, LGR_b200_c200_mini) differ by query budget (b50, b200) and chunk size (c200, c600). All used GPT-4o mini for relevance tests and GPT-4o for query expansion (to five subqueries) and answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout. Comparison systems were GraphRAG (Local, Global, and Drift Search), Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG (opens in new tab), RAPTOR (opens in new tab), and TREX (opens in new tab). All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy. LazyGraphRAG outperformed every comparison condition using the same generative model (GPT-4o), winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration (LGR_b200_c200). For DataLocal queries, the smaller budget (LGR_b50_c200) performed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk size (LGR_b50_c600) had a slight edge, likely because longer chunks provide a more coherent context. Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall. Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset. Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall. Figure 4. Win rates of LazyGraphRAG (LGR) over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition. AutoD: Automated data sampling and summarization Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities. The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clusters (breadth) and the number of samples per cluster (depth). This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations. AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited. Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech (opens in new tab) podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository (opens in new tab), alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.   We hope these datasets, together with the BenchmarkQED tools (opens in new tab), help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub (opens in new tab). Opens in a new tab

487

· 0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-29 16:03:36 ·

What AI’s impact on individuals means for the health workforce and industry

Transcript    
PETER LEE: “In American primary care, the missing workforce is stunning in magnitude, the shortfall estimated to reach up to 48,000 doctors within the next dozen years. China and other countries with aging populations can expect drastic shortfalls, as well. Just last month, I asked a respected colleague retiring from primary care who he would recommend as a replacement; he told me bluntly that, other than expensive concierge care practices, he could not think of anyone, even for himself. This mismatch between need and supply will only grow, and the US is far from alone among developed countries in facing it.”     
This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.  
Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?   
In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.     The book passage I read at the top is from “Chapter 4: Trust but Verify,” which was written by Zak.
You know, it’s no secret that in the US and elsewhere shortages in medical staff and the rise of clinician burnout are affecting the quality of patient care for the worse. In our book, we predicted that generative AI would be something that might help address these issues.
So in this episode, we’ll delve into how individual performance gains that our previous guests have described might affect the healthcare workforce as a whole, and on the patient side, we’ll look into the influence of generative AI on the consumerization of healthcare. Now, since all of this consumes such a huge fraction of the overall economy, we’ll also get into what a general-purpose technology as disruptive as generative AI might mean in the context of labor markets and beyond.
To help us do that, I’m pleased to welcome Ethan Mollick and Azeem Azhar.
Ethan Mollick is the Ralph J. Roberts Distinguished Faculty Scholar, a Rowan Fellow, and an associate professor at the Wharton School of the University of Pennsylvania. His research into the effects of AI on work, entrepreneurship, and education is applied by organizations around the world, leading him to be named one of Time magazine’s most influential people in AI for 2024. He’s also the author of the New York Times best-selling book Co-Intelligence.
Azeem Azhar is an author, founder, investor, and one of the most thoughtful and influential voices on the interplay between disruptive emerging technologies and business and society. In his best-selling book, The Exponential Age, and in his highly regarded newsletter and podcast, Exponential View, he explores how technologies like AI are reshaping everything from healthcare to geopolitics.
Ethan and Azeem are two leading thinkers on the ways that disruptive technologies—and especially AI—affect our work, our jobs, our business enterprises, and whole industries. As economists, they are trying to work out whether we are in the midst of an economic revolution as profound as the shift from an agrarian to an industrial society.Here is my interview with Ethan Mollick:
LEE: Ethan, welcome.
ETHAN MOLLICK: So happy to be here, thank you.
LEE: I described you as a professor at Wharton, which I think most of the people who listen to this podcast series know of as an elite business school. So it might surprise some people that you study AI. And beyond that, you know, that I would seek you out to talk about AI in medicine.So to get started, how and why did it happen that you’ve become one of the leading experts on AI?
MOLLICK: It’s actually an interesting story. I’ve been AI-adjacent my whole career. When I wasmy PhD at MIT, I worked with Marvin Minskyand the MITMedia Labs AI group. But I was never the technical AI guy. I was the person who was trying to explain AI to everybody else who didn’t understand it.
And then I became very interested in, how do you train and teach? And AI was always a part of that. I was building games for teaching, teaching tools that were used in hospitals and elsewhere, simulations. So when LLMs burst into the scene, I had already been using them and had a good sense of what they could do. And between that and, kind of, being practically oriented and getting some of the first research projects underway, especially under education and AI and performance, I became sort of a go-to person in the field.
And once you’re in a field where nobody knows what’s going on and we’re all making it up as we go along—I thought it’s funny that you led with the idea that you have a couple of months head start for GPT-4, right. Like that’s all we have at this point, is a few months’ head start.So being a few months ahead is good enough to be an expert at this point. Whether it should be or not is a different question.
LEE: Well, if I understand correctly, leading AI companies like OpenAI, Anthropic, and others have now sought you out as someone who should get early access to really start to do early assessments and gauge early reactions. How has that been?
MOLLICK: So, I mean, I think the bigger picture is less about me than about two things that tells us about the state of AI right now.
One, nobody really knows what’s going on, right. So in a lot of ways, if it wasn’t for your work, Peter, like, I don’t think people would be thinking about medicine as much because these systems weren’t built for medicine. They weren’t built to change education. They weren’t built to write memos. They, like, they weren’t built to do any of these things. They weren’t really built to do anything in particular. It turns out they’re just good at many things.
And to the extent that the labs work on them, they care about their coding ability above everything else and maybe math and science secondarily. They don’t think about the fact that it expresses high empathy. They don’t think about its accuracy and diagnosis or where it’s inaccurate. They don’t think about how it’s changing education forever.
So one part of this is the fact that they go to my Twitter feed or ask me for advice is an indicator of where they are, too, which is they’re not thinking about this. And the fact that a few months’ head start continues to give you a lead tells you that we are at the very cutting edge. These labs aren’t sitting on projects for two years and then releasing them. Months after a project is complete or sooner, it’s out the door. Like, there’s very little delay. So we’re kind of all in the same boat here, which is a very unusual space for a new technology.
LEE: And I, you know, explained that you’re at Wharton. Are you an odd fit as a faculty member at Wharton, or is this a trend now even in business schools that AI experts are becoming key members of the faculty?
MOLLICK: I mean, it’s a little of both, right. It’s faculty, so everybody does everything. I’m a professor of innovation-entrepreneurship. I’ve launched startups before and working on that and education means I think about, how do organizations redesign themselves? How do they take advantage of these kinds of problems? So medicine’s always been very central to that, right. A lot of people in my MBA class have been MDs either switching, you know, careers or else looking to advance from being sort of individual contributors to running teams. So I don’t think that’s that bad a fit. But I also think this is general-purpose technology; it’s going to touch everything. The focus on this is medicine, but Microsoft does far more than medicine, right. It’s … there’s transformation happening in literally every field, in every country. This is a widespread effect.
So I don’t think we should be surprised that business schools matter on this because we care about management. There’s a long tradition of management and medicine going together. There’s actually a great academic paper that shows that teaching hospitals that also have MBA programs associated with them have higher management scores and perform better. So I think that these are not as foreign concepts, especially as medicine continues to get more complicated.
LEE: Yeah. Well, in fact, I want to dive a little deeper on these issues of management, of entrepreneurship, um, education. But before doing that, if I could just stay focused on you. There is always something interesting to hear from people about their first encounters with AI. And throughout this entire series, I’ve been doing that both pre-generative AI and post-generative AI. So you, sort of, hinted at the pre-generative AI. You were in Minsky’s lab. Can you say a little bit more about that early encounter? And then tell us about your first encounters with generative AI.
MOLLICK: Yeah. Those are great questions. So first of all, when I was at the media lab, that was pre-the current boom in sort of, you know, even in the old-school machine learning kind of space. So there was a lot of potential directions to head in. While I was there, there were projects underway, for example, to record every interaction small children had. One of the professors was recording everything their baby interacted with in the hope that maybe that would give them a hint about how to build an AI system.
There was a bunch of projects underway that were about labeling every concept and how they relate to other concepts. So, like, it was very much Wild West of, like, how do we make an AI work—which has been this repeated problem in AI, which is, what is this thing?
The fact that it was just like brute force over the corpus of all human knowledge turns out to be a little bit of like a, you know, it’s a miracle and a little bit of a disappointment in some wayscompared to how elaborate some of this was. So, you know, I think that, that was sort of my first encounters in sort of the intellectual way.
The generative AI encounters actually started with the original, sort of, GPT-3, or, you know, earlier versions. And it was actually game-based. So I played games like AI Dungeon. And as an educator, I realized, oh my gosh, this stuff could write essays at a fourth-grade level. That’s really going to change the way, like, middle school works, was my thinking at the time. And I was posting about that back in, you know, 2021 that this is a big deal. But I think everybody was taken surprise, including the AI companies themselves, by, you know, ChatGPT, by GPT-3.5. The difference in degree turned out to be a difference in kind.
LEE: Yeah, you know, if I think back, even with GPT-3, and certainly this was the case with GPT-2, it was, at least, you know, from where I was sitting, it was hard to get people to really take this seriously and pay attention.
MOLLICK: Yes.
LEE: You know, it’s remarkable. Within Microsoft, I think a turning point was the use of GPT-3 to do code completions. And that was actually productized as GitHub Copilot, the very first version. That, I think, is where there was widespread belief. But, you know, in a way, I think there is, even for me early on, a sense of denial and skepticism. Did you have those initially at any point?
MOLLICK: Yeah, I mean, it still happens today, right. Like, this is a weird technology. You know, the original denial and skepticism was, I couldn’t see where this was going. It didn’t seem like a miracle because, you know, of course computers can complete code for you. Like, what else are they supposed to do? Of course, computers can give you answers to questions and write fun things. So there’s difference of moving into a world of generative AI. I think a lot of people just thought that’s what computers could do. So it made the conversations a little weird. But even today, faced with these, you know, with very strong reasoner models that operate at the level of PhD students, I think a lot of people have issues with it, right.
I mean, first of all, they seem intuitive to use, but they’re not always intuitive to use because the first use case that everyone puts AI to, it fails at because they use it like Google or some other use case. And then it’s genuinely upsetting in a lot of ways. I think, you know, I write in my book about the idea of three sleepless nights. That hasn’t changed. Like, you have to have an intellectual crisis to some extent, you know, and I think people do a lot to avoid having that existential angst of like, “Oh my god, what does it mean that a machine could think—apparently think—like a person?”
So, I mean, I see resistance now. I saw resistance then. And then on top of all of that, there’s the fact that the curve of the technology is quite great. I mean, the price of GPT-4 level intelligence from, you know, when it was released has dropped 99.97% at this point, right.
LEE: Yes. Mm-hmm.
MOLLICK: I mean, I could run a GPT-4 class system basically on my phone. Microsoft’s releasing things that can almost run on like, you know, like it fits in almost no space, that are almost as good as the original GPT-4 models. I mean, I don’t think people have a sense of how fast the trajectory is moving either.
LEE: Yeah, you know, there’s something that I think about often. There is this existential dread, or will this technology replace me? But I think the first people to feel that are researchers—people encountering this for the first time. You know, if you were working, let’s say, in Bayesian reasoning or in traditional, let’s say, Gaussian mixture model based, you know, speech recognition, you do get this feeling, Oh, my god, this technology has just solved the problem that I’ve dedicated my life to. And there is this really difficult period where you have to cope with that. And I think this is going to be spreading, you know, in more and more walks of life. And so this … at what point does that sort of sense of dread hit you, if ever?
MOLLICK: I mean, you know, it’s not even dread as much as like, you know, Tyler Cowen wrote that it’s impossible to not feel a little bit of sadness as you use these AI systems, too. Because, like, I was talking to a friend, just as the most minor example, and his talent that he was very proud of was he was very good at writing limericks for birthday cards. He’d write these limericks. Everyone was always amused by them.And now, you know, GPT-4 and GPT-4.5, they made limericks obsolete. Like, anyone can write a good limerick, right. So this was a talent, and it was a little sad. Like, this thing that you cared about mattered.
You know, as academics, we’re a little used to dead ends, right, and like, you know, some getting the lap. But the idea that entire fields are hitting that way. Like in medicine, there’s a lot of support systems that are now obsolete. And the question is how quickly you change that. In education, a lot of our techniques are obsolete.
What do you do to change that? You know, it’s like the fact that this brute force technology is good enough to solve so many problems is weird, right. And it’s not just the end of, you know, of our research angles that matter, too. Like, for example, I ran this, you know, 14-person-plus, multimillion-dollar effort at Wharton to build these teaching simulations, and we’re very proud of them. It took years of work to build one.
Now we’ve built a system that can build teaching simulations on demand by you talking to it with one team member. And, you know, you literally can create any simulation by having a discussion with the AI. I mean, you know, there’s a switch to a new form of excitement, but there is a little bit of like, this mattered to me, and, you know, now I have to change how I do things. I mean, adjustment happens. But if you haven’t had that displacement, I think that’s a good indicator that you haven’t really faced AI yet.
LEE: Yeah, what’s so interesting just listening to you is you use words like sadness, and yet I can see the—and hear the—excitement in your voice and your body language. So, you know, that’s also kind of an interesting aspect of all of this.
MOLLICK: Yeah, I mean, I think there’s something on the other side, right. But, like, I can’t say that I haven’t had moments where like, ughhhh, but then there’s joy and basically like also, you know, freeing stuff up. I mean, I think about doctors or professors, right. These are jobs that bundle together lots of different tasks that you would never have put together, right. If you’re a doctor, you would never have expected the same person to be good at keeping up with the research and being a good diagnostician and being a good manager and being good with people and being good with hand skills.
Like, who would ever want that kind of bundle? That’s not something you’re all good at, right. And a lot of our stress of our job comes from the fact that we suck at some of it. And so to the extent that AI steps in for that, you kind of feel bad about some of the stuff that it’s doing that you wanted to do. But it’s much more uplifting to be like, I don’t have to do this stuff I’m bad anymore, or I get the support to make myself good at it. And the stuff that I really care about, I can focus on more. Well, because we are at kind of a unique moment where whatever you’re best at, you’re still better than AI. And I think it’s an ongoing question about how long that lasts. But for right now, like you’re not going to say, OK, AI replaces me entirely in my job in medicine. It’s very unlikely.
But you will say it replaces these 17 things I’m bad at, but I never liked that anyway. So it’s a period of both excitement and a little anxiety.
LEE: Yeah, I’m going to want to get back to this question about in what ways AI may or may not replace doctors or some of what doctors and nurses and other clinicians do. But before that, let’s get into, I think, the real meat of this conversation. In previous episodes of this podcast, we talked to clinicians and healthcare administrators and technology developers that are very rapidly injecting AI today to do various forms of workforce automation, you know, automatically writing a clinical encounter note, automatically filling out a referral letter or request for prior authorization for some reimbursement to an insurance company.
And so these sorts of things are intended not only to make things more efficient and lower costs but also to reduce various forms of drudgery, cognitive burden on frontline health workers. So how do you think about the impact of AI on that aspect of workforce, and, you know, what would you expect will happen over the next few years in terms of impact on efficiency and costs?
MOLLICK: So I mean, this is a case where I think we’re facing the big bright problem in AI in a lot of ways, which is that this is … at the individual level, there’s lots of performance gains to be gained, right. The problem, though, is that we as individuals fit into systems, in medicine as much as anywhere else or more so, right. Which is that you could individually boost your performance, but it’s also about systems that fit along with this, right.
So, you know, if you could automatically, you know, record an encounter, if you could automatically make notes, does that change what you should be expecting for notes or the value of those notes or what they’re for? How do we take what one person does and validate it across the organization and roll it out for everybody without making it a 10-year process that it feels like IT in medicine often is? Like, so we’re in this really interesting period where there’s incredible amounts of individual innovation in productivity and performance improvements in this field, like very high levels of it, but not necessarily seeing that same thing translate to organizational efficiency or gains.
And one of my big concerns is seeing that happen. We’re seeing that in nonmedical problems, the same kind of thing, which is, you know, we’ve got research showing 20 and 40% performance improvements, like not uncommon to see those things. But then the organization doesn’t capture it; the system doesn’t capture it. Because the individuals are doing their own work and the systems don’t have the ability to, kind of, learn or adapt as a result.
LEE: You know, where are those productivity gains going, then, when you get to the organizational level?
MOLLICK: Well, they’re dying for a few reasons. One is, there’s a tendency for individual contributors to underestimate the power of management, right.
Practices associated with good management increase happiness, decrease, you know, issues, increase success rates. In the same way, about 40%, as far as we can tell, of the US advantage over other companies, of US firms, has to do with management ability. Like, management is a big deal. Organizing is a big deal. Thinking about how you coordinate is a big deal.
At the individual level, when things get stuck there, right, you can’t start bringing them up to how systems work together. It becomes, How do I deal with a doctor that has a 60% performance improvement? We really only have one thing in our playbook for doing that right now, which is, OK, we could fire 40% of the other doctors and still have a performance gain, which is not the answer you want to see happen.
So because of that, people are hiding their use. They’re actually hiding their use for lots of reasons.
And it’s a weird case because the people who are able to figure out best how to use these systems, for a lot of use cases, they’re actually clinicians themselves because they’re experimenting all the time. Like, they have to take those encounter notes. And if they figure out a better way to do it, they figure that out. You don’t want to wait for, you know, a med tech company to figure that out and then sell that back to you when it can be done by the physicians themselves.
So we’re just not used to a period where everybody’s innovating and where the management structure isn’t in place to take advantage of that. And so we’re seeing things stalled at the individual level, and people are often, especially in risk-averse organizations or organizations where there’s lots of regulatory hurdles, people are so afraid of the regulatory piece that they don’t even bother trying to make change.
LEE: If you are, you know, the leader of a hospital or a clinic or a whole health system, how should you approach this? You know, how should you be trying to extract positive success out of AI?
MOLLICK: So I think that you need to embrace the right kind of risk, right. We don’t want to put risk on our patients … like, we don’t want to put uninformed risk. But innovation involves risk to how organizations operate. They involve change. So I think part of this is embracing the idea that R&D has to happen in organizations again.
What’s happened over the last 20 years or so has been organizations giving that up. Partially, that’s a trend to focus on what you’re good at and not try and do this other stuff. Partially, it’s because it’s outsourced now to software companies that, like, Salesforce tells you how to organize your sales team. Workforce tells you how to organize your organization. Consultants come in and will tell you how to make change based on the average of what other people are doing in your field.
So companies and organizations and hospital systems have all started to give up their ability to create their own organizational change. And when I talk to organizations, I often say they have to have two approaches. They have to think about the crowd and the lab.
So the crowd is the idea of how to empower clinicians and administrators and supporter networks to start using AI and experimenting in ethical, legal ways and then sharing that information with each other. And the lab is, how are we doing R&D about the approach of how toAI to work, not just in direct patient care, right. But also fundamentally, like, what paperwork can you cut out? How can we better explain procedures? Like, what management role can this fill?
And we need to be doing active experimentation on that. We can’t just wait for, you know, Microsoft to solve the problems. It has to be at the level of the organizations themselves.
LEE: So let’s shift a little bit to the patient. You know, one of the things that we see, and I think everyone is seeing, is that people are turning to chatbots, like ChatGPT, actually to seek healthcare information for, you know, their own health or the health of their loved ones.
And there was already, prior to all of this, a trend towards, let’s call it, consumerization of healthcare. So just in the business of healthcare delivery, do you think AI is going to hasten these kinds of trends, or from the consumer’s perspective, what … ?
MOLLICK: I mean, absolutely, right. Like, all the early data that we have suggests that for most common medical problems, you should just consult AI, too, right. In fact, there is a real question to ask: at what point does it become unethical for doctors themselves to not ask for a second opinion from the AI because it’s cheap, right? You could overrule it or whatever you want, but like not asking seems foolish.
I think the two places where there’s a burning almost, you know, moral imperative is … let’s say, you know, I’m in Philadelphia, I’m a professor, I have access to really good healthcare through the Hospital University of Pennsylvania system. I know doctors. You know, I’m lucky. I’m well connected. If, you know, something goes wrong, I have friends who I can talk to. I have specialists. I’m, you know, pretty well educated in this space.
But for most people on the planet, they don’t have access to good medical care, they don’t have good health. It feels like it’s absolutely imperative to say when should you use AI and when not. Are there blind spots? What are those things?
And I worry that, like, to me, that would be the crash project I’d be invoking because I’m doing the same thing in education, which is this system is not as good as being in a room with a great teacher who also uses AI to help you, but it’s better than not getting an, you know, to the level of education people get in many cases. Where should we be using it? How do we guide usage in the right way? Because the AI labs aren’t thinking about this. We have to.
So, to me, there is a burning need here to understand this. And I worry that people will say, you know, everything that’s true—AI can hallucinate, AI can be biased. All of these things are absolutely true, but people are going to use it. The early indications are that it is quite useful. And unless we take the active role of saying, here’s when to use it, here’s when not to use it, we don’t have a right to say, don’t use this system. And I think, you know, we have to be exploring that.
LEE: What do people need to understand about AI? And what should schools, universities, and so on be teaching?
MOLLICK: Those are, kind of, two separate questions in lot of ways. I think a lot of people want to teach AI skills, and I will tell you, as somebody who works in this space a lot, there isn’t like an easy, sort of, AI skill, right. I could teach you prompt engineering in two to three classes, but every indication we have is that for most people under most circumstances, the value of prompting, you know, any one case is probably not that useful.
A lot of the tricks are disappearing because the AI systems are just starting to use them themselves. So asking good questions, being a good manager, being a good thinker tend to be important, but like magic tricks around making, you know, the AI do something because you use the right phrase used to be something that was real but is rapidly disappearing.
So I worry when people say teach AI skills. No one’s been able to articulate to me as somebody who knows AI very well and teaches classes on AI, what those AI skills that everyone should learn are, right.
I mean, there’s value in learning a little bit how the models work. There’s a value in working with these systems. A lot of it’s just hands on keyboard kind of work. But, like, we don’t have an easy slam dunk “this is what you learn in the world of AI” because the systems are getting better, and as they get better, they get less sensitive to these prompting techniques. They get better prompting themselves. They solve problems spontaneously and start being agentic. So it’s a hard problem to ask about, like, what do you train someone on? I think getting people experience in hands-on-keyboards, getting them to … there’s like four things I could teach you about AI, and two of them are already starting to disappear.
But, like, one is be direct. Like, tell the AI exactly what you want. That’s very helpful. Second, provide as much context as possible. That can include things like acting as a doctor, but also all the information you have. The third is give it step-by-step directions—that’s becoming less important. And the fourth is good and bad examples of the kind of output you want. Those four, that’s like, that’s it as far as the research telling you what to do, and the rest is building intuition.
LEE: I’m really impressed that you didn’t give the answer, “Well, everyone should be teaching my book, Co-Intelligence.”MOLLICK: Oh, no, sorry! Everybody should be teaching my book Co-Intelligence. I apologize.LEE: It’s good to chuckle about that, but actually, I can’t think of a better book, like, if you were to assign a textbook in any professional education space, I think Co-Intelligence would be number one on my list. Are there other things that you think are essential reading?
MOLLICK: That’s a really good question. I think that a lot of things are evolving very quickly. I happen to, kind of, hit a sweet spot with Co-Intelligence to some degree because I talk about how I used it, and I was, sort of, an advanced user of these systems.
So, like, it’s, sort of, like my Twitter feed, my online newsletter. I’m just trying to, kind of, in some ways, it’s about trying to make people aware of what these systems can do by just showing a lot, right. Rather than picking one thing, and, like, this is a general-purpose technology. Let’s use it for this. And, like, everybody gets a light bulb for a different reason. So more than reading, it is using, you know, and that can be Copilot or whatever your favorite tool is.
But using it. Voice modes help a lot. In terms of readings, I mean, I think that there is a couple of good guides to understanding AI that were originally blog posts. I think Tim Lee has one called Understanding AI, and it had a good overview …
LEE: Yeah, that’s a great one.
MOLLICK: … of that topic that I think explains how transformers work, which can give you some mental sense. I thinkKarpathyhas some really nice videos of use that I would recommend.
Like on the medical side, I think the book that you did, if you’re in medicine, you should read that. I think that that’s very valuable. But like all we can offer are hints in some ways. Like there isn’t … if you’re looking for the instruction manual, I think it can be very frustrating because it’s like you want the best practices and procedures laid out, and we cannot do that, right. That’s not how a system like this works.
LEE: Yeah.
MOLLICK: It’s not a person, but thinking about it like a person can be helpful, right.
LEE: One of the things that has been sort of a fun project for me for the last few years is I have been a founding board member of a new medical school at Kaiser Permanente. And, you know, that medical school curriculum is being formed in this era. But it’s been perplexing to understand, you know, what this means for a medical school curriculum. And maybe even more perplexing for me, at least, is the accrediting bodies, which are extremely important in US medical schools; how accreditors should think about what’s necessary here.
Besides the things that you’ve … the, kind of, four key ideas you mentioned, if you were talking to the board of directors of the LCMEaccrediting body, what’s the one thing you would want them to really internalize?
MOLLICK: This is both a fast-moving and vital area. This can’t be viewed like a usual change, which, “Let’s see how this works.” Because it’s, like, the things that make medical technologies hard to do, which is like unclear results, limited, you know, expensive use cases where it rolls out slowly. So one or two, you know, advanced medical facilities get access to, you know, proton beams or something else at multi-billion dollars of cost, and that takes a while to diffuse out. That’s not happening here. This is all happening at the same time, all at once. This is now … AI is part of medicine.
I mean, there’s a minor point that I’d make that actually is a really important one, which is large language models, generative AI overall, work incredibly differently than other forms of AI. So the other worry I have with some of these accreditors is they blend together algorithmic forms of AI, which medicine has been trying for long time—decision support, algorithmic methods, like, medicine more so than other places has been thinking about those issues. Generative AI, even though it uses the same underlying techniques, is a completely different beast.
So, like, even just take the most simple thing of algorithmic aversion, which is a well-understood problem in medicine, right. Which is, so you have a tool that could tell you as a radiologist, you know, the chance of this being cancer; you don’t like it, you overrule it, right.
We don’t find algorithmic aversion happening with LLMs in the same way. People actually enjoy using them because it’s more like working with a person. The flaws are different. The approach is different. So you need to both view this as universal applicable today, which makes it urgent, but also as something that is not the same as your other form of AI, and your AI working group that is thinking about how to solve this problem is not the right people here.
LEE: You know, I think the world has been trained because of the magic of web search to view computers as question-answering machines. Ask a question, get an answer.
MOLLICK: Yes. Yes.
LEE: Write a query, get results. And as I have interacted with medical professionals, you can see that medical professionals have that model of a machine in mind. And I think that’s partly, I think psychologically, why hallucination is so alarming. Because you have a mental model of a computer as a machine that has absolutely rock-solid perfect memory recall.
But the thing that was so powerful in Co-Intelligence, and we tried to get at this in our book also, is that’s not the sweet spot. It’s this sort of deeper interaction, more of a collaboration. And I thought your use of the term Co-Intelligence really just even in the title of the book tried to capture this. When I think about education, it seems like that’s the first step, to get past this concept of a machine being just a question-answering machine. Do you have a reaction to that idea?
MOLLICK: I think that’s very powerful. You know, we’ve been trained over so many years at both using computers but also in science fiction, right. Computers are about cold logic, right. They will give you the right answer, but if you ask it what love is, they explode, right. Like that’s the classic way you defeat the evil robot in Star Trek, right. “Love does not compute.”Instead, we have a system that makes mistakes, is warm, beats doctors in empathy in almost every controlled study on the subject, right. Like, absolutely can outwrite you in a sonnet but will absolutely struggle with giving you the right answer every time. And I think our mental models are just broken for this. And I think you’re absolutely right. And that’s part of what I thought your book does get at really well is, like, this is a different thing. It’s also generally applicable. Again, the model in your head should be kind of like a person even though it isn’t, right.
There’s a lot of warnings and caveats to it, but if you start from person, smart person you’re talking to, your mental model will be more accurate than smart machine, even though both are flawed examples, right. So it will make mistakes; it will make errors. The question is, what do you trust it on? What do you not trust it? As you get to know a model, you’ll get to understand, like, I totally don’t trust it for this, but I absolutely trust it for that, right.
LEE: All right. So we’re getting to the end of the time we have together. And so I’d just like to get now into something a little bit more provocative. And I get the question all the time. You know, will AI replace doctors? In medicine and other advanced knowledge work, project out five to 10 years. What do think happens?
MOLLICK: OK, so first of all, let’s acknowledge systems change much more slowly than individual use. You know, doctors are not individual actors; they’re part of systems, right. So not just the system of a patient who like may or may not want to talk to a machine instead of a person but also legal systems and administrative systems and systems that allocate labor and systems that train people.
So, like, it’s hard to imagine that in five to 10 years medicine being so upended that even if AI was better than doctors at every single thing doctors do, that we’d actually see as radical a change in medicine as you might in other fields. I think you will see faster changes happen in consulting and law and, you know, coding, other spaces than medicine.
But I do think that there is good reason to suspect that AI will outperform people while still having flaws, right. That’s the difference. We’re already seeing that for common medical questions in enough randomized controlled trials that, you know, best doctors beat AI, but the AI beats the mean doctor, right. Like, that’s just something we should acknowledge is happening at this point.
Now, will that work in your specialty? No. Will that work with all the contingent social knowledge that you have in your space? Probably not.
Like, these are vignettes, right. But, like, that’s kind of where things are. So let’s assume, right … you’re asking two questions. One is, how good will AI get?
LEE: Yeah.
MOLLICK: And we don’t know the answer to that question. I will tell you that your colleagues at Microsoft and increasingly the labs, the AI labs themselves, are all saying they think they’ll have a machine smarter than a human at every intellectual task in the next two to three years. If that doesn’t happen, that makes it easier to assume the future, but let’s just assume that that’s the case. I think medicine starts to change with the idea that people feel obligated to use this to help for everything.
Your patients will be using it, and it will be your advisor and helper at the beginning phases, right. And I think that I expect people to be better at empathy. I expect better bedside manner. I expect management tasks to become easier. I think administrative burden might lighten if we handle this right way or much worse if we handle it badly. Diagnostic accuracy will increase, right.
And then there’s a set of discovery pieces happening, too, right. One of the core goals of all the AI companies is to accelerate medical research. How does that happen and how does that affect us is a, kind of, unknown question. So I think clinicians are in both the eye of the storm and surrounded by it, right. Like, they can resist AI use for longer than most other fields, but everything around them is going to be affected by it.
LEE: Well, Ethan, this has been really a fantastic conversation. And, you know, I think in contrast to all the other conversations we’ve had, this one gives especially the leaders in healthcare, you know, people actually trying to lead their organizations into the future, whether it’s in education or in delivery, a lot to think about. So I really appreciate you joining.
MOLLICK: Thank you. 
I’m a computing researcher who works with people who are right in the middle of today’s bleeding-edge developments in AI. And because of that, I often lose sight of how to talk to a broader audience about what it’s all about. And so I think one of Ethan’s superpowers is that he has this knack for explaining complex topics in AI in a really accessible way, getting right to the most important points without making it so simple as to be useless. That’s why I rarely miss an opportunity to read up on his latest work.
One of the first things I learned from Ethan is the intuition that you can, sort of, think of AI as a very knowledgeable intern. In other words, think of it as a persona that you can interact with, but you also need to be a manager for it and to always assess the work that it does.
In our discussion, Ethan went further to stress that there is, because of that, a serious education gap. You know, over the last decade or two, we’ve all been trained, mainly by search engines, to think of computers as question-answering machines. In medicine, in fact, there’s a question-answering application that is really popular called UpToDate. Doctors use it all the time. But generative AI systems like ChatGPT are different. There’s therefore a challenge in how to break out of the old-fashioned mindset of search to get the full value out of generative AI.
The other big takeaway for me was that Ethan pointed out while it’s easy to see productivity gains from AI at the individual level, those same gains, at least today, don’t often translate automatically to organization-wide or system-wide gains. And one, of course, has to conclude that it takes more than just making individuals more productive; the whole system also has to adjust to the realities of AI.
Here’s now my interview with Azeem Azhar:
LEE: Azeem, welcome.
AZEEM AZHAR: Peter, thank you so much for having me.
LEE: You know, I think you’re extremely well known in the world. But still, some of the listeners of this podcast series might not have encountered you before.
And so one of the ways I like to ask people to introduce themselves is, how do you explain to your parents what you do every day?
AZHAR: Well, I’m very lucky in that way because my mother was the person who got me into computers more than 40 years ago. And I still have that first computer, a ZX81 with a Z80 chip …
LEE: Oh wow.
AZHAR: … to this day. It sits in my study, all seven and a half thousand transistors and Bakelite plastic that it is. And my parents were both economists, and economics is deeply connected with technology in some sense. And I grew up in the late ’70s and the early ’80s. And that was a time of tremendous optimism around technology. It was space opera, science fiction, robots, and of course, the personal computer and, you know, Bill Gates and Steve Jobs. So that’s where I started.
And so, in a way, my mother and my dad, who passed away a few years ago, had always known me as someone who was fiddling with computers but also thinking about economics and society. And so, in a way, it’s easier to explain to them because they’re the ones who nurtured the environment that allowed me to research technology and AI and think about what it means to firms and to the economy at large.
LEE: I always like to understand the origin story. And what I mean by that is, you know, what was your first encounter with generative AI? And what was that like? What did you go through?
AZHAR: The first real moment was when Midjourney and Stable Diffusion emerged in that summer of 2022. I’d been away on vacation, and I came back—and I’d been off grid, in fact—and the world had really changed.
Now, I’d been aware of GPT-3 and GPT-2, which I played around with and with BERT, the original transformer paper about seven or eight years ago, but it was the moment where I could talk to my computer, and it could produce these images, and it could be refined in natural language that really made me think we’ve crossed into a new domain. We’ve gone from AI being highly discriminative to AI that’s able to explore the world in particular ways. And then it was a few months later that ChatGPT came out—November, the 30th.
And I think it was the next day or the day after that I said to my team, everyone has to use this, and we have to meet every morning and discuss how we experimented the day before. And we did that for three or four months. And, you know, it was really clear to me in that interface at that point that, you know, we’d absolutely pass some kind of threshold.
LEE: And who’s the we that you were experimenting with?
AZHAR: So I have a team of four who support me. They’re mostly researchers of different types. I mean, it’s almost like one of those jokes. You know, I have a sociologist, an economist, and an astrophysicist. And, you know, they walk into the bar,or they walk into our virtual team room, and we try to solve problems.
LEE: Well, so let’s get now into brass tacks here. And I think I want to start maybe just with an exploration of the economics of all this and economic realities. Because I think in a lot of your work—for example, in your book—you look pretty deeply at how automation generally and AI specifically are transforming certain sectors like finance, manufacturing, and you have a really, kind of, insightful focus on what this means for productivity and which ways, you know, efficiencies are found.
And then you, sort of, balance that with risks, things that can and do go wrong. And so as you take that background and looking at all those other sectors, in what ways are the same patterns playing out or likely to play out in healthcare and medicine?
AZHAR: I’m sure we will see really remarkable parallels but also new things going on. I mean, medicine has a particular quality compared to other sectors in the sense that it’s highly regulated, market structure is very different country to country, and it’s an incredibly broad field. I mean, just think about taking a Tylenol and going through laparoscopic surgery. Having an MRI and seeing a physio. I mean, this is all medicine. I mean, it’s hard to imagine a sector that ismore broad than that.
So I think we can start to break it down, and, you know, where we’re seeing things with generative AI will be that the, sort of, softest entry point, which is the medical scribing. And I’m sure many of us have been with clinicians who have a medical scribe running alongside—they’re all on Surface Pros I noticed, right?They’re on the tablet computers, and they’re scribing away.
And what that’s doing is, in the words of my friend Eric Topol, it’s giving the clinician time back, right. They have time back from days that are extremely busy and, you know, full of administrative overload. So I think you can obviously do a great deal with reducing that overload.
And within my team, we have a view, which is if you do something five times in a week, you should be writing an automation for it. And if you’re a doctor, you’re probably reviewing your notes, writing the prescriptions, and so on several times a day. So those are things that can clearly be automated, and the human can be in the loop. But I think there are so many other ways just within the clinic that things can help.
So, one of my friends, my friend from my junior school—I’ve known him since I was 9—is an oncologist who’s also deeply into machine learning, and he’s in Cambridge in the UK. And he built with Microsoft Research a suite of imaging AI tools from his own discipline, which they then open sourced.
So that’s another way that you have an impact, which is that you actually enable the, you know, generalist, specialist, polymath, whatever they are in health systems to be able to get this technology, to tune it to their requirements, to use it, to encourage some grassroots adoption in a system that’s often been very, very heavily centralized.
LEE: Yeah.
AZHAR: And then I think there are some other things that are going on that I find really, really exciting. So one is the consumerization of healthcare. So I have one of those sleep tracking rings, the Oura.
LEE: Yup.
AZHAR: That is building a data stream that we’ll be able to apply more and more AI to. I mean, right now, it’s applying traditional, I suspect, machine learning, but you can imagine that as we start to get more data, we start to get more used to measuring ourselves, we create this sort of pot, a personal asset that we can turn AI to.
And there’s still another category. And that other category is one of the completely novel ways in which we can enable patient care and patient pathway. And there’s a fantastic startup in the UK called Neko Health, which, I mean, does physicals, MRI scans, and blood tests, and so on.
It’s hard to imagine Neko existing without the sort of advanced data, machine learning, AI that we’ve seen emerge over the last decade. So, I mean, I think that there are so many ways in which the temperature is slowly being turned up to encourage a phase change within the healthcare sector.
And last but not least, I do think that these tools can also be very, very supportive of a clinician’s life cycle. I think we, as patients, we’re a bit … I don’t know if we’re as grateful as we should be for our clinicians who are putting in 90-hour weeks.But you can imagine a world where AI is able to support not just the clinicians’ workload but also their sense of stress, their sense of burnout.
So just in those five areas, Peter, I sort of imagine we could start to fundamentally transform over the course of many years, of course, the way in which people think about their health and their interactions with healthcare systems
LEE: I love how you break that down. And I want to press on a couple of things.
You also touched on the fact that medicine is, at least in most of the world, is a highly regulated industry. I guess finance is the same way, but they also feel different because the, like, finance sector has to be very responsive to consumers, and consumers are sensitive to, you know, an abundance of choice; they are sensitive to price. Is there something unique about medicine besides being regulated?
AZHAR: I mean, there absolutely is. And in finance, as well, you have much clearer end states. So if you’re not in the consumer space, but you’re in the, you know, asset management space, you have to essentially deliver returns against the volatility or risk boundary, right. That’s what you have to go out and do. And I think if you’re in the consumer industry, you can come back to very, very clear measures, net promoter score being a very good example.
In the case of medicine and healthcare, it is much more complicated because as far as the clinician is concerned, people are individuals, and we have our own parts and our own responses. If we didn’t, there would never be a need for a differential diagnosis. There’d never be a need for, you know, Let’s try azithromycin first, and then if that doesn’t work, we’ll go to vancomycin, or, you know, whatever it happens to be. You would just know. But ultimately, you know, people are quite different. The symptoms that they’re showing are quite different, and also their compliance is really, really different.
I had a back problem that had to be dealt with by, you know, a physio and extremely boring exercises four times a week, but I was ruthless in complying, and my physio was incredibly surprised. He’d say well no one ever does this, and I said, well you know the thing is that I kind of just want to get this thing to go away.
LEE: Yeah.
AZHAR: And I think that that’s why medicine is and healthcare is so different and more complex. But I also think that’s why AI can be really, really helpful. I mean, we didn’t talk about, you know, AI in its ability to potentially do this, which is to extend the clinician’s presence throughout the week.
LEE: Right. Yeah.
AZHAR: The idea that maybe some part of what the clinician would do if you could talk to them on Wednesday, Thursday, and Friday could be delivered through an app or a chatbot just as a way of encouraging the compliance, which is often, especially with older patients, one reason why conditions, you know, linger on for longer.
LEE: You know, just staying on the regulatory thing, as I’ve thought about this, the one regulated sector that I think seems to have some parallels to healthcare is energy delivery, energy distribution.
Because like healthcare, as a consumer, I don’t have choice in who delivers electricity to my house. And even though I care about it being cheap or at least not being overcharged, I don’t have an abundance of choice. I can’t do price comparisons.
And there’s something about that, just speaking as a consumer of both energy and a consumer of healthcare, that feels similar. Whereas other regulated industries, you know, somehow, as a consumer, I feel like I have a lot more direct influence and power. Does that make any sense to someone, you know, like you, who’s really much more expert in how economic systems work?
AZHAR: I mean, in a sense, one part of that is very, very true. You have a limited panel of energy providers you can go to, and in the US, there may be places where you have no choice.
I think the area where it’s slightly different is that as a consumer or a patient, you can actually make meaningful choices and changes yourself using these technologies, and people used to joke about you know asking Dr. Google. But Dr. Google is not terrible, particularly if you go to WebMD. And, you know, when I look at long-range change, many of the regulations that exist around healthcare delivery were formed at a point before people had access to good quality information at the touch of their fingertips or when educational levels in general were much, much lower. And many regulations existed because of the incumbent power of particular professional sectors.
I’ll give you an example from the United Kingdom. So I have had asthma all of my life. That means I’ve been taking my inhaler, Ventolin, and maybe a steroid inhaler for nearly 50 years. That means that I know … actually, I’ve got more experience, and I—in some sense—know more about it than a general practitioner.
LEE: Yeah.
AZHAR: And until a few years ago, I would have to go to a general practitioner to get this drug that I’ve been taking for five decades, and there they are, age 30 or whatever it is. And a few years ago, the regulations changed. And now pharmacies can … or pharmacists can prescribe those types of drugs under certain conditions directly.
LEE: Right.
AZHAR: That was not to do with technology. That was to do with incumbent lock-in. So when we look at the medical industry, the healthcare space, there are some parallels with energy, but there are a few little things that the ability that the consumer has to put in some effort to learn about their condition, but also the fact that some of the regulations that exist just exist because certain professions are powerful.
LEE: Yeah, one last question while we’re still on economics. There seems to be a conundrum about productivity and efficiency in healthcare delivery because I’ve never encountered a doctor or a nurse that wants to be able to handle even more patients than they’re doing on a daily basis.
And so, you know, if productivity means simply, well, your rounds can now handle 16 patients instead of eight patients, that doesn’t seem necessarily to be a desirable thing. So how can we or should we be thinking about efficiency and productivity since obviously costs are, in most of the developed world, are a huge, huge problem?
AZHAR: Yes, and when you described doubling the number of patients on the round, I imagined you buying them all roller skates so they could just whizz aroundthe hospital faster and faster than ever before.
We can learn from what happened with the introduction of electricity. Electricity emerged at the end of the 19th century, around the same time that cars were emerging as a product, and car makers were very small and very artisanal. And in the early 1900s, some really smart car makers figured out that electricity was going to be important. And they bought into this technology by putting pendant lights in their workshops so they could “visit more patients.” Right?
LEE: Yeah, yeah.
AZHAR: They could effectively spend more hours working, and that was a productivity enhancement, and it was noticeable. But, of course, electricity fundamentally changed the productivity by orders of magnitude of people who made cars starting with Henry Ford because he was able to reorganize his factories around the electrical delivery of power and to therefore have the moving assembly line, which 10xed the productivity of that system.
So when we think about how AI will affect the clinician, the nurse, the doctor, it’s much easier for us to imagine it as the pendant light that just has them working later …
LEE: Right.
AZHAR: … than it is to imagine a reconceptualization of the relationship between the clinician and the people they care for.
And I’m not sure. I don’t think anybody knows what that looks like. But, you know, I do think that there will be a way that this changes, and you can see that scale out factor. And it may be, Peter, that what we end up doing is we end up saying, OK, because we have these brilliant AIs, there’s a lower level of training and cost and expense that’s required for a broader range of conditions that need treating. And that expands the market, right. That expands the market hugely. It’s what has happened in the market for taxis or ride sharing. The introduction of Uber and the GPS system …
LEE: Yup.
AZHAR: … has meant many more people now earn their living driving people around in their cars. And at least in London, you had to be reasonably highly trained to do that.
So I can see a reorganization is possible. Of course, entrenched interests, the economic flow … and there are many entrenched interests, particularly in the US between the health systems and the, you know, professional bodies that might slow things down. But I think a reimagining is possible.
And if I may, I’ll give you one example of that, which is, if you go to countries outside of the US where there are many more sick people per doctor, they have incentives to change the way they deliver their healthcare. And well before there was AI of this quality around, there was a few cases of health systems in India—Aravind Eye Carewas one, and Narayana Hrudayalayawas another. And in the latter, they were a cardiac care unit where you couldn’t get enough heart surgeons.
LEE: Yeah, yep.
AZHAR: So specially trained nurses would operate under the supervision of a single surgeon who would supervise many in parallel. So there are ways of increasing the quality of care, reducing the cost, but it does require a systems change. And we can’t expect a single bright algorithm to do it on its own.
LEE: Yeah, really, really interesting. So now let’s get into regulation. And let me start with this question. You know, there are several startup companies I’m aware of that are pushing on, I think, a near-term future possibility that a medical AI for consumer might be allowed, say, to prescribe a medication for you, something that would normally require a doctor or a pharmacist, you know, that is certified in some way, licensed to do. Do you think we’ll get to a point where for certain regulated activities, humans are more or less cut out of the loop?
AZHAR: Well, humans would have been in the loop because they would have provided the training data, they would have done the oversight, the quality control. But to your question in general, would we delegate an important decision entirely to a tested set of algorithms? I’m sure we will. We already do that. I delegate less important decisions like, What time should I leave for the airport to Waze. I delegate more important decisions to the automated braking in my car. We will do this at certain levels of risk and threshold.
If I come back to my example of prescribing Ventolin. It’s really unclear to me that the prescription of Ventolin, this incredibly benign bronchodilator that is only used by people who’ve been through the asthma process, needs to be prescribed by someone who’s gone through 10 years or 12 years of medical training. And why that couldn’t be prescribed by an algorithm or an AI system.
LEE: Right. Yep. Yep.
AZHAR: So, you know, I absolutely think that that will be the case and could be the case. I can’t really see what the objections are. And the real issue is where do you draw the line of where you say, “Listen, this is too important,” or “The cost is too great,” or “The side effects are too high,” and therefore this is a point at which we want to have some, you know, human taking personal responsibility, having a liability framework in place, having a sense that there is a person with legal agency who signed off on this decision. And that line I suspect will start fairly low, and what we’d expect to see would be that that would rise progressively over time.
LEE: What you just said, that scenario of your personal asthma medication, is really interesting because your personal AI might have the benefit of 50 years of your own experience with that medication. So, in a way, there is at least the data potential for, let’s say, the next prescription to be more personalized and more tailored specifically for you.
AZHAR: Yes. Well, let’s dig into this because I think this is super interesting, and we can look at how things have changed. So 15 years ago, if I had a bad asthma attack, which I might have once a year, I would have needed to go and see my general physician.
In the UK, it’s very difficult to get an appointment. I would have had to see someone privately who didn’t know me at all because I’ve just walked in off the street, and I would explain my situation. It would take me half a day. Productivity lost. I’ve been miserable for a couple of days with severe wheezing. Then a few years ago the system changed, a protocol changed, and now I have a thing called a rescue pack, which includes prednisolone steroids. It includes something else I’ve just forgotten, and an antibiotic in case I get an upper respiratory tract infection, and I have an “algorithm.” It’s called a protocol. It’s printed out. It’s a flowchart
I answer various questions, and then I say, “I’m going to prescribe this to myself.” You know, UK doctors don’t prescribe prednisolone, or prednisone as you may call it in the US, at the drop of a hat, right. It’s a powerful steroid. I can self-administer, and I can now get that repeat prescription without seeing a physician a couple of times a year. And the algorithm, the “AI” is, it’s obviously been done in PowerPoint naturally, and it’s a bunch of arrows.Surely, surely, an AI system is going to be more sophisticated, more nuanced, and give me more assurance that I’m making the right decision around something like that.
LEE: Yeah. Well, at a minimum, the AI should be able to make that PowerPoint the next time.AZHAR: Yeah, yeah. Thank god for Clippy. Yes.
LEE: So, you know, I think in our book, we had a lot of certainty about most of the things we’ve discussed here, but one chapter where I felt we really sort of ran out of ideas, frankly, was on regulation. And, you know, what we ended up doing for that chapter is … I can’t remember if it was Carey’s or Zak’s idea, but we asked GPT-4 to have a conversation, a debate with itself, about regulation. And we made some minor commentary on that.
And really, I think we took that approach because we just didn’t have much to offer. By the way, in our defense, I don’t think anyone else had any better ideas anyway.
AZHAR: Right.
LEE: And so now two years later, do we have better ideas about the need for regulation, the frameworks around which those regulations should be developed, and, you know, what should this look like?
AZHAR: So regulation is going to be in some cases very helpful because it provides certainty for the clinician that they’re doing the right thing, that they are still insured for what they’re doing, and it provides some degree of confidence for the patient. And we need to make sure that the claims that are made stand up to quite rigorous levels, where ideally there are RCTs, and there are the classic set of processes you go through.
You do also want to be able to experiment, and so the question is: as a regulator, how can you enable conditions for there to be experimentation? And what is experimentation? Experimentation is learning so that every element of the system can learn from this experience.
So finding that space where there can be bit of experimentation, I think, becomes very, very important. And a lot of this is about experience, so I think the first digital therapeutics have received FDA approval, which means there are now people within the FDA who understand how you go about running an approvals process for that, and what that ends up looking like—and of course what we’re very good at doing in this sort of modern hyper-connected world—is we can share that expertise, that knowledge, that experience very, very quickly.
So you go from one approval a year to a hundred approvals a year to a thousand approvals a year. So we will then actually, I suspect, need to think about what is it to approve digital therapeutics because, unlike big biological molecules, we can generate these digital therapeutics at the rate of knots.
LEE: Yes.
AZHAR: Every road in Hayes Valley in San Francisco, right, is churning out new startups who will want to do things like this. So then, I think about, what does it mean to get approved if indeed it gets approved? But we can also go really far with things that don’t require approval.
I come back to my sleep tracking ring. So I’ve been wearing this for a few years, and when I go and see my doctor or I have my annual checkup, one of the first things that he asks is how have I been sleeping. And in fact, I even sync my sleep tracking data to their medical record system, so he’s saying … hearing what I’m saying, but he’s actually pulling up the real data going, This patient’s lying to me again. Of course, I’m very truthful with my doctor, as we should all be.LEE: You know, actually, that brings up a point that consumer-facing health AI has to deal with pop science, bad science, you know, weird stuff that you hear on Reddit. And because one of the things that consumers want to know always is, you know, what’s the truth?
AZHAR: Right.
LEE: What can I rely on? And I think that somehow feels different than an AI that you actually put in the hands of, let’s say, a licensed practitioner. And so the regulatory issues seem very, very different for these two cases somehow.
AZHAR: I agree, they’re very different. And I think for a lot of areas, you will want to build AI systems that are first and foremost for the clinician, even if they have patient extensions, that idea that the clinician can still be with a patient during the week.
And you’ll do that anyway because you need the data, and you also need a little bit of a liability shield to have like a sensible person who’s been trained around that. And I think that’s going to be a very important pathway for many AI medical crossovers. We’re going to go through the clinician.
LEE: Yeah.
AZHAR: But I also do recognize what you say about the, kind of, kooky quackery that exists on Reddit. Although on Creatine, Reddit may yet prove to have been right.LEE: Yeah, that’s right. Yes, yeah, absolutely. Yeah.
AZHAR: Sometimes it’s right. And I think that it serves a really good role as a field of extreme experimentation. So if you’re somebody who makes a continuous glucose monitor traditionally given to diabetics but now lots of people will wear them—and sports people will wear them—you probably gathered a lot of extreme tail distribution data by reading the Reddit/biohackers …
LEE: Yes.
AZHAR: … for the last few years, where people were doing things that you would never want them to really do with the CGM. And so I think we shouldn’t understate how important that petri dish can be for helping us learn what could happen next.
LEE: Oh, I think it’s absolutely going to be essential and a bigger thing in the future. So I think I just want to close here then with one last question. And I always try to be a little bit provocative with this.
And so as you look ahead to what doctors and nurses and patients might be doing two years from now, five years from now, 10 years from now, do you have any kind of firm predictions?
AZHAR: I’m going to push the boat out, and I’m going to go further out than closer in.
LEE: OK.AZHAR: As patients, we will have many, many more touch points and interaction with our biomarkers and our health. We’ll be reading how well we feel through an array of things. And some of them we’ll be wearing directly, like sleep trackers and watches.
And so we’ll have a better sense of what’s happening in our lives. It’s like the moment you go from paper bank statements that arrive every month to being able to see your account in real time.
LEE: Yes.
AZHAR: And I suspect we’ll have … we’ll still have interactions with clinicians because societies that get richer see doctors more, societies that get older see doctors more, and we’re going to be doing both of those over the coming 10 years. But there will be a sense, I think, of continuous health engagement, not in an overbearing way, but just in a sense that we know it’s there, we can check in with it, it’s likely to be data that is compiled on our behalf somewhere centrally and delivered through a user experience that reinforces agency rather than anxiety.
And we’re learning how to do that slowly. I don’t think the health apps on our phones and devices have yet quite got that right. And that could help us personalize problems before they arise, and again, I use my experience for things that I’ve tracked really, really well. And I know from my data and from how I’m feeling when I’m on the verge of one of those severe asthma attacks that hits me once a year, and I can take a little bit of preemptive measure, so I think that that will become progressively more common and that sense that we will know our baselines.
I mean, when you think about being an athlete, which is something I think about, but I could never ever do,but what happens is you start with your detailed baselines, and that’s what your health coach looks at every three or four months. For most of us, we have no idea of our baselines. You we get our blood pressure measured once a year. We will have baselines, and that will help us on an ongoing basis to better understand and be in control of our health. And then if the product designers get it right, it will be done in a way that doesn’t feel invasive, but it’ll be done in a way that feels enabling. We’ll still be engaging with clinicians augmented by AI systems more and more because they will also have gone up the stack. They won’t be spending their time on just “take two Tylenol and have a lie down” type of engagements because that will be dealt with earlier on in the system. And so we will be there in a very, very different set of relationships. And they will feel that they have different ways of looking after our health.
LEE: Azeem, it’s so comforting to hear such a wonderfully optimistic picture of the future of healthcare. And I actually agree with everything you’ve said.
Let me just thank you again for joining this conversation. I think it’s been really fascinating. And I think somehow the systemic issues, the systemic issues that you tend to just see with such clarity, I think are going to be the most, kind of, profound drivers of change in the future. So thank you so much.
AZHAR: Well, thank you, it’s been my pleasure, Peter, thank you. 
I always think of Azeem as a systems thinker. He’s always able to take the experiences of new technologies at an individual level and then project out to what this could mean for whole organizations and whole societies.
In our conversation, I felt that Azeem really connected some of what we learned in a previous episode—for example, from Chrissy Farr—on the evolving consumerization of healthcare to the broader workforce and economic impacts that we’ve heard about from Ethan Mollick.
Azeem’s personal story about managing his asthma was also a great example. You know, he imagines a future, as do I, where personal AI might assist and remember decades of personal experience with a condition like asthma and thereby know more than any human being could possibly know in a deeply personalized and effective way, leading to better care. Azeem’s relentless optimism about our AI future was also so heartening to hear.
Both of these conversations leave me really optimistic about the future of AI in medicine. At the same time, it is pretty sobering to realize just how much we’ll all need to change in pretty fundamental and maybe even in radical ways. I think a big insight I got from these conversations is how we interact with machines is going to have to be altered not only at the individual level, but at the company level and maybe even at the societal level.
Since my conversation with Ethan and Azeem, there have been some pretty important developments that speak directly to this. Just last week at Build, which is Microsoft’s yearly developer conference, we announced a slew of AI agent technologies. Our CEO, Satya Nadella, in fact, started his keynote by going online in a GitHub developer environment and then assigning a coding task to an AI agent, basically treating that AI as a full-fledged member of a development team. Other agents, for example, a meeting facilitator, a data analyst, a business researcher, travel agent, and more were also shown during the conference.
But pertinent to healthcare specifically, what really blew me away was the demonstration of a healthcare orchestrator agent. And the specific thing here was in Stanford’s cancer treatment center, when they are trying to decide on potentially experimental treatments for cancer patients, they convene a meeting of experts. That is typically called a tumor board. And so this AI healthcare orchestrator agent actually participated as a full-fledged member of a tumor board meeting to help bring data together, make sure that the latest medical knowledge was brought to bear, and to assist in the decision-making around a patient’s cancer treatment. It was pretty amazing.A big thank-you again to Ethan and Azeem for sharing their knowledge and understanding of the dynamics between AI and society more broadly. And to our listeners, thank you for joining us. I’m really excited for the upcoming episodes, including discussions on medical students’ experiences with AI and AI’s influence on the operation of health systems and public health departments. We hope you’ll continue to tune in.
Until next time.
#what #ais #impact #individuals #means

What AI’s impact on individuals means for the health workforce and industry
Transcript     PETER LEE: “In American primary care, the missing workforce is stunning in magnitude, the shortfall estimated to reach up to 48,000 doctors within the next dozen years. China and other countries with aging populations can expect drastic shortfalls, as well. Just last month, I asked a respected colleague retiring from primary care who he would recommend as a replacement; he told me bluntly that, other than expensive concierge care practices, he could not think of anyone, even for himself. This mismatch between need and supply will only grow, and the US is far from alone among developed countries in facing it.”      This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.   Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?    In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.     The book passage I read at the top is from “Chapter 4: Trust but Verify,” which was written by Zak. You know, it’s no secret that in the US and elsewhere shortages in medical staff and the rise of clinician burnout are affecting the quality of patient care for the worse. In our book, we predicted that generative AI would be something that might help address these issues. So in this episode, we’ll delve into how individual performance gains that our previous guests have described might affect the healthcare workforce as a whole, and on the patient side, we’ll look into the influence of generative AI on the consumerization of healthcare. Now, since all of this consumes such a huge fraction of the overall economy, we’ll also get into what a general-purpose technology as disruptive as generative AI might mean in the context of labor markets and beyond.   To help us do that, I’m pleased to welcome Ethan Mollick and Azeem Azhar. Ethan Mollick is the Ralph J. Roberts Distinguished Faculty Scholar, a Rowan Fellow, and an associate professor at the Wharton School of the University of Pennsylvania. His research into the effects of AI on work, entrepreneurship, and education is applied by organizations around the world, leading him to be named one of Time magazine’s most influential people in AI for 2024. He’s also the author of the New York Times best-selling book Co-Intelligence. Azeem Azhar is an author, founder, investor, and one of the most thoughtful and influential voices on the interplay between disruptive emerging technologies and business and society. In his best-selling book, The Exponential Age, and in his highly regarded newsletter and podcast, Exponential View, he explores how technologies like AI are reshaping everything from healthcare to geopolitics. Ethan and Azeem are two leading thinkers on the ways that disruptive technologies—and especially AI—affect our work, our jobs, our business enterprises, and whole industries. As economists, they are trying to work out whether we are in the midst of an economic revolution as profound as the shift from an agrarian to an industrial society.Here is my interview with Ethan Mollick: LEE: Ethan, welcome. ETHAN MOLLICK: So happy to be here, thank you. LEE: I described you as a professor at Wharton, which I think most of the people who listen to this podcast series know of as an elite business school. So it might surprise some people that you study AI. And beyond that, you know, that I would seek you out to talk about AI in medicine.So to get started, how and why did it happen that you’ve become one of the leading experts on AI? MOLLICK: It’s actually an interesting story. I’ve been AI-adjacent my whole career. When I wasmy PhD at MIT, I worked with Marvin Minskyand the MITMedia Labs AI group. But I was never the technical AI guy. I was the person who was trying to explain AI to everybody else who didn’t understand it. And then I became very interested in, how do you train and teach? And AI was always a part of that. I was building games for teaching, teaching tools that were used in hospitals and elsewhere, simulations. So when LLMs burst into the scene, I had already been using them and had a good sense of what they could do. And between that and, kind of, being practically oriented and getting some of the first research projects underway, especially under education and AI and performance, I became sort of a go-to person in the field. And once you’re in a field where nobody knows what’s going on and we’re all making it up as we go along—I thought it’s funny that you led with the idea that you have a couple of months head start for GPT-4, right. Like that’s all we have at this point, is a few months’ head start.So being a few months ahead is good enough to be an expert at this point. Whether it should be or not is a different question. LEE: Well, if I understand correctly, leading AI companies like OpenAI, Anthropic, and others have now sought you out as someone who should get early access to really start to do early assessments and gauge early reactions. How has that been? MOLLICK: So, I mean, I think the bigger picture is less about me than about two things that tells us about the state of AI right now. One, nobody really knows what’s going on, right. So in a lot of ways, if it wasn’t for your work, Peter, like, I don’t think people would be thinking about medicine as much because these systems weren’t built for medicine. They weren’t built to change education. They weren’t built to write memos. They, like, they weren’t built to do any of these things. They weren’t really built to do anything in particular. It turns out they’re just good at many things. And to the extent that the labs work on them, they care about their coding ability above everything else and maybe math and science secondarily. They don’t think about the fact that it expresses high empathy. They don’t think about its accuracy and diagnosis or where it’s inaccurate. They don’t think about how it’s changing education forever. So one part of this is the fact that they go to my Twitter feed or ask me for advice is an indicator of where they are, too, which is they’re not thinking about this. And the fact that a few months’ head start continues to give you a lead tells you that we are at the very cutting edge. These labs aren’t sitting on projects for two years and then releasing them. Months after a project is complete or sooner, it’s out the door. Like, there’s very little delay. So we’re kind of all in the same boat here, which is a very unusual space for a new technology. LEE: And I, you know, explained that you’re at Wharton. Are you an odd fit as a faculty member at Wharton, or is this a trend now even in business schools that AI experts are becoming key members of the faculty? MOLLICK: I mean, it’s a little of both, right. It’s faculty, so everybody does everything. I’m a professor of innovation-entrepreneurship. I’ve launched startups before and working on that and education means I think about, how do organizations redesign themselves? How do they take advantage of these kinds of problems? So medicine’s always been very central to that, right. A lot of people in my MBA class have been MDs either switching, you know, careers or else looking to advance from being sort of individual contributors to running teams. So I don’t think that’s that bad a fit. But I also think this is general-purpose technology; it’s going to touch everything. The focus on this is medicine, but Microsoft does far more than medicine, right. It’s … there’s transformation happening in literally every field, in every country. This is a widespread effect. So I don’t think we should be surprised that business schools matter on this because we care about management. There’s a long tradition of management and medicine going together. There’s actually a great academic paper that shows that teaching hospitals that also have MBA programs associated with them have higher management scores and perform better. So I think that these are not as foreign concepts, especially as medicine continues to get more complicated. LEE: Yeah. Well, in fact, I want to dive a little deeper on these issues of management, of entrepreneurship, um, education. But before doing that, if I could just stay focused on you. There is always something interesting to hear from people about their first encounters with AI. And throughout this entire series, I’ve been doing that both pre-generative AI and post-generative AI. So you, sort of, hinted at the pre-generative AI. You were in Minsky’s lab. Can you say a little bit more about that early encounter? And then tell us about your first encounters with generative AI. MOLLICK: Yeah. Those are great questions. So first of all, when I was at the media lab, that was pre-the current boom in sort of, you know, even in the old-school machine learning kind of space. So there was a lot of potential directions to head in. While I was there, there were projects underway, for example, to record every interaction small children had. One of the professors was recording everything their baby interacted with in the hope that maybe that would give them a hint about how to build an AI system. There was a bunch of projects underway that were about labeling every concept and how they relate to other concepts. So, like, it was very much Wild West of, like, how do we make an AI work—which has been this repeated problem in AI, which is, what is this thing? The fact that it was just like brute force over the corpus of all human knowledge turns out to be a little bit of like a, you know, it’s a miracle and a little bit of a disappointment in some wayscompared to how elaborate some of this was. So, you know, I think that, that was sort of my first encounters in sort of the intellectual way. The generative AI encounters actually started with the original, sort of, GPT-3, or, you know, earlier versions. And it was actually game-based. So I played games like AI Dungeon. And as an educator, I realized, oh my gosh, this stuff could write essays at a fourth-grade level. That’s really going to change the way, like, middle school works, was my thinking at the time. And I was posting about that back in, you know, 2021 that this is a big deal. But I think everybody was taken surprise, including the AI companies themselves, by, you know, ChatGPT, by GPT-3.5. The difference in degree turned out to be a difference in kind. LEE: Yeah, you know, if I think back, even with GPT-3, and certainly this was the case with GPT-2, it was, at least, you know, from where I was sitting, it was hard to get people to really take this seriously and pay attention. MOLLICK: Yes. LEE: You know, it’s remarkable. Within Microsoft, I think a turning point was the use of GPT-3 to do code completions. And that was actually productized as GitHub Copilot, the very first version. That, I think, is where there was widespread belief. But, you know, in a way, I think there is, even for me early on, a sense of denial and skepticism. Did you have those initially at any point? MOLLICK: Yeah, I mean, it still happens today, right. Like, this is a weird technology. You know, the original denial and skepticism was, I couldn’t see where this was going. It didn’t seem like a miracle because, you know, of course computers can complete code for you. Like, what else are they supposed to do? Of course, computers can give you answers to questions and write fun things. So there’s difference of moving into a world of generative AI. I think a lot of people just thought that’s what computers could do. So it made the conversations a little weird. But even today, faced with these, you know, with very strong reasoner models that operate at the level of PhD students, I think a lot of people have issues with it, right. I mean, first of all, they seem intuitive to use, but they’re not always intuitive to use because the first use case that everyone puts AI to, it fails at because they use it like Google or some other use case. And then it’s genuinely upsetting in a lot of ways. I think, you know, I write in my book about the idea of three sleepless nights. That hasn’t changed. Like, you have to have an intellectual crisis to some extent, you know, and I think people do a lot to avoid having that existential angst of like, “Oh my god, what does it mean that a machine could think—apparently think—like a person?” So, I mean, I see resistance now. I saw resistance then. And then on top of all of that, there’s the fact that the curve of the technology is quite great. I mean, the price of GPT-4 level intelligence from, you know, when it was released has dropped 99.97% at this point, right. LEE: Yes. Mm-hmm. MOLLICK: I mean, I could run a GPT-4 class system basically on my phone. Microsoft’s releasing things that can almost run on like, you know, like it fits in almost no space, that are almost as good as the original GPT-4 models. I mean, I don’t think people have a sense of how fast the trajectory is moving either. LEE: Yeah, you know, there’s something that I think about often. There is this existential dread, or will this technology replace me? But I think the first people to feel that are researchers—people encountering this for the first time. You know, if you were working, let’s say, in Bayesian reasoning or in traditional, let’s say, Gaussian mixture model based, you know, speech recognition, you do get this feeling, Oh, my god, this technology has just solved the problem that I’ve dedicated my life to. And there is this really difficult period where you have to cope with that. And I think this is going to be spreading, you know, in more and more walks of life. And so this … at what point does that sort of sense of dread hit you, if ever? MOLLICK: I mean, you know, it’s not even dread as much as like, you know, Tyler Cowen wrote that it’s impossible to not feel a little bit of sadness as you use these AI systems, too. Because, like, I was talking to a friend, just as the most minor example, and his talent that he was very proud of was he was very good at writing limericks for birthday cards. He’d write these limericks. Everyone was always amused by them.And now, you know, GPT-4 and GPT-4.5, they made limericks obsolete. Like, anyone can write a good limerick, right. So this was a talent, and it was a little sad. Like, this thing that you cared about mattered. You know, as academics, we’re a little used to dead ends, right, and like, you know, some getting the lap. But the idea that entire fields are hitting that way. Like in medicine, there’s a lot of support systems that are now obsolete. And the question is how quickly you change that. In education, a lot of our techniques are obsolete. What do you do to change that? You know, it’s like the fact that this brute force technology is good enough to solve so many problems is weird, right. And it’s not just the end of, you know, of our research angles that matter, too. Like, for example, I ran this, you know, 14-person-plus, multimillion-dollar effort at Wharton to build these teaching simulations, and we’re very proud of them. It took years of work to build one. Now we’ve built a system that can build teaching simulations on demand by you talking to it with one team member. And, you know, you literally can create any simulation by having a discussion with the AI. I mean, you know, there’s a switch to a new form of excitement, but there is a little bit of like, this mattered to me, and, you know, now I have to change how I do things. I mean, adjustment happens. But if you haven’t had that displacement, I think that’s a good indicator that you haven’t really faced AI yet. LEE: Yeah, what’s so interesting just listening to you is you use words like sadness, and yet I can see the—and hear the—excitement in your voice and your body language. So, you know, that’s also kind of an interesting aspect of all of this. MOLLICK: Yeah, I mean, I think there’s something on the other side, right. But, like, I can’t say that I haven’t had moments where like, ughhhh, but then there’s joy and basically like also, you know, freeing stuff up. I mean, I think about doctors or professors, right. These are jobs that bundle together lots of different tasks that you would never have put together, right. If you’re a doctor, you would never have expected the same person to be good at keeping up with the research and being a good diagnostician and being a good manager and being good with people and being good with hand skills. Like, who would ever want that kind of bundle? That’s not something you’re all good at, right. And a lot of our stress of our job comes from the fact that we suck at some of it. And so to the extent that AI steps in for that, you kind of feel bad about some of the stuff that it’s doing that you wanted to do. But it’s much more uplifting to be like, I don’t have to do this stuff I’m bad anymore, or I get the support to make myself good at it. And the stuff that I really care about, I can focus on more. Well, because we are at kind of a unique moment where whatever you’re best at, you’re still better than AI. And I think it’s an ongoing question about how long that lasts. But for right now, like you’re not going to say, OK, AI replaces me entirely in my job in medicine. It’s very unlikely. But you will say it replaces these 17 things I’m bad at, but I never liked that anyway. So it’s a period of both excitement and a little anxiety. LEE: Yeah, I’m going to want to get back to this question about in what ways AI may or may not replace doctors or some of what doctors and nurses and other clinicians do. But before that, let’s get into, I think, the real meat of this conversation. In previous episodes of this podcast, we talked to clinicians and healthcare administrators and technology developers that are very rapidly injecting AI today to do various forms of workforce automation, you know, automatically writing a clinical encounter note, automatically filling out a referral letter or request for prior authorization for some reimbursement to an insurance company. And so these sorts of things are intended not only to make things more efficient and lower costs but also to reduce various forms of drudgery, cognitive burden on frontline health workers. So how do you think about the impact of AI on that aspect of workforce, and, you know, what would you expect will happen over the next few years in terms of impact on efficiency and costs? MOLLICK: So I mean, this is a case where I think we’re facing the big bright problem in AI in a lot of ways, which is that this is … at the individual level, there’s lots of performance gains to be gained, right. The problem, though, is that we as individuals fit into systems, in medicine as much as anywhere else or more so, right. Which is that you could individually boost your performance, but it’s also about systems that fit along with this, right. So, you know, if you could automatically, you know, record an encounter, if you could automatically make notes, does that change what you should be expecting for notes or the value of those notes or what they’re for? How do we take what one person does and validate it across the organization and roll it out for everybody without making it a 10-year process that it feels like IT in medicine often is? Like, so we’re in this really interesting period where there’s incredible amounts of individual innovation in productivity and performance improvements in this field, like very high levels of it, but not necessarily seeing that same thing translate to organizational efficiency or gains. And one of my big concerns is seeing that happen. We’re seeing that in nonmedical problems, the same kind of thing, which is, you know, we’ve got research showing 20 and 40% performance improvements, like not uncommon to see those things. But then the organization doesn’t capture it; the system doesn’t capture it. Because the individuals are doing their own work and the systems don’t have the ability to, kind of, learn or adapt as a result. LEE: You know, where are those productivity gains going, then, when you get to the organizational level? MOLLICK: Well, they’re dying for a few reasons. One is, there’s a tendency for individual contributors to underestimate the power of management, right. Practices associated with good management increase happiness, decrease, you know, issues, increase success rates. In the same way, about 40%, as far as we can tell, of the US advantage over other companies, of US firms, has to do with management ability. Like, management is a big deal. Organizing is a big deal. Thinking about how you coordinate is a big deal. At the individual level, when things get stuck there, right, you can’t start bringing them up to how systems work together. It becomes, How do I deal with a doctor that has a 60% performance improvement? We really only have one thing in our playbook for doing that right now, which is, OK, we could fire 40% of the other doctors and still have a performance gain, which is not the answer you want to see happen. So because of that, people are hiding their use. They’re actually hiding their use for lots of reasons. And it’s a weird case because the people who are able to figure out best how to use these systems, for a lot of use cases, they’re actually clinicians themselves because they’re experimenting all the time. Like, they have to take those encounter notes. And if they figure out a better way to do it, they figure that out. You don’t want to wait for, you know, a med tech company to figure that out and then sell that back to you when it can be done by the physicians themselves. So we’re just not used to a period where everybody’s innovating and where the management structure isn’t in place to take advantage of that. And so we’re seeing things stalled at the individual level, and people are often, especially in risk-averse organizations or organizations where there’s lots of regulatory hurdles, people are so afraid of the regulatory piece that they don’t even bother trying to make change. LEE: If you are, you know, the leader of a hospital or a clinic or a whole health system, how should you approach this? You know, how should you be trying to extract positive success out of AI? MOLLICK: So I think that you need to embrace the right kind of risk, right. We don’t want to put risk on our patients … like, we don’t want to put uninformed risk. But innovation involves risk to how organizations operate. They involve change. So I think part of this is embracing the idea that R&D has to happen in organizations again. What’s happened over the last 20 years or so has been organizations giving that up. Partially, that’s a trend to focus on what you’re good at and not try and do this other stuff. Partially, it’s because it’s outsourced now to software companies that, like, Salesforce tells you how to organize your sales team. Workforce tells you how to organize your organization. Consultants come in and will tell you how to make change based on the average of what other people are doing in your field. So companies and organizations and hospital systems have all started to give up their ability to create their own organizational change. And when I talk to organizations, I often say they have to have two approaches. They have to think about the crowd and the lab. So the crowd is the idea of how to empower clinicians and administrators and supporter networks to start using AI and experimenting in ethical, legal ways and then sharing that information with each other. And the lab is, how are we doing R&D about the approach of how toAI to work, not just in direct patient care, right. But also fundamentally, like, what paperwork can you cut out? How can we better explain procedures? Like, what management role can this fill? And we need to be doing active experimentation on that. We can’t just wait for, you know, Microsoft to solve the problems. It has to be at the level of the organizations themselves. LEE: So let’s shift a little bit to the patient. You know, one of the things that we see, and I think everyone is seeing, is that people are turning to chatbots, like ChatGPT, actually to seek healthcare information for, you know, their own health or the health of their loved ones. And there was already, prior to all of this, a trend towards, let’s call it, consumerization of healthcare. So just in the business of healthcare delivery, do you think AI is going to hasten these kinds of trends, or from the consumer’s perspective, what … ? MOLLICK: I mean, absolutely, right. Like, all the early data that we have suggests that for most common medical problems, you should just consult AI, too, right. In fact, there is a real question to ask: at what point does it become unethical for doctors themselves to not ask for a second opinion from the AI because it’s cheap, right? You could overrule it or whatever you want, but like not asking seems foolish. I think the two places where there’s a burning almost, you know, moral imperative is … let’s say, you know, I’m in Philadelphia, I’m a professor, I have access to really good healthcare through the Hospital University of Pennsylvania system. I know doctors. You know, I’m lucky. I’m well connected. If, you know, something goes wrong, I have friends who I can talk to. I have specialists. I’m, you know, pretty well educated in this space. But for most people on the planet, they don’t have access to good medical care, they don’t have good health. It feels like it’s absolutely imperative to say when should you use AI and when not. Are there blind spots? What are those things? And I worry that, like, to me, that would be the crash project I’d be invoking because I’m doing the same thing in education, which is this system is not as good as being in a room with a great teacher who also uses AI to help you, but it’s better than not getting an, you know, to the level of education people get in many cases. Where should we be using it? How do we guide usage in the right way? Because the AI labs aren’t thinking about this. We have to. So, to me, there is a burning need here to understand this. And I worry that people will say, you know, everything that’s true—AI can hallucinate, AI can be biased. All of these things are absolutely true, but people are going to use it. The early indications are that it is quite useful. And unless we take the active role of saying, here’s when to use it, here’s when not to use it, we don’t have a right to say, don’t use this system. And I think, you know, we have to be exploring that. LEE: What do people need to understand about AI? And what should schools, universities, and so on be teaching? MOLLICK: Those are, kind of, two separate questions in lot of ways. I think a lot of people want to teach AI skills, and I will tell you, as somebody who works in this space a lot, there isn’t like an easy, sort of, AI skill, right. I could teach you prompt engineering in two to three classes, but every indication we have is that for most people under most circumstances, the value of prompting, you know, any one case is probably not that useful. A lot of the tricks are disappearing because the AI systems are just starting to use them themselves. So asking good questions, being a good manager, being a good thinker tend to be important, but like magic tricks around making, you know, the AI do something because you use the right phrase used to be something that was real but is rapidly disappearing. So I worry when people say teach AI skills. No one’s been able to articulate to me as somebody who knows AI very well and teaches classes on AI, what those AI skills that everyone should learn are, right. I mean, there’s value in learning a little bit how the models work. There’s a value in working with these systems. A lot of it’s just hands on keyboard kind of work. But, like, we don’t have an easy slam dunk “this is what you learn in the world of AI” because the systems are getting better, and as they get better, they get less sensitive to these prompting techniques. They get better prompting themselves. They solve problems spontaneously and start being agentic. So it’s a hard problem to ask about, like, what do you train someone on? I think getting people experience in hands-on-keyboards, getting them to … there’s like four things I could teach you about AI, and two of them are already starting to disappear. But, like, one is be direct. Like, tell the AI exactly what you want. That’s very helpful. Second, provide as much context as possible. That can include things like acting as a doctor, but also all the information you have. The third is give it step-by-step directions—that’s becoming less important. And the fourth is good and bad examples of the kind of output you want. Those four, that’s like, that’s it as far as the research telling you what to do, and the rest is building intuition. LEE: I’m really impressed that you didn’t give the answer, “Well, everyone should be teaching my book, Co-Intelligence.”MOLLICK: Oh, no, sorry! Everybody should be teaching my book Co-Intelligence. I apologize.LEE: It’s good to chuckle about that, but actually, I can’t think of a better book, like, if you were to assign a textbook in any professional education space, I think Co-Intelligence would be number one on my list. Are there other things that you think are essential reading? MOLLICK: That’s a really good question. I think that a lot of things are evolving very quickly. I happen to, kind of, hit a sweet spot with Co-Intelligence to some degree because I talk about how I used it, and I was, sort of, an advanced user of these systems. So, like, it’s, sort of, like my Twitter feed, my online newsletter. I’m just trying to, kind of, in some ways, it’s about trying to make people aware of what these systems can do by just showing a lot, right. Rather than picking one thing, and, like, this is a general-purpose technology. Let’s use it for this. And, like, everybody gets a light bulb for a different reason. So more than reading, it is using, you know, and that can be Copilot or whatever your favorite tool is. But using it. Voice modes help a lot. In terms of readings, I mean, I think that there is a couple of good guides to understanding AI that were originally blog posts. I think Tim Lee has one called Understanding AI, and it had a good overview … LEE: Yeah, that’s a great one. MOLLICK: … of that topic that I think explains how transformers work, which can give you some mental sense. I thinkKarpathyhas some really nice videos of use that I would recommend. Like on the medical side, I think the book that you did, if you’re in medicine, you should read that. I think that that’s very valuable. But like all we can offer are hints in some ways. Like there isn’t … if you’re looking for the instruction manual, I think it can be very frustrating because it’s like you want the best practices and procedures laid out, and we cannot do that, right. That’s not how a system like this works. LEE: Yeah. MOLLICK: It’s not a person, but thinking about it like a person can be helpful, right. LEE: One of the things that has been sort of a fun project for me for the last few years is I have been a founding board member of a new medical school at Kaiser Permanente. And, you know, that medical school curriculum is being formed in this era. But it’s been perplexing to understand, you know, what this means for a medical school curriculum. And maybe even more perplexing for me, at least, is the accrediting bodies, which are extremely important in US medical schools; how accreditors should think about what’s necessary here. Besides the things that you’ve … the, kind of, four key ideas you mentioned, if you were talking to the board of directors of the LCMEaccrediting body, what’s the one thing you would want them to really internalize? MOLLICK: This is both a fast-moving and vital area. This can’t be viewed like a usual change, which, “Let’s see how this works.” Because it’s, like, the things that make medical technologies hard to do, which is like unclear results, limited, you know, expensive use cases where it rolls out slowly. So one or two, you know, advanced medical facilities get access to, you know, proton beams or something else at multi-billion dollars of cost, and that takes a while to diffuse out. That’s not happening here. This is all happening at the same time, all at once. This is now … AI is part of medicine. I mean, there’s a minor point that I’d make that actually is a really important one, which is large language models, generative AI overall, work incredibly differently than other forms of AI. So the other worry I have with some of these accreditors is they blend together algorithmic forms of AI, which medicine has been trying for long time—decision support, algorithmic methods, like, medicine more so than other places has been thinking about those issues. Generative AI, even though it uses the same underlying techniques, is a completely different beast. So, like, even just take the most simple thing of algorithmic aversion, which is a well-understood problem in medicine, right. Which is, so you have a tool that could tell you as a radiologist, you know, the chance of this being cancer; you don’t like it, you overrule it, right. We don’t find algorithmic aversion happening with LLMs in the same way. People actually enjoy using them because it’s more like working with a person. The flaws are different. The approach is different. So you need to both view this as universal applicable today, which makes it urgent, but also as something that is not the same as your other form of AI, and your AI working group that is thinking about how to solve this problem is not the right people here. LEE: You know, I think the world has been trained because of the magic of web search to view computers as question-answering machines. Ask a question, get an answer. MOLLICK: Yes. Yes. LEE: Write a query, get results. And as I have interacted with medical professionals, you can see that medical professionals have that model of a machine in mind. And I think that’s partly, I think psychologically, why hallucination is so alarming. Because you have a mental model of a computer as a machine that has absolutely rock-solid perfect memory recall. But the thing that was so powerful in Co-Intelligence, and we tried to get at this in our book also, is that’s not the sweet spot. It’s this sort of deeper interaction, more of a collaboration. And I thought your use of the term Co-Intelligence really just even in the title of the book tried to capture this. When I think about education, it seems like that’s the first step, to get past this concept of a machine being just a question-answering machine. Do you have a reaction to that idea? MOLLICK: I think that’s very powerful. You know, we’ve been trained over so many years at both using computers but also in science fiction, right. Computers are about cold logic, right. They will give you the right answer, but if you ask it what love is, they explode, right. Like that’s the classic way you defeat the evil robot in Star Trek, right. “Love does not compute.”Instead, we have a system that makes mistakes, is warm, beats doctors in empathy in almost every controlled study on the subject, right. Like, absolutely can outwrite you in a sonnet but will absolutely struggle with giving you the right answer every time. And I think our mental models are just broken for this. And I think you’re absolutely right. And that’s part of what I thought your book does get at really well is, like, this is a different thing. It’s also generally applicable. Again, the model in your head should be kind of like a person even though it isn’t, right. There’s a lot of warnings and caveats to it, but if you start from person, smart person you’re talking to, your mental model will be more accurate than smart machine, even though both are flawed examples, right. So it will make mistakes; it will make errors. The question is, what do you trust it on? What do you not trust it? As you get to know a model, you’ll get to understand, like, I totally don’t trust it for this, but I absolutely trust it for that, right. LEE: All right. So we’re getting to the end of the time we have together. And so I’d just like to get now into something a little bit more provocative. And I get the question all the time. You know, will AI replace doctors? In medicine and other advanced knowledge work, project out five to 10 years. What do think happens? MOLLICK: OK, so first of all, let’s acknowledge systems change much more slowly than individual use. You know, doctors are not individual actors; they’re part of systems, right. So not just the system of a patient who like may or may not want to talk to a machine instead of a person but also legal systems and administrative systems and systems that allocate labor and systems that train people. So, like, it’s hard to imagine that in five to 10 years medicine being so upended that even if AI was better than doctors at every single thing doctors do, that we’d actually see as radical a change in medicine as you might in other fields. I think you will see faster changes happen in consulting and law and, you know, coding, other spaces than medicine. But I do think that there is good reason to suspect that AI will outperform people while still having flaws, right. That’s the difference. We’re already seeing that for common medical questions in enough randomized controlled trials that, you know, best doctors beat AI, but the AI beats the mean doctor, right. Like, that’s just something we should acknowledge is happening at this point. Now, will that work in your specialty? No. Will that work with all the contingent social knowledge that you have in your space? Probably not. Like, these are vignettes, right. But, like, that’s kind of where things are. So let’s assume, right … you’re asking two questions. One is, how good will AI get? LEE: Yeah. MOLLICK: And we don’t know the answer to that question. I will tell you that your colleagues at Microsoft and increasingly the labs, the AI labs themselves, are all saying they think they’ll have a machine smarter than a human at every intellectual task in the next two to three years. If that doesn’t happen, that makes it easier to assume the future, but let’s just assume that that’s the case. I think medicine starts to change with the idea that people feel obligated to use this to help for everything. Your patients will be using it, and it will be your advisor and helper at the beginning phases, right. And I think that I expect people to be better at empathy. I expect better bedside manner. I expect management tasks to become easier. I think administrative burden might lighten if we handle this right way or much worse if we handle it badly. Diagnostic accuracy will increase, right. And then there’s a set of discovery pieces happening, too, right. One of the core goals of all the AI companies is to accelerate medical research. How does that happen and how does that affect us is a, kind of, unknown question. So I think clinicians are in both the eye of the storm and surrounded by it, right. Like, they can resist AI use for longer than most other fields, but everything around them is going to be affected by it. LEE: Well, Ethan, this has been really a fantastic conversation. And, you know, I think in contrast to all the other conversations we’ve had, this one gives especially the leaders in healthcare, you know, people actually trying to lead their organizations into the future, whether it’s in education or in delivery, a lot to think about. So I really appreciate you joining. MOLLICK: Thank you.  I’m a computing researcher who works with people who are right in the middle of today’s bleeding-edge developments in AI. And because of that, I often lose sight of how to talk to a broader audience about what it’s all about. And so I think one of Ethan’s superpowers is that he has this knack for explaining complex topics in AI in a really accessible way, getting right to the most important points without making it so simple as to be useless. That’s why I rarely miss an opportunity to read up on his latest work. One of the first things I learned from Ethan is the intuition that you can, sort of, think of AI as a very knowledgeable intern. In other words, think of it as a persona that you can interact with, but you also need to be a manager for it and to always assess the work that it does. In our discussion, Ethan went further to stress that there is, because of that, a serious education gap. You know, over the last decade or two, we’ve all been trained, mainly by search engines, to think of computers as question-answering machines. In medicine, in fact, there’s a question-answering application that is really popular called UpToDate. Doctors use it all the time. But generative AI systems like ChatGPT are different. There’s therefore a challenge in how to break out of the old-fashioned mindset of search to get the full value out of generative AI. The other big takeaway for me was that Ethan pointed out while it’s easy to see productivity gains from AI at the individual level, those same gains, at least today, don’t often translate automatically to organization-wide or system-wide gains. And one, of course, has to conclude that it takes more than just making individuals more productive; the whole system also has to adjust to the realities of AI. Here’s now my interview with Azeem Azhar: LEE: Azeem, welcome. AZEEM AZHAR: Peter, thank you so much for having me. LEE: You know, I think you’re extremely well known in the world. But still, some of the listeners of this podcast series might not have encountered you before. And so one of the ways I like to ask people to introduce themselves is, how do you explain to your parents what you do every day? AZHAR: Well, I’m very lucky in that way because my mother was the person who got me into computers more than 40 years ago. And I still have that first computer, a ZX81 with a Z80 chip … LEE: Oh wow. AZHAR: … to this day. It sits in my study, all seven and a half thousand transistors and Bakelite plastic that it is. And my parents were both economists, and economics is deeply connected with technology in some sense. And I grew up in the late ’70s and the early ’80s. And that was a time of tremendous optimism around technology. It was space opera, science fiction, robots, and of course, the personal computer and, you know, Bill Gates and Steve Jobs. So that’s where I started. And so, in a way, my mother and my dad, who passed away a few years ago, had always known me as someone who was fiddling with computers but also thinking about economics and society. And so, in a way, it’s easier to explain to them because they’re the ones who nurtured the environment that allowed me to research technology and AI and think about what it means to firms and to the economy at large. LEE: I always like to understand the origin story. And what I mean by that is, you know, what was your first encounter with generative AI? And what was that like? What did you go through? AZHAR: The first real moment was when Midjourney and Stable Diffusion emerged in that summer of 2022. I’d been away on vacation, and I came back—and I’d been off grid, in fact—and the world had really changed. Now, I’d been aware of GPT-3 and GPT-2, which I played around with and with BERT, the original transformer paper about seven or eight years ago, but it was the moment where I could talk to my computer, and it could produce these images, and it could be refined in natural language that really made me think we’ve crossed into a new domain. We’ve gone from AI being highly discriminative to AI that’s able to explore the world in particular ways. And then it was a few months later that ChatGPT came out—November, the 30th. And I think it was the next day or the day after that I said to my team, everyone has to use this, and we have to meet every morning and discuss how we experimented the day before. And we did that for three or four months. And, you know, it was really clear to me in that interface at that point that, you know, we’d absolutely pass some kind of threshold. LEE: And who’s the we that you were experimenting with? AZHAR: So I have a team of four who support me. They’re mostly researchers of different types. I mean, it’s almost like one of those jokes. You know, I have a sociologist, an economist, and an astrophysicist. And, you know, they walk into the bar,or they walk into our virtual team room, and we try to solve problems. LEE: Well, so let’s get now into brass tacks here. And I think I want to start maybe just with an exploration of the economics of all this and economic realities. Because I think in a lot of your work—for example, in your book—you look pretty deeply at how automation generally and AI specifically are transforming certain sectors like finance, manufacturing, and you have a really, kind of, insightful focus on what this means for productivity and which ways, you know, efficiencies are found. And then you, sort of, balance that with risks, things that can and do go wrong. And so as you take that background and looking at all those other sectors, in what ways are the same patterns playing out or likely to play out in healthcare and medicine? AZHAR: I’m sure we will see really remarkable parallels but also new things going on. I mean, medicine has a particular quality compared to other sectors in the sense that it’s highly regulated, market structure is very different country to country, and it’s an incredibly broad field. I mean, just think about taking a Tylenol and going through laparoscopic surgery. Having an MRI and seeing a physio. I mean, this is all medicine. I mean, it’s hard to imagine a sector that ismore broad than that. So I think we can start to break it down, and, you know, where we’re seeing things with generative AI will be that the, sort of, softest entry point, which is the medical scribing. And I’m sure many of us have been with clinicians who have a medical scribe running alongside—they’re all on Surface Pros I noticed, right?They’re on the tablet computers, and they’re scribing away. And what that’s doing is, in the words of my friend Eric Topol, it’s giving the clinician time back, right. They have time back from days that are extremely busy and, you know, full of administrative overload. So I think you can obviously do a great deal with reducing that overload. And within my team, we have a view, which is if you do something five times in a week, you should be writing an automation for it. And if you’re a doctor, you’re probably reviewing your notes, writing the prescriptions, and so on several times a day. So those are things that can clearly be automated, and the human can be in the loop. But I think there are so many other ways just within the clinic that things can help. So, one of my friends, my friend from my junior school—I’ve known him since I was 9—is an oncologist who’s also deeply into machine learning, and he’s in Cambridge in the UK. And he built with Microsoft Research a suite of imaging AI tools from his own discipline, which they then open sourced. So that’s another way that you have an impact, which is that you actually enable the, you know, generalist, specialist, polymath, whatever they are in health systems to be able to get this technology, to tune it to their requirements, to use it, to encourage some grassroots adoption in a system that’s often been very, very heavily centralized. LEE: Yeah. AZHAR: And then I think there are some other things that are going on that I find really, really exciting. So one is the consumerization of healthcare. So I have one of those sleep tracking rings, the Oura. LEE: Yup. AZHAR: That is building a data stream that we’ll be able to apply more and more AI to. I mean, right now, it’s applying traditional, I suspect, machine learning, but you can imagine that as we start to get more data, we start to get more used to measuring ourselves, we create this sort of pot, a personal asset that we can turn AI to. And there’s still another category. And that other category is one of the completely novel ways in which we can enable patient care and patient pathway. And there’s a fantastic startup in the UK called Neko Health, which, I mean, does physicals, MRI scans, and blood tests, and so on. It’s hard to imagine Neko existing without the sort of advanced data, machine learning, AI that we’ve seen emerge over the last decade. So, I mean, I think that there are so many ways in which the temperature is slowly being turned up to encourage a phase change within the healthcare sector. And last but not least, I do think that these tools can also be very, very supportive of a clinician’s life cycle. I think we, as patients, we’re a bit … I don’t know if we’re as grateful as we should be for our clinicians who are putting in 90-hour weeks.But you can imagine a world where AI is able to support not just the clinicians’ workload but also their sense of stress, their sense of burnout. So just in those five areas, Peter, I sort of imagine we could start to fundamentally transform over the course of many years, of course, the way in which people think about their health and their interactions with healthcare systems LEE: I love how you break that down. And I want to press on a couple of things. You also touched on the fact that medicine is, at least in most of the world, is a highly regulated industry. I guess finance is the same way, but they also feel different because the, like, finance sector has to be very responsive to consumers, and consumers are sensitive to, you know, an abundance of choice; they are sensitive to price. Is there something unique about medicine besides being regulated? AZHAR: I mean, there absolutely is. And in finance, as well, you have much clearer end states. So if you’re not in the consumer space, but you’re in the, you know, asset management space, you have to essentially deliver returns against the volatility or risk boundary, right. That’s what you have to go out and do. And I think if you’re in the consumer industry, you can come back to very, very clear measures, net promoter score being a very good example. In the case of medicine and healthcare, it is much more complicated because as far as the clinician is concerned, people are individuals, and we have our own parts and our own responses. If we didn’t, there would never be a need for a differential diagnosis. There’d never be a need for, you know, Let’s try azithromycin first, and then if that doesn’t work, we’ll go to vancomycin, or, you know, whatever it happens to be. You would just know. But ultimately, you know, people are quite different. The symptoms that they’re showing are quite different, and also their compliance is really, really different. I had a back problem that had to be dealt with by, you know, a physio and extremely boring exercises four times a week, but I was ruthless in complying, and my physio was incredibly surprised. He’d say well no one ever does this, and I said, well you know the thing is that I kind of just want to get this thing to go away. LEE: Yeah. AZHAR: And I think that that’s why medicine is and healthcare is so different and more complex. But I also think that’s why AI can be really, really helpful. I mean, we didn’t talk about, you know, AI in its ability to potentially do this, which is to extend the clinician’s presence throughout the week. LEE: Right. Yeah. AZHAR: The idea that maybe some part of what the clinician would do if you could talk to them on Wednesday, Thursday, and Friday could be delivered through an app or a chatbot just as a way of encouraging the compliance, which is often, especially with older patients, one reason why conditions, you know, linger on for longer. LEE: You know, just staying on the regulatory thing, as I’ve thought about this, the one regulated sector that I think seems to have some parallels to healthcare is energy delivery, energy distribution. Because like healthcare, as a consumer, I don’t have choice in who delivers electricity to my house. And even though I care about it being cheap or at least not being overcharged, I don’t have an abundance of choice. I can’t do price comparisons. And there’s something about that, just speaking as a consumer of both energy and a consumer of healthcare, that feels similar. Whereas other regulated industries, you know, somehow, as a consumer, I feel like I have a lot more direct influence and power. Does that make any sense to someone, you know, like you, who’s really much more expert in how economic systems work? AZHAR: I mean, in a sense, one part of that is very, very true. You have a limited panel of energy providers you can go to, and in the US, there may be places where you have no choice. I think the area where it’s slightly different is that as a consumer or a patient, you can actually make meaningful choices and changes yourself using these technologies, and people used to joke about you know asking Dr. Google. But Dr. Google is not terrible, particularly if you go to WebMD. And, you know, when I look at long-range change, many of the regulations that exist around healthcare delivery were formed at a point before people had access to good quality information at the touch of their fingertips or when educational levels in general were much, much lower. And many regulations existed because of the incumbent power of particular professional sectors. I’ll give you an example from the United Kingdom. So I have had asthma all of my life. That means I’ve been taking my inhaler, Ventolin, and maybe a steroid inhaler for nearly 50 years. That means that I know … actually, I’ve got more experience, and I—in some sense—know more about it than a general practitioner. LEE: Yeah. AZHAR: And until a few years ago, I would have to go to a general practitioner to get this drug that I’ve been taking for five decades, and there they are, age 30 or whatever it is. And a few years ago, the regulations changed. And now pharmacies can … or pharmacists can prescribe those types of drugs under certain conditions directly. LEE: Right. AZHAR: That was not to do with technology. That was to do with incumbent lock-in. So when we look at the medical industry, the healthcare space, there are some parallels with energy, but there are a few little things that the ability that the consumer has to put in some effort to learn about their condition, but also the fact that some of the regulations that exist just exist because certain professions are powerful. LEE: Yeah, one last question while we’re still on economics. There seems to be a conundrum about productivity and efficiency in healthcare delivery because I’ve never encountered a doctor or a nurse that wants to be able to handle even more patients than they’re doing on a daily basis. And so, you know, if productivity means simply, well, your rounds can now handle 16 patients instead of eight patients, that doesn’t seem necessarily to be a desirable thing. So how can we or should we be thinking about efficiency and productivity since obviously costs are, in most of the developed world, are a huge, huge problem? AZHAR: Yes, and when you described doubling the number of patients on the round, I imagined you buying them all roller skates so they could just whizz aroundthe hospital faster and faster than ever before. We can learn from what happened with the introduction of electricity. Electricity emerged at the end of the 19th century, around the same time that cars were emerging as a product, and car makers were very small and very artisanal. And in the early 1900s, some really smart car makers figured out that electricity was going to be important. And they bought into this technology by putting pendant lights in their workshops so they could “visit more patients.” Right? LEE: Yeah, yeah. AZHAR: They could effectively spend more hours working, and that was a productivity enhancement, and it was noticeable. But, of course, electricity fundamentally changed the productivity by orders of magnitude of people who made cars starting with Henry Ford because he was able to reorganize his factories around the electrical delivery of power and to therefore have the moving assembly line, which 10xed the productivity of that system. So when we think about how AI will affect the clinician, the nurse, the doctor, it’s much easier for us to imagine it as the pendant light that just has them working later … LEE: Right. AZHAR: … than it is to imagine a reconceptualization of the relationship between the clinician and the people they care for. And I’m not sure. I don’t think anybody knows what that looks like. But, you know, I do think that there will be a way that this changes, and you can see that scale out factor. And it may be, Peter, that what we end up doing is we end up saying, OK, because we have these brilliant AIs, there’s a lower level of training and cost and expense that’s required for a broader range of conditions that need treating. And that expands the market, right. That expands the market hugely. It’s what has happened in the market for taxis or ride sharing. The introduction of Uber and the GPS system … LEE: Yup. AZHAR: … has meant many more people now earn their living driving people around in their cars. And at least in London, you had to be reasonably highly trained to do that. So I can see a reorganization is possible. Of course, entrenched interests, the economic flow … and there are many entrenched interests, particularly in the US between the health systems and the, you know, professional bodies that might slow things down. But I think a reimagining is possible. And if I may, I’ll give you one example of that, which is, if you go to countries outside of the US where there are many more sick people per doctor, they have incentives to change the way they deliver their healthcare. And well before there was AI of this quality around, there was a few cases of health systems in India—Aravind Eye Carewas one, and Narayana Hrudayalayawas another. And in the latter, they were a cardiac care unit where you couldn’t get enough heart surgeons. LEE: Yeah, yep. AZHAR: So specially trained nurses would operate under the supervision of a single surgeon who would supervise many in parallel. So there are ways of increasing the quality of care, reducing the cost, but it does require a systems change. And we can’t expect a single bright algorithm to do it on its own. LEE: Yeah, really, really interesting. So now let’s get into regulation. And let me start with this question. You know, there are several startup companies I’m aware of that are pushing on, I think, a near-term future possibility that a medical AI for consumer might be allowed, say, to prescribe a medication for you, something that would normally require a doctor or a pharmacist, you know, that is certified in some way, licensed to do. Do you think we’ll get to a point where for certain regulated activities, humans are more or less cut out of the loop? AZHAR: Well, humans would have been in the loop because they would have provided the training data, they would have done the oversight, the quality control. But to your question in general, would we delegate an important decision entirely to a tested set of algorithms? I’m sure we will. We already do that. I delegate less important decisions like, What time should I leave for the airport to Waze. I delegate more important decisions to the automated braking in my car. We will do this at certain levels of risk and threshold. If I come back to my example of prescribing Ventolin. It’s really unclear to me that the prescription of Ventolin, this incredibly benign bronchodilator that is only used by people who’ve been through the asthma process, needs to be prescribed by someone who’s gone through 10 years or 12 years of medical training. And why that couldn’t be prescribed by an algorithm or an AI system. LEE: Right. Yep. Yep. AZHAR: So, you know, I absolutely think that that will be the case and could be the case. I can’t really see what the objections are. And the real issue is where do you draw the line of where you say, “Listen, this is too important,” or “The cost is too great,” or “The side effects are too high,” and therefore this is a point at which we want to have some, you know, human taking personal responsibility, having a liability framework in place, having a sense that there is a person with legal agency who signed off on this decision. And that line I suspect will start fairly low, and what we’d expect to see would be that that would rise progressively over time. LEE: What you just said, that scenario of your personal asthma medication, is really interesting because your personal AI might have the benefit of 50 years of your own experience with that medication. So, in a way, there is at least the data potential for, let’s say, the next prescription to be more personalized and more tailored specifically for you. AZHAR: Yes. Well, let’s dig into this because I think this is super interesting, and we can look at how things have changed. So 15 years ago, if I had a bad asthma attack, which I might have once a year, I would have needed to go and see my general physician. In the UK, it’s very difficult to get an appointment. I would have had to see someone privately who didn’t know me at all because I’ve just walked in off the street, and I would explain my situation. It would take me half a day. Productivity lost. I’ve been miserable for a couple of days with severe wheezing. Then a few years ago the system changed, a protocol changed, and now I have a thing called a rescue pack, which includes prednisolone steroids. It includes something else I’ve just forgotten, and an antibiotic in case I get an upper respiratory tract infection, and I have an “algorithm.” It’s called a protocol. It’s printed out. It’s a flowchart I answer various questions, and then I say, “I’m going to prescribe this to myself.” You know, UK doctors don’t prescribe prednisolone, or prednisone as you may call it in the US, at the drop of a hat, right. It’s a powerful steroid. I can self-administer, and I can now get that repeat prescription without seeing a physician a couple of times a year. And the algorithm, the “AI” is, it’s obviously been done in PowerPoint naturally, and it’s a bunch of arrows.Surely, surely, an AI system is going to be more sophisticated, more nuanced, and give me more assurance that I’m making the right decision around something like that. LEE: Yeah. Well, at a minimum, the AI should be able to make that PowerPoint the next time.AZHAR: Yeah, yeah. Thank god for Clippy. Yes. LEE: So, you know, I think in our book, we had a lot of certainty about most of the things we’ve discussed here, but one chapter where I felt we really sort of ran out of ideas, frankly, was on regulation. And, you know, what we ended up doing for that chapter is … I can’t remember if it was Carey’s or Zak’s idea, but we asked GPT-4 to have a conversation, a debate with itself, about regulation. And we made some minor commentary on that. And really, I think we took that approach because we just didn’t have much to offer. By the way, in our defense, I don’t think anyone else had any better ideas anyway. AZHAR: Right. LEE: And so now two years later, do we have better ideas about the need for regulation, the frameworks around which those regulations should be developed, and, you know, what should this look like? AZHAR: So regulation is going to be in some cases very helpful because it provides certainty for the clinician that they’re doing the right thing, that they are still insured for what they’re doing, and it provides some degree of confidence for the patient. And we need to make sure that the claims that are made stand up to quite rigorous levels, where ideally there are RCTs, and there are the classic set of processes you go through. You do also want to be able to experiment, and so the question is: as a regulator, how can you enable conditions for there to be experimentation? And what is experimentation? Experimentation is learning so that every element of the system can learn from this experience. So finding that space where there can be bit of experimentation, I think, becomes very, very important. And a lot of this is about experience, so I think the first digital therapeutics have received FDA approval, which means there are now people within the FDA who understand how you go about running an approvals process for that, and what that ends up looking like—and of course what we’re very good at doing in this sort of modern hyper-connected world—is we can share that expertise, that knowledge, that experience very, very quickly. So you go from one approval a year to a hundred approvals a year to a thousand approvals a year. So we will then actually, I suspect, need to think about what is it to approve digital therapeutics because, unlike big biological molecules, we can generate these digital therapeutics at the rate of knots. LEE: Yes. AZHAR: Every road in Hayes Valley in San Francisco, right, is churning out new startups who will want to do things like this. So then, I think about, what does it mean to get approved if indeed it gets approved? But we can also go really far with things that don’t require approval. I come back to my sleep tracking ring. So I’ve been wearing this for a few years, and when I go and see my doctor or I have my annual checkup, one of the first things that he asks is how have I been sleeping. And in fact, I even sync my sleep tracking data to their medical record system, so he’s saying … hearing what I’m saying, but he’s actually pulling up the real data going, This patient’s lying to me again. Of course, I’m very truthful with my doctor, as we should all be.LEE: You know, actually, that brings up a point that consumer-facing health AI has to deal with pop science, bad science, you know, weird stuff that you hear on Reddit. And because one of the things that consumers want to know always is, you know, what’s the truth? AZHAR: Right. LEE: What can I rely on? And I think that somehow feels different than an AI that you actually put in the hands of, let’s say, a licensed practitioner. And so the regulatory issues seem very, very different for these two cases somehow. AZHAR: I agree, they’re very different. And I think for a lot of areas, you will want to build AI systems that are first and foremost for the clinician, even if they have patient extensions, that idea that the clinician can still be with a patient during the week. And you’ll do that anyway because you need the data, and you also need a little bit of a liability shield to have like a sensible person who’s been trained around that. And I think that’s going to be a very important pathway for many AI medical crossovers. We’re going to go through the clinician. LEE: Yeah. AZHAR: But I also do recognize what you say about the, kind of, kooky quackery that exists on Reddit. Although on Creatine, Reddit may yet prove to have been right.LEE: Yeah, that’s right. Yes, yeah, absolutely. Yeah. AZHAR: Sometimes it’s right. And I think that it serves a really good role as a field of extreme experimentation. So if you’re somebody who makes a continuous glucose monitor traditionally given to diabetics but now lots of people will wear them—and sports people will wear them—you probably gathered a lot of extreme tail distribution data by reading the Reddit/biohackers … LEE: Yes. AZHAR: … for the last few years, where people were doing things that you would never want them to really do with the CGM. And so I think we shouldn’t understate how important that petri dish can be for helping us learn what could happen next. LEE: Oh, I think it’s absolutely going to be essential and a bigger thing in the future. So I think I just want to close here then with one last question. And I always try to be a little bit provocative with this. And so as you look ahead to what doctors and nurses and patients might be doing two years from now, five years from now, 10 years from now, do you have any kind of firm predictions? AZHAR: I’m going to push the boat out, and I’m going to go further out than closer in. LEE: OK.AZHAR: As patients, we will have many, many more touch points and interaction with our biomarkers and our health. We’ll be reading how well we feel through an array of things. And some of them we’ll be wearing directly, like sleep trackers and watches. And so we’ll have a better sense of what’s happening in our lives. It’s like the moment you go from paper bank statements that arrive every month to being able to see your account in real time. LEE: Yes. AZHAR: And I suspect we’ll have … we’ll still have interactions with clinicians because societies that get richer see doctors more, societies that get older see doctors more, and we’re going to be doing both of those over the coming 10 years. But there will be a sense, I think, of continuous health engagement, not in an overbearing way, but just in a sense that we know it’s there, we can check in with it, it’s likely to be data that is compiled on our behalf somewhere centrally and delivered through a user experience that reinforces agency rather than anxiety. And we’re learning how to do that slowly. I don’t think the health apps on our phones and devices have yet quite got that right. And that could help us personalize problems before they arise, and again, I use my experience for things that I’ve tracked really, really well. And I know from my data and from how I’m feeling when I’m on the verge of one of those severe asthma attacks that hits me once a year, and I can take a little bit of preemptive measure, so I think that that will become progressively more common and that sense that we will know our baselines. I mean, when you think about being an athlete, which is something I think about, but I could never ever do,but what happens is you start with your detailed baselines, and that’s what your health coach looks at every three or four months. For most of us, we have no idea of our baselines. You we get our blood pressure measured once a year. We will have baselines, and that will help us on an ongoing basis to better understand and be in control of our health. And then if the product designers get it right, it will be done in a way that doesn’t feel invasive, but it’ll be done in a way that feels enabling. We’ll still be engaging with clinicians augmented by AI systems more and more because they will also have gone up the stack. They won’t be spending their time on just “take two Tylenol and have a lie down” type of engagements because that will be dealt with earlier on in the system. And so we will be there in a very, very different set of relationships. And they will feel that they have different ways of looking after our health. LEE: Azeem, it’s so comforting to hear such a wonderfully optimistic picture of the future of healthcare. And I actually agree with everything you’ve said. Let me just thank you again for joining this conversation. I think it’s been really fascinating. And I think somehow the systemic issues, the systemic issues that you tend to just see with such clarity, I think are going to be the most, kind of, profound drivers of change in the future. So thank you so much. AZHAR: Well, thank you, it’s been my pleasure, Peter, thank you.  I always think of Azeem as a systems thinker. He’s always able to take the experiences of new technologies at an individual level and then project out to what this could mean for whole organizations and whole societies. In our conversation, I felt that Azeem really connected some of what we learned in a previous episode—for example, from Chrissy Farr—on the evolving consumerization of healthcare to the broader workforce and economic impacts that we’ve heard about from Ethan Mollick. Azeem’s personal story about managing his asthma was also a great example. You know, he imagines a future, as do I, where personal AI might assist and remember decades of personal experience with a condition like asthma and thereby know more than any human being could possibly know in a deeply personalized and effective way, leading to better care. Azeem’s relentless optimism about our AI future was also so heartening to hear. Both of these conversations leave me really optimistic about the future of AI in medicine. At the same time, it is pretty sobering to realize just how much we’ll all need to change in pretty fundamental and maybe even in radical ways. I think a big insight I got from these conversations is how we interact with machines is going to have to be altered not only at the individual level, but at the company level and maybe even at the societal level. Since my conversation with Ethan and Azeem, there have been some pretty important developments that speak directly to this. Just last week at Build, which is Microsoft’s yearly developer conference, we announced a slew of AI agent technologies. Our CEO, Satya Nadella, in fact, started his keynote by going online in a GitHub developer environment and then assigning a coding task to an AI agent, basically treating that AI as a full-fledged member of a development team. Other agents, for example, a meeting facilitator, a data analyst, a business researcher, travel agent, and more were also shown during the conference. But pertinent to healthcare specifically, what really blew me away was the demonstration of a healthcare orchestrator agent. And the specific thing here was in Stanford’s cancer treatment center, when they are trying to decide on potentially experimental treatments for cancer patients, they convene a meeting of experts. That is typically called a tumor board. And so this AI healthcare orchestrator agent actually participated as a full-fledged member of a tumor board meeting to help bring data together, make sure that the latest medical knowledge was brought to bear, and to assist in the decision-making around a patient’s cancer treatment. It was pretty amazing.A big thank-you again to Ethan and Azeem for sharing their knowledge and understanding of the dynamics between AI and society more broadly. And to our listeners, thank you for joining us. I’m really excited for the upcoming episodes, including discussions on medical students’ experiences with AI and AI’s influence on the operation of health systems and public health departments. We hope you’ll continue to tune in. Until next time. #what #ais #impact #individuals #means

What AI’s impact on individuals means for the health workforce and industry

www.microsoft.com
Transcript [MUSIC]   [BOOK PASSAGE]  PETER LEE: “In American primary care, the missing workforce is stunning in magnitude, the shortfall estimated to reach up to 48,000 doctors within the next dozen years. China and other countries with aging populations can expect drastic shortfalls, as well. Just last month, I asked a respected colleague retiring from primary care who he would recommend as a replacement; he told me bluntly that, other than expensive concierge care practices, he could not think of anyone, even for himself. This mismatch between need and supply will only grow, and the US is far from alone among developed countries in facing it.” [END OF BOOK PASSAGE]   [THEME MUSIC]   This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.   Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?    In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.     [THEME MUSIC FADES] The book passage I read at the top is from “Chapter 4: Trust but Verify,” which was written by Zak. You know, it’s no secret that in the US and elsewhere shortages in medical staff and the rise of clinician burnout are affecting the quality of patient care for the worse. In our book, we predicted that generative AI would be something that might help address these issues. So in this episode, we’ll delve into how individual performance gains that our previous guests have described might affect the healthcare workforce as a whole, and on the patient side, we’ll look into the influence of generative AI on the consumerization of healthcare. Now, since all of this consumes such a huge fraction of the overall economy, we’ll also get into what a general-purpose technology as disruptive as generative AI might mean in the context of labor markets and beyond.   To help us do that, I’m pleased to welcome Ethan Mollick and Azeem Azhar. Ethan Mollick is the Ralph J. Roberts Distinguished Faculty Scholar, a Rowan Fellow, and an associate professor at the Wharton School of the University of Pennsylvania. His research into the effects of AI on work, entrepreneurship, and education is applied by organizations around the world, leading him to be named one of Time magazine’s most influential people in AI for 2024. He’s also the author of the New York Times best-selling book Co-Intelligence. Azeem Azhar is an author, founder, investor, and one of the most thoughtful and influential voices on the interplay between disruptive emerging technologies and business and society. In his best-selling book, The Exponential Age, and in his highly regarded newsletter and podcast, Exponential View, he explores how technologies like AI are reshaping everything from healthcare to geopolitics. Ethan and Azeem are two leading thinkers on the ways that disruptive technologies—and especially AI—affect our work, our jobs, our business enterprises, and whole industries. As economists, they are trying to work out whether we are in the midst of an economic revolution as profound as the shift from an agrarian to an industrial society. [TRANSITION MUSIC] Here is my interview with Ethan Mollick: LEE: Ethan, welcome. ETHAN MOLLICK: So happy to be here, thank you. LEE: I described you as a professor at Wharton, which I think most of the people who listen to this podcast series know of as an elite business school. So it might surprise some people that you study AI. And beyond that, you know, that I would seek you out to talk about AI in medicine. [LAUGHTER] So to get started, how and why did it happen that you’ve become one of the leading experts on AI? MOLLICK: It’s actually an interesting story. I’ve been AI-adjacent my whole career. When I was [getting] my PhD at MIT, I worked with Marvin Minsky (opens in new tab) and the MIT [Massachusetts Institute of Technology] Media Labs AI group. But I was never the technical AI guy. I was the person who was trying to explain AI to everybody else who didn’t understand it. And then I became very interested in, how do you train and teach? And AI was always a part of that. I was building games for teaching, teaching tools that were used in hospitals and elsewhere, simulations. So when LLMs burst into the scene, I had already been using them and had a good sense of what they could do. And between that and, kind of, being practically oriented and getting some of the first research projects underway, especially under education and AI and performance, I became sort of a go-to person in the field. And once you’re in a field where nobody knows what’s going on and we’re all making it up as we go along—I thought it’s funny that you led with the idea that you have a couple of months head start for GPT-4, right. Like that’s all we have at this point, is a few months’ head start. [LAUGHTER] So being a few months ahead is good enough to be an expert at this point. Whether it should be or not is a different question. LEE: Well, if I understand correctly, leading AI companies like OpenAI, Anthropic, and others have now sought you out as someone who should get early access to really start to do early assessments and gauge early reactions. How has that been? MOLLICK: So, I mean, I think the bigger picture is less about me than about two things that tells us about the state of AI right now. One, nobody really knows what’s going on, right. So in a lot of ways, if it wasn’t for your work, Peter, like, I don’t think people would be thinking about medicine as much because these systems weren’t built for medicine. They weren’t built to change education. They weren’t built to write memos. They, like, they weren’t built to do any of these things. They weren’t really built to do anything in particular. It turns out they’re just good at many things. And to the extent that the labs work on them, they care about their coding ability above everything else and maybe math and science secondarily. They don’t think about the fact that it expresses high empathy. They don’t think about its accuracy and diagnosis or where it’s inaccurate. They don’t think about how it’s changing education forever. So one part of this is the fact that they go to my Twitter feed or ask me for advice is an indicator of where they are, too, which is they’re not thinking about this. And the fact that a few months’ head start continues to give you a lead tells you that we are at the very cutting edge. These labs aren’t sitting on projects for two years and then releasing them. Months after a project is complete or sooner, it’s out the door. Like, there’s very little delay. So we’re kind of all in the same boat here, which is a very unusual space for a new technology. LEE: And I, you know, explained that you’re at Wharton. Are you an odd fit as a faculty member at Wharton, or is this a trend now even in business schools that AI experts are becoming key members of the faculty? MOLLICK: I mean, it’s a little of both, right. It’s faculty, so everybody does everything. I’m a professor of innovation-entrepreneurship. I’ve launched startups before and working on that and education means I think about, how do organizations redesign themselves? How do they take advantage of these kinds of problems? So medicine’s always been very central to that, right. A lot of people in my MBA class have been MDs either switching, you know, careers or else looking to advance from being sort of individual contributors to running teams. So I don’t think that’s that bad a fit. But I also think this is general-purpose technology; it’s going to touch everything. The focus on this is medicine, but Microsoft does far more than medicine, right. It’s … there’s transformation happening in literally every field, in every country. This is a widespread effect. So I don’t think we should be surprised that business schools matter on this because we care about management. There’s a long tradition of management and medicine going together. There’s actually a great academic paper that shows that teaching hospitals that also have MBA programs associated with them have higher management scores and perform better (opens in new tab). So I think that these are not as foreign concepts, especially as medicine continues to get more complicated. LEE: Yeah. Well, in fact, I want to dive a little deeper on these issues of management, of entrepreneurship, um, education. But before doing that, if I could just stay focused on you. There is always something interesting to hear from people about their first encounters with AI. And throughout this entire series, I’ve been doing that both pre-generative AI and post-generative AI. So you, sort of, hinted at the pre-generative AI. You were in Minsky’s lab. Can you say a little bit more about that early encounter? And then tell us about your first encounters with generative AI. MOLLICK: Yeah. Those are great questions. So first of all, when I was at the media lab, that was pre-the current boom in sort of, you know, even in the old-school machine learning kind of space. So there was a lot of potential directions to head in. While I was there, there were projects underway, for example, to record every interaction small children had. One of the professors was recording everything their baby interacted with in the hope that maybe that would give them a hint about how to build an AI system. There was a bunch of projects underway that were about labeling every concept and how they relate to other concepts. So, like, it was very much Wild West of, like, how do we make an AI work—which has been this repeated problem in AI, which is, what is this thing? The fact that it was just like brute force over the corpus of all human knowledge turns out to be a little bit of like a, you know, it’s a miracle and a little bit of a disappointment in some ways [LAUGHTER] compared to how elaborate some of this was. So, you know, I think that, that was sort of my first encounters in sort of the intellectual way. The generative AI encounters actually started with the original, sort of, GPT-3, or, you know, earlier versions. And it was actually game-based. So I played games like AI Dungeon. And as an educator, I realized, oh my gosh, this stuff could write essays at a fourth-grade level. That’s really going to change the way, like, middle school works, was my thinking at the time. And I was posting about that back in, you know, 2021 that this is a big deal. But I think everybody was taken surprise, including the AI companies themselves, by, you know, ChatGPT, by GPT-3.5. The difference in degree turned out to be a difference in kind. LEE: Yeah, you know, if I think back, even with GPT-3, and certainly this was the case with GPT-2, it was, at least, you know, from where I was sitting, it was hard to get people to really take this seriously and pay attention. MOLLICK: Yes. LEE: You know, it’s remarkable. Within Microsoft, I think a turning point was the use of GPT-3 to do code completions. And that was actually productized as GitHub Copilot (opens in new tab), the very first version. That, I think, is where there was widespread belief. But, you know, in a way, I think there is, even for me early on, a sense of denial and skepticism. Did you have those initially at any point? MOLLICK: Yeah, I mean, it still happens today, right. Like, this is a weird technology. You know, the original denial and skepticism was, I couldn’t see where this was going. It didn’t seem like a miracle because, you know, of course computers can complete code for you. Like, what else are they supposed to do? Of course, computers can give you answers to questions and write fun things. So there’s difference of moving into a world of generative AI. I think a lot of people just thought that’s what computers could do. So it made the conversations a little weird. But even today, faced with these, you know, with very strong reasoner models that operate at the level of PhD students, I think a lot of people have issues with it, right. I mean, first of all, they seem intuitive to use, but they’re not always intuitive to use because the first use case that everyone puts AI to, it fails at because they use it like Google or some other use case. And then it’s genuinely upsetting in a lot of ways. I think, you know, I write in my book about the idea of three sleepless nights. That hasn’t changed. Like, you have to have an intellectual crisis to some extent, you know, and I think people do a lot to avoid having that existential angst of like, “Oh my god, what does it mean that a machine could think—apparently think—like a person?” So, I mean, I see resistance now. I saw resistance then. And then on top of all of that, there’s the fact that the curve of the technology is quite great. I mean, the price of GPT-4 level intelligence from, you know, when it was released has dropped 99.97% at this point, right. LEE: Yes. Mm-hmm. MOLLICK: I mean, I could run a GPT-4 class system basically on my phone. Microsoft’s releasing things that can almost run on like, you know, like it fits in almost no space, that are almost as good as the original GPT-4 models. I mean, I don’t think people have a sense of how fast the trajectory is moving either. LEE: Yeah, you know, there’s something that I think about often. There is this existential dread, or will this technology replace me? But I think the first people to feel that are researchers—people encountering this for the first time. You know, if you were working, let’s say, in Bayesian reasoning or in traditional, let’s say, Gaussian mixture model based, you know, speech recognition, you do get this feeling, Oh, my god, this technology has just solved the problem that I’ve dedicated my life to. And there is this really difficult period where you have to cope with that. And I think this is going to be spreading, you know, in more and more walks of life. And so this … at what point does that sort of sense of dread hit you, if ever? MOLLICK: I mean, you know, it’s not even dread as much as like, you know, Tyler Cowen wrote that it’s impossible to not feel a little bit of sadness as you use these AI systems, too. Because, like, I was talking to a friend, just as the most minor example, and his talent that he was very proud of was he was very good at writing limericks for birthday cards. He’d write these limericks. Everyone was always amused by them. [LAUGHTER] And now, you know, GPT-4 and GPT-4.5, they made limericks obsolete. Like, anyone can write a good limerick, right. So this was a talent, and it was a little sad. Like, this thing that you cared about mattered. You know, as academics, we’re a little used to dead ends, right, and like, you know, some getting the lap. But the idea that entire fields are hitting that way. Like in medicine, there’s a lot of support systems that are now obsolete. And the question is how quickly you change that. In education, a lot of our techniques are obsolete. What do you do to change that? You know, it’s like the fact that this brute force technology is good enough to solve so many problems is weird, right. And it’s not just the end of, you know, of our research angles that matter, too. Like, for example, I ran this, you know, 14-person-plus, multimillion-dollar effort at Wharton to build these teaching simulations, and we’re very proud of them. It took years of work to build one. Now we’ve built a system that can build teaching simulations on demand by you talking to it with one team member. And, you know, you literally can create any simulation by having a discussion with the AI. I mean, you know, there’s a switch to a new form of excitement, but there is a little bit of like, this mattered to me, and, you know, now I have to change how I do things. I mean, adjustment happens. But if you haven’t had that displacement, I think that’s a good indicator that you haven’t really faced AI yet. LEE: Yeah, what’s so interesting just listening to you is you use words like sadness, and yet I can see the—and hear the—excitement in your voice and your body language. So, you know, that’s also kind of an interesting aspect of all of this. MOLLICK: Yeah, I mean, I think there’s something on the other side, right. But, like, I can’t say that I haven’t had moments where like, ughhhh, but then there’s joy and basically like also, you know, freeing stuff up. I mean, I think about doctors or professors, right. These are jobs that bundle together lots of different tasks that you would never have put together, right. If you’re a doctor, you would never have expected the same person to be good at keeping up with the research and being a good diagnostician and being a good manager and being good with people and being good with hand skills. Like, who would ever want that kind of bundle? That’s not something you’re all good at, right. And a lot of our stress of our job comes from the fact that we suck at some of it. And so to the extent that AI steps in for that, you kind of feel bad about some of the stuff that it’s doing that you wanted to do. But it’s much more uplifting to be like, I don’t have to do this stuff I’m bad anymore, or I get the support to make myself good at it. And the stuff that I really care about, I can focus on more. Well, because we are at kind of a unique moment where whatever you’re best at, you’re still better than AI. And I think it’s an ongoing question about how long that lasts. But for right now, like you’re not going to say, OK, AI replaces me entirely in my job in medicine. It’s very unlikely. But you will say it replaces these 17 things I’m bad at, but I never liked that anyway. So it’s a period of both excitement and a little anxiety. LEE: Yeah, I’m going to want to get back to this question about in what ways AI may or may not replace doctors or some of what doctors and nurses and other clinicians do. But before that, let’s get into, I think, the real meat of this conversation. In previous episodes of this podcast, we talked to clinicians and healthcare administrators and technology developers that are very rapidly injecting AI today to do various forms of workforce automation, you know, automatically writing a clinical encounter note, automatically filling out a referral letter or request for prior authorization for some reimbursement to an insurance company. And so these sorts of things are intended not only to make things more efficient and lower costs but also to reduce various forms of drudgery, cognitive burden on frontline health workers. So how do you think about the impact of AI on that aspect of workforce, and, you know, what would you expect will happen over the next few years in terms of impact on efficiency and costs? MOLLICK: So I mean, this is a case where I think we’re facing the big bright problem in AI in a lot of ways, which is that this is … at the individual level, there’s lots of performance gains to be gained, right. The problem, though, is that we as individuals fit into systems, in medicine as much as anywhere else or more so, right. Which is that you could individually boost your performance, but it’s also about systems that fit along with this, right. So, you know, if you could automatically, you know, record an encounter, if you could automatically make notes, does that change what you should be expecting for notes or the value of those notes or what they’re for? How do we take what one person does and validate it across the organization and roll it out for everybody without making it a 10-year process that it feels like IT in medicine often is? Like, so we’re in this really interesting period where there’s incredible amounts of individual innovation in productivity and performance improvements in this field, like very high levels of it, but not necessarily seeing that same thing translate to organizational efficiency or gains. And one of my big concerns is seeing that happen. We’re seeing that in nonmedical problems, the same kind of thing, which is, you know, we’ve got research showing 20 and 40% performance improvements, like not uncommon to see those things. But then the organization doesn’t capture it; the system doesn’t capture it. Because the individuals are doing their own work and the systems don’t have the ability to, kind of, learn or adapt as a result. LEE: You know, where are those productivity gains going, then, when you get to the organizational level? MOLLICK: Well, they’re dying for a few reasons. One is, there’s a tendency for individual contributors to underestimate the power of management, right. Practices associated with good management increase happiness, decrease, you know, issues, increase success rates. In the same way, about 40%, as far as we can tell, of the US advantage over other companies, of US firms, has to do with management ability. Like, management is a big deal. Organizing is a big deal. Thinking about how you coordinate is a big deal. At the individual level, when things get stuck there, right, you can’t start bringing them up to how systems work together. It becomes, How do I deal with a doctor that has a 60% performance improvement? We really only have one thing in our playbook for doing that right now, which is, OK, we could fire 40% of the other doctors and still have a performance gain, which is not the answer you want to see happen. So because of that, people are hiding their use. They’re actually hiding their use for lots of reasons. And it’s a weird case because the people who are able to figure out best how to use these systems, for a lot of use cases, they’re actually clinicians themselves because they’re experimenting all the time. Like, they have to take those encounter notes. And if they figure out a better way to do it, they figure that out. You don’t want to wait for, you know, a med tech company to figure that out and then sell that back to you when it can be done by the physicians themselves. So we’re just not used to a period where everybody’s innovating and where the management structure isn’t in place to take advantage of that. And so we’re seeing things stalled at the individual level, and people are often, especially in risk-averse organizations or organizations where there’s lots of regulatory hurdles, people are so afraid of the regulatory piece that they don’t even bother trying to make change. LEE: If you are, you know, the leader of a hospital or a clinic or a whole health system, how should you approach this? You know, how should you be trying to extract positive success out of AI? MOLLICK: So I think that you need to embrace the right kind of risk, right. We don’t want to put risk on our patients … like, we don’t want to put uninformed risk. But innovation involves risk to how organizations operate. They involve change. So I think part of this is embracing the idea that R&D has to happen in organizations again. What’s happened over the last 20 years or so has been organizations giving that up. Partially, that’s a trend to focus on what you’re good at and not try and do this other stuff. Partially, it’s because it’s outsourced now to software companies that, like, Salesforce tells you how to organize your sales team. Workforce tells you how to organize your organization. Consultants come in and will tell you how to make change based on the average of what other people are doing in your field. So companies and organizations and hospital systems have all started to give up their ability to create their own organizational change. And when I talk to organizations, I often say they have to have two approaches. They have to think about the crowd and the lab. So the crowd is the idea of how to empower clinicians and administrators and supporter networks to start using AI and experimenting in ethical, legal ways and then sharing that information with each other. And the lab is, how are we doing R&D about the approach of how to [get] AI to work, not just in direct patient care, right. But also fundamentally, like, what paperwork can you cut out? How can we better explain procedures? Like, what management role can this fill? And we need to be doing active experimentation on that. We can’t just wait for, you know, Microsoft to solve the problems. It has to be at the level of the organizations themselves. LEE: So let’s shift a little bit to the patient. You know, one of the things that we see, and I think everyone is seeing, is that people are turning to chatbots, like ChatGPT, actually to seek healthcare information for, you know, their own health or the health of their loved ones. And there was already, prior to all of this, a trend towards, let’s call it, consumerization of healthcare. So just in the business of healthcare delivery, do you think AI is going to hasten these kinds of trends, or from the consumer’s perspective, what … ? MOLLICK: I mean, absolutely, right. Like, all the early data that we have suggests that for most common medical problems, you should just consult AI, too, right. In fact, there is a real question to ask: at what point does it become unethical for doctors themselves to not ask for a second opinion from the AI because it’s cheap, right? You could overrule it or whatever you want, but like not asking seems foolish. I think the two places where there’s a burning almost, you know, moral imperative is … let’s say, you know, I’m in Philadelphia, I’m a professor, I have access to really good healthcare through the Hospital University of Pennsylvania system. I know doctors. You know, I’m lucky. I’m well connected. If, you know, something goes wrong, I have friends who I can talk to. I have specialists. I’m, you know, pretty well educated in this space. But for most people on the planet, they don’t have access to good medical care, they don’t have good health. It feels like it’s absolutely imperative to say when should you use AI and when not. Are there blind spots? What are those things? And I worry that, like, to me, that would be the crash project I’d be invoking because I’m doing the same thing in education, which is this system is not as good as being in a room with a great teacher who also uses AI to help you, but it’s better than not getting an, you know, to the level of education people get in many cases. Where should we be using it? How do we guide usage in the right way? Because the AI labs aren’t thinking about this. We have to. So, to me, there is a burning need here to understand this. And I worry that people will say, you know, everything that’s true—AI can hallucinate, AI can be biased. All of these things are absolutely true, but people are going to use it. The early indications are that it is quite useful. And unless we take the active role of saying, here’s when to use it, here’s when not to use it, we don’t have a right to say, don’t use this system. And I think, you know, we have to be exploring that. LEE: What do people need to understand about AI? And what should schools, universities, and so on be teaching? MOLLICK: Those are, kind of, two separate questions in lot of ways. I think a lot of people want to teach AI skills, and I will tell you, as somebody who works in this space a lot, there isn’t like an easy, sort of, AI skill, right. I could teach you prompt engineering in two to three classes, but every indication we have is that for most people under most circumstances, the value of prompting, you know, any one case is probably not that useful. A lot of the tricks are disappearing because the AI systems are just starting to use them themselves. So asking good questions, being a good manager, being a good thinker tend to be important, but like magic tricks around making, you know, the AI do something because you use the right phrase used to be something that was real but is rapidly disappearing. So I worry when people say teach AI skills. No one’s been able to articulate to me as somebody who knows AI very well and teaches classes on AI, what those AI skills that everyone should learn are, right. I mean, there’s value in learning a little bit how the models work. There’s a value in working with these systems. A lot of it’s just hands on keyboard kind of work. But, like, we don’t have an easy slam dunk “this is what you learn in the world of AI” because the systems are getting better, and as they get better, they get less sensitive to these prompting techniques. They get better prompting themselves. They solve problems spontaneously and start being agentic. So it’s a hard problem to ask about, like, what do you train someone on? I think getting people experience in hands-on-keyboards, getting them to … there’s like four things I could teach you about AI, and two of them are already starting to disappear. But, like, one is be direct. Like, tell the AI exactly what you want. That’s very helpful. Second, provide as much context as possible. That can include things like acting as a doctor, but also all the information you have. The third is give it step-by-step directions—that’s becoming less important. And the fourth is good and bad examples of the kind of output you want. Those four, that’s like, that’s it as far as the research telling you what to do, and the rest is building intuition. LEE: I’m really impressed that you didn’t give the answer, “Well, everyone should be teaching my book, Co-Intelligence.” [LAUGHS] MOLLICK: Oh, no, sorry! Everybody should be teaching my book Co-Intelligence. I apologize. [LAUGHTER] LEE: It’s good to chuckle about that, but actually, I can’t think of a better book, like, if you were to assign a textbook in any professional education space, I think Co-Intelligence would be number one on my list. Are there other things that you think are essential reading? MOLLICK: That’s a really good question. I think that a lot of things are evolving very quickly. I happen to, kind of, hit a sweet spot with Co-Intelligence to some degree because I talk about how I used it, and I was, sort of, an advanced user of these systems. So, like, it’s, sort of, like my Twitter feed, my online newsletter. I’m just trying to, kind of, in some ways, it’s about trying to make people aware of what these systems can do by just showing a lot, right. Rather than picking one thing, and, like, this is a general-purpose technology. Let’s use it for this. And, like, everybody gets a light bulb for a different reason. So more than reading, it is using, you know, and that can be Copilot or whatever your favorite tool is. But using it. Voice modes help a lot. In terms of readings, I mean, I think that there is a couple of good guides to understanding AI that were originally blog posts. I think Tim Lee has one called Understanding AI (opens in new tab), and it had a good overview … LEE: Yeah, that’s a great one. MOLLICK: … of that topic that I think explains how transformers work, which can give you some mental sense. I think [Andrej] Karpathy (opens in new tab) has some really nice videos of use that I would recommend. Like on the medical side, I think the book that you did, if you’re in medicine, you should read that. I think that that’s very valuable. But like all we can offer are hints in some ways. Like there isn’t … if you’re looking for the instruction manual, I think it can be very frustrating because it’s like you want the best practices and procedures laid out, and we cannot do that, right. That’s not how a system like this works. LEE: Yeah. MOLLICK: It’s not a person, but thinking about it like a person can be helpful, right. LEE: One of the things that has been sort of a fun project for me for the last few years is I have been a founding board member of a new medical school at Kaiser Permanente. And, you know, that medical school curriculum is being formed in this era. But it’s been perplexing to understand, you know, what this means for a medical school curriculum. And maybe even more perplexing for me, at least, is the accrediting bodies, which are extremely important in US medical schools; how accreditors should think about what’s necessary here. Besides the things that you’ve … the, kind of, four key ideas you mentioned, if you were talking to the board of directors of the LCME [Liaison Committee on Medical Education] accrediting body, what’s the one thing you would want them to really internalize? MOLLICK: This is both a fast-moving and vital area. This can’t be viewed like a usual change, which [is], “Let’s see how this works.” Because it’s, like, the things that make medical technologies hard to do, which is like unclear results, limited, you know, expensive use cases where it rolls out slowly. So one or two, you know, advanced medical facilities get access to, you know, proton beams or something else at multi-billion dollars of cost, and that takes a while to diffuse out. That’s not happening here. This is all happening at the same time, all at once. This is now … AI is part of medicine. I mean, there’s a minor point that I’d make that actually is a really important one, which is large language models, generative AI overall, work incredibly differently than other forms of AI. So the other worry I have with some of these accreditors is they blend together algorithmic forms of AI, which medicine has been trying for long time—decision support, algorithmic methods, like, medicine more so than other places has been thinking about those issues. Generative AI, even though it uses the same underlying techniques, is a completely different beast. So, like, even just take the most simple thing of algorithmic aversion, which is a well-understood problem in medicine, right. Which is, so you have a tool that could tell you as a radiologist, you know, the chance of this being cancer; you don’t like it, you overrule it, right. We don’t find algorithmic aversion happening with LLMs in the same way. People actually enjoy using them because it’s more like working with a person. The flaws are different. The approach is different. So you need to both view this as universal applicable today, which makes it urgent, but also as something that is not the same as your other form of AI, and your AI working group that is thinking about how to solve this problem is not the right people here. LEE: You know, I think the world has been trained because of the magic of web search to view computers as question-answering machines. Ask a question, get an answer. MOLLICK: Yes. Yes. LEE: Write a query, get results. And as I have interacted with medical professionals, you can see that medical professionals have that model of a machine in mind. And I think that’s partly, I think psychologically, why hallucination is so alarming. Because you have a mental model of a computer as a machine that has absolutely rock-solid perfect memory recall. But the thing that was so powerful in Co-Intelligence, and we tried to get at this in our book also, is that’s not the sweet spot. It’s this sort of deeper interaction, more of a collaboration. And I thought your use of the term Co-Intelligence really just even in the title of the book tried to capture this. When I think about education, it seems like that’s the first step, to get past this concept of a machine being just a question-answering machine. Do you have a reaction to that idea? MOLLICK: I think that’s very powerful. You know, we’ve been trained over so many years at both using computers but also in science fiction, right. Computers are about cold logic, right. They will give you the right answer, but if you ask it what love is, they explode, right. Like that’s the classic way you defeat the evil robot in Star Trek, right. “Love does not compute.” [LAUGHTER] Instead, we have a system that makes mistakes, is warm, beats doctors in empathy in almost every controlled study on the subject, right. Like, absolutely can outwrite you in a sonnet but will absolutely struggle with giving you the right answer every time. And I think our mental models are just broken for this. And I think you’re absolutely right. And that’s part of what I thought your book does get at really well is, like, this is a different thing. It’s also generally applicable. Again, the model in your head should be kind of like a person even though it isn’t, right. There’s a lot of warnings and caveats to it, but if you start from person, smart person you’re talking to, your mental model will be more accurate than smart machine, even though both are flawed examples, right. So it will make mistakes; it will make errors. The question is, what do you trust it on? What do you not trust it? As you get to know a model, you’ll get to understand, like, I totally don’t trust it for this, but I absolutely trust it for that, right. LEE: All right. So we’re getting to the end of the time we have together. And so I’d just like to get now into something a little bit more provocative. And I get the question all the time. You know, will AI replace doctors? In medicine and other advanced knowledge work, project out five to 10 years. What do think happens? MOLLICK: OK, so first of all, let’s acknowledge systems change much more slowly than individual use. You know, doctors are not individual actors; they’re part of systems, right. So not just the system of a patient who like may or may not want to talk to a machine instead of a person but also legal systems and administrative systems and systems that allocate labor and systems that train people. So, like, it’s hard to imagine that in five to 10 years medicine being so upended that even if AI was better than doctors at every single thing doctors do, that we’d actually see as radical a change in medicine as you might in other fields. I think you will see faster changes happen in consulting and law and, you know, coding, other spaces than medicine. But I do think that there is good reason to suspect that AI will outperform people while still having flaws, right. That’s the difference. We’re already seeing that for common medical questions in enough randomized controlled trials that, you know, best doctors beat AI, but the AI beats the mean doctor, right. Like, that’s just something we should acknowledge is happening at this point. Now, will that work in your specialty? No. Will that work with all the contingent social knowledge that you have in your space? Probably not. Like, these are vignettes, right. But, like, that’s kind of where things are. So let’s assume, right … you’re asking two questions. One is, how good will AI get? LEE: Yeah. MOLLICK: And we don’t know the answer to that question. I will tell you that your colleagues at Microsoft and increasingly the labs, the AI labs themselves, are all saying they think they’ll have a machine smarter than a human at every intellectual task in the next two to three years. If that doesn’t happen, that makes it easier to assume the future, but let’s just assume that that’s the case. I think medicine starts to change with the idea that people feel obligated to use this to help for everything. Your patients will be using it, and it will be your advisor and helper at the beginning phases, right. And I think that I expect people to be better at empathy. I expect better bedside manner. I expect management tasks to become easier. I think administrative burden might lighten if we handle this right way or much worse if we handle it badly. Diagnostic accuracy will increase, right. And then there’s a set of discovery pieces happening, too, right. One of the core goals of all the AI companies is to accelerate medical research. How does that happen and how does that affect us is a, kind of, unknown question. So I think clinicians are in both the eye of the storm and surrounded by it, right. Like, they can resist AI use for longer than most other fields, but everything around them is going to be affected by it. LEE: Well, Ethan, this has been really a fantastic conversation. And, you know, I think in contrast to all the other conversations we’ve had, this one gives especially the leaders in healthcare, you know, people actually trying to lead their organizations into the future, whether it’s in education or in delivery, a lot to think about. So I really appreciate you joining. MOLLICK: Thank you. [TRANSITION MUSIC]  I’m a computing researcher who works with people who are right in the middle of today’s bleeding-edge developments in AI. And because of that, I often lose sight of how to talk to a broader audience about what it’s all about. And so I think one of Ethan’s superpowers is that he has this knack for explaining complex topics in AI in a really accessible way, getting right to the most important points without making it so simple as to be useless. That’s why I rarely miss an opportunity to read up on his latest work. One of the first things I learned from Ethan is the intuition that you can, sort of, think of AI as a very knowledgeable intern. In other words, think of it as a persona that you can interact with, but you also need to be a manager for it and to always assess the work that it does. In our discussion, Ethan went further to stress that there is, because of that, a serious education gap. You know, over the last decade or two, we’ve all been trained, mainly by search engines, to think of computers as question-answering machines. In medicine, in fact, there’s a question-answering application that is really popular called UpToDate (opens in new tab). Doctors use it all the time. But generative AI systems like ChatGPT are different. There’s therefore a challenge in how to break out of the old-fashioned mindset of search to get the full value out of generative AI. The other big takeaway for me was that Ethan pointed out while it’s easy to see productivity gains from AI at the individual level, those same gains, at least today, don’t often translate automatically to organization-wide or system-wide gains. And one, of course, has to conclude that it takes more than just making individuals more productive; the whole system also has to adjust to the realities of AI. Here’s now my interview with Azeem Azhar: LEE: Azeem, welcome. AZEEM AZHAR: Peter, thank you so much for having me. LEE: You know, I think you’re extremely well known in the world. But still, some of the listeners of this podcast series might not have encountered you before. And so one of the ways I like to ask people to introduce themselves is, how do you explain to your parents what you do every day? AZHAR: Well, I’m very lucky in that way because my mother was the person who got me into computers more than 40 years ago. And I still have that first computer, a ZX81 with a Z80 chip … LEE: Oh wow. AZHAR: … to this day. It sits in my study, all seven and a half thousand transistors and Bakelite plastic that it is. And my parents were both economists, and economics is deeply connected with technology in some sense. And I grew up in the late ’70s and the early ’80s. And that was a time of tremendous optimism around technology. It was space opera, science fiction, robots, and of course, the personal computer and, you know, Bill Gates and Steve Jobs. So that’s where I started. And so, in a way, my mother and my dad, who passed away a few years ago, had always known me as someone who was fiddling with computers but also thinking about economics and society. And so, in a way, it’s easier to explain to them because they’re the ones who nurtured the environment that allowed me to research technology and AI and think about what it means to firms and to the economy at large. LEE: I always like to understand the origin story. And what I mean by that is, you know, what was your first encounter with generative AI? And what was that like? What did you go through? AZHAR: The first real moment was when Midjourney and Stable Diffusion emerged in that summer of 2022. I’d been away on vacation, and I came back—and I’d been off grid, in fact—and the world had really changed. Now, I’d been aware of GPT-3 and GPT-2, which I played around with and with BERT, the original transformer paper about seven or eight years ago, but it was the moment where I could talk to my computer, and it could produce these images, and it could be refined in natural language that really made me think we’ve crossed into a new domain. We’ve gone from AI being highly discriminative to AI that’s able to explore the world in particular ways. And then it was a few months later that ChatGPT came out—November, the 30th. And I think it was the next day or the day after that I said to my team, everyone has to use this, and we have to meet every morning and discuss how we experimented the day before. And we did that for three or four months. And, you know, it was really clear to me in that interface at that point that, you know, we’d absolutely pass some kind of threshold. LEE: And who’s the we that you were experimenting with? AZHAR: So I have a team of four who support me. They’re mostly researchers of different types. I mean, it’s almost like one of those jokes. You know, I have a sociologist, an economist, and an astrophysicist. And, you know, they walk into the bar, [LAUGHTER] or they walk into our virtual team room, and we try to solve problems. LEE: Well, so let’s get now into brass tacks here. And I think I want to start maybe just with an exploration of the economics of all this and economic realities. Because I think in a lot of your work—for example, in your book—you look pretty deeply at how automation generally and AI specifically are transforming certain sectors like finance, manufacturing, and you have a really, kind of, insightful focus on what this means for productivity and which ways, you know, efficiencies are found. And then you, sort of, balance that with risks, things that can and do go wrong. And so as you take that background and looking at all those other sectors, in what ways are the same patterns playing out or likely to play out in healthcare and medicine? AZHAR: I’m sure we will see really remarkable parallels but also new things going on. I mean, medicine has a particular quality compared to other sectors in the sense that it’s highly regulated, market structure is very different country to country, and it’s an incredibly broad field. I mean, just think about taking a Tylenol and going through laparoscopic surgery. Having an MRI and seeing a physio. I mean, this is all medicine. I mean, it’s hard to imagine a sector that is [LAUGHS] more broad than that. So I think we can start to break it down, and, you know, where we’re seeing things with generative AI will be that the, sort of, softest entry point, which is the medical scribing. And I’m sure many of us have been with clinicians who have a medical scribe running alongside—they’re all on Surface Pros I noticed, right? [LAUGHTER] They’re on the tablet computers, and they’re scribing away. And what that’s doing is, in the words of my friend Eric Topol, it’s giving the clinician time back (opens in new tab), right. They have time back from days that are extremely busy and, you know, full of administrative overload. So I think you can obviously do a great deal with reducing that overload. And within my team, we have a view, which is if you do something five times in a week, you should be writing an automation for it. And if you’re a doctor, you’re probably reviewing your notes, writing the prescriptions, and so on several times a day. So those are things that can clearly be automated, and the human can be in the loop. But I think there are so many other ways just within the clinic that things can help. So, one of my friends, my friend from my junior school—I’ve known him since I was 9—is an oncologist who’s also deeply into machine learning, and he’s in Cambridge in the UK. And he built with Microsoft Research a suite of imaging AI tools from his own discipline, which they then open sourced. So that’s another way that you have an impact, which is that you actually enable the, you know, generalist, specialist, polymath, whatever they are in health systems to be able to get this technology, to tune it to their requirements, to use it, to encourage some grassroots adoption in a system that’s often been very, very heavily centralized. LEE: Yeah. AZHAR: And then I think there are some other things that are going on that I find really, really exciting. So one is the consumerization of healthcare. So I have one of those sleep tracking rings, the Oura (opens in new tab). LEE: Yup. AZHAR: That is building a data stream that we’ll be able to apply more and more AI to. I mean, right now, it’s applying traditional, I suspect, machine learning, but you can imagine that as we start to get more data, we start to get more used to measuring ourselves, we create this sort of pot, a personal asset that we can turn AI to. And there’s still another category. And that other category is one of the completely novel ways in which we can enable patient care and patient pathway. And there’s a fantastic startup in the UK called Neko Health (opens in new tab), which, I mean, does physicals, MRI scans, and blood tests, and so on. It’s hard to imagine Neko existing without the sort of advanced data, machine learning, AI that we’ve seen emerge over the last decade. So, I mean, I think that there are so many ways in which the temperature is slowly being turned up to encourage a phase change within the healthcare sector. And last but not least, I do think that these tools can also be very, very supportive of a clinician’s life cycle. I think we, as patients, we’re a bit … I don’t know if we’re as grateful as we should be for our clinicians who are putting in 90-hour weeks. [LAUGHTER] But you can imagine a world where AI is able to support not just the clinicians’ workload but also their sense of stress, their sense of burnout. So just in those five areas, Peter, I sort of imagine we could start to fundamentally transform over the course of many years, of course, the way in which people think about their health and their interactions with healthcare systems LEE: I love how you break that down. And I want to press on a couple of things. You also touched on the fact that medicine is, at least in most of the world, is a highly regulated industry. I guess finance is the same way, but they also feel different because the, like, finance sector has to be very responsive to consumers, and consumers are sensitive to, you know, an abundance of choice; they are sensitive to price. Is there something unique about medicine besides being regulated? AZHAR: I mean, there absolutely is. And in finance, as well, you have much clearer end states. So if you’re not in the consumer space, but you’re in the, you know, asset management space, you have to essentially deliver returns against the volatility or risk boundary, right. That’s what you have to go out and do. And I think if you’re in the consumer industry, you can come back to very, very clear measures, net promoter score being a very good example. In the case of medicine and healthcare, it is much more complicated because as far as the clinician is concerned, people are individuals, and we have our own parts and our own responses. If we didn’t, there would never be a need for a differential diagnosis. There’d never be a need for, you know, Let’s try azithromycin first, and then if that doesn’t work, we’ll go to vancomycin, or, you know, whatever it happens to be. You would just know. But ultimately, you know, people are quite different. The symptoms that they’re showing are quite different, and also their compliance is really, really different. I had a back problem that had to be dealt with by, you know, a physio and extremely boring exercises four times a week, but I was ruthless in complying, and my physio was incredibly surprised. He’d say well no one ever does this, and I said, well you know the thing is that I kind of just want to get this thing to go away. LEE: Yeah. AZHAR: And I think that that’s why medicine is and healthcare is so different and more complex. But I also think that’s why AI can be really, really helpful. I mean, we didn’t talk about, you know, AI in its ability to potentially do this, which is to extend the clinician’s presence throughout the week. LEE: Right. Yeah. AZHAR: The idea that maybe some part of what the clinician would do if you could talk to them on Wednesday, Thursday, and Friday could be delivered through an app or a chatbot just as a way of encouraging the compliance, which is often, especially with older patients, one reason why conditions, you know, linger on for longer. LEE: You know, just staying on the regulatory thing, as I’ve thought about this, the one regulated sector that I think seems to have some parallels to healthcare is energy delivery, energy distribution. Because like healthcare, as a consumer, I don’t have choice in who delivers electricity to my house. And even though I care about it being cheap or at least not being overcharged, I don’t have an abundance of choice. I can’t do price comparisons. And there’s something about that, just speaking as a consumer of both energy and a consumer of healthcare, that feels similar. Whereas other regulated industries, you know, somehow, as a consumer, I feel like I have a lot more direct influence and power. Does that make any sense to someone, you know, like you, who’s really much more expert in how economic systems work? AZHAR: I mean, in a sense, one part of that is very, very true. You have a limited panel of energy providers you can go to, and in the US, there may be places where you have no choice. I think the area where it’s slightly different is that as a consumer or a patient, you can actually make meaningful choices and changes yourself using these technologies, and people used to joke about you know asking Dr. Google. But Dr. Google is not terrible, particularly if you go to WebMD. And, you know, when I look at long-range change, many of the regulations that exist around healthcare delivery were formed at a point before people had access to good quality information at the touch of their fingertips or when educational levels in general were much, much lower. And many regulations existed because of the incumbent power of particular professional sectors. I’ll give you an example from the United Kingdom. So I have had asthma all of my life. That means I’ve been taking my inhaler, Ventolin, and maybe a steroid inhaler for nearly 50 years. That means that I know … actually, I’ve got more experience, and I—in some sense—know more about it than a general practitioner. LEE: Yeah. AZHAR: And until a few years ago, I would have to go to a general practitioner to get this drug that I’ve been taking for five decades, and there they are, age 30 or whatever it is. And a few years ago, the regulations changed. And now pharmacies can … or pharmacists can prescribe those types of drugs under certain conditions directly. LEE: Right. AZHAR: That was not to do with technology. That was to do with incumbent lock-in. So when we look at the medical industry, the healthcare space, there are some parallels with energy, but there are a few little things that the ability that the consumer has to put in some effort to learn about their condition, but also the fact that some of the regulations that exist just exist because certain professions are powerful. LEE: Yeah, one last question while we’re still on economics. There seems to be a conundrum about productivity and efficiency in healthcare delivery because I’ve never encountered a doctor or a nurse that wants to be able to handle even more patients than they’re doing on a daily basis. And so, you know, if productivity means simply, well, your rounds can now handle 16 patients instead of eight patients, that doesn’t seem necessarily to be a desirable thing. So how can we or should we be thinking about efficiency and productivity since obviously costs are, in most of the developed world, are a huge, huge problem? AZHAR: Yes, and when you described doubling the number of patients on the round, I imagined you buying them all roller skates so they could just whizz around [LAUGHTER] the hospital faster and faster than ever before. We can learn from what happened with the introduction of electricity. Electricity emerged at the end of the 19th century, around the same time that cars were emerging as a product, and car makers were very small and very artisanal. And in the early 1900s, some really smart car makers figured out that electricity was going to be important. And they bought into this technology by putting pendant lights in their workshops so they could “visit more patients.” Right? LEE: Yeah, yeah. AZHAR: They could effectively spend more hours working, and that was a productivity enhancement, and it was noticeable. But, of course, electricity fundamentally changed the productivity by orders of magnitude of people who made cars starting with Henry Ford because he was able to reorganize his factories around the electrical delivery of power and to therefore have the moving assembly line, which 10xed the productivity of that system. So when we think about how AI will affect the clinician, the nurse, the doctor, it’s much easier for us to imagine it as the pendant light that just has them working later … LEE: Right. AZHAR: … than it is to imagine a reconceptualization of the relationship between the clinician and the people they care for. And I’m not sure. I don’t think anybody knows what that looks like. But, you know, I do think that there will be a way that this changes, and you can see that scale out factor. And it may be, Peter, that what we end up doing is we end up saying, OK, because we have these brilliant AIs, there’s a lower level of training and cost and expense that’s required for a broader range of conditions that need treating. And that expands the market, right. That expands the market hugely. It’s what has happened in the market for taxis or ride sharing. The introduction of Uber and the GPS system … LEE: Yup. AZHAR: … has meant many more people now earn their living driving people around in their cars. And at least in London, you had to be reasonably highly trained to do that. So I can see a reorganization is possible. Of course, entrenched interests, the economic flow … and there are many entrenched interests, particularly in the US between the health systems and the, you know, professional bodies that might slow things down. But I think a reimagining is possible. And if I may, I’ll give you one example of that, which is, if you go to countries outside of the US where there are many more sick people per doctor, they have incentives to change the way they deliver their healthcare. And well before there was AI of this quality around, there was a few cases of health systems in India—Aravind Eye Care (opens in new tab) was one, and Narayana Hrudayalaya [now known as Narayana Health (opens in new tab)] was another. And in the latter, they were a cardiac care unit where you couldn’t get enough heart surgeons. LEE: Yeah, yep. AZHAR: So specially trained nurses would operate under the supervision of a single surgeon who would supervise many in parallel. So there are ways of increasing the quality of care, reducing the cost, but it does require a systems change. And we can’t expect a single bright algorithm to do it on its own. LEE: Yeah, really, really interesting. So now let’s get into regulation. And let me start with this question. You know, there are several startup companies I’m aware of that are pushing on, I think, a near-term future possibility that a medical AI for consumer might be allowed, say, to prescribe a medication for you, something that would normally require a doctor or a pharmacist, you know, that is certified in some way, licensed to do. Do you think we’ll get to a point where for certain regulated activities, humans are more or less cut out of the loop? AZHAR: Well, humans would have been in the loop because they would have provided the training data, they would have done the oversight, the quality control. But to your question in general, would we delegate an important decision entirely to a tested set of algorithms? I’m sure we will. We already do that. I delegate less important decisions like, What time should I leave for the airport to Waze. I delegate more important decisions to the automated braking in my car. We will do this at certain levels of risk and threshold. If I come back to my example of prescribing Ventolin. It’s really unclear to me that the prescription of Ventolin, this incredibly benign bronchodilator that is only used by people who’ve been through the asthma process, needs to be prescribed by someone who’s gone through 10 years or 12 years of medical training. And why that couldn’t be prescribed by an algorithm or an AI system. LEE: Right. Yep. Yep. AZHAR: So, you know, I absolutely think that that will be the case and could be the case. I can’t really see what the objections are. And the real issue is where do you draw the line of where you say, “Listen, this is too important,” or “The cost is too great,” or “The side effects are too high,” and therefore this is a point at which we want to have some, you know, human taking personal responsibility, having a liability framework in place, having a sense that there is a person with legal agency who signed off on this decision. And that line I suspect will start fairly low, and what we’d expect to see would be that that would rise progressively over time. LEE: What you just said, that scenario of your personal asthma medication, is really interesting because your personal AI might have the benefit of 50 years of your own experience with that medication. So, in a way, there is at least the data potential for, let’s say, the next prescription to be more personalized and more tailored specifically for you. AZHAR: Yes. Well, let’s dig into this because I think this is super interesting, and we can look at how things have changed. So 15 years ago, if I had a bad asthma attack, which I might have once a year, I would have needed to go and see my general physician. In the UK, it’s very difficult to get an appointment. I would have had to see someone privately who didn’t know me at all because I’ve just walked in off the street, and I would explain my situation. It would take me half a day. Productivity lost. I’ve been miserable for a couple of days with severe wheezing. Then a few years ago the system changed, a protocol changed, and now I have a thing called a rescue pack, which includes prednisolone steroids. It includes something else I’ve just forgotten, and an antibiotic in case I get an upper respiratory tract infection, and I have an “algorithm.” It’s called a protocol. It’s printed out. It’s a flowchart I answer various questions, and then I say, “I’m going to prescribe this to myself.” You know, UK doctors don’t prescribe prednisolone, or prednisone as you may call it in the US, at the drop of a hat, right. It’s a powerful steroid. I can self-administer, and I can now get that repeat prescription without seeing a physician a couple of times a year. And the algorithm, the “AI” is, it’s obviously been done in PowerPoint naturally, and it’s a bunch of arrows. [LAUGHS] Surely, surely, an AI system is going to be more sophisticated, more nuanced, and give me more assurance that I’m making the right decision around something like that. LEE: Yeah. Well, at a minimum, the AI should be able to make that PowerPoint the next time. [LAUGHS] AZHAR: Yeah, yeah. Thank god for Clippy. Yes. LEE: So, you know, I think in our book, we had a lot of certainty about most of the things we’ve discussed here, but one chapter where I felt we really sort of ran out of ideas, frankly, was on regulation. And, you know, what we ended up doing for that chapter is … I can’t remember if it was Carey’s or Zak’s idea, but we asked GPT-4 to have a conversation, a debate with itself [LAUGHS], about regulation. And we made some minor commentary on that. And really, I think we took that approach because we just didn’t have much to offer. By the way, in our defense, I don’t think anyone else had any better ideas anyway. AZHAR: Right. LEE: And so now two years later, do we have better ideas about the need for regulation, the frameworks around which those regulations should be developed, and, you know, what should this look like? AZHAR: So regulation is going to be in some cases very helpful because it provides certainty for the clinician that they’re doing the right thing, that they are still insured for what they’re doing, and it provides some degree of confidence for the patient. And we need to make sure that the claims that are made stand up to quite rigorous levels, where ideally there are RCTs [randomized control trials], and there are the classic set of processes you go through. You do also want to be able to experiment, and so the question is: as a regulator, how can you enable conditions for there to be experimentation? And what is experimentation? Experimentation is learning so that every element of the system can learn from this experience. So finding that space where there can be bit of experimentation, I think, becomes very, very important. And a lot of this is about experience, so I think the first digital therapeutics have received FDA approval, which means there are now people within the FDA who understand how you go about running an approvals process for that, and what that ends up looking like—and of course what we’re very good at doing in this sort of modern hyper-connected world—is we can share that expertise, that knowledge, that experience very, very quickly. So you go from one approval a year to a hundred approvals a year to a thousand approvals a year. So we will then actually, I suspect, need to think about what is it to approve digital therapeutics because, unlike big biological molecules, we can generate these digital therapeutics at the rate of knots [very rapidly]. LEE: Yes. AZHAR: Every road in Hayes Valley in San Francisco, right, is churning out new startups who will want to do things like this. So then, I think about, what does it mean to get approved if indeed it gets approved? But we can also go really far with things that don’t require approval. I come back to my sleep tracking ring. So I’ve been wearing this for a few years, and when I go and see my doctor or I have my annual checkup, one of the first things that he asks is how have I been sleeping. And in fact, I even sync my sleep tracking data to their medical record system, so he’s saying … hearing what I’m saying, but he’s actually pulling up the real data going, This patient’s lying to me again. Of course, I’m very truthful with my doctor, as we should all be. [LAUGHTER] LEE: You know, actually, that brings up a point that consumer-facing health AI has to deal with pop science, bad science, you know, weird stuff that you hear on Reddit. And because one of the things that consumers want to know always is, you know, what’s the truth? AZHAR: Right. LEE: What can I rely on? And I think that somehow feels different than an AI that you actually put in the hands of, let’s say, a licensed practitioner. And so the regulatory issues seem very, very different for these two cases somehow. AZHAR: I agree, they’re very different. And I think for a lot of areas, you will want to build AI systems that are first and foremost for the clinician, even if they have patient extensions, that idea that the clinician can still be with a patient during the week. And you’ll do that anyway because you need the data, and you also need a little bit of a liability shield to have like a sensible person who’s been trained around that. And I think that’s going to be a very important pathway for many AI medical crossovers. We’re going to go through the clinician. LEE: Yeah. AZHAR: But I also do recognize what you say about the, kind of, kooky quackery that exists on Reddit. Although on Creatine, Reddit may yet prove to have been right. [LAUGHTER] LEE: Yeah, that’s right. Yes, yeah, absolutely. Yeah. AZHAR: Sometimes it’s right. And I think that it serves a really good role as a field of extreme experimentation. So if you’re somebody who makes a continuous glucose monitor traditionally given to diabetics but now lots of people will wear them—and sports people will wear them—you probably gathered a lot of extreme tail distribution data by reading the Reddit/biohackers … LEE: Yes. AZHAR: … for the last few years, where people were doing things that you would never want them to really do with the CGM [continuous glucose monitor]. And so I think we shouldn’t understate how important that petri dish can be for helping us learn what could happen next. LEE: Oh, I think it’s absolutely going to be essential and a bigger thing in the future. So I think I just want to close here then with one last question. And I always try to be a little bit provocative with this. And so as you look ahead to what doctors and nurses and patients might be doing two years from now, five years from now, 10 years from now, do you have any kind of firm predictions? AZHAR: I’m going to push the boat out, and I’m going to go further out than closer in. LEE: OK. [LAUGHS] AZHAR: As patients, we will have many, many more touch points and interaction with our biomarkers and our health. We’ll be reading how well we feel through an array of things. And some of them we’ll be wearing directly, like sleep trackers and watches. And so we’ll have a better sense of what’s happening in our lives. It’s like the moment you go from paper bank statements that arrive every month to being able to see your account in real time. LEE: Yes. AZHAR: And I suspect we’ll have … we’ll still have interactions with clinicians because societies that get richer see doctors more, societies that get older see doctors more, and we’re going to be doing both of those over the coming 10 years. But there will be a sense, I think, of continuous health engagement, not in an overbearing way, but just in a sense that we know it’s there, we can check in with it, it’s likely to be data that is compiled on our behalf somewhere centrally and delivered through a user experience that reinforces agency rather than anxiety. And we’re learning how to do that slowly. I don’t think the health apps on our phones and devices have yet quite got that right. And that could help us personalize problems before they arise, and again, I use my experience for things that I’ve tracked really, really well. And I know from my data and from how I’m feeling when I’m on the verge of one of those severe asthma attacks that hits me once a year, and I can take a little bit of preemptive measure, so I think that that will become progressively more common and that sense that we will know our baselines. I mean, when you think about being an athlete, which is something I think about, but I could never ever do, [LAUGHTER] but what happens is you start with your detailed baselines, and that’s what your health coach looks at every three or four months. For most of us, we have no idea of our baselines. You we get our blood pressure measured once a year. We will have baselines, and that will help us on an ongoing basis to better understand and be in control of our health. And then if the product designers get it right, it will be done in a way that doesn’t feel invasive, but it’ll be done in a way that feels enabling. We’ll still be engaging with clinicians augmented by AI systems more and more because they will also have gone up the stack. They won’t be spending their time on just “take two Tylenol and have a lie down” type of engagements because that will be dealt with earlier on in the system. And so we will be there in a very, very different set of relationships. And they will feel that they have different ways of looking after our health. LEE: Azeem, it’s so comforting to hear such a wonderfully optimistic picture of the future of healthcare. And I actually agree with everything you’ve said. Let me just thank you again for joining this conversation. I think it’s been really fascinating. And I think somehow the systemic issues, the systemic issues that you tend to just see with such clarity, I think are going to be the most, kind of, profound drivers of change in the future. So thank you so much. AZHAR: Well, thank you, it’s been my pleasure, Peter, thank you. [TRANSITION MUSIC]  I always think of Azeem as a systems thinker. He’s always able to take the experiences of new technologies at an individual level and then project out to what this could mean for whole organizations and whole societies. In our conversation, I felt that Azeem really connected some of what we learned in a previous episode—for example, from Chrissy Farr—on the evolving consumerization of healthcare to the broader workforce and economic impacts that we’ve heard about from Ethan Mollick. Azeem’s personal story about managing his asthma was also a great example. You know, he imagines a future, as do I, where personal AI might assist and remember decades of personal experience with a condition like asthma and thereby know more than any human being could possibly know in a deeply personalized and effective way, leading to better care. Azeem’s relentless optimism about our AI future was also so heartening to hear. Both of these conversations leave me really optimistic about the future of AI in medicine. At the same time, it is pretty sobering to realize just how much we’ll all need to change in pretty fundamental and maybe even in radical ways. I think a big insight I got from these conversations is how we interact with machines is going to have to be altered not only at the individual level, but at the company level and maybe even at the societal level. Since my conversation with Ethan and Azeem, there have been some pretty important developments that speak directly to this. Just last week at Build (opens in new tab), which is Microsoft’s yearly developer conference, we announced a slew of AI agent technologies. Our CEO, Satya Nadella, in fact, started his keynote by going online in a GitHub developer environment and then assigning a coding task to an AI agent, basically treating that AI as a full-fledged member of a development team. Other agents, for example, a meeting facilitator, a data analyst, a business researcher, travel agent, and more were also shown during the conference. But pertinent to healthcare specifically, what really blew me away was the demonstration of a healthcare orchestrator agent. And the specific thing here was in Stanford’s cancer treatment center, when they are trying to decide on potentially experimental treatments for cancer patients, they convene a meeting of experts. That is typically called a tumor board. And so this AI healthcare orchestrator agent actually participated as a full-fledged member of a tumor board meeting to help bring data together, make sure that the latest medical knowledge was brought to bear, and to assist in the decision-making around a patient’s cancer treatment. It was pretty amazing. [THEME MUSIC] A big thank-you again to Ethan and Azeem for sharing their knowledge and understanding of the dynamics between AI and society more broadly. And to our listeners, thank you for joining us. I’m really excited for the upcoming episodes, including discussions on medical students’ experiences with AI and AI’s influence on the operation of health systems and public health departments. We hope you’ll continue to tune in. Until next time. [MUSIC FADES]

11 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-27 17:25:54 ·

FrodoKEM: A conservative quantum-safe cryptographic algorithm

In this post, we describe FrodoKEM, a key encapsulation protocol that offers a simple design and provides strong security guarantees even in a future with powerful quantum computers.
The quantum threat to cryptography
For decades, modern cryptography has relied on mathematical problems that are practically impossible for classical computers to solve without a secret key. Cryptosystems like RSA, Diffie-Hellman key-exchange, and elliptic curve-based schemes—which rely on the hardness of the integer factorization anddiscrete logarithm problems—secure communications on the internet, banking transactions, and even national security systems. However, the emergence of
Quantum computers leverage the principles of quantum mechanics to perform certain calculations exponentially faster than classical computers. Their ability to solve complex problems, such as simulating molecular interactions, optimizing large-scale systems, and accelerating machine learning, is expected to have profound and beneficial implications for fields ranging from chemistry and material science to artificial intelligence.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience
Discover more about research at Microsoft through our AI-powered experience

Start now

Opens in a new tab
At the same time, quantum computing is poised to disrupt cryptography. In particular, Shor’s algorithm, a quantum algorithm developed in 1994, can efficiently factor large numbers and compute discrete logarithms—the very problems that underpin the security of RSA, Diffie-Hellman, and elliptic curve cryptography. This means that once large-scale, fault-tolerant quantum computers become available, public-key protocols based on RSA, ECC, and Diffie-Hellman will become insecure, breaking a sizable portion of the cryptographic backbone of today’s digital world. Recent advances in quantum computing, such as Microsoft’s Majorana 1, the first quantum processor powered by topological qubits, represent major steps toward practical quantum computing and underscore the urgency of transitioning to quantum-resistant cryptographic systems.
To address this looming security crisis, cryptographers and government agencies have been working on post-quantum cryptography—new cryptographic algorithms that can resist attacks from both classical and quantum computers.
The NIST Post-Quantum Cryptography Standardization effort
In 2017, the U.S. National Institute of Standards and Technologylaunched the Post-Quantum Cryptography Standardization projectto evaluate and select cryptographic algorithms capable of withstanding quantum attacks. As part of this initiative, NIST sought proposals for two types of cryptographic primitives: key encapsulation mechanisms—which enable two parties to securely derive a shared key to establish an encrypted connection, similar to traditional key exchange schemes—and digital signature schemes.
This initiative attracted submissions from cryptographers worldwide, and after multiple evaluation rounds, NIST selected CRYSTALS-Kyber, a KEM based on structured lattices, and standardized it as ML-KEM. Additionally, NIST selected three digital signature schemes: CRYSTALS-Dilithium, now called ML-DSA; SPHINCS+, now called SLH-DSA; and Falcon, now called FN-DSA.
While ML-KEM provides great overall security and efficiency, some governments and cryptographic researchers advocate for the inclusion and standardization of alternative algorithms that minimize reliance on algebraic structure. Reducing algebraic structure might prevent potential vulnerabilities and, hence, can be considered a more conservative design choice. One such algorithm is FrodoKEM.
International standardization of post-quantum cryptography
Beyond NIST, other international standardization bodies have been actively working on quantum-resistant cryptographic solutions. The International Organization for Standardizationis leading a global effort to standardize additional PQC algorithms. Notably, European government agencies—including Germany’s BSI, the Netherlands’ NLNCSA and AIVD, and France’s ANSSI—have shown strong support for FrodoKEM, recognizing it as a conservative alternative to structured lattice-based schemes.
As a result, FrodoKEM is undergoing standardization at ISO. Additionally, ISO is standardizing ML-KEM and a conservative code-based KEM called Classic McEliece. These three algorithms are planned for inclusion in ISO/IEC 18033-2:2006 as Amendment 2.
What is FrodoKEM?
FrodoKEM is a key encapsulation mechanismbased on the Learning with Errorsproblem, a cornerstone of lattice-based cryptography. Unlike structured lattice-based schemes such as ML-KEM, FrodoKEM is built on generic, unstructured lattices, i.e., it is based on the plain LWE problem.
Why unstructured lattices?
Structured lattice-based schemes introduce additional algebraic properties that could potentially be exploited in future cryptanalytic attacks. By using unstructured lattices, FrodoKEM eliminates these concerns, making it a safer choice in the long run, albeit at the cost of larger key sizes and lower efficiency.
It is important to emphasize that no particular cryptanalytic weaknesses are currently known for recommended parameterizations of structured lattice schemes in comparison to plain LWE. However, our current understanding of the security of these schemes could potentially change in the future with cryptanalytic advances.
Lattices and the Learning with Errorsproblem
Lattice-based cryptography relies on the mathematical structure of lattices, which are regular arrangements of points in multidimensional space. A lattice is defined as the set of all integer linear combinations of a set of basis vectors. The difficulty of certain computational problems on lattices, such as the Shortest Vector Problemand the Learning with Errorsproblem, forms the basis of lattice-based schemes.
The Learning with Errorsproblem
The LWE problem is a fundamental hard problem in lattice-based cryptography. It involves solving a system of linear equations where some small random error has been added to each equation, making it extremely difficult to recover the original secret values. This added error ensures that the problem remains computationally infeasible, even for quantum computers. Figure 1 below illustrates the LWE problem, specifically, the search version of the problem.
As can be seen in Figure 1, for the setup of the problem we need a dimension \that defines the size of matrices, a modulus \that defines the value range of the matrix coefficients, and a certain error distribution \from which we sample \matrices. We sample two matrices from \, a small matrix \and an error matrix \; sample an \matrix \uniformly at random; and compute \. In the illustration, each matrix coefficient is represented by a colored square, and the “legend of coefficients” gives an idea of the size of the respective coefficients, e.g., orange squares represent the small coefficients of matrix \ ). Finally, given \and \, the search LWE problem consists in finding \. This problem is believed to be hard for suitably chosen parameterssufficiently large) and is used at the core of FrodoKEM.
In comparison, the LWE variant used in ML-KEM—called Module-LWE—has additional symmetries, adding mathematical structure that helps improve efficiency. In a setting similar to that of the search LWE problem above, the matrix \can be represented by just a single row of coefficients.
FIGURE 1: Visualization of theLWE problem.
LWE is conjectured to be quantum-resistant, and FrodoKEM’s security is directly tied to its hardness. In other words, cryptanalysts and quantum researchers have not been able to devise an efficient quantum algorithm capable of solving the LWE problem and, hence, FrodoKEM. In cryptography, absolute security can never be guaranteed; instead, confidence in a problem’s hardness comes from extensive scrutiny and its resilience against attacks over time.
How FrodoKEM Works
FrodoKEM follows the standard paradigm of a KEM, which consists of three main operations—key generation, encapsulation, and decapsulation—performed interactively between a sender and a recipient with the goal of establishing a shared secret key:

Key generation, computed by the recipient

Generates a public key and a secret key.
The public key is sent to the sender, while the private key remains secret.

Encapsulation, computed by the sender

Generates a random session key.
Encrypts the session key using the recipient’s public key to produce a ciphertext.
Produces a shared key using the session key and the ciphertext.
The ciphertext is sent to the recipient.

Decapsulation, computed by the recipient

Decrypts the ciphertext using their secret key to recover the original session key.
Reproduces the shared key using the decrypted session key and the ciphertext.

The shared key generated by the sender and reconstructed by the recipient can then be used to establish secure symmetric-key encryption for further communication between the two parties.
Figure 2 below shows a simplified view of the FrodoKEM protocol. As highlighted in red, FrodoKEM uses at its core LWE operations of the form “\”, which are directly applied within the KEM paradigm.
FIGURE 2: Simplified overview of FrodoKEM.
Performance: Strong security has a cost
Not relying on additional algebraic structure certainly comes at a cost for FrodoKEM in the form of increased protocol runtime and bandwidth. The table below compares the performance and key sizes corresponding to the FrodoKEM level 1 parameter setand the respective parameter set of ML-KEM. These parameter sets are intended to match or exceed the brute force security of AES-128. As can be seen, the difference in speed and key sizes between FrodoKEM and ML-KEM is more than an order of magnitude. Nevertheless, the runtime of the FrodoKEM protocol remains reasonable for most applications. For example, on our benchmarking platform clocked at 3.2GHz, the measured runtimes are 0.97 ms, 1.9 ms, and 3.2 ms for security levels 1, 2, and 3, respectively.
For security-sensitive applications, a more relevant comparison is with Classic McEliece, a post-quantum code-based scheme also considered for standardization. In this case, FrodoKEM offers several efficiency advantages. Classic McEliece’s public keys are significantly larger—well over an order of magnitude greater than FrodoKEM’s—and its key generation is substantially more computationally expensive. Nonetheless, Classic McEliece provides an advantage in certain static key-exchange scenarios, where its high key generation cost can be amortized across multiple key encapsulation executions.
TABLE 1: Comparison of key sizes and performance on an x86-64 processor for NIST level 1 parameter sets.
A holistic design made with security in mind
FrodoKEM’s design principles support security beyond its reliance on generic, unstructured lattices to minimize the attack surface of potential future cryptanalytic threats. Its parameters have been carefully chosen with additional security margins to withstand advancements in known attacks. Furthermore, FrodoKEM is designed with simplicity in mind—its internal operations are based on straightforward matrix-vector arithmetic using integer coefficients reduced modulo a power of two. These design decisions facilitate simple, compact and secure implementations that are also easier to maintain and to protect against side-channel attacks.
Conclusion
After years of research and analysis, the next generation of post-quantum cryptographic algorithms has arrived. NIST has chosen strong PQC protocols that we believe will serve Microsoft and its customers well in many applications. For security-sensitive applications, FrodoKEM offers a secure yet practical approach for post-quantum cryptography. While its reliance on unstructured lattices results in larger key sizes and higher computational overhead compared to structured lattice-based alternatives, it provides strong security assurances against potential future attacks. Given the ongoing standardization efforts and its endorsement by multiple governmental agencies, FrodoKEM is well-positioned as a viable alternative for organizations seeking long-term cryptographic resilience in a post-quantum world.
Further Reading
For those interested in learning more about FrodoKEM, post-quantum cryptography, and lattice-based cryptography, the following resources provide valuable insights:

The official FrodoKEM website: /, which contains, among several other resources, FrodoKEM’s specification document.
The official FrodoKEM software library:, which contains reference and optimized implementations of FrodoKEM written in C and Python.
NIST’s Post-Quantum Cryptography Project:.
Microsoft’s blogpost on its transition plan for PQC:.
A comprehensive survey on lattice-based cryptography: Peikert, C. “A Decade of Lattice Cryptography.” Foundations and Trends in Theoretical Computer Science.A comprehensive tutorial on modern lattice-based schemes, including ML-KEM and ML-DSA: Lyubashevsky, V. “Basic Lattice Cryptography: The concepts behind Kyberand Dilithium.”.Opens in a new tab
#frodokem #conservative #quantumsafe #cryptographic #algorithm

FrodoKEM: A conservative quantum-safe cryptographic algorithm
In this post, we describe FrodoKEM, a key encapsulation protocol that offers a simple design and provides strong security guarantees even in a future with powerful quantum computers. The quantum threat to cryptography For decades, modern cryptography has relied on mathematical problems that are practically impossible for classical computers to solve without a secret key. Cryptosystems like RSA, Diffie-Hellman key-exchange, and elliptic curve-based schemes—which rely on the hardness of the integer factorization anddiscrete logarithm problems—secure communications on the internet, banking transactions, and even national security systems. However, the emergence of Quantum computers leverage the principles of quantum mechanics to perform certain calculations exponentially faster than classical computers. Their ability to solve complex problems, such as simulating molecular interactions, optimizing large-scale systems, and accelerating machine learning, is expected to have profound and beneficial implications for fields ranging from chemistry and material science to artificial intelligence. Spotlight: AI-POWERED EXPERIENCE Microsoft research copilot experience Discover more about research at Microsoft through our AI-powered experience Start now Opens in a new tab At the same time, quantum computing is poised to disrupt cryptography. In particular, Shor’s algorithm, a quantum algorithm developed in 1994, can efficiently factor large numbers and compute discrete logarithms—the very problems that underpin the security of RSA, Diffie-Hellman, and elliptic curve cryptography. This means that once large-scale, fault-tolerant quantum computers become available, public-key protocols based on RSA, ECC, and Diffie-Hellman will become insecure, breaking a sizable portion of the cryptographic backbone of today’s digital world. Recent advances in quantum computing, such as Microsoft’s Majorana 1, the first quantum processor powered by topological qubits, represent major steps toward practical quantum computing and underscore the urgency of transitioning to quantum-resistant cryptographic systems. To address this looming security crisis, cryptographers and government agencies have been working on post-quantum cryptography—new cryptographic algorithms that can resist attacks from both classical and quantum computers. The NIST Post-Quantum Cryptography Standardization effort In 2017, the U.S. National Institute of Standards and Technologylaunched the Post-Quantum Cryptography Standardization projectto evaluate and select cryptographic algorithms capable of withstanding quantum attacks. As part of this initiative, NIST sought proposals for two types of cryptographic primitives: key encapsulation mechanisms—which enable two parties to securely derive a shared key to establish an encrypted connection, similar to traditional key exchange schemes—and digital signature schemes. This initiative attracted submissions from cryptographers worldwide, and after multiple evaluation rounds, NIST selected CRYSTALS-Kyber, a KEM based on structured lattices, and standardized it as ML-KEM. Additionally, NIST selected three digital signature schemes: CRYSTALS-Dilithium, now called ML-DSA; SPHINCS+, now called SLH-DSA; and Falcon, now called FN-DSA. While ML-KEM provides great overall security and efficiency, some governments and cryptographic researchers advocate for the inclusion and standardization of alternative algorithms that minimize reliance on algebraic structure. Reducing algebraic structure might prevent potential vulnerabilities and, hence, can be considered a more conservative design choice. One such algorithm is FrodoKEM. International standardization of post-quantum cryptography Beyond NIST, other international standardization bodies have been actively working on quantum-resistant cryptographic solutions. The International Organization for Standardizationis leading a global effort to standardize additional PQC algorithms. Notably, European government agencies—including Germany’s BSI, the Netherlands’ NLNCSA and AIVD, and France’s ANSSI—have shown strong support for FrodoKEM, recognizing it as a conservative alternative to structured lattice-based schemes. As a result, FrodoKEM is undergoing standardization at ISO. Additionally, ISO is standardizing ML-KEM and a conservative code-based KEM called Classic McEliece. These three algorithms are planned for inclusion in ISO/IEC 18033-2:2006 as Amendment 2. What is FrodoKEM? FrodoKEM is a key encapsulation mechanismbased on the Learning with Errorsproblem, a cornerstone of lattice-based cryptography. Unlike structured lattice-based schemes such as ML-KEM, FrodoKEM is built on generic, unstructured lattices, i.e., it is based on the plain LWE problem. Why unstructured lattices? Structured lattice-based schemes introduce additional algebraic properties that could potentially be exploited in future cryptanalytic attacks. By using unstructured lattices, FrodoKEM eliminates these concerns, making it a safer choice in the long run, albeit at the cost of larger key sizes and lower efficiency. It is important to emphasize that no particular cryptanalytic weaknesses are currently known for recommended parameterizations of structured lattice schemes in comparison to plain LWE. However, our current understanding of the security of these schemes could potentially change in the future with cryptanalytic advances. Lattices and the Learning with Errorsproblem Lattice-based cryptography relies on the mathematical structure of lattices, which are regular arrangements of points in multidimensional space. A lattice is defined as the set of all integer linear combinations of a set of basis vectors. The difficulty of certain computational problems on lattices, such as the Shortest Vector Problemand the Learning with Errorsproblem, forms the basis of lattice-based schemes. The Learning with Errorsproblem The LWE problem is a fundamental hard problem in lattice-based cryptography. It involves solving a system of linear equations where some small random error has been added to each equation, making it extremely difficult to recover the original secret values. This added error ensures that the problem remains computationally infeasible, even for quantum computers. Figure 1 below illustrates the LWE problem, specifically, the search version of the problem. As can be seen in Figure 1, for the setup of the problem we need a dimension \that defines the size of matrices, a modulus \that defines the value range of the matrix coefficients, and a certain error distribution \from which we sample \matrices. We sample two matrices from \, a small matrix \and an error matrix \; sample an \matrix \uniformly at random; and compute \. In the illustration, each matrix coefficient is represented by a colored square, and the “legend of coefficients” gives an idea of the size of the respective coefficients, e.g., orange squares represent the small coefficients of matrix \ ). Finally, given \and \, the search LWE problem consists in finding \. This problem is believed to be hard for suitably chosen parameterssufficiently large) and is used at the core of FrodoKEM. In comparison, the LWE variant used in ML-KEM—called Module-LWE—has additional symmetries, adding mathematical structure that helps improve efficiency. In a setting similar to that of the search LWE problem above, the matrix \can be represented by just a single row of coefficients. FIGURE 1: Visualization of theLWE problem. LWE is conjectured to be quantum-resistant, and FrodoKEM’s security is directly tied to its hardness. In other words, cryptanalysts and quantum researchers have not been able to devise an efficient quantum algorithm capable of solving the LWE problem and, hence, FrodoKEM. In cryptography, absolute security can never be guaranteed; instead, confidence in a problem’s hardness comes from extensive scrutiny and its resilience against attacks over time. How FrodoKEM Works FrodoKEM follows the standard paradigm of a KEM, which consists of three main operations—key generation, encapsulation, and decapsulation—performed interactively between a sender and a recipient with the goal of establishing a shared secret key: Key generation, computed by the recipient Generates a public key and a secret key. The public key is sent to the sender, while the private key remains secret. Encapsulation, computed by the sender Generates a random session key. Encrypts the session key using the recipient’s public key to produce a ciphertext. Produces a shared key using the session key and the ciphertext. The ciphertext is sent to the recipient. Decapsulation, computed by the recipient Decrypts the ciphertext using their secret key to recover the original session key. Reproduces the shared key using the decrypted session key and the ciphertext. The shared key generated by the sender and reconstructed by the recipient can then be used to establish secure symmetric-key encryption for further communication between the two parties. Figure 2 below shows a simplified view of the FrodoKEM protocol. As highlighted in red, FrodoKEM uses at its core LWE operations of the form “\”, which are directly applied within the KEM paradigm. FIGURE 2: Simplified overview of FrodoKEM. Performance: Strong security has a cost Not relying on additional algebraic structure certainly comes at a cost for FrodoKEM in the form of increased protocol runtime and bandwidth. The table below compares the performance and key sizes corresponding to the FrodoKEM level 1 parameter setand the respective parameter set of ML-KEM. These parameter sets are intended to match or exceed the brute force security of AES-128. As can be seen, the difference in speed and key sizes between FrodoKEM and ML-KEM is more than an order of magnitude. Nevertheless, the runtime of the FrodoKEM protocol remains reasonable for most applications. For example, on our benchmarking platform clocked at 3.2GHz, the measured runtimes are 0.97 ms, 1.9 ms, and 3.2 ms for security levels 1, 2, and 3, respectively. For security-sensitive applications, a more relevant comparison is with Classic McEliece, a post-quantum code-based scheme also considered for standardization. In this case, FrodoKEM offers several efficiency advantages. Classic McEliece’s public keys are significantly larger—well over an order of magnitude greater than FrodoKEM’s—and its key generation is substantially more computationally expensive. Nonetheless, Classic McEliece provides an advantage in certain static key-exchange scenarios, where its high key generation cost can be amortized across multiple key encapsulation executions. TABLE 1: Comparison of key sizes and performance on an x86-64 processor for NIST level 1 parameter sets. A holistic design made with security in mind FrodoKEM’s design principles support security beyond its reliance on generic, unstructured lattices to minimize the attack surface of potential future cryptanalytic threats. Its parameters have been carefully chosen with additional security margins to withstand advancements in known attacks. Furthermore, FrodoKEM is designed with simplicity in mind—its internal operations are based on straightforward matrix-vector arithmetic using integer coefficients reduced modulo a power of two. These design decisions facilitate simple, compact and secure implementations that are also easier to maintain and to protect against side-channel attacks. Conclusion After years of research and analysis, the next generation of post-quantum cryptographic algorithms has arrived. NIST has chosen strong PQC protocols that we believe will serve Microsoft and its customers well in many applications. For security-sensitive applications, FrodoKEM offers a secure yet practical approach for post-quantum cryptography. While its reliance on unstructured lattices results in larger key sizes and higher computational overhead compared to structured lattice-based alternatives, it provides strong security assurances against potential future attacks. Given the ongoing standardization efforts and its endorsement by multiple governmental agencies, FrodoKEM is well-positioned as a viable alternative for organizations seeking long-term cryptographic resilience in a post-quantum world. Further Reading For those interested in learning more about FrodoKEM, post-quantum cryptography, and lattice-based cryptography, the following resources provide valuable insights: The official FrodoKEM website: /, which contains, among several other resources, FrodoKEM’s specification document. The official FrodoKEM software library:, which contains reference and optimized implementations of FrodoKEM written in C and Python. NIST’s Post-Quantum Cryptography Project:. Microsoft’s blogpost on its transition plan for PQC:. A comprehensive survey on lattice-based cryptography: Peikert, C. “A Decade of Lattice Cryptography.” Foundations and Trends in Theoretical Computer Science.A comprehensive tutorial on modern lattice-based schemes, including ML-KEM and ML-DSA: Lyubashevsky, V. “Basic Lattice Cryptography: The concepts behind Kyberand Dilithium.”.Opens in a new tab #frodokem #conservative #quantumsafe #cryptographic #algorithm

FrodoKEM: A conservative quantum-safe cryptographic algorithm

www.microsoft.com
In this post, we describe FrodoKEM, a key encapsulation protocol that offers a simple design and provides strong security guarantees even in a future with powerful quantum computers. The quantum threat to cryptography For decades, modern cryptography has relied on mathematical problems that are practically impossible for classical computers to solve without a secret key. Cryptosystems like RSA, Diffie-Hellman key-exchange, and elliptic curve-based schemes—which rely on the hardness of the integer factorization and (elliptic curve) discrete logarithm problems—secure communications on the internet, banking transactions, and even national security systems. However, the emergence of Quantum computers leverage the principles of quantum mechanics to perform certain calculations exponentially faster than classical computers. Their ability to solve complex problems, such as simulating molecular interactions, optimizing large-scale systems, and accelerating machine learning, is expected to have profound and beneficial implications for fields ranging from chemistry and material science to artificial intelligence. Spotlight: AI-POWERED EXPERIENCE Microsoft research copilot experience Discover more about research at Microsoft through our AI-powered experience Start now Opens in a new tab At the same time, quantum computing is poised to disrupt cryptography. In particular, Shor’s algorithm, a quantum algorithm developed in 1994, can efficiently factor large numbers and compute discrete logarithms—the very problems that underpin the security of RSA, Diffie-Hellman, and elliptic curve cryptography. This means that once large-scale, fault-tolerant quantum computers become available, public-key protocols based on RSA, ECC, and Diffie-Hellman will become insecure, breaking a sizable portion of the cryptographic backbone of today’s digital world. Recent advances in quantum computing, such as Microsoft’s Majorana 1 (opens in new tab), the first quantum processor powered by topological qubits, represent major steps toward practical quantum computing and underscore the urgency of transitioning to quantum-resistant cryptographic systems. To address this looming security crisis, cryptographers and government agencies have been working on post-quantum cryptography (PQC)—new cryptographic algorithms that can resist attacks from both classical and quantum computers. The NIST Post-Quantum Cryptography Standardization effort In 2017, the U.S. National Institute of Standards and Technology (NIST) launched the Post-Quantum Cryptography Standardization project (opens in new tab) to evaluate and select cryptographic algorithms capable of withstanding quantum attacks. As part of this initiative, NIST sought proposals for two types of cryptographic primitives: key encapsulation mechanisms (KEMs)—which enable two parties to securely derive a shared key to establish an encrypted connection, similar to traditional key exchange schemes—and digital signature schemes. This initiative attracted submissions from cryptographers worldwide, and after multiple evaluation rounds, NIST selected CRYSTALS-Kyber, a KEM based on structured lattices, and standardized it as ML-KEM (opens in new tab). Additionally, NIST selected three digital signature schemes: CRYSTALS-Dilithium, now called ML-DSA; SPHINCS+, now called SLH-DSA; and Falcon, now called FN-DSA. While ML-KEM provides great overall security and efficiency, some governments and cryptographic researchers advocate for the inclusion and standardization of alternative algorithms that minimize reliance on algebraic structure. Reducing algebraic structure might prevent potential vulnerabilities and, hence, can be considered a more conservative design choice. One such algorithm is FrodoKEM. International standardization of post-quantum cryptography Beyond NIST, other international standardization bodies have been actively working on quantum-resistant cryptographic solutions. The International Organization for Standardization (ISO) is leading a global effort to standardize additional PQC algorithms. Notably, European government agencies—including Germany’s BSI (opens in new tab), the Netherlands’ NLNCSA and AIVD (opens in new tab), and France’s ANSSI (opens in new tab)—have shown strong support for FrodoKEM, recognizing it as a conservative alternative to structured lattice-based schemes. As a result, FrodoKEM is undergoing standardization at ISO. Additionally, ISO is standardizing ML-KEM and a conservative code-based KEM called Classic McEliece. These three algorithms are planned for inclusion in ISO/IEC 18033-2:2006 as Amendment 2 (opens in new tab). What is FrodoKEM? FrodoKEM is a key encapsulation mechanism (KEM) based on the Learning with Errors (LWE) problem, a cornerstone of lattice-based cryptography. Unlike structured lattice-based schemes such as ML-KEM, FrodoKEM is built on generic, unstructured lattices, i.e., it is based on the plain LWE problem. Why unstructured lattices? Structured lattice-based schemes introduce additional algebraic properties that could potentially be exploited in future cryptanalytic attacks. By using unstructured lattices, FrodoKEM eliminates these concerns, making it a safer choice in the long run, albeit at the cost of larger key sizes and lower efficiency. It is important to emphasize that no particular cryptanalytic weaknesses are currently known for recommended parameterizations of structured lattice schemes in comparison to plain LWE. However, our current understanding of the security of these schemes could potentially change in the future with cryptanalytic advances. Lattices and the Learning with Errors (LWE) problem Lattice-based cryptography relies on the mathematical structure of lattices, which are regular arrangements of points in multidimensional space. A lattice is defined as the set of all integer linear combinations of a set of basis vectors. The difficulty of certain computational problems on lattices, such as the Shortest Vector Problem (SVP) and the Learning with Errors (LWE) problem, forms the basis of lattice-based schemes. The Learning with Errors (LWE) problem The LWE problem is a fundamental hard problem in lattice-based cryptography. It involves solving a system of linear equations where some small random error has been added to each equation, making it extremely difficult to recover the original secret values. This added error ensures that the problem remains computationally infeasible, even for quantum computers. Figure 1 below illustrates the LWE problem, specifically, the search version of the problem. As can be seen in Figure 1, for the setup of the problem we need a dimension \(n\) that defines the size of matrices, a modulus \(q\) that defines the value range of the matrix coefficients, and a certain error distribution \(\chi\) from which we sample \(\textit{“small”}\) matrices. We sample two matrices from \(\chi\), a small matrix \(\text{s}\) and an error matrix \(\text{e}\) (for simplicity in the explanation, we assume that both have only one column); sample an \(n \times n\) matrix \(\text{A}\) uniformly at random; and compute \(\text{b} = \text{A} \times \text{s} + \text{e}\). In the illustration, each matrix coefficient is represented by a colored square, and the “legend of coefficients” gives an idea of the size of the respective coefficients, e.g., orange squares represent the small coefficients of matrix \(\text{s}\) (small relative to the modulus \(q\)). Finally, given \(\text{A}\) and \(\text{b}\), the search LWE problem consists in finding \(\text{s}\). This problem is believed to be hard for suitably chosen parameters (e.g., for dimension \(n\) sufficiently large) and is used at the core of FrodoKEM. In comparison, the LWE variant used in ML-KEM—called Module-LWE (M-LWE)—has additional symmetries, adding mathematical structure that helps improve efficiency. In a setting similar to that of the search LWE problem above, the matrix \(\text{A}\) can be represented by just a single row of coefficients. FIGURE 1: Visualization of the (search) LWE problem. LWE is conjectured to be quantum-resistant, and FrodoKEM’s security is directly tied to its hardness. In other words, cryptanalysts and quantum researchers have not been able to devise an efficient quantum algorithm capable of solving the LWE problem and, hence, FrodoKEM. In cryptography, absolute security can never be guaranteed; instead, confidence in a problem’s hardness comes from extensive scrutiny and its resilience against attacks over time. How FrodoKEM Works FrodoKEM follows the standard paradigm of a KEM, which consists of three main operations—key generation, encapsulation, and decapsulation—performed interactively between a sender and a recipient with the goal of establishing a shared secret key: Key generation (KeyGen), computed by the recipient Generates a public key and a secret key. The public key is sent to the sender, while the private key remains secret. Encapsulation (Encapsulate), computed by the sender Generates a random session key. Encrypts the session key using the recipient’s public key to produce a ciphertext. Produces a shared key using the session key and the ciphertext. The ciphertext is sent to the recipient. Decapsulation (Decapsulate), computed by the recipient Decrypts the ciphertext using their secret key to recover the original session key. Reproduces the shared key using the decrypted session key and the ciphertext. The shared key generated by the sender and reconstructed by the recipient can then be used to establish secure symmetric-key encryption for further communication between the two parties. Figure 2 below shows a simplified view of the FrodoKEM protocol. As highlighted in red, FrodoKEM uses at its core LWE operations of the form “\(\text{b} = \text{A} \times \text{s} + \text{e}\)”, which are directly applied within the KEM paradigm. FIGURE 2: Simplified overview of FrodoKEM. Performance: Strong security has a cost Not relying on additional algebraic structure certainly comes at a cost for FrodoKEM in the form of increased protocol runtime and bandwidth. The table below compares the performance and key sizes corresponding to the FrodoKEM level 1 parameter set (variant called “FrodoKEM-640-AES”) and the respective parameter set of ML-KEM (variant called “ML-KEM-512”). These parameter sets are intended to match or exceed the brute force security of AES-128. As can be seen, the difference in speed and key sizes between FrodoKEM and ML-KEM is more than an order of magnitude. Nevertheless, the runtime of the FrodoKEM protocol remains reasonable for most applications. For example, on our benchmarking platform clocked at 3.2GHz, the measured runtimes are 0.97 ms, 1.9 ms, and 3.2 ms for security levels 1, 2, and 3, respectively. For security-sensitive applications, a more relevant comparison is with Classic McEliece, a post-quantum code-based scheme also considered for standardization. In this case, FrodoKEM offers several efficiency advantages. Classic McEliece’s public keys are significantly larger—well over an order of magnitude greater than FrodoKEM’s—and its key generation is substantially more computationally expensive. Nonetheless, Classic McEliece provides an advantage in certain static key-exchange scenarios, where its high key generation cost can be amortized across multiple key encapsulation executions. TABLE 1: Comparison of key sizes and performance on an x86-64 processor for NIST level 1 parameter sets. A holistic design made with security in mind FrodoKEM’s design principles support security beyond its reliance on generic, unstructured lattices to minimize the attack surface of potential future cryptanalytic threats. Its parameters have been carefully chosen with additional security margins to withstand advancements in known attacks. Furthermore, FrodoKEM is designed with simplicity in mind—its internal operations are based on straightforward matrix-vector arithmetic using integer coefficients reduced modulo a power of two. These design decisions facilitate simple, compact and secure implementations that are also easier to maintain and to protect against side-channel attacks. Conclusion After years of research and analysis, the next generation of post-quantum cryptographic algorithms has arrived. NIST has chosen strong PQC protocols that we believe will serve Microsoft and its customers well in many applications. For security-sensitive applications, FrodoKEM offers a secure yet practical approach for post-quantum cryptography. While its reliance on unstructured lattices results in larger key sizes and higher computational overhead compared to structured lattice-based alternatives, it provides strong security assurances against potential future attacks. Given the ongoing standardization efforts and its endorsement by multiple governmental agencies, FrodoKEM is well-positioned as a viable alternative for organizations seeking long-term cryptographic resilience in a post-quantum world. Further Reading For those interested in learning more about FrodoKEM, post-quantum cryptography, and lattice-based cryptography, the following resources provide valuable insights: The official FrodoKEM website: https://frodokem.org/ (opens in new tab), which contains, among several other resources, FrodoKEM’s specification document. The official FrodoKEM software library: https://github.com/Microsoft/PQCrypto-LWEKE (opens in new tab), which contains reference and optimized implementations of FrodoKEM written in C and Python. NIST’s Post-Quantum Cryptography Project: https://csrc.nist.gov/projects/post-quantum-cryptography (opens in new tab). Microsoft’s blogpost on its transition plan for PQC: https://techcommunity.microsoft.com/blog/microsoft-security-blog/microsofts-quantum-resistant-cryptography-is-here/4238780 (opens in new tab). A comprehensive survey on lattice-based cryptography: Peikert, C. “A Decade of Lattice Cryptography.” Foundations and Trends in Theoretical Computer Science. (2016) A comprehensive tutorial on modern lattice-based schemes, including ML-KEM and ML-DSA: Lyubashevsky, V. “Basic Lattice Cryptography: The concepts behind Kyber (ML-KEM) and Dilithium (ML-DSA).” https://eprint.iacr.org/2024/1287 (opens in new tab). (2024) Opens in a new tab

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-22 17:28:01 ·

Abstracts: Zero-shot models in single-cell biology with Alex Lu

TranscriptGRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts!

ALEX LU: Yeah, I’m really excited to be joining you today.
HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters.
LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover.
HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it?
LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model.
HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models?
LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets.
HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are.
LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes.
HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why?
LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models.
HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field?
LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes. HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts!
#abstracts #zeroshot #models #singlecell #biology

Abstracts: Zero-shot models in single-cell biology with Alex Lu
TranscriptGRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts! ALEX LU: Yeah, I’m really excited to be joining you today. HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters. LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover. HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it? LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model. HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models? LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets. HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are. LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes. HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why? LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models. HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field? LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes. HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts! #abstracts #zeroshot #models #singlecell #biology

Abstracts: Zero-shot models in single-cell biology with Alex Lu

www.microsoft.com
Transcript [MUSIC] GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. [MUSIC FADES] On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts! ALEX LU: Yeah, I’m really excited to be joining you today. HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters. LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover. HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it? LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model. HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models? LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets. HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are. LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes. HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why? LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models. HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field? LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes. [MUSIC] HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts! [MUSIC FADES]

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-21 17:29:40 ·

Abstracts: Aurora with Megan Stanley and Wessel Bruinsma

This is such exciting work about environmental forecasting, so we’re happy to have the two of you join us today.
Megan and Wessel, welcome.
MEGAN STANLEY: Thank you. Thanks. Great to be here.
WESSEL BRUINSMA: Thanks.
TINGLE: Let’s jump right in. Wessel, share a bit about the problem your research addresses and why this work is so important.
BRUINSMA: I think we’re all very much aware of the revolution that’s happening in the space of large language models, which have just become so strong. What’s perhaps lesser well-known is that machine learning models have also started to revolutionize this field of weather prediction. Whereas traditional weather prediction models, based on physical laws, used to be the state of the art, these traditional models are now challenged and often even outperformed by AI models.
This advancement is super impressive and really a big deal. Mostly because AI weather forecasting models are computationally much more efficient and can even be more accurate. What’s unfortunate though, about this big step forward, is that these developments are mostly limited to the setting of weather forecasting.
Weather forecasting is very important, obviously, but there are many other important environmental forecasting problems out there, such as air pollution forecasting or ocean wave forecasting. We have developed a model, named Aurora, which really kicks the AI revolution in weather forecasting into the next gear by extending these advancements to other environmental forecasting fields, too. With Aurora, we’re now able to produce state-of-the-art air pollution forecasts using an AI approach. And that wasn’t possible before!
TINGLE: Megan, how does this approach differ from or build on work that’s already been done in the atmospheric sciences?
STANLEY: Current approaches have really focused training very specifically on weather forecasting models. And in contrast, with Aurora, what we’ve attempted to do is train a so-called foundation model for the Earth system. In the first step, we train Aurora on a vast body of Earth system data. This is our pretraining step.
And when I say a vast body of data, I really do mean a lot. And the purpose of this pretraining is to let Aurora, kind of, learn some general-purpose representation of the dynamics that govern the Earth system. But then once we’ve pretrained Aurora, and this really is the crux of this, the reason why we’re doing this project, is after the model has been pretrained, it can leverage this learned general-purpose representation and efficiently adapt to new tasks, new domains, new variables. And this is called fine-tuning.
The idea is that the model really uses the learned representation to perform this adaptation very efficiently, which basically means Aurora is a powerful, flexible model that can relatively cheaply be adapted to any environmental forecasting task.
TINGLE: Wessel, can you tell us about your methodology? How did you all conduct this research?
BRUINSMA: While approaches so far have trained models on primarily one particular data
These datasets are a combination of estimates of the historical state of the world, forecasts by other models, climate simulations, and more. We’ve been able to show that training on not just more data but more diverse data helps the model achieve even better performance. Showing this is difficult because there is just so much data.
In addition to scaling to more and more diverse data, we also increased the size of the model as much as we could. Here we found that bigger models, despite being slower to run, make more efficient use of computational resources. It’s cheaper to train a good big model than a good small model. The mantra of this project was to really keep it simple and to scale to simultaneously very large and, more importantly, diverse data and large model size.
TINGLE: So, Megan, what were your major findings? And we know they’re major because they’re in Nature.
STANLEY: Yeah,I guess they really are. So the main outcome of this project is we were actually able to train a single foundation model that achieves state-of-the-art performance in four different domains. Air pollution forecasting. For example, predicting particulate matter near the surface or ozone in the atmosphere. Ocean wave forecasting, which is critical for planning shipping routes.
Tropical cyclone track forecasting, so that means being able to predict where a hurricane or a typhoon is expected to go, which is obviously incredibly important, and very high-resolution weather forecasting.
And I’ve, kind of, named these forecasting domains as if they’re just items in a list, but in every single one, Aurora really pushed the limits of what is possible with AI models. And we’re really proud of that.
But perhaps, kind of, you know, to my mind, the key takeaway here is that the foundation model approach actually works. So what we have shown is it’s possible to actually train some kind of general model, a foundation model, and then adapt it to a wide variety of environmental tasks. Now we definitely do not claim that Aurora is some kind of ultimate environmental forecasting model. We are sure that the model and the pretraining procedure can actually be improved. But, nevertheless, we’ve shown that this approach works for environmental forecasting. It really holds massive promise, and that’s incredibly cool.
TINGLE: Wessel, what do you think will be the real-world impact of this work?
BRUINSMA: Well, for applications that we mentioned, which are air pollution forecasting, ocean wave forecasting, tropical cyclone track forecasting, and very high-resolution weather forecasting, Aurora could today be deployed in real-time systems to produce near real-time forecasts. And, you know, in fact, it already is. You can view real-time weather forecasts by the high-resolution version of the model on the website of ECMWF.
But what’s remarkable is that every of these applications took a small team of engineers about four to eight weeks to fully execute. You should compare this to a typical development timeline for more traditional models, which can be on the order of multiple years. Using the pretraining fine-tuning approach that we used for Aurora, we might see significantly accelerated development cycles for environmental forecasting problems. And that’s exciting.
TINGLE: Megan, if our listeners only walk away from this conversation with one key talking point, what would you like that to be? What should we remember about this paper?
STANLEY: The biggest takeaway is that the pretraining fine-tuning paradigm, it really works for environmental forecasting, right? So you can train a foundational model, it learns some kind of general-purpose representation of the Earth system dynamics, and this representation boosts performance in a wide variety of forecasting tasks. But we really want to emphasize that Aurora only scratches the surface of what’s actually possible.
So there are many more applications to explore than the four we’ve mentioned. And undoubtedly, the model and pretraining procedure can actually be improved. So we’re really excited to see what the next few years will bring.
TINGLE: Wessel, tell us more about those opportunities and unanswered questions. What’s next on the research agenda in environmental prediction?
BRUINSMA: Well, Aurora has two main limitations. The first is that the model produces only deterministic predictions, by which I mean a single predicted value. For variables like temperature, this is mostly fine. But other variables like precipitation, they are inherently some kind of stochastic. For these variables, we really want to assign probabilities to different levels of precipitation rather than predicting only a single value.
An extension of Aurora to allow this sort of prediction would be a great next step.
The second limitation is that Aurora depends on a procedure called assimilation. Assimilation attempts to create a starting point for the model from real-world observations, such as from weather stations and satellites. The model then takes the starting point and uses it to make predictions. Unfortunately, assimilation is super expensive, so it would be great if we could somehow circumvent the need for it.
Finally, what we find really important is to make our advancements available to the community.
TINGLE: Great. Megan and Wessel, thanks for joining us today on the Microsoft Research Podcast.
BRUINSMA: Thanks for having us.
STANLEY: Yeah, thank you. It’s been great.
TINGLE: You can check out the Aurora model on Azure AI Foundry. You can read the entire paper, “A Foundation Model for the Earth System,” at aka.ms/abstracts. And you’ll certainly find it on the Nature website, too.
Thank you so much for tuning in to Abstracts today. Until next time.
#abstracts #aurora #with #megan #stanley

Abstracts: Aurora with Megan Stanley and Wessel Bruinsma
This is such exciting work about environmental forecasting, so we’re happy to have the two of you join us today.   Megan and Wessel, welcome. MEGAN STANLEY: Thank you. Thanks. Great to be here. WESSEL BRUINSMA: Thanks. TINGLE: Let’s jump right in. Wessel, share a bit about the problem your research addresses and why this work is so important. BRUINSMA: I think we’re all very much aware of the revolution that’s happening in the space of large language models, which have just become so strong. What’s perhaps lesser well-known is that machine learning models have also started to revolutionize this field of weather prediction. Whereas traditional weather prediction models, based on physical laws, used to be the state of the art, these traditional models are now challenged and often even outperformed by AI models. This advancement is super impressive and really a big deal. Mostly because AI weather forecasting models are computationally much more efficient and can even be more accurate. What’s unfortunate though, about this big step forward, is that these developments are mostly limited to the setting of weather forecasting.   Weather forecasting is very important, obviously, but there are many other important environmental forecasting problems out there, such as air pollution forecasting or ocean wave forecasting. We have developed a model, named Aurora, which really kicks the AI revolution in weather forecasting into the next gear by extending these advancements to other environmental forecasting fields, too. With Aurora, we’re now able to produce state-of-the-art air pollution forecasts using an AI approach. And that wasn’t possible before! TINGLE: Megan, how does this approach differ from or build on work that’s already been done in the atmospheric sciences? STANLEY: Current approaches have really focused training very specifically on weather forecasting models. And in contrast, with Aurora, what we’ve attempted to do is train a so-called foundation model for the Earth system. In the first step, we train Aurora on a vast body of Earth system data. This is our pretraining step. And when I say a vast body of data, I really do mean a lot. And the purpose of this pretraining is to let Aurora, kind of, learn some general-purpose representation of the dynamics that govern the Earth system. But then once we’ve pretrained Aurora, and this really is the crux of this, the reason why we’re doing this project, is after the model has been pretrained, it can leverage this learned general-purpose representation and efficiently adapt to new tasks, new domains, new variables. And this is called fine-tuning. The idea is that the model really uses the learned representation to perform this adaptation very efficiently, which basically means Aurora is a powerful, flexible model that can relatively cheaply be adapted to any environmental forecasting task.    TINGLE: Wessel, can you tell us about your methodology? How did you all conduct this research? BRUINSMA: While approaches so far have trained models on primarily one particular data These datasets are a combination of estimates of the historical state of the world, forecasts by other models, climate simulations, and more. We’ve been able to show that training on not just more data but more diverse data helps the model achieve even better performance. Showing this is difficult because there is just so much data. In addition to scaling to more and more diverse data, we also increased the size of the model as much as we could. Here we found that bigger models, despite being slower to run, make more efficient use of computational resources. It’s cheaper to train a good big model than a good small model. The mantra of this project was to really keep it simple and to scale to simultaneously very large and, more importantly, diverse data and large model size. TINGLE: So, Megan, what were your major findings? And we know they’re major because they’re in Nature. STANLEY: Yeah,I guess they really are. So the main outcome of this project is we were actually able to train a single foundation model that achieves state-of-the-art performance in four different domains. Air pollution forecasting. For example, predicting particulate matter near the surface or ozone in the atmosphere. Ocean wave forecasting, which is critical for planning shipping routes. Tropical cyclone track forecasting, so that means being able to predict where a hurricane or a typhoon is expected to go, which is obviously incredibly important, and very high-resolution weather forecasting.   And I’ve, kind of, named these forecasting domains as if they’re just items in a list, but in every single one, Aurora really pushed the limits of what is possible with AI models. And we’re really proud of that. But perhaps, kind of, you know, to my mind, the key takeaway here is that the foundation model approach actually works. So what we have shown is it’s possible to actually train some kind of general model, a foundation model, and then adapt it to a wide variety of environmental tasks. Now we definitely do not claim that Aurora is some kind of ultimate environmental forecasting model. We are sure that the model and the pretraining procedure can actually be improved. But, nevertheless, we’ve shown that this approach works for environmental forecasting. It really holds massive promise, and that’s incredibly cool. TINGLE: Wessel, what do you think will be the real-world impact of this work? BRUINSMA: Well, for applications that we mentioned, which are air pollution forecasting, ocean wave forecasting, tropical cyclone track forecasting, and very high-resolution weather forecasting, Aurora could today be deployed in real-time systems to produce near real-time forecasts. And, you know, in fact, it already is. You can view real-time weather forecasts by the high-resolution version of the model on the website of ECMWF. But what’s remarkable is that every of these applications took a small team of engineers about four to eight weeks to fully execute. You should compare this to a typical development timeline for more traditional models, which can be on the order of multiple years. Using the pretraining fine-tuning approach that we used for Aurora, we might see significantly accelerated development cycles for environmental forecasting problems. And that’s exciting. TINGLE: Megan, if our listeners only walk away from this conversation with one key talking point, what would you like that to be? What should we remember about this paper? STANLEY: The biggest takeaway is that the pretraining fine-tuning paradigm, it really works for environmental forecasting, right? So you can train a foundational model, it learns some kind of general-purpose representation of the Earth system dynamics, and this representation boosts performance in a wide variety of forecasting tasks. But we really want to emphasize that Aurora only scratches the surface of what’s actually possible. So there are many more applications to explore than the four we’ve mentioned. And undoubtedly, the model and pretraining procedure can actually be improved. So we’re really excited to see what the next few years will bring. TINGLE: Wessel, tell us more about those opportunities and unanswered questions. What’s next on the research agenda in environmental prediction? BRUINSMA: Well, Aurora has two main limitations. The first is that the model produces only deterministic predictions, by which I mean a single predicted value. For variables like temperature, this is mostly fine. But other variables like precipitation, they are inherently some kind of stochastic. For these variables, we really want to assign probabilities to different levels of precipitation rather than predicting only a single value. An extension of Aurora to allow this sort of prediction would be a great next step.   The second limitation is that Aurora depends on a procedure called assimilation. Assimilation attempts to create a starting point for the model from real-world observations, such as from weather stations and satellites. The model then takes the starting point and uses it to make predictions. Unfortunately, assimilation is super expensive, so it would be great if we could somehow circumvent the need for it. Finally, what we find really important is to make our advancements available to the community. TINGLE: Great. Megan and Wessel, thanks for joining us today on the Microsoft Research Podcast. BRUINSMA: Thanks for having us. STANLEY: Yeah, thank you. It’s been great. TINGLE: You can check out the Aurora model on Azure AI Foundry. You can read the entire paper, “A Foundation Model for the Earth System,” at aka.ms/abstracts. And you’ll certainly find it on the Nature website, too.   Thank you so much for tuning in to Abstracts today. Until next time.    #abstracts #aurora #with #megan #stanley

Abstracts: Aurora with Megan Stanley and Wessel Bruinsma

www.microsoft.com
This is such exciting work about environmental forecasting, so we’re happy to have the two of you join us today.   Megan and Wessel, welcome. MEGAN STANLEY: Thank you. Thanks. Great to be here. WESSEL BRUINSMA: Thanks. TINGLE: Let’s jump right in. Wessel, share a bit about the problem your research addresses and why this work is so important. BRUINSMA: I think we’re all very much aware of the revolution that’s happening in the space of large language models, which have just become so strong. What’s perhaps lesser well-known is that machine learning models have also started to revolutionize this field of weather prediction. Whereas traditional weather prediction models, based on physical laws, used to be the state of the art, these traditional models are now challenged and often even outperformed by AI models. This advancement is super impressive and really a big deal. Mostly because AI weather forecasting models are computationally much more efficient and can even be more accurate. What’s unfortunate though, about this big step forward, is that these developments are mostly limited to the setting of weather forecasting.   Weather forecasting is very important, obviously, but there are many other important environmental forecasting problems out there, such as air pollution forecasting or ocean wave forecasting. We have developed a model, named Aurora, which really kicks the AI revolution in weather forecasting into the next gear by extending these advancements to other environmental forecasting fields, too. With Aurora, we’re now able to produce state-of-the-art air pollution forecasts using an AI approach. And that wasn’t possible before! TINGLE: Megan, how does this approach differ from or build on work that’s already been done in the atmospheric sciences? STANLEY: Current approaches have really focused training very specifically on weather forecasting models. And in contrast, with Aurora, what we’ve attempted to do is train a so-called foundation model for the Earth system. In the first step, we train Aurora on a vast body of Earth system data. This is our pretraining step. And when I say a vast body of data, I really do mean a lot. And the purpose of this pretraining is to let Aurora, kind of, learn some general-purpose representation of the dynamics that govern the Earth system. But then once we’ve pretrained Aurora, and this really is the crux of this, the reason why we’re doing this project, is after the model has been pretrained, it can leverage this learned general-purpose representation and efficiently adapt to new tasks, new domains, new variables. And this is called fine-tuning. The idea is that the model really uses the learned representation to perform this adaptation very efficiently, which basically means Aurora is a powerful, flexible model that can relatively cheaply be adapted to any environmental forecasting task.    TINGLE: Wessel, can you tell us about your methodology? How did you all conduct this research? BRUINSMA: While approaches so far have trained models on primarily one particular data These datasets are a combination of estimates of the historical state of the world, forecasts by other models, climate simulations, and more. We’ve been able to show that training on not just more data but more diverse data helps the model achieve even better performance. Showing this is difficult because there is just so much data. In addition to scaling to more and more diverse data, we also increased the size of the model as much as we could. Here we found that bigger models, despite being slower to run, make more efficient use of computational resources. It’s cheaper to train a good big model than a good small model. The mantra of this project was to really keep it simple and to scale to simultaneously very large and, more importantly, diverse data and large model size. TINGLE: So, Megan, what were your major findings? And we know they’re major because they’re in Nature. [LAUGHS] STANLEY: Yeah, [LAUGHS] I guess they really are. So the main outcome of this project is we were actually able to train a single foundation model that achieves state-of-the-art performance in four different domains. Air pollution forecasting. For example, predicting particulate matter near the surface or ozone in the atmosphere. Ocean wave forecasting, which is critical for planning shipping routes. Tropical cyclone track forecasting, so that means being able to predict where a hurricane or a typhoon is expected to go, which is obviously incredibly important, and very high-resolution weather forecasting.   And I’ve, kind of, named these forecasting domains as if they’re just items in a list, but in every single one, Aurora really pushed the limits of what is possible with AI models. And we’re really proud of that. But perhaps, kind of, you know, to my mind, the key takeaway here is that the foundation model approach actually works. So what we have shown is it’s possible to actually train some kind of general model, a foundation model, and then adapt it to a wide variety of environmental tasks. Now we definitely do not claim that Aurora is some kind of ultimate environmental forecasting model. We are sure that the model and the pretraining procedure can actually be improved. But, nevertheless, we’ve shown that this approach works for environmental forecasting. It really holds massive promise, and that’s incredibly cool. TINGLE: Wessel, what do you think will be the real-world impact of this work? BRUINSMA: Well, for applications that we mentioned, which are air pollution forecasting, ocean wave forecasting, tropical cyclone track forecasting, and very high-resolution weather forecasting, Aurora could today be deployed in real-time systems to produce near real-time forecasts. And, you know, in fact, it already is. You can view real-time weather forecasts by the high-resolution version of the model on the website of ECMWF (European Centre for Medium-Range Weather Forecasts). But what’s remarkable is that every of these applications took a small team of engineers about four to eight weeks to fully execute. You should compare this to a typical development timeline for more traditional models, which can be on the order of multiple years. Using the pretraining fine-tuning approach that we used for Aurora, we might see significantly accelerated development cycles for environmental forecasting problems. And that’s exciting. TINGLE: Megan, if our listeners only walk away from this conversation with one key talking point, what would you like that to be? What should we remember about this paper? STANLEY: The biggest takeaway is that the pretraining fine-tuning paradigm, it really works for environmental forecasting, right? So you can train a foundational model, it learns some kind of general-purpose representation of the Earth system dynamics, and this representation boosts performance in a wide variety of forecasting tasks. But we really want to emphasize that Aurora only scratches the surface of what’s actually possible. So there are many more applications to explore than the four we’ve mentioned. And undoubtedly, the model and pretraining procedure can actually be improved. So we’re really excited to see what the next few years will bring. TINGLE: Wessel, tell us more about those opportunities and unanswered questions. What’s next on the research agenda in environmental prediction? BRUINSMA: Well, Aurora has two main limitations. The first is that the model produces only deterministic predictions, by which I mean a single predicted value. For variables like temperature, this is mostly fine. But other variables like precipitation, they are inherently some kind of stochastic. For these variables, we really want to assign probabilities to different levels of precipitation rather than predicting only a single value. An extension of Aurora to allow this sort of prediction would be a great next step.   The second limitation is that Aurora depends on a procedure called assimilation. Assimilation attempts to create a starting point for the model from real-world observations, such as from weather stations and satellites. The model then takes the starting point and uses it to make predictions. Unfortunately, assimilation is super expensive, so it would be great if we could somehow circumvent the need for it. Finally, what we find really important is to make our advancements available to the community. [MUSIC] TINGLE: Great. Megan and Wessel, thanks for joining us today on the Microsoft Research Podcast. BRUINSMA: Thanks for having us. STANLEY: Yeah, thank you. It’s been great. TINGLE: You can check out the Aurora model on Azure AI Foundry. You can read the entire paper, “A Foundation Model for the Earth System,” at aka.ms/abstracts. And you’ll certainly find it on the Nature website, too.   Thank you so much for tuning in to Abstracts today. Until next time.   [MUSIC FADES]

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-20 22:49:33 ·

Collaborators: Healthcare Innovation to Impact

JONATHAN CARLSON: From the beginning, healthcare stood out to us as an important opportunity for general reasoners to improve the lives and experiences of patients and providers. Indeed, in the past two years, there’s been an explosion of scientific papers looking at the application first of text reasoners and medicine, then multi-modal reasoners that can interpret medical images, and now, most recently, healthcare agents that can reason with each other. But even more impressive than the pace of research has been the surprisingly rapid diffusion of this technology into real world clinical workflows.
LUNGREN: So today, we’ll talk about how our cross-company collaboration has shortened that gap and delivered advanced AI capabilities and solutions into the hands of developers and clinicians around the world, empowering everyone in health and life sciences to achieve more. I’m Doctor Matt Lungren, chief scientific officer for Microsoft Health and Life Sciences.
CARLSON: And I’m Jonathan Carlson, vice president and managing director of Microsoft Health Futures.
LUNGREN: And together we brought some key players leading in the space of AI and health
CARLSON: We’ve asked these brilliant folks to join us because each of them represents a mission critical group of cutting-edge stakeholders, scaling breakthroughs into purpose-built solutions and capabilities for health
LUNGREN: We’ll hear today how generative AI capabilities can unlock reasoning across every data type in medicine: text, images, waveforms, genomics. And further, how multi-agent frameworks in healthcare can accelerate complex workflows, in some cases acting as a specialist team member, safely secured inside the Microsoft 365 tools used by hundreds of millions of healthcare enterprise users across the world. The opportunity to save time today and lives tomorrow with AI has never been larger.  MATTHEW LUNGREN: Jonathan. You know, it’s been really interesting kind of observing Microsoft Research over the decades. I’ve, you know, been watching you guys in my prior academic career. You are always on the front of innovation, particularly in health
JONATHAN CARLSON: I mean, it’s some of what’s in our DNA, I mean, we’ve been publishing in health and life sciences for two decades here. But when we launched Health Futures as a mission-focused lab about 7 or 8 years ago, we really started with the premise that the way to have impact was to really close the loop between, not just good ideas that get published, but good ideas that can actually be grounded in real problems that clinicians and scientists care about, that then allow us to actually go from that first proof of concept into an incubation, into getting real world feedback that allows us to close that loop. And now with, you know, the HLS organization here as a product group, we have the opportunity to work really closely with you all to not just prove what’s possible in the clinic or in the lab, but actually start scaling that into the broader community.
CAMERON RUNDE: And one thing I’ll add here is that the problems that we’re trying to tackle in health
CARLSON: So, Matt, back to you. What are you guys doing in the product group? How do you guys see these models getting into the clinic?
LUNGREN: You know, I think a lot of people, you know, think about AI is just, you know, maybe just even a few years old because of GPT and how that really captured the public’s consciousness. Right?
And so, you think about the speech-to-text technology of being able to dictate something, for a clinic note or for a visit, that was typically based on Nuance technology. And so there’s a lot of product understanding of the market, how to deliver something that clinicians will use, understanding the pain points and workflows and really that Health IT space, which is sometimes the third rail, I feel like with a lot of innovation in healthcare.
But beyond that, I mean, I think now that we have this really powerful engine of Microsoft and the platform capabilities, we’re seeing, innovations on the healthcare side for data storage, data interoperability, with different types of medical data. You have new applications coming online, the ability, of course, to see generative AI now infused into the speech-to-text and, becoming Dragon Copilot, which is something that has been, you know, tremendously, received by the community.
Physicians are able to now just have a conversation with a patient. They turn to their computer and the note is ready for them. There’s no more this, we call it keyboard liberation. I don’t know if you heard that before. And that’s just been tremendous. And there’s so much more coming from that side. And then there’s other parts of the workflow that we also get engaged in — the diagnostic workflow.
So medical imaging, sharing images across different hospital systems, the list goes on. And so now when you move into AI, we feel like there’s a huge opportunity to deliver capabilities into the clinical workflow via the products and solutions we already have. But, I mean, we’ll now that we’ve kind of expanded our team to involve Azure and platform, we’re really able to now focus on the developers.
WILL GUYMAN: Yeah. And you’re always telling me as a doctor how frustrating it is to be spending time at the computer instead of with your patients. I think you told me, you know, 4,000 clicks a day for the typical doctor, which is tremendous. And something like Dragon Copilot can save that five minutes per patient. But it can also now take actions after the patient encounter so it can draft the after-visit summary.
It can order labs and medications for the referral. And that’s incredible. And we want to keep building on that. There’s so many other use cases across the ecosystem. And so that’s why in Azure AI Foundry, we have translated a lot of the research from Microsoft Research and made that available to developers to build and customize for their own applications.
SMITHA SALIGRAMA: Yeah. And as you were saying, in our transformation of moving from solutions to platforms and as, scaling solutions to other, multiple scenarios, as we put our models in AI Foundry, we provide these developer capabilities like bring your own data and fine
LUNGREN: Well, I want to do a reality check because, you know, I think to us that are now really focused on technology, it seems like, I’ve heard this story before, right. I, I remember even in, my academic clinical days where it felt like technology was always the quick answer and it felt like technology was, there was maybe a disconnect between what my problems were or what I think needed to be done versus kind of the solutions that were kind of, created or offered to us. And I guess at some level, how Jonathan, do you think about this? Because to do things well in the science space is one thing, to do things well in science, but then also have it be something that actually drives health
CARLSON: Yeah. I mean, as you said, I think one of the core pathologies of Big Tech is we assume every problem is a technology problem. And that’s all it will take to solve the problem. And I think, look, I was trained as a computational biologist, and that sits in the awkward middle between biology and computation. And the thing that we always have to remember, the thing that we were very acutely aware of when we set out, was that we are not the experts. We do have, you know, you as an M.D., we have everybody on the team, we have biologists on the team.
But this is a big space. And the only way we’re going to have real impact, the only way we’re even going to pick the right problems to work on is if we really partner deeply, with providers, with EHRvendors, with scientists, and really understand what’s important and again, get that feedback loop.
RUNDE: Yeah, I think we really need to ground the work that we do in the science itself. You need to understand the broader ecosystem and the broader landscape, across healthwe think are important. Because, as Jonathan said, we’re not the experts in health
CARLSON: When we really launched this, this mission, 7 or 8 years ago, we really came in with the premise of, if we decide to stop, we want to be sure the world cares. And the only way that’s going to be true is if we’re really deeply embedded with the people that matter–the patients, the providers and the scientists.
LUNGREN: And now it really feels like this collaborative effort, you know, really can help start to extend that mission. Right. I think, you know, Will and Smitha, that we definitely feel the passion and the innovation. And we certainly benefit from those collaborations, too. But then we have these other partners and even customers, right, that we can start to tap into and have that flywheel keep spinning.
GUYMAN: Yeah. And the whole industry is an ecosystem. So, we have our own data sets at Microsoft Research that you’ve trained amazing AI models with. And those are in the catalog. But then you’ve also partnered with institutions like Providence or Page AI . And those models are in the catalog with their data. And then there are third parties like Nvidia that have their own specialized proprietary data sets, and their models are there too. So, we have this ecosystem of open source models. And maybe Smitha, you want to talk about how developers can actually customize these.
SALIGRAMA: Yeah. So we use the Azure AI Foundry ecosystem. Developers can feel at home if they’re using the AI Foundry. So they can look at our model cards that we publish as part of the models we publish, understand the use cases of these models, how to, quickly, bring up these APIs and, look at different use cases of how to apply these and even fine
LUNGREN: Yeah it has been interesting to see we have these health
GUYMAN: Well, the general-purpose large language models are amazing for medical general reasoning. So Microsoft Research has shown that that they can perform super well on, for example, like the United States medical licensing exam, they can exceed doctor performance if they’re just picking between different multiple-choice questions. But real medicine we know is messier. It doesn’t always start with the whole patient context provided as text in the prompt. You have to get the source data and that raw data is often non-text. The majority of it is non-text. It’s things like medical imaging, radiology, pathology, ophthalmology, dermatology. It goes on and on. And there’s endless signal data, lab data. And so all of this diverse data type needs to be processed through specialized models because much of that data is not available on the public internet.
And that’s why we’re taking this partner approach, first party and third party models that can interpret all this kind of data and then connect them ultimately back to these general reasoners to reason over that.
LUNGREN: So, you know, I’ve been at this company for a while and, you know, familiar with kind of how long it takes, generally to get, you know, a really good research paper, do all the studies, do all the data analysis, and then go through the process of publishing, right, which takes, as, you know, a long time and it’s, you know, very rigorous.
And one of the things that struck me, last year, I think we, we started this big collaboration and, within a quarter, you had a Nature paper coming out from Microsoft Research, and that model that the Nature paper was describing was ready to be used by anyone on the Azure AI Foundry within that same quarter. It kind of blew my mind when I thought about it, you know, even though we were all, you know, working very hard to get that done. Any thoughts on that? I mean, has this ever happened in your career? And, you know, what’s the secret sauce to that?
CARLSON: Yeah, I mean, the time scale from research to product has been massively compressed. And I’d push that even further, which is to say, the reason why it took a quarter was because we were laying the railroad tracks as we’re driving the train. We have examples right after that when we are launching on Foundry the same day we were publishing the paper.
And frankly, the review times are becoming longer than it takes to actually productize the models. I think there’s two things that are going on with that are really converging. One is that the overall ecosystem is converging on a relatively small number of patterns, and that gives us, as a tech company, a reason to go off and really make those patterns hardened in a way that allows not just us, but third parties as well, to really have a nice workflow to publish these models.
But the other is actually, I think, a change in how we work, you know, and for most of our history as an industrial research lab, we would do research and then we’d go pitch it to somebody and try and throw it over the fence. We’ve really built a much more integrated team. In fact, if you look at that Nature paper or any of the other papers, there’s folks from product teams. Many of you are on the papers along with our clinical collaborators.
RUNDE: Yeah. I think one thing that’s really important to note is that there’s a ton of different ways that you can have impact, right? So I like to think about phasing. In Health Futures at least, I like to think about phasing the work that we do. So first we have research, which is really early innovation. And the impact there is getting our technology and our tools out there and really sharing the learnings that we’ve had.
So that can be through publications like you mentioned. It can be through open-sourcing our models. And then you go to incubation. So, this is, I think, one of the more new spaces that we’re getting into, which is maybe that blurred line between research and product. Right. Which is, how do we take the tools and technologies that we’ve built and get them into the hands of users, typically through our partnerships?
Right. So, we partner very deeply and collaborate very deeply across the industry. And incubation is really important because we get that early feedback. We get an ability to pivot if we need to. And we also get the ability to see what types of impact our technology is having in the real world. And then lastly, when you think about scale, there’s tons of different ways that you can scale. We can scale third-party through our collaborators and really empower them to go to market to commercialize the things that we’ve built together.
You can also think about scaling internally, which is why I’m so thankful that we’ve created this flywheel between research and product, and a lot of the models that we’ve built that have gone through research, have gone through incubation, have been able to scale on the Azure AI Foundry. But that’s not really our expertise. Right? The scale piece in research, that’s research and incubation. Smitha, how do you think about scaling?
SALIGRAMA: So, there are several angles to scaling the models, the state-of-the-art models we see from the research team. The first angle is, the open sourcing, to get developer trust, and very generous commercial licenses so that they can use it and for their own, use cases. The second is, we also allow them to customize these models, fine
GUYMAN: And as one example, you know, University of Wisconsin Health, you know, which Matt knows well. They took one of our models, which is highly versatile. They customized it in Foundry and they optimized it to reliably identify abnormal chest X-rays, the most common imaging procedure, so they could improve their turnaround time triage quickly. And that’s just one example. But we have other partners like Sectra who are doing more of operations use cases automatically routing imaging to the radiologists, setting them up to be efficient. And then Page AI is doing, you know, biomarker identification for actually diagnostics and new drug discovery. So, there’s so many use cases that we have partners already who are building and customizing.
LUNGREN: The part that’s striking to me is just that, you know, we could all sit in a room and think about all the different ways someone might use these models on the catalog. And I’m still shocked at the stuff that people use them for and how effective they are. And I think part of that is, you know, again, we talk a lot about generative AI and healthcare and all the things you can do. Again, you know, in text, you refer to that earlier and certainly off the shelf, there’s really powerful applications. But there is, you know, kind of this tip of the iceberg effect where under the water, most of the data that we use to take care of our patients is not text. Right. It’s all the different other modalities. And I think that this has been an unlock right, sort of taking these innovations, innovations from the community, putting them in this ecosystem kind of catalog, essentially. Right. And then allowing folks to kind of, you know, build and develop applications with all these different types of data. Again, I’ve been surprised at what I’m seeing.
CARLSON: This has been just one of the most profound shifts that’s happened in the last 12 months, really. I mean, two years ago we had general models in text that really shifted how we think about, I mean, natural language processing got totally upended by that. Turns out the same technology works for images as well. It doesn’t only allow you to automatically extract concepts from images, but allows you to align those image concepts with text concepts, which means that you can have a conversation with that image. And once you’re in that world now, you are a place where you can start stitching together these multimodal models that really change how you can interact with the data, and how you can start getting more information out of the raw primary data that is part of the patient journey.
LUNGREN: Well, and we’re going to get to that because I think you just touched on something. And I want to re-emphasize stitching these things together. There’s a lot of different ways to potentially do that. Right? There’s ways that you can literally train the model end to end with adapters and all kinds of other early fusion fusions. All kinds of ways. But one of the things that the word of the I guess the year is going to be agents and an agent is a very interesting term to think about how you might abstract away some of the components or the tasks that you want the model to, to accomplish in the midst of sort of a real human to maybe model interaction. Can you talk a little bit more about, how we’re thinking about agents in this, in this platform approach?  GUYMAN: Well, this is our newest addition to the Azure AI Foundry. So there’s an agent catalog now where we have a set of pre-configured agents for health care. And then we also have a multi-agent orchestrator that can jump
LUNGREN: And, and I really like that concept because, you know, as, as a, as a from the user personas, I think about myself as a user. How am I going to interact with these agents? Where does it naturally fit? And I and I sort of, you know, I’ve seen some of the demonstrations and some of the work that’s going on with Stanford in particular, showing that, you know, and literally in a Teams chat, I can have my clinician colleagues and I can have specialized health
It is a completely mind-blowing thing for me. And it’s a light bulb moment for me to I wonder, what have we, what have we heard from folks that have, you know, tried out this health care agent orchestrator in this kind of deployment environment via Teams?
GUYMAN: Well, someone joked, you know, are you sure you’re not using Teams because you work at Microsoft?But, then we actually were meeting with one of the, radiologists at one of our partners, and they said that that morning they had just done a Teams meeting, or they had met with other specialists to talk about a patient’s cancer case, or they were coming up with a treatment plan.
And that was the light bulb moment for us. We realized, actually, Teams is already being used by physicians as an internal communication tool, as a tool to get work done. And especially since the pandemic, a lot of the meetings moved to virtual and telemedicine. And so it’s a great distribution channel for AI, which is often been a struggle for AI to actually get in the hands of clinicians. And so now we’re allowing developers to build and then deploy very easily and extend it into their own workflows.
CARLSON: I think that’s such an important point. I mean, if you think about one of the really important concepts in computer science is an application programing interface, like some set of rules that allow two applications to talk to each other. One of the big pushes, really important pushes, in medicine has been standards that allow us to actually have data standards and APIs that allow these to talk to each other, and yet still we end up with these silos. There’s silos of data. There’s silos of applications.
And just like when you and I work on our phone, we have to go back and forth between applications. One of the things that I think agents do is that it takes the idea that now you can use language to understand intent and effectively program an interface, and it creates a whole new abstraction layer that allows us to simplify the interaction between not just humans and the endpoint, but also for developers.
It allows us to have this abstraction layer that lets different developers focus on different types of models, and yet stitch them all together in a very, very natural, way, not just for the users, but for the ability to actually deploy those models.
SALIGRAMA: Just to add to what Jonathan was mentioning, the other cool thing about the Microsoft Teams user interface is it’s also enterprise ready.
RUNDE: And one important thing that we’re thinking about, is exactly this from the very early research through incubation and then to scale, obviously. Right. And so early on in research, we are actively working with our partners and our collaborators to make sure that we have the right data privacy and consent in place. We’re doing this in incubation as well. And then obviously in scale. Yep.
LUNGREN: So, I think AI has always been thought of as a savior kind of technology. We talked a little bit about how there’s been some ups and downs in terms of the ability for technology to be effective in health care. At the same time, we’re seeing a lot of new innovations that are really making a difference. But then we kind of get, you know, we talked about agents a little bit. It feels like we’re maybe abstracting too far. Maybe it’s things are going too fast, almost. What makes this different? I mean, in your mind is this truly a logical next step or is it going to take some time?
CARLSON: I think there’s a couple things that have happened. I think first, on just a pure technology. What led to ChatGPT? And I like to think of really three major breakthroughs.
The first was new mathematical concepts of attention, which really means that we now have a way that a machine can figure out which parts of the context it should actually focus on, just the way our brains do. Right? I mean, if you’re a clinician and somebody is talking to you, the majority of that conversation is not relevant for the diagnosis. But, you know how to zoom in on the parts that matter. That’s a super powerful mathematical concept. The second one is this idea of self-supervision. So, I think one of the fundamental problems of machine learning has been that you have to train on labeled training data and labels are expensive, which means data sets are small, which means the final models are very narrow and brittle. And the idea of self-supervision is that you can just get a model to automatically learn concepts, and the language is just predict the next word. And what’s important about that is that leads to models that can actually manipulate and understand really messy text and pull out what’s important about that, and then and then stitch that back together in interesting ways.
And the third concept, that came out of those first two, was just the observational scale. And that’s that more is better, more data, more compute, bigger models. And that really leads to a reason to keep investing. And for these models to keep getting better. So that as a as a groundwork, that’s what led to ChatGPT. That’s what led to our ability now to not just have rule-based systems or simple machine learning based systems to take a messy EHR record, say, and pull out a couple concepts.
But to really feed the whole thing in and say, okay, I need you to figure out which concepts are in here. And is this particular attribute there, for example. That’s now led to the next breakthrough, which is all those core ideas apply to images as well. They apply to proteins, to DNA. And so we’re starting to see models that understand images and the concepts of images, and can actually map those back to text as well.
So, you can look at a pathology image and say, not just at the cell, but it appears that there’s some certain sort of cancer in this particular, tissue there. And then you take those two things together and you layer on the fact that now you have a model, or a set of models, that can understand intent, can understand human concepts and biomedical concepts, and you can start stitching them together into specialized agents that can actually reason with each other, which at some level gives you an API as a developer to say, okay, I need to focus on a pathology model and get this really, really, sound while somebody else is focusing on a radiology model, but now allows us to stitch these all together with a user interface that we can now talk to through natural language.
RUNDE: I’d like to double click a little bit on that medical abstraction piece that you mentioned. Just the amount of data, clinical data that there is for each individual patient. Let’s think about cancer patients for a second to make this real. Right. For every cancer patient, it could take a couple of hours to structure their information. And why is that important? Because, you have to get that information in a structured way and abstract relevant information to be able to unlock precision health applications right, for each patient. So, to be able to match them to a trial, right, someone has to sit there and go through all of the clinical notes from their entire patient care journey, from the beginning to the end. And that’s not scalable. And so one thing that we’ve been doing in an active project that we’ve been working on with a handful of our partners, but Providence specifically, I’ll call out, is using AI to actually abstract and curate that information. So that gives time back to the health care provider to spend with patients, instead of spending all their time curating this information.
And this is super important because it sets the scene and the backbone for all those precision health applications. Like I mentioned, clinical trial matching, tumor boards is another really important example here. Maybe Matt, you can talk to that a little bit.
LUNGREN: It’s a great example. And you know it’s so funny. We’ve talked about this use case and the you know the health
And a tumor board is a critical meeting that happens at many cancer centers where specialists all get together, come with their perspective, and make a comment on what would be the best next step in treatment. But the background in preparing for that is you know, again, organizing the data. But to your point, also, what are the clinical trials that are active? There are thousands of clinical trials. There’s hundreds every day added. How can anyone keep up with that? And these are the kinds of use cases that start to bubble up. And you realize that a technology that understands concepts, context and can reason over vast amounts of data with a language interface-that is a powerful tool. Even before we get to some of the, you know, unlocking new insights and even precision medicine, this is that idea of saving time before lives to me. And there’s an enormous amount of undifferentiated heavy lifting that happens in health
GUYMAN: And we’ve packaged these agents, the manual abstraction work that, you know, manually takes hours. Now we have an agent. It’s in Foundry along with the clinical trial matching agent, which I think at Providence you showed could double the match rate over the baseline that they were using by using the AI for multiple data sources. So, we have that and then we have this orchestration that is using this really neat technology from Microsoft Research. Semantic Kernel, Magentic
There’s turn taking, there’s negotiation between the agents. So, there’s this really interesting system that’s emerging. And again, this is all possible to be used through Teams. And there’s some great extensibility as well. We’ve been talking about that and working on some cool tools.
SALIGRAMA: Yeah. Yeah. No, I think if I have to geek out a little bit on how all this agent tech orchestrations are coming up, like I’ve been in software engineering for decades, it’s kind of a next version of distributed systems where you have these services that talk to each other. It’s a more natural way because LLMs are giving these natural ways instead of a structured API ways of conversing. We have these agents which can naturally understand how to talk to each other. Right. So this is like the next evolution of our systems now. And the way we’re packaging all of this is multiple ways based on all the standards and innovation that’s happening in this space. So, first of all, we are building these agents that are very good at specific tasks, like, Will was saying like, a trial matching agent or patient timeline agents.
So, we take all of these, and then we package it in a workflow and an orchestration. We use the standard, some of these coming from research. The Semantic Kernel, the Magentic-One. And then, all of these also allow us to extend these agents with custom agents that can be plugged in. So, we are open sourcing the entire agent orchestration in AI Foundry templates, so that developers can extend their own agents, and make their own workflows out of it. So, a lot of cool innovation happening to apply this technology to specific scenarios and workflows.
LUNGREN: Well, I was going to ask you, like, so as part of that extension. So, like, you know, folks can say, hey, I have maybe a really specific part of my workflow that I want to use some agents for, maybe one of the agents that can do PubMed literature search, for example. But then there’s also agents that, come in from the outside, you know, sort of like I could, I can imagine a software company or AI company that has a built-in agent that plugs in as well.
SALIGRAMA: Yeah. Yeah, absolutely. So, you can bring your own agent. And then we have these, standard ways of communicating with agents and integrating with the orchestration language so you can bring your own agent and extend this health care agent, agent orchestrator to your own needs.
LUNGREN: I can just think of, like, in a group chat, like a bunch of different specialist agents. And I really would want an orchestrator to help find the right tool, to your point earlier, because I’m guessing this ecosystem is going to expand quickly. Yeah. And I may not know which tool is best for which question. I just want to ask the question. Right.
SALIGRAMA: Yeah. Yeah.
CARLSON: Well, I think to that point to I mean, you said an important point here, which is tools, and these are not necessarily just AI tools. Right? I mean, we’ve known this for a while, right? LLMS are not very good at math, but you can have it use a calculator and then it works very well. And you know you guys both brought up the universal medical abstraction a couple times.
And one of the things that I find so powerful about that is we’ve long had this vision within the precision health community that we should be able to have a learning hospital system. We should be able to actually learn from the actual real clinical experiences that are happening every day, so that we can stop practicing medicine based off averages.
There’s a lot of work that’s gone on for the last 20 years about how to actually do causal inference. That’s not an AI question. That’s a statistical question. The bottleneck, the reason why we haven’t been able to do that is because most of that information is locked up in unstructured text. And these other tools need essentially a table.
And so now you can decompose this problem, say, well, what if I can use AI not to get to the causal answer, but to just structure the information. So now I can put it into the causal inference tool. And these sorts of patterns I think again become very, not just powerful for a programmer, but they start pulling together different specialties. And I think we’ll really see an acceleration, really, of collaboration across disciplines because of this.
CARLSON: So, when I joined Microsoft Research 18 years ago, I was doing work in computational biology. And I would always have to answer the question: why is Microsoft in biomedicine? And I would always kind of joke saying, well, it is. We sell Office and Windows to every health
SALIGRAMA: A lot of healthcare organizations already use Microsoft productivity tools, as you mentioned. So, they asked the developers, build these agents, and use our healthcare orchestrations, to plug in these agents and expose these in these productivity tools. They will get access to all these healthcare workers. So the healthcare agent orchestrator we have today integrates with Microsoft Teams, and it showcases an example of how you can atmention these agents and talk to them like you were talking to another person in a Teams chat. And then it also provides examples of these agents and how they can use these productivity tools. One of the examples we have there is how they can summarize the assessments of this whole chat into a Word Doc, or even convert that into a PowerPoint presentation, for later on.
CARLSON: One of the things that has struck me is how easy it is to do. I mean, Will, I don’t know if you’ve worked with folks that have gone from 0 to 60, like, how fast? What does that look like?
GUYMAN: Yeah, it’s funny for us, the technology to transfer all this context into a Word Document or PowerPoint presentation for a doctor to take to a meeting is relatively straightforward compared to the complicated clinical trial matching multimodal processing. The feedback has been tremendous in terms of, wow, that saves so much time to have this organized report that then I can show up to meeting with and the agents can come with me to that meeting because they’re literally having a Teams meeting, often with other human specialists. And the agents can be there and ask and answer questions and fact check and source all the right information on the fly. So, there’s a nice integration into these existing tools.
LUNGREN: We worked with several different centers just to kind of understand, you know, where this might be useful. And, like, as I think we talked about before, the ideas that we’ve come up with again, this is a great one because it’s complex. It’s kind of hairy. There’s a lot of things happening under the hood that don’t necessarily require a medical license to do, right, to prepare for a tumor board and to organize data. But, it’s fascinating, actually. So, you know, folks have come up with ideas of, could I have an agent that can operate an MRI machine, and I can ask the agent to change some parameters or redo a protocol. We thought that was a pretty powerful use case. We’ve had others that have just said, you know, I really want to have a specific agent that’s able to kind of act like deep research does for the consumer side, but based on the context of my patient, so that it can search all the literature and pull the data in the papers that are relevant to this case. And the list goes on and on from operations all the way to clinical, you know, sort of decision making at some level. And I think that the research community that’s going to sprout around this will help us, guide us, I guess, to see what is the most high-impact use cases. Where is this effective? And maybe where it’s not effective.
But to me, the part that makes me so, I guess excited about this is just that I don’t have to think about, okay, well, then we have to figure out Health IT. Because it’s always, you know, we always have great ideas and research, and it always feels like there’s such a huge chasm to get it in front of the health care workers that might want to test this out. And it feels like, again, this productivity tool use case again with the enterprise security, the possibility for bringing in third parties to contribute really does feel like it’s a new surface area for innovation.
CARLSON: Yeah, I love that. Look. Let me end by putting you all on the spot. So, in three years, multimodal agents will do what? Matt, I’ll start with you.
LUNGREN: I am convinced that it’s going to save massive amount of time before it saves many lives.
RUNDE: I’ll focus on the patient care journey and diagnostic journey. I think it will kind of transform that process for the patient itself and shorten that process.
GUYMAN: Yeah, I think we’ve seen already papers recently showing that different modalities surfaced complementary information. And so we’ll see kind of this AI and these agents becoming an essential companion to the physician, surfacing insights that would have been overlooked otherwise.
SALIGRAMA: And similar to what you guys were saying, agents will become important assistants to healthcare workers, reducing a lot of documentation and workflow, excess work they have to do.
CARLSON: I love that. And I guess for my part, I think really what we’re going to see is a massive unleash of creativity. We’ve had a lot of folks that have been innovating in this space, but they haven’t had a way to actually get it into the hands of early adopters. And I think we’re going to see that really lead to an explosion of creativity across the ecosystem.
LUNGREN: So, where do we get started? Like where are the developers who are listening to this? The folks that are at, you know, labs, research labs and developing health care solutions. Where do they go to get started with the Foundry, the models we’ve talked about, the healthcare agent orchestrator. Where do they go?
GUYMAN: So AI.azure.com is the AI Foundry. It’s a website you can go as a developer. You can sign in with your Azure subscription, get your Azure account, your own VM, all that stuff. And you have an agent catalog, the model catalog. You can start from there. There is documentation and templates that you can then deploy to Teams or other applications.
LUNGREN: And tutorials are coming. Right. We have recordings of tutorials. We’ll have Hackathons, some sessions and then more to come. Yeah, we’re really excited.
LUNGREN: Thank you so much, guys for joining us.
CARLSON: Yes. Yeah. Thanks.
SALIGRAMA: Thanks for having us.
#collaborators #healthcare #innovation #impact

Collaborators: Healthcare Innovation to Impact
JONATHAN CARLSON: From the beginning, healthcare stood out to us as an important opportunity for general reasoners to improve the lives and experiences of patients and providers. Indeed, in the past two years, there’s been an explosion of scientific papers looking at the application first of text reasoners and medicine, then multi-modal reasoners that can interpret medical images, and now, most recently, healthcare agents that can reason with each other. But even more impressive than the pace of research has been the surprisingly rapid diffusion of this technology into real world clinical workflows. LUNGREN: So today, we’ll talk about how our cross-company collaboration has shortened that gap and delivered advanced AI capabilities and solutions into the hands of developers and clinicians around the world, empowering everyone in health and life sciences to achieve more. I’m Doctor Matt Lungren, chief scientific officer for Microsoft Health and Life Sciences. CARLSON: And I’m Jonathan Carlson, vice president and managing director of Microsoft Health Futures. LUNGREN: And together we brought some key players leading in the space of AI and health CARLSON: We’ve asked these brilliant folks to join us because each of them represents a mission critical group of cutting-edge stakeholders, scaling breakthroughs into purpose-built solutions and capabilities for health LUNGREN: We’ll hear today how generative AI capabilities can unlock reasoning across every data type in medicine: text, images, waveforms, genomics. And further, how multi-agent frameworks in healthcare can accelerate complex workflows, in some cases acting as a specialist team member, safely secured inside the Microsoft 365 tools used by hundreds of millions of healthcare enterprise users across the world. The opportunity to save time today and lives tomorrow with AI has never been larger.  MATTHEW LUNGREN: Jonathan. You know, it’s been really interesting kind of observing Microsoft Research over the decades. I’ve, you know, been watching you guys in my prior academic career. You are always on the front of innovation, particularly in health JONATHAN CARLSON: I mean, it’s some of what’s in our DNA, I mean, we’ve been publishing in health and life sciences for two decades here. But when we launched Health Futures as a mission-focused lab about 7 or 8 years ago, we really started with the premise that the way to have impact was to really close the loop between, not just good ideas that get published, but good ideas that can actually be grounded in real problems that clinicians and scientists care about, that then allow us to actually go from that first proof of concept into an incubation, into getting real world feedback that allows us to close that loop. And now with, you know, the HLS organization here as a product group, we have the opportunity to work really closely with you all to not just prove what’s possible in the clinic or in the lab, but actually start scaling that into the broader community. CAMERON RUNDE: And one thing I’ll add here is that the problems that we’re trying to tackle in health CARLSON: So, Matt, back to you. What are you guys doing in the product group? How do you guys see these models getting into the clinic? LUNGREN: You know, I think a lot of people, you know, think about AI is just, you know, maybe just even a few years old because of GPT and how that really captured the public’s consciousness. Right? And so, you think about the speech-to-text technology of being able to dictate something, for a clinic note or for a visit, that was typically based on Nuance technology. And so there’s a lot of product understanding of the market, how to deliver something that clinicians will use, understanding the pain points and workflows and really that Health IT space, which is sometimes the third rail, I feel like with a lot of innovation in healthcare. But beyond that, I mean, I think now that we have this really powerful engine of Microsoft and the platform capabilities, we’re seeing, innovations on the healthcare side for data storage, data interoperability, with different types of medical data. You have new applications coming online, the ability, of course, to see generative AI now infused into the speech-to-text and, becoming Dragon Copilot, which is something that has been, you know, tremendously, received by the community. Physicians are able to now just have a conversation with a patient. They turn to their computer and the note is ready for them. There’s no more this, we call it keyboard liberation. I don’t know if you heard that before. And that’s just been tremendous. And there’s so much more coming from that side. And then there’s other parts of the workflow that we also get engaged in — the diagnostic workflow. So medical imaging, sharing images across different hospital systems, the list goes on. And so now when you move into AI, we feel like there’s a huge opportunity to deliver capabilities into the clinical workflow via the products and solutions we already have. But, I mean, we’ll now that we’ve kind of expanded our team to involve Azure and platform, we’re really able to now focus on the developers. WILL GUYMAN: Yeah. And you’re always telling me as a doctor how frustrating it is to be spending time at the computer instead of with your patients. I think you told me, you know, 4,000 clicks a day for the typical doctor, which is tremendous. And something like Dragon Copilot can save that five minutes per patient. But it can also now take actions after the patient encounter so it can draft the after-visit summary. It can order labs and medications for the referral. And that’s incredible. And we want to keep building on that. There’s so many other use cases across the ecosystem. And so that’s why in Azure AI Foundry, we have translated a lot of the research from Microsoft Research and made that available to developers to build and customize for their own applications. SMITHA SALIGRAMA: Yeah. And as you were saying, in our transformation of moving from solutions to platforms and as, scaling solutions to other, multiple scenarios, as we put our models in AI Foundry, we provide these developer capabilities like bring your own data and fine LUNGREN: Well, I want to do a reality check because, you know, I think to us that are now really focused on technology, it seems like, I’ve heard this story before, right. I, I remember even in, my academic clinical days where it felt like technology was always the quick answer and it felt like technology was, there was maybe a disconnect between what my problems were or what I think needed to be done versus kind of the solutions that were kind of, created or offered to us. And I guess at some level, how Jonathan, do you think about this? Because to do things well in the science space is one thing, to do things well in science, but then also have it be something that actually drives health CARLSON: Yeah. I mean, as you said, I think one of the core pathologies of Big Tech is we assume every problem is a technology problem. And that’s all it will take to solve the problem. And I think, look, I was trained as a computational biologist, and that sits in the awkward middle between biology and computation. And the thing that we always have to remember, the thing that we were very acutely aware of when we set out, was that we are not the experts. We do have, you know, you as an M.D., we have everybody on the team, we have biologists on the team. But this is a big space. And the only way we’re going to have real impact, the only way we’re even going to pick the right problems to work on is if we really partner deeply, with providers, with EHRvendors, with scientists, and really understand what’s important and again, get that feedback loop. RUNDE: Yeah, I think we really need to ground the work that we do in the science itself. You need to understand the broader ecosystem and the broader landscape, across healthwe think are important. Because, as Jonathan said, we’re not the experts in health CARLSON: When we really launched this, this mission, 7 or 8 years ago, we really came in with the premise of, if we decide to stop, we want to be sure the world cares. And the only way that’s going to be true is if we’re really deeply embedded with the people that matter–the patients, the providers and the scientists. LUNGREN: And now it really feels like this collaborative effort, you know, really can help start to extend that mission. Right. I think, you know, Will and Smitha, that we definitely feel the passion and the innovation. And we certainly benefit from those collaborations, too. But then we have these other partners and even customers, right, that we can start to tap into and have that flywheel keep spinning. GUYMAN: Yeah. And the whole industry is an ecosystem. So, we have our own data sets at Microsoft Research that you’ve trained amazing AI models with. And those are in the catalog. But then you’ve also partnered with institutions like Providence or Page AI . And those models are in the catalog with their data. And then there are third parties like Nvidia that have their own specialized proprietary data sets, and their models are there too. So, we have this ecosystem of open source models. And maybe Smitha, you want to talk about how developers can actually customize these. SALIGRAMA: Yeah. So we use the Azure AI Foundry ecosystem. Developers can feel at home if they’re using the AI Foundry. So they can look at our model cards that we publish as part of the models we publish, understand the use cases of these models, how to, quickly, bring up these APIs and, look at different use cases of how to apply these and even fine LUNGREN: Yeah it has been interesting to see we have these health GUYMAN: Well, the general-purpose large language models are amazing for medical general reasoning. So Microsoft Research has shown that that they can perform super well on, for example, like the United States medical licensing exam, they can exceed doctor performance if they’re just picking between different multiple-choice questions. But real medicine we know is messier. It doesn’t always start with the whole patient context provided as text in the prompt. You have to get the source data and that raw data is often non-text. The majority of it is non-text. It’s things like medical imaging, radiology, pathology, ophthalmology, dermatology. It goes on and on. And there’s endless signal data, lab data. And so all of this diverse data type needs to be processed through specialized models because much of that data is not available on the public internet. And that’s why we’re taking this partner approach, first party and third party models that can interpret all this kind of data and then connect them ultimately back to these general reasoners to reason over that. LUNGREN: So, you know, I’ve been at this company for a while and, you know, familiar with kind of how long it takes, generally to get, you know, a really good research paper, do all the studies, do all the data analysis, and then go through the process of publishing, right, which takes, as, you know, a long time and it’s, you know, very rigorous. And one of the things that struck me, last year, I think we, we started this big collaboration and, within a quarter, you had a Nature paper coming out from Microsoft Research, and that model that the Nature paper was describing was ready to be used by anyone on the Azure AI Foundry within that same quarter. It kind of blew my mind when I thought about it, you know, even though we were all, you know, working very hard to get that done. Any thoughts on that? I mean, has this ever happened in your career? And, you know, what’s the secret sauce to that? CARLSON: Yeah, I mean, the time scale from research to product has been massively compressed. And I’d push that even further, which is to say, the reason why it took a quarter was because we were laying the railroad tracks as we’re driving the train. We have examples right after that when we are launching on Foundry the same day we were publishing the paper. And frankly, the review times are becoming longer than it takes to actually productize the models. I think there’s two things that are going on with that are really converging. One is that the overall ecosystem is converging on a relatively small number of patterns, and that gives us, as a tech company, a reason to go off and really make those patterns hardened in a way that allows not just us, but third parties as well, to really have a nice workflow to publish these models. But the other is actually, I think, a change in how we work, you know, and for most of our history as an industrial research lab, we would do research and then we’d go pitch it to somebody and try and throw it over the fence. We’ve really built a much more integrated team. In fact, if you look at that Nature paper or any of the other papers, there’s folks from product teams. Many of you are on the papers along with our clinical collaborators. RUNDE: Yeah. I think one thing that’s really important to note is that there’s a ton of different ways that you can have impact, right? So I like to think about phasing. In Health Futures at least, I like to think about phasing the work that we do. So first we have research, which is really early innovation. And the impact there is getting our technology and our tools out there and really sharing the learnings that we’ve had. So that can be through publications like you mentioned. It can be through open-sourcing our models. And then you go to incubation. So, this is, I think, one of the more new spaces that we’re getting into, which is maybe that blurred line between research and product. Right. Which is, how do we take the tools and technologies that we’ve built and get them into the hands of users, typically through our partnerships? Right. So, we partner very deeply and collaborate very deeply across the industry. And incubation is really important because we get that early feedback. We get an ability to pivot if we need to. And we also get the ability to see what types of impact our technology is having in the real world. And then lastly, when you think about scale, there’s tons of different ways that you can scale. We can scale third-party through our collaborators and really empower them to go to market to commercialize the things that we’ve built together. You can also think about scaling internally, which is why I’m so thankful that we’ve created this flywheel between research and product, and a lot of the models that we’ve built that have gone through research, have gone through incubation, have been able to scale on the Azure AI Foundry. But that’s not really our expertise. Right? The scale piece in research, that’s research and incubation. Smitha, how do you think about scaling? SALIGRAMA: So, there are several angles to scaling the models, the state-of-the-art models we see from the research team. The first angle is, the open sourcing, to get developer trust, and very generous commercial licenses so that they can use it and for their own, use cases. The second is, we also allow them to customize these models, fine GUYMAN: And as one example, you know, University of Wisconsin Health, you know, which Matt knows well. They took one of our models, which is highly versatile. They customized it in Foundry and they optimized it to reliably identify abnormal chest X-rays, the most common imaging procedure, so they could improve their turnaround time triage quickly. And that’s just one example. But we have other partners like Sectra who are doing more of operations use cases automatically routing imaging to the radiologists, setting them up to be efficient. And then Page AI is doing, you know, biomarker identification for actually diagnostics and new drug discovery. So, there’s so many use cases that we have partners already who are building and customizing. LUNGREN: The part that’s striking to me is just that, you know, we could all sit in a room and think about all the different ways someone might use these models on the catalog. And I’m still shocked at the stuff that people use them for and how effective they are. And I think part of that is, you know, again, we talk a lot about generative AI and healthcare and all the things you can do. Again, you know, in text, you refer to that earlier and certainly off the shelf, there’s really powerful applications. But there is, you know, kind of this tip of the iceberg effect where under the water, most of the data that we use to take care of our patients is not text. Right. It’s all the different other modalities. And I think that this has been an unlock right, sort of taking these innovations, innovations from the community, putting them in this ecosystem kind of catalog, essentially. Right. And then allowing folks to kind of, you know, build and develop applications with all these different types of data. Again, I’ve been surprised at what I’m seeing. CARLSON: This has been just one of the most profound shifts that’s happened in the last 12 months, really. I mean, two years ago we had general models in text that really shifted how we think about, I mean, natural language processing got totally upended by that. Turns out the same technology works for images as well. It doesn’t only allow you to automatically extract concepts from images, but allows you to align those image concepts with text concepts, which means that you can have a conversation with that image. And once you’re in that world now, you are a place where you can start stitching together these multimodal models that really change how you can interact with the data, and how you can start getting more information out of the raw primary data that is part of the patient journey. LUNGREN: Well, and we’re going to get to that because I think you just touched on something. And I want to re-emphasize stitching these things together. There’s a lot of different ways to potentially do that. Right? There’s ways that you can literally train the model end to end with adapters and all kinds of other early fusion fusions. All kinds of ways. But one of the things that the word of the I guess the year is going to be agents and an agent is a very interesting term to think about how you might abstract away some of the components or the tasks that you want the model to, to accomplish in the midst of sort of a real human to maybe model interaction. Can you talk a little bit more about, how we’re thinking about agents in this, in this platform approach?  GUYMAN: Well, this is our newest addition to the Azure AI Foundry. So there’s an agent catalog now where we have a set of pre-configured agents for health care. And then we also have a multi-agent orchestrator that can jump LUNGREN: And, and I really like that concept because, you know, as, as a, as a from the user personas, I think about myself as a user. How am I going to interact with these agents? Where does it naturally fit? And I and I sort of, you know, I’ve seen some of the demonstrations and some of the work that’s going on with Stanford in particular, showing that, you know, and literally in a Teams chat, I can have my clinician colleagues and I can have specialized health It is a completely mind-blowing thing for me. And it’s a light bulb moment for me to I wonder, what have we, what have we heard from folks that have, you know, tried out this health care agent orchestrator in this kind of deployment environment via Teams? GUYMAN: Well, someone joked, you know, are you sure you’re not using Teams because you work at Microsoft?But, then we actually were meeting with one of the, radiologists at one of our partners, and they said that that morning they had just done a Teams meeting, or they had met with other specialists to talk about a patient’s cancer case, or they were coming up with a treatment plan. And that was the light bulb moment for us. We realized, actually, Teams is already being used by physicians as an internal communication tool, as a tool to get work done. And especially since the pandemic, a lot of the meetings moved to virtual and telemedicine. And so it’s a great distribution channel for AI, which is often been a struggle for AI to actually get in the hands of clinicians. And so now we’re allowing developers to build and then deploy very easily and extend it into their own workflows. CARLSON: I think that’s such an important point. I mean, if you think about one of the really important concepts in computer science is an application programing interface, like some set of rules that allow two applications to talk to each other. One of the big pushes, really important pushes, in medicine has been standards that allow us to actually have data standards and APIs that allow these to talk to each other, and yet still we end up with these silos. There’s silos of data. There’s silos of applications. And just like when you and I work on our phone, we have to go back and forth between applications. One of the things that I think agents do is that it takes the idea that now you can use language to understand intent and effectively program an interface, and it creates a whole new abstraction layer that allows us to simplify the interaction between not just humans and the endpoint, but also for developers. It allows us to have this abstraction layer that lets different developers focus on different types of models, and yet stitch them all together in a very, very natural, way, not just for the users, but for the ability to actually deploy those models. SALIGRAMA: Just to add to what Jonathan was mentioning, the other cool thing about the Microsoft Teams user interface is it’s also enterprise ready. RUNDE: And one important thing that we’re thinking about, is exactly this from the very early research through incubation and then to scale, obviously. Right. And so early on in research, we are actively working with our partners and our collaborators to make sure that we have the right data privacy and consent in place. We’re doing this in incubation as well. And then obviously in scale. Yep. LUNGREN: So, I think AI has always been thought of as a savior kind of technology. We talked a little bit about how there’s been some ups and downs in terms of the ability for technology to be effective in health care. At the same time, we’re seeing a lot of new innovations that are really making a difference. But then we kind of get, you know, we talked about agents a little bit. It feels like we’re maybe abstracting too far. Maybe it’s things are going too fast, almost. What makes this different? I mean, in your mind is this truly a logical next step or is it going to take some time? CARLSON: I think there’s a couple things that have happened. I think first, on just a pure technology. What led to ChatGPT? And I like to think of really three major breakthroughs. The first was new mathematical concepts of attention, which really means that we now have a way that a machine can figure out which parts of the context it should actually focus on, just the way our brains do. Right? I mean, if you’re a clinician and somebody is talking to you, the majority of that conversation is not relevant for the diagnosis. But, you know how to zoom in on the parts that matter. That’s a super powerful mathematical concept. The second one is this idea of self-supervision. So, I think one of the fundamental problems of machine learning has been that you have to train on labeled training data and labels are expensive, which means data sets are small, which means the final models are very narrow and brittle. And the idea of self-supervision is that you can just get a model to automatically learn concepts, and the language is just predict the next word. And what’s important about that is that leads to models that can actually manipulate and understand really messy text and pull out what’s important about that, and then and then stitch that back together in interesting ways. And the third concept, that came out of those first two, was just the observational scale. And that’s that more is better, more data, more compute, bigger models. And that really leads to a reason to keep investing. And for these models to keep getting better. So that as a as a groundwork, that’s what led to ChatGPT. That’s what led to our ability now to not just have rule-based systems or simple machine learning based systems to take a messy EHR record, say, and pull out a couple concepts. But to really feed the whole thing in and say, okay, I need you to figure out which concepts are in here. And is this particular attribute there, for example. That’s now led to the next breakthrough, which is all those core ideas apply to images as well. They apply to proteins, to DNA. And so we’re starting to see models that understand images and the concepts of images, and can actually map those back to text as well. So, you can look at a pathology image and say, not just at the cell, but it appears that there’s some certain sort of cancer in this particular, tissue there. And then you take those two things together and you layer on the fact that now you have a model, or a set of models, that can understand intent, can understand human concepts and biomedical concepts, and you can start stitching them together into specialized agents that can actually reason with each other, which at some level gives you an API as a developer to say, okay, I need to focus on a pathology model and get this really, really, sound while somebody else is focusing on a radiology model, but now allows us to stitch these all together with a user interface that we can now talk to through natural language. RUNDE: I’d like to double click a little bit on that medical abstraction piece that you mentioned. Just the amount of data, clinical data that there is for each individual patient. Let’s think about cancer patients for a second to make this real. Right. For every cancer patient, it could take a couple of hours to structure their information. And why is that important? Because, you have to get that information in a structured way and abstract relevant information to be able to unlock precision health applications right, for each patient. So, to be able to match them to a trial, right, someone has to sit there and go through all of the clinical notes from their entire patient care journey, from the beginning to the end. And that’s not scalable. And so one thing that we’ve been doing in an active project that we’ve been working on with a handful of our partners, but Providence specifically, I’ll call out, is using AI to actually abstract and curate that information. So that gives time back to the health care provider to spend with patients, instead of spending all their time curating this information. And this is super important because it sets the scene and the backbone for all those precision health applications. Like I mentioned, clinical trial matching, tumor boards is another really important example here. Maybe Matt, you can talk to that a little bit. LUNGREN: It’s a great example. And you know it’s so funny. We’ve talked about this use case and the you know the health And a tumor board is a critical meeting that happens at many cancer centers where specialists all get together, come with their perspective, and make a comment on what would be the best next step in treatment. But the background in preparing for that is you know, again, organizing the data. But to your point, also, what are the clinical trials that are active? There are thousands of clinical trials. There’s hundreds every day added. How can anyone keep up with that? And these are the kinds of use cases that start to bubble up. And you realize that a technology that understands concepts, context and can reason over vast amounts of data with a language interface-that is a powerful tool. Even before we get to some of the, you know, unlocking new insights and even precision medicine, this is that idea of saving time before lives to me. And there’s an enormous amount of undifferentiated heavy lifting that happens in health GUYMAN: And we’ve packaged these agents, the manual abstraction work that, you know, manually takes hours. Now we have an agent. It’s in Foundry along with the clinical trial matching agent, which I think at Providence you showed could double the match rate over the baseline that they were using by using the AI for multiple data sources. So, we have that and then we have this orchestration that is using this really neat technology from Microsoft Research. Semantic Kernel, Magentic There’s turn taking, there’s negotiation between the agents. So, there’s this really interesting system that’s emerging. And again, this is all possible to be used through Teams. And there’s some great extensibility as well. We’ve been talking about that and working on some cool tools. SALIGRAMA: Yeah. Yeah. No, I think if I have to geek out a little bit on how all this agent tech orchestrations are coming up, like I’ve been in software engineering for decades, it’s kind of a next version of distributed systems where you have these services that talk to each other. It’s a more natural way because LLMs are giving these natural ways instead of a structured API ways of conversing. We have these agents which can naturally understand how to talk to each other. Right. So this is like the next evolution of our systems now. And the way we’re packaging all of this is multiple ways based on all the standards and innovation that’s happening in this space. So, first of all, we are building these agents that are very good at specific tasks, like, Will was saying like, a trial matching agent or patient timeline agents. So, we take all of these, and then we package it in a workflow and an orchestration. We use the standard, some of these coming from research. The Semantic Kernel, the Magentic-One. And then, all of these also allow us to extend these agents with custom agents that can be plugged in. So, we are open sourcing the entire agent orchestration in AI Foundry templates, so that developers can extend their own agents, and make their own workflows out of it. So, a lot of cool innovation happening to apply this technology to specific scenarios and workflows. LUNGREN: Well, I was going to ask you, like, so as part of that extension. So, like, you know, folks can say, hey, I have maybe a really specific part of my workflow that I want to use some agents for, maybe one of the agents that can do PubMed literature search, for example. But then there’s also agents that, come in from the outside, you know, sort of like I could, I can imagine a software company or AI company that has a built-in agent that plugs in as well. SALIGRAMA: Yeah. Yeah, absolutely. So, you can bring your own agent. And then we have these, standard ways of communicating with agents and integrating with the orchestration language so you can bring your own agent and extend this health care agent, agent orchestrator to your own needs. LUNGREN: I can just think of, like, in a group chat, like a bunch of different specialist agents. And I really would want an orchestrator to help find the right tool, to your point earlier, because I’m guessing this ecosystem is going to expand quickly. Yeah. And I may not know which tool is best for which question. I just want to ask the question. Right. SALIGRAMA: Yeah. Yeah. CARLSON: Well, I think to that point to I mean, you said an important point here, which is tools, and these are not necessarily just AI tools. Right? I mean, we’ve known this for a while, right? LLMS are not very good at math, but you can have it use a calculator and then it works very well. And you know you guys both brought up the universal medical abstraction a couple times. And one of the things that I find so powerful about that is we’ve long had this vision within the precision health community that we should be able to have a learning hospital system. We should be able to actually learn from the actual real clinical experiences that are happening every day, so that we can stop practicing medicine based off averages. There’s a lot of work that’s gone on for the last 20 years about how to actually do causal inference. That’s not an AI question. That’s a statistical question. The bottleneck, the reason why we haven’t been able to do that is because most of that information is locked up in unstructured text. And these other tools need essentially a table. And so now you can decompose this problem, say, well, what if I can use AI not to get to the causal answer, but to just structure the information. So now I can put it into the causal inference tool. And these sorts of patterns I think again become very, not just powerful for a programmer, but they start pulling together different specialties. And I think we’ll really see an acceleration, really, of collaboration across disciplines because of this. CARLSON: So, when I joined Microsoft Research 18 years ago, I was doing work in computational biology. And I would always have to answer the question: why is Microsoft in biomedicine? And I would always kind of joke saying, well, it is. We sell Office and Windows to every health SALIGRAMA: A lot of healthcare organizations already use Microsoft productivity tools, as you mentioned. So, they asked the developers, build these agents, and use our healthcare orchestrations, to plug in these agents and expose these in these productivity tools. They will get access to all these healthcare workers. So the healthcare agent orchestrator we have today integrates with Microsoft Teams, and it showcases an example of how you can atmention these agents and talk to them like you were talking to another person in a Teams chat. And then it also provides examples of these agents and how they can use these productivity tools. One of the examples we have there is how they can summarize the assessments of this whole chat into a Word Doc, or even convert that into a PowerPoint presentation, for later on. CARLSON: One of the things that has struck me is how easy it is to do. I mean, Will, I don’t know if you’ve worked with folks that have gone from 0 to 60, like, how fast? What does that look like? GUYMAN: Yeah, it’s funny for us, the technology to transfer all this context into a Word Document or PowerPoint presentation for a doctor to take to a meeting is relatively straightforward compared to the complicated clinical trial matching multimodal processing. The feedback has been tremendous in terms of, wow, that saves so much time to have this organized report that then I can show up to meeting with and the agents can come with me to that meeting because they’re literally having a Teams meeting, often with other human specialists. And the agents can be there and ask and answer questions and fact check and source all the right information on the fly. So, there’s a nice integration into these existing tools. LUNGREN: We worked with several different centers just to kind of understand, you know, where this might be useful. And, like, as I think we talked about before, the ideas that we’ve come up with again, this is a great one because it’s complex. It’s kind of hairy. There’s a lot of things happening under the hood that don’t necessarily require a medical license to do, right, to prepare for a tumor board and to organize data. But, it’s fascinating, actually. So, you know, folks have come up with ideas of, could I have an agent that can operate an MRI machine, and I can ask the agent to change some parameters or redo a protocol. We thought that was a pretty powerful use case. We’ve had others that have just said, you know, I really want to have a specific agent that’s able to kind of act like deep research does for the consumer side, but based on the context of my patient, so that it can search all the literature and pull the data in the papers that are relevant to this case. And the list goes on and on from operations all the way to clinical, you know, sort of decision making at some level. And I think that the research community that’s going to sprout around this will help us, guide us, I guess, to see what is the most high-impact use cases. Where is this effective? And maybe where it’s not effective. But to me, the part that makes me so, I guess excited about this is just that I don’t have to think about, okay, well, then we have to figure out Health IT. Because it’s always, you know, we always have great ideas and research, and it always feels like there’s such a huge chasm to get it in front of the health care workers that might want to test this out. And it feels like, again, this productivity tool use case again with the enterprise security, the possibility for bringing in third parties to contribute really does feel like it’s a new surface area for innovation. CARLSON: Yeah, I love that. Look. Let me end by putting you all on the spot. So, in three years, multimodal agents will do what? Matt, I’ll start with you. LUNGREN: I am convinced that it’s going to save massive amount of time before it saves many lives. RUNDE: I’ll focus on the patient care journey and diagnostic journey. I think it will kind of transform that process for the patient itself and shorten that process. GUYMAN: Yeah, I think we’ve seen already papers recently showing that different modalities surfaced complementary information. And so we’ll see kind of this AI and these agents becoming an essential companion to the physician, surfacing insights that would have been overlooked otherwise. SALIGRAMA: And similar to what you guys were saying, agents will become important assistants to healthcare workers, reducing a lot of documentation and workflow, excess work they have to do. CARLSON: I love that. And I guess for my part, I think really what we’re going to see is a massive unleash of creativity. We’ve had a lot of folks that have been innovating in this space, but they haven’t had a way to actually get it into the hands of early adopters. And I think we’re going to see that really lead to an explosion of creativity across the ecosystem. LUNGREN: So, where do we get started? Like where are the developers who are listening to this? The folks that are at, you know, labs, research labs and developing health care solutions. Where do they go to get started with the Foundry, the models we’ve talked about, the healthcare agent orchestrator. Where do they go? GUYMAN: So AI.azure.com is the AI Foundry. It’s a website you can go as a developer. You can sign in with your Azure subscription, get your Azure account, your own VM, all that stuff. And you have an agent catalog, the model catalog. You can start from there. There is documentation and templates that you can then deploy to Teams or other applications. LUNGREN: And tutorials are coming. Right. We have recordings of tutorials. We’ll have Hackathons, some sessions and then more to come. Yeah, we’re really excited.   LUNGREN: Thank you so much, guys for joining us. CARLSON: Yes. Yeah. Thanks. SALIGRAMA: Thanks for having us.   #collaborators #healthcare #innovation #impact

Collaborators: Healthcare Innovation to Impact

www.microsoft.com
JONATHAN CARLSON: From the beginning, healthcare stood out to us as an important opportunity for general reasoners to improve the lives and experiences of patients and providers. Indeed, in the past two years, there’s been an explosion of scientific papers looking at the application first of text reasoners and medicine, then multi-modal reasoners that can interpret medical images, and now, most recently, healthcare agents that can reason with each other. But even more impressive than the pace of research has been the surprisingly rapid diffusion of this technology into real world clinical workflows. LUNGREN: So today, we’ll talk about how our cross-company collaboration has shortened that gap and delivered advanced AI capabilities and solutions into the hands of developers and clinicians around the world, empowering everyone in health and life sciences to achieve more. I’m Doctor Matt Lungren, chief scientific officer for Microsoft Health and Life Sciences. CARLSON: And I’m Jonathan Carlson, vice president and managing director of Microsoft Health Futures. LUNGREN: And together we brought some key players leading in the space of AI and health CARLSON: We’ve asked these brilliant folks to join us because each of them represents a mission critical group of cutting-edge stakeholders, scaling breakthroughs into purpose-built solutions and capabilities for health LUNGREN: We’ll hear today how generative AI capabilities can unlock reasoning across every data type in medicine: text, images, waveforms, genomics. And further, how multi-agent frameworks in healthcare can accelerate complex workflows, in some cases acting as a specialist team member, safely secured inside the Microsoft 365 tools used by hundreds of millions of healthcare enterprise users across the world. The opportunity to save time today and lives tomorrow with AI has never been larger. [MUSIC FADES]  MATTHEW LUNGREN: Jonathan. You know, it’s been really interesting kind of observing Microsoft Research over the decades. I’ve, you know, been watching you guys in my prior academic career. You are always on the front of innovation, particularly in health JONATHAN CARLSON: I mean, it’s some of what’s in our DNA, I mean, we’ve been publishing in health and life sciences for two decades here. But when we launched Health Futures as a mission-focused lab about 7 or 8 years ago, we really started with the premise that the way to have impact was to really close the loop between, not just good ideas that get published, but good ideas that can actually be grounded in real problems that clinicians and scientists care about, that then allow us to actually go from that first proof of concept into an incubation, into getting real world feedback that allows us to close that loop. And now with, you know, the HLS organization here as a product group, we have the opportunity to work really closely with you all to not just prove what’s possible in the clinic or in the lab, but actually start scaling that into the broader community. CAMERON RUNDE: And one thing I’ll add here is that the problems that we’re trying to tackle in health CARLSON: So, Matt, back to you. What are you guys doing in the product group? How do you guys see these models getting into the clinic? LUNGREN: You know, I think a lot of people, you know, think about AI is just, you know, maybe just even a few years old because of GPT and how that really captured the public’s consciousness. Right? And so, you think about the speech-to-text technology of being able to dictate something, for a clinic note or for a visit, that was typically based on Nuance technology. And so there’s a lot of product understanding of the market, how to deliver something that clinicians will use, understanding the pain points and workflows and really that Health IT space, which is sometimes the third rail, I feel like with a lot of innovation in healthcare. But beyond that, I mean, I think now that we have this really powerful engine of Microsoft and the platform capabilities, we’re seeing, innovations on the healthcare side for data storage, data interoperability, with different types of medical data. You have new applications coming online, the ability, of course, to see generative AI now infused into the speech-to-text and, becoming Dragon Copilot, which is something that has been, you know, tremendously, received by the community. Physicians are able to now just have a conversation with a patient. They turn to their computer and the note is ready for them. There’s no more this, we call it keyboard liberation. I don’t know if you heard that before. And that’s just been tremendous. And there’s so much more coming from that side. And then there’s other parts of the workflow that we also get engaged in — the diagnostic workflow. So medical imaging, sharing images across different hospital systems, the list goes on. And so now when you move into AI, we feel like there’s a huge opportunity to deliver capabilities into the clinical workflow via the products and solutions we already have. But, I mean, we’ll now that we’ve kind of expanded our team to involve Azure and platform, we’re really able to now focus on the developers. WILL GUYMAN: Yeah. And you’re always telling me as a doctor how frustrating it is to be spending time at the computer instead of with your patients. I think you told me, you know, 4,000 clicks a day for the typical doctor, which is tremendous. And something like Dragon Copilot can save that five minutes per patient. But it can also now take actions after the patient encounter so it can draft the after-visit summary. It can order labs and medications for the referral. And that’s incredible. And we want to keep building on that. There’s so many other use cases across the ecosystem. And so that’s why in Azure AI Foundry, we have translated a lot of the research from Microsoft Research and made that available to developers to build and customize for their own applications. SMITHA SALIGRAMA: Yeah. And as you were saying, in our transformation of moving from solutions to platforms and as, scaling solutions to other, multiple scenarios, as we put our models in AI Foundry, we provide these developer capabilities like bring your own data and fine LUNGREN: Well, I want to do a reality check because, you know, I think to us that are now really focused on technology, it seems like, I’ve heard this story before, right. I, I remember even in, my academic clinical days where it felt like technology was always the quick answer and it felt like technology was, there was maybe a disconnect between what my problems were or what I think needed to be done versus kind of the solutions that were kind of, created or offered to us. And I guess at some level, how Jonathan, do you think about this? Because to do things well in the science space is one thing, to do things well in science, but then also have it be something that actually drives health CARLSON: Yeah. I mean, as you said, I think one of the core pathologies of Big Tech is we assume every problem is a technology problem. And that’s all it will take to solve the problem. And I think, look, I was trained as a computational biologist, and that sits in the awkward middle between biology and computation. And the thing that we always have to remember, the thing that we were very acutely aware of when we set out, was that we are not the experts. We do have, you know, you as an M.D., we have everybody on the team, we have biologists on the team. But this is a big space. And the only way we’re going to have real impact, the only way we’re even going to pick the right problems to work on is if we really partner deeply, with providers, with EHR (electronic health records) vendors, with scientists, and really understand what’s important and again, get that feedback loop. RUNDE: Yeah, I think we really need to ground the work that we do in the science itself. You need to understand the broader ecosystem and the broader landscape, across healthwe think are important. Because, as Jonathan said, we’re not the experts in health CARLSON: When we really launched this, this mission, 7 or 8 years ago, we really came in with the premise of, if we decide to stop, we want to be sure the world cares. And the only way that’s going to be true is if we’re really deeply embedded with the people that matter–the patients, the providers and the scientists. LUNGREN: And now it really feels like this collaborative effort, you know, really can help start to extend that mission. Right. I think, you know, Will and Smitha, that we definitely feel the passion and the innovation. And we certainly benefit from those collaborations, too. But then we have these other partners and even customers, right, that we can start to tap into and have that flywheel keep spinning. GUYMAN: Yeah. And the whole industry is an ecosystem. So, we have our own data sets at Microsoft Research that you’ve trained amazing AI models with. And those are in the catalog. But then you’ve also partnered with institutions like Providence or Page AI . And those models are in the catalog with their data. And then there are third parties like Nvidia that have their own specialized proprietary data sets, and their models are there too. So, we have this ecosystem of open source models. And maybe Smitha, you want to talk about how developers can actually customize these. SALIGRAMA: Yeah. So we use the Azure AI Foundry ecosystem. Developers can feel at home if they’re using the AI Foundry. So they can look at our model cards that we publish as part of the models we publish, understand the use cases of these models, how to, quickly, bring up these APIs and, look at different use cases of how to apply these and even fine LUNGREN: Yeah it has been interesting to see we have these health GUYMAN: Well, the general-purpose large language models are amazing for medical general reasoning. So Microsoft Research has shown that that they can perform super well on, for example, like the United States medical licensing exam, they can exceed doctor performance if they’re just picking between different multiple-choice questions. But real medicine we know is messier. It doesn’t always start with the whole patient context provided as text in the prompt. You have to get the source data and that raw data is often non-text. The majority of it is non-text. It’s things like medical imaging, radiology, pathology, ophthalmology, dermatology. It goes on and on. And there’s endless signal data, lab data. And so all of this diverse data type needs to be processed through specialized models because much of that data is not available on the public internet. And that’s why we’re taking this partner approach, first party and third party models that can interpret all this kind of data and then connect them ultimately back to these general reasoners to reason over that. LUNGREN: So, you know, I’ve been at this company for a while and, you know, familiar with kind of how long it takes, generally to get, you know, a really good research paper, do all the studies, do all the data analysis, and then go through the process of publishing, right, which takes, as, you know, a long time and it’s, you know, very rigorous. And one of the things that struck me, last year, I think we, we started this big collaboration and, within a quarter, you had a Nature paper coming out from Microsoft Research, and that model that the Nature paper was describing was ready to be used by anyone on the Azure AI Foundry within that same quarter. It kind of blew my mind when I thought about it, you know, even though we were all, you know, working very hard to get that done. Any thoughts on that? I mean, has this ever happened in your career? And, you know, what’s the secret sauce to that? CARLSON: Yeah, I mean, the time scale from research to product has been massively compressed. And I’d push that even further, which is to say, the reason why it took a quarter was because we were laying the railroad tracks as we’re driving the train. We have examples right after that when we are launching on Foundry the same day we were publishing the paper. And frankly, the review times are becoming longer than it takes to actually productize the models. I think there’s two things that are going on with that are really converging. One is that the overall ecosystem is converging on a relatively small number of patterns, and that gives us, as a tech company, a reason to go off and really make those patterns hardened in a way that allows not just us, but third parties as well, to really have a nice workflow to publish these models. But the other is actually, I think, a change in how we work, you know, and for most of our history as an industrial research lab, we would do research and then we’d go pitch it to somebody and try and throw it over the fence. We’ve really built a much more integrated team. In fact, if you look at that Nature paper or any of the other papers, there’s folks from product teams. Many of you are on the papers along with our clinical collaborators. RUNDE: Yeah. I think one thing that’s really important to note is that there’s a ton of different ways that you can have impact, right? So I like to think about phasing. In Health Futures at least, I like to think about phasing the work that we do. So first we have research, which is really early innovation. And the impact there is getting our technology and our tools out there and really sharing the learnings that we’ve had. So that can be through publications like you mentioned. It can be through open-sourcing our models. And then you go to incubation. So, this is, I think, one of the more new spaces that we’re getting into, which is maybe that blurred line between research and product. Right. Which is, how do we take the tools and technologies that we’ve built and get them into the hands of users, typically through our partnerships? Right. So, we partner very deeply and collaborate very deeply across the industry. And incubation is really important because we get that early feedback. We get an ability to pivot if we need to. And we also get the ability to see what types of impact our technology is having in the real world. And then lastly, when you think about scale, there’s tons of different ways that you can scale. We can scale third-party through our collaborators and really empower them to go to market to commercialize the things that we’ve built together. You can also think about scaling internally, which is why I’m so thankful that we’ve created this flywheel between research and product, and a lot of the models that we’ve built that have gone through research, have gone through incubation, have been able to scale on the Azure AI Foundry. But that’s not really our expertise. Right? The scale piece in research, that’s research and incubation. Smitha, how do you think about scaling? SALIGRAMA: So, there are several angles to scaling the models, the state-of-the-art models we see from the research team. The first angle is, the open sourcing, to get developer trust, and very generous commercial licenses so that they can use it and for their own, use cases. The second is, we also allow them to customize these models, fine GUYMAN: And as one example, you know, University of Wisconsin Health, you know, which Matt knows well. They took one of our models, which is highly versatile. They customized it in Foundry and they optimized it to reliably identify abnormal chest X-rays, the most common imaging procedure, so they could improve their turnaround time triage quickly. And that’s just one example. But we have other partners like Sectra who are doing more of operations use cases automatically routing imaging to the radiologists, setting them up to be efficient. And then Page AI is doing, you know, biomarker identification for actually diagnostics and new drug discovery. So, there’s so many use cases that we have partners already who are building and customizing. LUNGREN: The part that’s striking to me is just that, you know, we could all sit in a room and think about all the different ways someone might use these models on the catalog. And I’m still shocked at the stuff that people use them for and how effective they are. And I think part of that is, you know, again, we talk a lot about generative AI and healthcare and all the things you can do. Again, you know, in text, you refer to that earlier and certainly off the shelf, there’s really powerful applications. But there is, you know, kind of this tip of the iceberg effect where under the water, most of the data that we use to take care of our patients is not text. Right. It’s all the different other modalities. And I think that this has been an unlock right, sort of taking these innovations, innovations from the community, putting them in this ecosystem kind of catalog, essentially. Right. And then allowing folks to kind of, you know, build and develop applications with all these different types of data. Again, I’ve been surprised at what I’m seeing. CARLSON: This has been just one of the most profound shifts that’s happened in the last 12 months, really. I mean, two years ago we had general models in text that really shifted how we think about, I mean, natural language processing got totally upended by that. Turns out the same technology works for images as well. It doesn’t only allow you to automatically extract concepts from images, but allows you to align those image concepts with text concepts, which means that you can have a conversation with that image. And once you’re in that world now, you are a place where you can start stitching together these multimodal models that really change how you can interact with the data, and how you can start getting more information out of the raw primary data that is part of the patient journey. LUNGREN: Well, and we’re going to get to that because I think you just touched on something. And I want to re-emphasize stitching these things together. There’s a lot of different ways to potentially do that. Right? There’s ways that you can literally train the model end to end with adapters and all kinds of other early fusion fusions. All kinds of ways. But one of the things that the word of the I guess the year is going to be agents and an agent is a very interesting term to think about how you might abstract away some of the components or the tasks that you want the model to, to accomplish in the midst of sort of a real human to maybe model interaction. Can you talk a little bit more about, how we’re thinking about agents in this, in this platform approach?  GUYMAN: Well, this is our newest addition to the Azure AI Foundry. So there’s an agent catalog now where we have a set of pre-configured agents for health care. And then we also have a multi-agent orchestrator that can jump LUNGREN: And, and I really like that concept because, you know, as, as a, as a from the user personas, I think about myself as a user. How am I going to interact with these agents? Where does it naturally fit? And I and I sort of, you know, I’ve seen some of the demonstrations and some of the work that’s going on with Stanford in particular, showing that, you know, and literally in a Teams chat, I can have my clinician colleagues and I can have specialized health It is a completely mind-blowing thing for me. And it’s a light bulb moment for me to I wonder, what have we, what have we heard from folks that have, you know, tried out this health care agent orchestrator in this kind of deployment environment via Teams? GUYMAN: Well, someone joked, you know, are you sure you’re not using Teams because you work at Microsoft? [LAUGHS] But, then we actually were meeting with one of the, radiologists at one of our partners, and they said that that morning they had just done a Teams meeting, or they had met with other specialists to talk about a patient’s cancer case, or they were coming up with a treatment plan. And that was the light bulb moment for us. We realized, actually, Teams is already being used by physicians as an internal communication tool, as a tool to get work done. And especially since the pandemic, a lot of the meetings moved to virtual and telemedicine. And so it’s a great distribution channel for AI, which is often been a struggle for AI to actually get in the hands of clinicians. And so now we’re allowing developers to build and then deploy very easily and extend it into their own workflows. CARLSON: I think that’s such an important point. I mean, if you think about one of the really important concepts in computer science is an application programing interface, like some set of rules that allow two applications to talk to each other. One of the big pushes, really important pushes, in medicine has been standards that allow us to actually have data standards and APIs that allow these to talk to each other, and yet still we end up with these silos. There’s silos of data. There’s silos of applications. And just like when you and I work on our phone, we have to go back and forth between applications. One of the things that I think agents do is that it takes the idea that now you can use language to understand intent and effectively program an interface, and it creates a whole new abstraction layer that allows us to simplify the interaction between not just humans and the endpoint, but also for developers. It allows us to have this abstraction layer that lets different developers focus on different types of models, and yet stitch them all together in a very, very natural, way, not just for the users, but for the ability to actually deploy those models. SALIGRAMA: Just to add to what Jonathan was mentioning, the other cool thing about the Microsoft Teams user interface is it’s also enterprise ready. RUNDE: And one important thing that we’re thinking about, is exactly this from the very early research through incubation and then to scale, obviously. Right. And so early on in research, we are actively working with our partners and our collaborators to make sure that we have the right data privacy and consent in place. We’re doing this in incubation as well. And then obviously in scale. Yep. LUNGREN: So, I think AI has always been thought of as a savior kind of technology. We talked a little bit about how there’s been some ups and downs in terms of the ability for technology to be effective in health care. At the same time, we’re seeing a lot of new innovations that are really making a difference. But then we kind of get, you know, we talked about agents a little bit. It feels like we’re maybe abstracting too far. Maybe it’s things are going too fast, almost. What makes this different? I mean, in your mind is this truly a logical next step or is it going to take some time? CARLSON: I think there’s a couple things that have happened. I think first, on just a pure technology. What led to ChatGPT? And I like to think of really three major breakthroughs. The first was new mathematical concepts of attention, which really means that we now have a way that a machine can figure out which parts of the context it should actually focus on, just the way our brains do. Right? I mean, if you’re a clinician and somebody is talking to you, the majority of that conversation is not relevant for the diagnosis. But, you know how to zoom in on the parts that matter. That’s a super powerful mathematical concept. The second one is this idea of self-supervision. So, I think one of the fundamental problems of machine learning has been that you have to train on labeled training data and labels are expensive, which means data sets are small, which means the final models are very narrow and brittle. And the idea of self-supervision is that you can just get a model to automatically learn concepts, and the language is just predict the next word. And what’s important about that is that leads to models that can actually manipulate and understand really messy text and pull out what’s important about that, and then and then stitch that back together in interesting ways. And the third concept, that came out of those first two, was just the observational scale. And that’s that more is better, more data, more compute, bigger models. And that really leads to a reason to keep investing. And for these models to keep getting better. So that as a as a groundwork, that’s what led to ChatGPT. That’s what led to our ability now to not just have rule-based systems or simple machine learning based systems to take a messy EHR record, say, and pull out a couple concepts. But to really feed the whole thing in and say, okay, I need you to figure out which concepts are in here. And is this particular attribute there, for example. That’s now led to the next breakthrough, which is all those core ideas apply to images as well. They apply to proteins, to DNA. And so we’re starting to see models that understand images and the concepts of images, and can actually map those back to text as well. So, you can look at a pathology image and say, not just at the cell, but it appears that there’s some certain sort of cancer in this particular, tissue there. And then you take those two things together and you layer on the fact that now you have a model, or a set of models, that can understand intent, can understand human concepts and biomedical concepts, and you can start stitching them together into specialized agents that can actually reason with each other, which at some level gives you an API as a developer to say, okay, I need to focus on a pathology model and get this really, really, sound while somebody else is focusing on a radiology model, but now allows us to stitch these all together with a user interface that we can now talk to through natural language. RUNDE: I’d like to double click a little bit on that medical abstraction piece that you mentioned. Just the amount of data, clinical data that there is for each individual patient. Let’s think about cancer patients for a second to make this real. Right. For every cancer patient, it could take a couple of hours to structure their information. And why is that important? Because, you have to get that information in a structured way and abstract relevant information to be able to unlock precision health applications right, for each patient. So, to be able to match them to a trial, right, someone has to sit there and go through all of the clinical notes from their entire patient care journey, from the beginning to the end. And that’s not scalable. And so one thing that we’ve been doing in an active project that we’ve been working on with a handful of our partners, but Providence specifically, I’ll call out, is using AI to actually abstract and curate that information. So that gives time back to the health care provider to spend with patients, instead of spending all their time curating this information. And this is super important because it sets the scene and the backbone for all those precision health applications. Like I mentioned, clinical trial matching, tumor boards is another really important example here. Maybe Matt, you can talk to that a little bit. LUNGREN: It’s a great example. And you know it’s so funny. We’ve talked about this use case and the you know the health And a tumor board is a critical meeting that happens at many cancer centers where specialists all get together, come with their perspective, and make a comment on what would be the best next step in treatment. But the background in preparing for that is you know, again, organizing the data. But to your point, also, what are the clinical trials that are active? There are thousands of clinical trials. There’s hundreds every day added. How can anyone keep up with that? And these are the kinds of use cases that start to bubble up. And you realize that a technology that understands concepts, context and can reason over vast amounts of data with a language interface-that is a powerful tool. Even before we get to some of the, you know, unlocking new insights and even precision medicine, this is that idea of saving time before lives to me. And there’s an enormous amount of undifferentiated heavy lifting that happens in health GUYMAN: And we’ve packaged these agents, the manual abstraction work that, you know, manually takes hours. Now we have an agent. It’s in Foundry along with the clinical trial matching agent, which I think at Providence you showed could double the match rate over the baseline that they were using by using the AI for multiple data sources. So, we have that and then we have this orchestration that is using this really neat technology from Microsoft Research. Semantic Kernel, Magentic There’s turn taking, there’s negotiation between the agents. So, there’s this really interesting system that’s emerging. And again, this is all possible to be used through Teams. And there’s some great extensibility as well. We’ve been talking about that and working on some cool tools. SALIGRAMA: Yeah. Yeah. No, I think if I have to geek out a little bit on how all this agent tech orchestrations are coming up, like I’ve been in software engineering for decades, it’s kind of a next version of distributed systems where you have these services that talk to each other. It’s a more natural way because LLMs are giving these natural ways instead of a structured API ways of conversing. We have these agents which can naturally understand how to talk to each other. Right. So this is like the next evolution of our systems now. And the way we’re packaging all of this is multiple ways based on all the standards and innovation that’s happening in this space. So, first of all, we are building these agents that are very good at specific tasks, like, Will was saying like, a trial matching agent or patient timeline agents. So, we take all of these, and then we package it in a workflow and an orchestration. We use the standard, some of these coming from research. The Semantic Kernel, the Magentic-One. And then, all of these also allow us to extend these agents with custom agents that can be plugged in. So, we are open sourcing the entire agent orchestration in AI Foundry templates, so that developers can extend their own agents, and make their own workflows out of it. So, a lot of cool innovation happening to apply this technology to specific scenarios and workflows. LUNGREN: Well, I was going to ask you, like, so as part of that extension. So, like, you know, folks can say, hey, I have maybe a really specific part of my workflow that I want to use some agents for, maybe one of the agents that can do PubMed literature search, for example. But then there’s also agents that, come in from the outside, you know, sort of like I could, I can imagine a software company or AI company that has a built-in agent that plugs in as well. SALIGRAMA: Yeah. Yeah, absolutely. So, you can bring your own agent. And then we have these, standard ways of communicating with agents and integrating with the orchestration language so you can bring your own agent and extend this health care agent, agent orchestrator to your own needs. LUNGREN: I can just think of, like, in a group chat, like a bunch of different specialist agents. And I really would want an orchestrator to help find the right tool, to your point earlier, because I’m guessing this ecosystem is going to expand quickly. Yeah. And I may not know which tool is best for which question. I just want to ask the question. Right. SALIGRAMA: Yeah. Yeah. CARLSON: Well, I think to that point to I mean, you said an important point here, which is tools, and these are not necessarily just AI tools. Right? I mean, we’ve known this for a while, right? LLMS are not very good at math, but you can have it use a calculator and then it works very well. And you know you guys both brought up the universal medical abstraction a couple times. And one of the things that I find so powerful about that is we’ve long had this vision within the precision health community that we should be able to have a learning hospital system. We should be able to actually learn from the actual real clinical experiences that are happening every day, so that we can stop practicing medicine based off averages. There’s a lot of work that’s gone on for the last 20 years about how to actually do causal inference. That’s not an AI question. That’s a statistical question. The bottleneck, the reason why we haven’t been able to do that is because most of that information is locked up in unstructured text. And these other tools need essentially a table. And so now you can decompose this problem, say, well, what if I can use AI not to get to the causal answer, but to just structure the information. So now I can put it into the causal inference tool. And these sorts of patterns I think again become very, not just powerful for a programmer, but they start pulling together different specialties. And I think we’ll really see an acceleration, really, of collaboration across disciplines because of this. CARLSON: So, when I joined Microsoft Research 18 years ago, I was doing work in computational biology. And I would always have to answer the question: why is Microsoft in biomedicine? And I would always kind of joke saying, well, it is. We sell Office and Windows to every health SALIGRAMA: A lot of healthcare organizations already use Microsoft productivity tools, as you mentioned. So, they asked the developers, build these agents, and use our healthcare orchestrations, to plug in these agents and expose these in these productivity tools. They will get access to all these healthcare workers. So the healthcare agent orchestrator we have today integrates with Microsoft Teams, and it showcases an example of how you can at (@) mention these agents and talk to them like you were talking to another person in a Teams chat. And then it also provides examples of these agents and how they can use these productivity tools. One of the examples we have there is how they can summarize the assessments of this whole chat into a Word Doc, or even convert that into a PowerPoint presentation, for later on. CARLSON: One of the things that has struck me is how easy it is to do. I mean, Will, I don’t know if you’ve worked with folks that have gone from 0 to 60, like, how fast? What does that look like? GUYMAN: Yeah, it’s funny for us, the technology to transfer all this context into a Word Document or PowerPoint presentation for a doctor to take to a meeting is relatively straightforward compared to the complicated clinical trial matching multimodal processing. The feedback has been tremendous in terms of, wow, that saves so much time to have this organized report that then I can show up to meeting with and the agents can come with me to that meeting because they’re literally having a Teams meeting, often with other human specialists. And the agents can be there and ask and answer questions and fact check and source all the right information on the fly. So, there’s a nice integration into these existing tools. LUNGREN: We worked with several different centers just to kind of understand, you know, where this might be useful. And, like, as I think we talked about before, the ideas that we’ve come up with again, this is a great one because it’s complex. It’s kind of hairy. There’s a lot of things happening under the hood that don’t necessarily require a medical license to do, right, to prepare for a tumor board and to organize data. But, it’s fascinating, actually. So, you know, folks have come up with ideas of, could I have an agent that can operate an MRI machine, and I can ask the agent to change some parameters or redo a protocol. We thought that was a pretty powerful use case. We’ve had others that have just said, you know, I really want to have a specific agent that’s able to kind of act like deep research does for the consumer side, but based on the context of my patient, so that it can search all the literature and pull the data in the papers that are relevant to this case. And the list goes on and on from operations all the way to clinical, you know, sort of decision making at some level. And I think that the research community that’s going to sprout around this will help us, guide us, I guess, to see what is the most high-impact use cases. Where is this effective? And maybe where it’s not effective. But to me, the part that makes me so, I guess excited about this is just that I don’t have to think about, okay, well, then we have to figure out Health IT. Because it’s always, you know, we always have great ideas and research, and it always feels like there’s such a huge chasm to get it in front of the health care workers that might want to test this out. And it feels like, again, this productivity tool use case again with the enterprise security, the possibility for bringing in third parties to contribute really does feel like it’s a new surface area for innovation. CARLSON: Yeah, I love that. Look. Let me end by putting you all on the spot. So, in three years, multimodal agents will do what? Matt, I’ll start with you. LUNGREN: I am convinced that it’s going to save massive amount of time before it saves many lives. RUNDE: I’ll focus on the patient care journey and diagnostic journey. I think it will kind of transform that process for the patient itself and shorten that process. GUYMAN: Yeah, I think we’ve seen already papers recently showing that different modalities surfaced complementary information. And so we’ll see kind of this AI and these agents becoming an essential companion to the physician, surfacing insights that would have been overlooked otherwise. SALIGRAMA: And similar to what you guys were saying, agents will become important assistants to healthcare workers, reducing a lot of documentation and workflow, excess work they have to do. CARLSON: I love that. And I guess for my part, I think really what we’re going to see is a massive unleash of creativity. We’ve had a lot of folks that have been innovating in this space, but they haven’t had a way to actually get it into the hands of early adopters. And I think we’re going to see that really lead to an explosion of creativity across the ecosystem. LUNGREN: So, where do we get started? Like where are the developers who are listening to this? The folks that are at, you know, labs, research labs and developing health care solutions. Where do they go to get started with the Foundry, the models we’ve talked about, the healthcare agent orchestrator. Where do they go? GUYMAN: So AI.azure.com is the AI Foundry. It’s a website you can go as a developer. You can sign in with your Azure subscription, get your Azure account, your own VM, all that stuff. And you have an agent catalog, the model catalog. You can start from there. There is documentation and templates that you can then deploy to Teams or other applications. LUNGREN: And tutorials are coming. Right. We have recordings of tutorials. We’ll have Hackathons, some sessions and then more to come. Yeah, we’re really excited. [MUSIC] LUNGREN: Thank you so much, guys for joining us. CARLSON: Yes. Yeah. Thanks. SALIGRAMA: Thanks for having us. [MUSIC FADES]

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-19 16:51:49 ·

Magentic-UI, an experimental human-centered web agent

Modern productivity is rooted in the web—from searching for information and filling in forms to navigating dashboards. Yet, many of these tasks remain manual and repetitive. Today, we are introducing Magentic-UI, a new open-source research prototype of a human-centered agent that is meant to help researchers study open questions on human-in-the-loop approaches and oversight mechanisms for AI agents. This prototype collaborates with users on web-based tasks and operates in real time over a web browser. Unlike other computer use agents that aim for full autonomy, Magentic-UI offers a transparent and controllable experience for tasks that are action-oriented and
Magentic-UI builds on Magentic-One, a powerful multi-agent team we released last year, and is powered by AutoGen, our leading agent framework. It is available under MIT license atand on Azure AI Foundry Labs, the hub where developers, startups, and enterprises can explore groundbreaking innovations from Microsoft Research. Magentic-UI is integrated with Azure AI Foundry models and agents. Learn more about how to integrate Azure AI agents into the Magentic-UI multi-agent architecture by following this code sample.
Magentic-UI can perform tasks that require browsing the web, writing and executing Python and shell code, and understanding files. Its key features include:

Collaborative planning with users. Magentic-UI allows users to directly modify its plan through a plan editor or by providing textual feedback before Magentic-UI executes any actions.
Collaborative execution with users. Users can pause the system and give feedback in natural language or demonstrate it by directly taking control of the browser.
Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
Learning from experience. Magentic-UI can learn and save plans from previous interactions to improve task completion for future tasks.

Figure 1: Screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan and progress to accomplish a user’s complex goal. The right side shows the browser Magentic-UI is controlling.
How is Magentic-UI human-centered?
While many web agents promise full autonomy, in practice users can be left unsure of what the agent can do, what it is currently doing, and whether they have enough control to intervene when something goes wrong or doesn’t occur as expected. By contrast, Magentic-UI considers user needs at every stage of interaction. We followed a human-centered design methodology in building Magentic-UI by prototyping and obtaining feedback from pilot users during its design.
Figure 2: Co-planning – Users can collaboratively plan with Magentic-UI.
For example, after a person specifies and before Magentic-UI even begins to execute, it creates a clear step-by-step plan that outlines what it would do to accomplish the task. People can collaborate with Magentic-UI to modify this plan and then give final approval for Magentic-UI to begin execution. This is crucial as users may have expectations of how the task should be completed; communicating that information could significantly improve agent performance. We call this feature co-planning.
During execution, Magentic-UI shows in real time what specific actions it’s about to take. For example, whether it is about to click on a button or input a search query. It also shows in real time what it observed on the web pages it is visiting. Users can take control of the action at any point in time and give control back to the agent. We call this feature co-tasking.
Figure 3: Co-tasking – Magentic-UI provides real-time updates about what it is about to do and what it already did, allowing users to collaboratively complete tasks with the agent.
Figure 4: Action-guards – Magentic-UI will ask users for permission before executing actions that it deems consequential or important.
Additionally, Magentic-UI asks for user permission before performing actions that are deemed irreversible, such as closing a tab or clicking a button with side effects. We call these “action guards”. The user can also configure Magentic-UI’s action guards to always ask for permission before performing any action. If the user deems an action risky, they can reject it.

Figure 5: Plan learning – Once a task is successfully completed, users can request Magentic-UI to learn a step-by-step plan from this experience.
After execution, the user can ask Magentic-UI to reflect on the conversation and infer and save a step-by-step plan for future similar tasks. Users can view and modify saved plans for Magentic-UI to reuse in the future in a saved-plans gallery. In a future session, users can launch Magentic-UI with the saved plan to either execute the same task again, like checking the price of a specific flight, or use the plan as a guide to help complete similar tasks, such as checking the price of a different type of flight.
Combined, these four features—co-planning, co-tasking, action guards, and plan learning—enable users to collaborate effectively with Magentic-UI.
Architecture
Magentic-UI’s underlying system is a team of specialized agents adapted from AutoGen’s Magentic-One system. The agents work together to create a modular system:

Orchestrator is the lead agent, powered by a large language model, that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete.
WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator.
Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator.
FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDownpackage. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them.

Figure 6: System architecture diagram of Magentic-UI
To interact with Magentic-UI, users can enter a text message and attach images. In response, Magentic-UI creates a natural-language step-by-step plan with which users can interact through a plan-editing interface. Users can add, delete, edit, regenerate steps, and write follow-up messages to iterate on the plan. While the user editing the plan adds an upfront cost to the interaction, it can potentially save a significant amount of time in the agent executing the plan and increase its chance at success.
The plan is stored inside the Orchestrator and is used to execute the task. For each step of the plan, the Orchestrator determines which of the agentsor the user should complete the step. Once that decision is made, the Orchestrator sends a request to one of the agents or the user and waits for a response. After the response is received, the Orchestrator decides whether that step is complete. If it is, the Orchestrator moves on to the following step.
Once all steps are completed, the Orchestrator generates a final answer that is presented to the user. If, while executing any of the steps, the Orchestrator decides that the plan is inadequate, the Orchestrator can replan with user permission and start executing a new plan.
All intermediate progress steps are clearly displayed to the user. Furthermore, the user can pause the execution of the plan and send additional requests or feedback. The user can also configure through the interface whether agent actionsrequire approval.
Evaluating Magentic-UI
Magentic-UI innovates through its ability to integrate human feedback in its planning and execution of tasks. We performed a preliminary automated evaluation to showcase this ability on the GAIA benchmarkfor agents with a user-simulation experiment.
Evaluation with simulated users
Figure 7: Comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has a\access to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost.
GAIA is a benchmark for general AI assistants, with multimodal question-answer pairs that are challenging, requiring the agents to navigate the web, process files, and execute code. The traditional evaluation setup with GAIA assumes the system will autonomously complete the task and return an answer, which is compared to the ground-truth answer.
To evaluate the human-in-the-loop capabilities of Magentic-UI, we transform GAIA into an interactive benchmark by introducing the concept of a simulated user. Simulated users provide value in two ways: by having specific expertise that the agent may not possess, and by providing guidance on how the task should be performed.
We experiment with two types of simulated users to show the value of human-in-the-loop:a simulated user that is more intelligent than the Magentic-UI agents anda simulated user with the same intelligence as Magentic-UI agents but with additional information about the task. During co-planning, Magentic-UI takes feedback from this simulated user to improve its plan. During co-tasking, Magentic-UI can ask theuser for help when it gets stuck. Finally, if Magentic-UI does not provide a final answer, then the simulated user provides an answer instead.
The simulated user is an LLM without any tools, instructed to interact with Magentic-UI the way we expect a human would act. The first type of simulated user relies on OpenAI’s o4-mini, more performant at many tasks than the one powering the Magentic-UI agents. For the second type of simulated user, we use GPT-4o for both the simulated user and the rest of the agents, but the user has access to side information about each task. Each task in GAIA has side information, which includes a human-written plan to solve the task. While this plan is not used as input in the traditional benchmark, in our interactive setting we provide this information to the second type of simulated user,which is powered by an LLM so that it can mimic a knowledgeable user. Importantly, we tuned our simulated user so as not to reveal the ground-truth answer directly as the answer is usually found inside the human written plan. Instead, it is prompted to guide Magentic-UI indirectly. We found that this tuning prevented the simulated user from inadvertently revealing the answer in all but 6% of tasks when Magentic-UI provides a final answer.
On the validation subset of GAIA, we show the results of Magentic-One operating in autonomous mode, Magentic-UI operating in autonomous mode, Magentic-UI with simulated user, Magentic-UI with simulated user, and human performance. We first note that Magentic-UI in autonomous mode is within a margin of error of the performance of Magentic-One. Note that the same LLMis used for Magentic-UI and Magentic-One.
Magentic-UI with the simulated user that has access to side information improves the accuracy of autonomous Magentic-UI by 71%, from a 30.3% task-completion rate to a 51.9% task-completion rate. Moreover, Magentic-UI only asks for help from the simulated user in 10% of tasks and relies on the simulated user for the final answer in 18% of tasks. And in those tasks where it does ask for help, it asks for help on average 1.1 times. Magentic-UI with the simulated user powered by a smarter model improves to 42.6% where Magentic-UI asks for help in only 4.3% of tasks, asking for help an average of 1.7 times in those tasks. This demonstrates the potential of even lightweight human feedback for improving performanceof autonomous agents, especially at a fraction of the cost compared to people completing tasks entirely manually.
Learning and reusing plans
As described above, once Magentic-UI completes a task, users have the option for Magentic-UI to learn a plan based on the execution of the task. These plans are saved in a plan gallery, which users and Magentic-UI can access in the future.
The user can select a plan from the plan gallery, which is displayed by clicking on the Saved Plans button. Alternatively, as a user enters a task that closely matches a previous task, the saved plan will be displayed even before the user is done typing. If no identical task is found, Magentic-UI can use AutoGen’s Task-Centric Memoryto retrieve plans for any similar tasks. Our preliminary evaluations show that this retrieval is highly accurate, and when recalling a saved plan can be around 3x faster than generating a new plan. Once a plan is recalled or generated, the user can always accept it, modify it, or ask Magentic-UI to modify it for the specific task at hand.
Safety and control
Magentic-UI can surf the live internet and execute code. With such capabilities, we need to ensure that Magentic-UI acts in a safe and secure manner. The following features, design decisions, and evaluations were made to ensure this:

Allow-list: Users can set a list of websites that Magentic-UI is allowed to access. If Magentic-UI needs to access a website outside of the allow-list, users must explicitly approve it through the interface
Anytime interruptions: At any point of Magentic-UI completing the task, the user can interrupt Magentic-UI and stop any pending code execution or web browsing.
Docker sandboxing: Magentic-UI controls a browser that is launched inside a Docker container with no credentials, which avoids risks with logged-in accounts and credentials. Moreover, any code execution is also performed inside a separate Docker container to avoid affecting the host environment in which Magentic-UI is running. This is illustrated in the system architecture of Magentic-UI.
Detection and approval of irreversible agent actions: Users can configure an action-approval policyto determine which actions Magentic-UI can perform without user approval. In the extreme, users can specify that any actionneeds explicit user approval. Users must press an “Accept” or “Deny” button for each action.

In addition to the above design decisions, we performed a red-team evaluation of Magentic-UI on a set of internal scenarios, which we developed to challenge the security and safety of Magentic-UI. Such scenarios include cross-site prompt injection attacks, where web pages contain malicious instructions distinct from the user’s original intent. It also contains scenarios comparable to phishing, which try to trick Magentic-UI into entering sensitive information, or granting permissions on impostor sites. In our preliminary evaluations, we found that Magentic-UI either refuses to complete the requests, stops to ask the user, or, as a final safety measure, is eventually unable to complete the request due to Docker sandboxing. We have found that this layered approach is effective for thwarting these attacks.
We have also released transparency notes, which can be found at:Open research questions
Magentic-UI provides a tool for researchers to study critical questions in agentic systems and particularly on human-agent interaction. In a previous report, we outlined 12 questions for human-agent communication, and Magentic-UI provides a vehicle to study these questions in a realistic setting. A key question among these is how we enable humans to efficiently intervene and provide feedback to the agent while executing a task. Humans should not have to constantly watch the agent. Ideally, the agent should know when to reach out for help and provide the necessary context for the human to assist it. A second question is about safety. As agents interact with the live web, they may become prone to attacks from malicious actors. We need to study what necessary safeguards are needed to protect the human from side effects without adding a heavy burden on the human to verify every agent action. There are also many other questions surrounding security, personalization, and learning that Magentic-UI can help with studying.
Conclusion
Magentic-UI is an open-source agent prototype that works with people to complete complex tasks that require multi-step planning and browser use. As agentic systems expand in the scope of tasks they can complete, Magentic-UI’s design enables better transparency into agent actions and enables human control to ensure safety and reliability. Moreover, by facilitating human intervention, we can improve performance while still reducing human cost in completing tasks on aggregate. Today we have released the first version of Magentic-UI. Looking ahead, we plan to continue developing it in the open with the goal of improving its capabilities and answering research questions on human-agent collaboration. We invite the research community to extend and reuse Magentic-UI for their scientific explorations and domains.
Opens in a new tab
#magenticui #experimental #humancentered #web #agent

Magentic-UI, an experimental human-centered web agent
Modern productivity is rooted in the web—from searching for information and filling in forms to navigating dashboards. Yet, many of these tasks remain manual and repetitive. Today, we are introducing Magentic-UI, a new open-source research prototype of a human-centered agent that is meant to help researchers study open questions on human-in-the-loop approaches and oversight mechanisms for AI agents. This prototype collaborates with users on web-based tasks and operates in real time over a web browser. Unlike other computer use agents that aim for full autonomy, Magentic-UI offers a transparent and controllable experience for tasks that are action-oriented and Magentic-UI builds on Magentic-One, a powerful multi-agent team we released last year, and is powered by AutoGen, our leading agent framework. It is available under MIT license atand on Azure AI Foundry Labs, the hub where developers, startups, and enterprises can explore groundbreaking innovations from Microsoft Research. Magentic-UI is integrated with Azure AI Foundry models and agents. Learn more about how to integrate Azure AI agents into the Magentic-UI multi-agent architecture by following this code sample. Magentic-UI can perform tasks that require browsing the web, writing and executing Python and shell code, and understanding files. Its key features include: Collaborative planning with users. Magentic-UI allows users to directly modify its plan through a plan editor or by providing textual feedback before Magentic-UI executes any actions. Collaborative execution with users. Users can pause the system and give feedback in natural language or demonstrate it by directly taking control of the browser. Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors. Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors. Learning from experience. Magentic-UI can learn and save plans from previous interactions to improve task completion for future tasks. Figure 1: Screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan and progress to accomplish a user’s complex goal. The right side shows the browser Magentic-UI is controlling. How is Magentic-UI human-centered? While many web agents promise full autonomy, in practice users can be left unsure of what the agent can do, what it is currently doing, and whether they have enough control to intervene when something goes wrong or doesn’t occur as expected. By contrast, Magentic-UI considers user needs at every stage of interaction. We followed a human-centered design methodology in building Magentic-UI by prototyping and obtaining feedback from pilot users during its design. Figure 2: Co-planning – Users can collaboratively plan with Magentic-UI. For example, after a person specifies and before Magentic-UI even begins to execute, it creates a clear step-by-step plan that outlines what it would do to accomplish the task. People can collaborate with Magentic-UI to modify this plan and then give final approval for Magentic-UI to begin execution. This is crucial as users may have expectations of how the task should be completed; communicating that information could significantly improve agent performance. We call this feature co-planning. During execution, Magentic-UI shows in real time what specific actions it’s about to take. For example, whether it is about to click on a button or input a search query. It also shows in real time what it observed on the web pages it is visiting. Users can take control of the action at any point in time and give control back to the agent. We call this feature co-tasking. Figure 3: Co-tasking – Magentic-UI provides real-time updates about what it is about to do and what it already did, allowing users to collaboratively complete tasks with the agent. Figure 4: Action-guards – Magentic-UI will ask users for permission before executing actions that it deems consequential or important. Additionally, Magentic-UI asks for user permission before performing actions that are deemed irreversible, such as closing a tab or clicking a button with side effects. We call these “action guards”. The user can also configure Magentic-UI’s action guards to always ask for permission before performing any action. If the user deems an action risky, they can reject it. Figure 5: Plan learning – Once a task is successfully completed, users can request Magentic-UI to learn a step-by-step plan from this experience. After execution, the user can ask Magentic-UI to reflect on the conversation and infer and save a step-by-step plan for future similar tasks. Users can view and modify saved plans for Magentic-UI to reuse in the future in a saved-plans gallery. In a future session, users can launch Magentic-UI with the saved plan to either execute the same task again, like checking the price of a specific flight, or use the plan as a guide to help complete similar tasks, such as checking the price of a different type of flight. Combined, these four features—co-planning, co-tasking, action guards, and plan learning—enable users to collaborate effectively with Magentic-UI. Architecture Magentic-UI’s underlying system is a team of specialized agents adapted from AutoGen’s Magentic-One system. The agents work together to create a modular system: Orchestrator is the lead agent, powered by a large language model, that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete. WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator. Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator. FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDownpackage. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them. Figure 6: System architecture diagram of Magentic-UI To interact with Magentic-UI, users can enter a text message and attach images. In response, Magentic-UI creates a natural-language step-by-step plan with which users can interact through a plan-editing interface. Users can add, delete, edit, regenerate steps, and write follow-up messages to iterate on the plan. While the user editing the plan adds an upfront cost to the interaction, it can potentially save a significant amount of time in the agent executing the plan and increase its chance at success. The plan is stored inside the Orchestrator and is used to execute the task. For each step of the plan, the Orchestrator determines which of the agentsor the user should complete the step. Once that decision is made, the Orchestrator sends a request to one of the agents or the user and waits for a response. After the response is received, the Orchestrator decides whether that step is complete. If it is, the Orchestrator moves on to the following step. Once all steps are completed, the Orchestrator generates a final answer that is presented to the user. If, while executing any of the steps, the Orchestrator decides that the plan is inadequate, the Orchestrator can replan with user permission and start executing a new plan. All intermediate progress steps are clearly displayed to the user. Furthermore, the user can pause the execution of the plan and send additional requests or feedback. The user can also configure through the interface whether agent actionsrequire approval. Evaluating Magentic-UI Magentic-UI innovates through its ability to integrate human feedback in its planning and execution of tasks. We performed a preliminary automated evaluation to showcase this ability on the GAIA benchmarkfor agents with a user-simulation experiment. Evaluation with simulated users Figure 7: Comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has a\access to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost. GAIA is a benchmark for general AI assistants, with multimodal question-answer pairs that are challenging, requiring the agents to navigate the web, process files, and execute code. The traditional evaluation setup with GAIA assumes the system will autonomously complete the task and return an answer, which is compared to the ground-truth answer. To evaluate the human-in-the-loop capabilities of Magentic-UI, we transform GAIA into an interactive benchmark by introducing the concept of a simulated user. Simulated users provide value in two ways: by having specific expertise that the agent may not possess, and by providing guidance on how the task should be performed. We experiment with two types of simulated users to show the value of human-in-the-loop:a simulated user that is more intelligent than the Magentic-UI agents anda simulated user with the same intelligence as Magentic-UI agents but with additional information about the task. During co-planning, Magentic-UI takes feedback from this simulated user to improve its plan. During co-tasking, Magentic-UI can ask theuser for help when it gets stuck. Finally, if Magentic-UI does not provide a final answer, then the simulated user provides an answer instead. The simulated user is an LLM without any tools, instructed to interact with Magentic-UI the way we expect a human would act. The first type of simulated user relies on OpenAI’s o4-mini, more performant at many tasks than the one powering the Magentic-UI agents. For the second type of simulated user, we use GPT-4o for both the simulated user and the rest of the agents, but the user has access to side information about each task. Each task in GAIA has side information, which includes a human-written plan to solve the task. While this plan is not used as input in the traditional benchmark, in our interactive setting we provide this information to the second type of simulated user,which is powered by an LLM so that it can mimic a knowledgeable user. Importantly, we tuned our simulated user so as not to reveal the ground-truth answer directly as the answer is usually found inside the human written plan. Instead, it is prompted to guide Magentic-UI indirectly. We found that this tuning prevented the simulated user from inadvertently revealing the answer in all but 6% of tasks when Magentic-UI provides a final answer. On the validation subset of GAIA, we show the results of Magentic-One operating in autonomous mode, Magentic-UI operating in autonomous mode, Magentic-UI with simulated user, Magentic-UI with simulated user, and human performance. We first note that Magentic-UI in autonomous mode is within a margin of error of the performance of Magentic-One. Note that the same LLMis used for Magentic-UI and Magentic-One. Magentic-UI with the simulated user that has access to side information improves the accuracy of autonomous Magentic-UI by 71%, from a 30.3% task-completion rate to a 51.9% task-completion rate. Moreover, Magentic-UI only asks for help from the simulated user in 10% of tasks and relies on the simulated user for the final answer in 18% of tasks. And in those tasks where it does ask for help, it asks for help on average 1.1 times. Magentic-UI with the simulated user powered by a smarter model improves to 42.6% where Magentic-UI asks for help in only 4.3% of tasks, asking for help an average of 1.7 times in those tasks. This demonstrates the potential of even lightweight human feedback for improving performanceof autonomous agents, especially at a fraction of the cost compared to people completing tasks entirely manually. Learning and reusing plans As described above, once Magentic-UI completes a task, users have the option for Magentic-UI to learn a plan based on the execution of the task. These plans are saved in a plan gallery, which users and Magentic-UI can access in the future. The user can select a plan from the plan gallery, which is displayed by clicking on the Saved Plans button. Alternatively, as a user enters a task that closely matches a previous task, the saved plan will be displayed even before the user is done typing. If no identical task is found, Magentic-UI can use AutoGen’s Task-Centric Memoryto retrieve plans for any similar tasks. Our preliminary evaluations show that this retrieval is highly accurate, and when recalling a saved plan can be around 3x faster than generating a new plan. Once a plan is recalled or generated, the user can always accept it, modify it, or ask Magentic-UI to modify it for the specific task at hand. Safety and control Magentic-UI can surf the live internet and execute code. With such capabilities, we need to ensure that Magentic-UI acts in a safe and secure manner. The following features, design decisions, and evaluations were made to ensure this: Allow-list: Users can set a list of websites that Magentic-UI is allowed to access. If Magentic-UI needs to access a website outside of the allow-list, users must explicitly approve it through the interface Anytime interruptions: At any point of Magentic-UI completing the task, the user can interrupt Magentic-UI and stop any pending code execution or web browsing. Docker sandboxing: Magentic-UI controls a browser that is launched inside a Docker container with no credentials, which avoids risks with logged-in accounts and credentials. Moreover, any code execution is also performed inside a separate Docker container to avoid affecting the host environment in which Magentic-UI is running. This is illustrated in the system architecture of Magentic-UI. Detection and approval of irreversible agent actions: Users can configure an action-approval policyto determine which actions Magentic-UI can perform without user approval. In the extreme, users can specify that any actionneeds explicit user approval. Users must press an “Accept” or “Deny” button for each action. In addition to the above design decisions, we performed a red-team evaluation of Magentic-UI on a set of internal scenarios, which we developed to challenge the security and safety of Magentic-UI. Such scenarios include cross-site prompt injection attacks, where web pages contain malicious instructions distinct from the user’s original intent. It also contains scenarios comparable to phishing, which try to trick Magentic-UI into entering sensitive information, or granting permissions on impostor sites. In our preliminary evaluations, we found that Magentic-UI either refuses to complete the requests, stops to ask the user, or, as a final safety measure, is eventually unable to complete the request due to Docker sandboxing. We have found that this layered approach is effective for thwarting these attacks. We have also released transparency notes, which can be found at:Open research questions Magentic-UI provides a tool for researchers to study critical questions in agentic systems and particularly on human-agent interaction. In a previous report, we outlined 12 questions for human-agent communication, and Magentic-UI provides a vehicle to study these questions in a realistic setting. A key question among these is how we enable humans to efficiently intervene and provide feedback to the agent while executing a task. Humans should not have to constantly watch the agent. Ideally, the agent should know when to reach out for help and provide the necessary context for the human to assist it. A second question is about safety. As agents interact with the live web, they may become prone to attacks from malicious actors. We need to study what necessary safeguards are needed to protect the human from side effects without adding a heavy burden on the human to verify every agent action. There are also many other questions surrounding security, personalization, and learning that Magentic-UI can help with studying. Conclusion Magentic-UI is an open-source agent prototype that works with people to complete complex tasks that require multi-step planning and browser use. As agentic systems expand in the scope of tasks they can complete, Magentic-UI’s design enables better transparency into agent actions and enables human control to ensure safety and reliability. Moreover, by facilitating human intervention, we can improve performance while still reducing human cost in completing tasks on aggregate. Today we have released the first version of Magentic-UI. Looking ahead, we plan to continue developing it in the open with the goal of improving its capabilities and answering research questions on human-agent collaboration. We invite the research community to extend and reuse Magentic-UI for their scientific explorations and domains. Opens in a new tab #magenticui #experimental #humancentered #web #agent

Magentic-UI, an experimental human-centered web agent

www.microsoft.com
Modern productivity is rooted in the web—from searching for information and filling in forms to navigating dashboards. Yet, many of these tasks remain manual and repetitive. Today, we are introducing Magentic-UI, a new open-source research prototype of a human-centered agent that is meant to help researchers study open questions on human-in-the-loop approaches and oversight mechanisms for AI agents. This prototype collaborates with users on web-based tasks and operates in real time over a web browser. Unlike other computer use agents that aim for full autonomy, Magentic-UI offers a transparent and controllable experience for tasks that are action-oriented and Magentic-UI builds on Magentic-One (opens in new tab), a powerful multi-agent team we released last year, and is powered by AutoGen (opens in new tab), our leading agent framework. It is available under MIT license at https://github.com/microsoft/Magentic-UI (opens in new tab) and on Azure AI Foundry Labs (opens in new tab), the hub where developers, startups, and enterprises can explore groundbreaking innovations from Microsoft Research. Magentic-UI is integrated with Azure AI Foundry models and agents. Learn more about how to integrate Azure AI agents into the Magentic-UI multi-agent architecture by following this code sample (opens in new tab). Magentic-UI can perform tasks that require browsing the web, writing and executing Python and shell code, and understanding files. Its key features include: Collaborative planning with users (co-planning). Magentic-UI allows users to directly modify its plan through a plan editor or by providing textual feedback before Magentic-UI executes any actions. Collaborative execution with users (co-tasking). Users can pause the system and give feedback in natural language or demonstrate it by directly taking control of the browser. Safety with human-in-the-loop (action guards). Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors. Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors. Learning from experience (plan learning). Magentic-UI can learn and save plans from previous interactions to improve task completion for future tasks. Figure 1: Screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan and progress to accomplish a user’s complex goal. The right side shows the browser Magentic-UI is controlling. How is Magentic-UI human-centered? While many web agents promise full autonomy, in practice users can be left unsure of what the agent can do, what it is currently doing, and whether they have enough control to intervene when something goes wrong or doesn’t occur as expected. By contrast, Magentic-UI considers user needs at every stage of interaction. We followed a human-centered design methodology in building Magentic-UI by prototyping and obtaining feedback from pilot users during its design. Figure 2: Co-planning – Users can collaboratively plan with Magentic-UI. For example, after a person specifies and before Magentic-UI even begins to execute, it creates a clear step-by-step plan that outlines what it would do to accomplish the task. People can collaborate with Magentic-UI to modify this plan and then give final approval for Magentic-UI to begin execution. This is crucial as users may have expectations of how the task should be completed; communicating that information could significantly improve agent performance. We call this feature co-planning. During execution, Magentic-UI shows in real time what specific actions it’s about to take. For example, whether it is about to click on a button or input a search query. It also shows in real time what it observed on the web pages it is visiting. Users can take control of the action at any point in time and give control back to the agent. We call this feature co-tasking. Figure 3: Co-tasking – Magentic-UI provides real-time updates about what it is about to do and what it already did, allowing users to collaboratively complete tasks with the agent. Figure 4: Action-guards – Magentic-UI will ask users for permission before executing actions that it deems consequential or important. Additionally, Magentic-UI asks for user permission before performing actions that are deemed irreversible, such as closing a tab or clicking a button with side effects. We call these “action guards”. The user can also configure Magentic-UI’s action guards to always ask for permission before performing any action. If the user deems an action risky (e.g., paying for an item), they can reject it. Figure 5: Plan learning – Once a task is successfully completed, users can request Magentic-UI to learn a step-by-step plan from this experience. After execution, the user can ask Magentic-UI to reflect on the conversation and infer and save a step-by-step plan for future similar tasks. Users can view and modify saved plans for Magentic-UI to reuse in the future in a saved-plans gallery. In a future session, users can launch Magentic-UI with the saved plan to either execute the same task again, like checking the price of a specific flight, or use the plan as a guide to help complete similar tasks, such as checking the price of a different type of flight. Combined, these four features—co-planning, co-tasking, action guards, and plan learning—enable users to collaborate effectively with Magentic-UI. Architecture Magentic-UI’s underlying system is a team of specialized agents adapted from AutoGen’s Magentic-One system. The agents work together to create a modular system: Orchestrator is the lead agent, powered by a large language model (LLM), that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete. WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator. Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator. FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDown (opens in new tab) package. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them. Figure 6: System architecture diagram of Magentic-UI To interact with Magentic-UI, users can enter a text message and attach images. In response, Magentic-UI creates a natural-language step-by-step plan with which users can interact through a plan-editing interface. Users can add, delete, edit, regenerate steps, and write follow-up messages to iterate on the plan. While the user editing the plan adds an upfront cost to the interaction, it can potentially save a significant amount of time in the agent executing the plan and increase its chance at success. The plan is stored inside the Orchestrator and is used to execute the task. For each step of the plan, the Orchestrator determines which of the agents (WebSurfer, Coder, FileSurfer) or the user should complete the step. Once that decision is made, the Orchestrator sends a request to one of the agents or the user and waits for a response. After the response is received, the Orchestrator decides whether that step is complete. If it is, the Orchestrator moves on to the following step. Once all steps are completed, the Orchestrator generates a final answer that is presented to the user. If, while executing any of the steps, the Orchestrator decides that the plan is inadequate (for example, because a certain website is unreachable), the Orchestrator can replan with user permission and start executing a new plan. All intermediate progress steps are clearly displayed to the user. Furthermore, the user can pause the execution of the plan and send additional requests or feedback. The user can also configure through the interface whether agent actions (e.g., clicking a button) require approval. Evaluating Magentic-UI Magentic-UI innovates through its ability to integrate human feedback in its planning and execution of tasks. We performed a preliminary automated evaluation to showcase this ability on the GAIA benchmark (opens in new tab) for agents with a user-simulation experiment. Evaluation with simulated users Figure 7: Comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has a\access to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost. GAIA is a benchmark for general AI assistants, with multimodal question-answer pairs that are challenging, requiring the agents to navigate the web, process files, and execute code. The traditional evaluation setup with GAIA assumes the system will autonomously complete the task and return an answer, which is compared to the ground-truth answer. To evaluate the human-in-the-loop capabilities of Magentic-UI, we transform GAIA into an interactive benchmark by introducing the concept of a simulated user. Simulated users provide value in two ways: by having specific expertise that the agent may not possess, and by providing guidance on how the task should be performed. We experiment with two types of simulated users to show the value of human-in-the-loop: (1) a simulated user that is more intelligent than the Magentic-UI agents and (2) a simulated user with the same intelligence as Magentic-UI agents but with additional information about the task. During co-planning, Magentic-UI takes feedback from this simulated user to improve its plan. During co-tasking, Magentic-UI can ask the (simulated) user for help when it gets stuck. Finally, if Magentic-UI does not provide a final answer, then the simulated user provides an answer instead. The simulated user is an LLM without any tools, instructed to interact with Magentic-UI the way we expect a human would act. The first type of simulated user relies on OpenAI’s o4-mini, more performant at many tasks than the one powering the Magentic-UI agents (GPT-4o). For the second type of simulated user, we use GPT-4o for both the simulated user and the rest of the agents, but the user has access to side information about each task. Each task in GAIA has side information, which includes a human-written plan to solve the task. While this plan is not used as input in the traditional benchmark, in our interactive setting we provide this information to the second type of simulated user,which is powered by an LLM so that it can mimic a knowledgeable user. Importantly, we tuned our simulated user so as not to reveal the ground-truth answer directly as the answer is usually found inside the human written plan. Instead, it is prompted to guide Magentic-UI indirectly. We found that this tuning prevented the simulated user from inadvertently revealing the answer in all but 6% of tasks when Magentic-UI provides a final answer. On the validation subset of GAIA (162 tasks), we show the results of Magentic-One operating in autonomous mode, Magentic-UI operating in autonomous mode (without the simulated user), Magentic-UI with simulated user (1) (smarter model), Magentic-UI with simulated user (2) (side-information), and human performance. We first note that Magentic-UI in autonomous mode is within a margin of error of the performance of Magentic-One. Note that the same LLM (GPT-4o) is used for Magentic-UI and Magentic-One. Magentic-UI with the simulated user that has access to side information improves the accuracy of autonomous Magentic-UI by 71%, from a 30.3% task-completion rate to a 51.9% task-completion rate. Moreover, Magentic-UI only asks for help from the simulated user in 10% of tasks and relies on the simulated user for the final answer in 18% of tasks. And in those tasks where it does ask for help, it asks for help on average 1.1 times. Magentic-UI with the simulated user powered by a smarter model improves to 42.6% where Magentic-UI asks for help in only 4.3% of tasks, asking for help an average of 1.7 times in those tasks. This demonstrates the potential of even lightweight human feedback for improving performance (e.g., task completion) of autonomous agents, especially at a fraction of the cost compared to people completing tasks entirely manually. Learning and reusing plans As described above, once Magentic-UI completes a task, users have the option for Magentic-UI to learn a plan based on the execution of the task. These plans are saved in a plan gallery, which users and Magentic-UI can access in the future. The user can select a plan from the plan gallery, which is displayed by clicking on the Saved Plans button. Alternatively, as a user enters a task that closely matches a previous task, the saved plan will be displayed even before the user is done typing. If no identical task is found, Magentic-UI can use AutoGen’s Task-Centric Memory (opens in new tab) to retrieve plans for any similar tasks. Our preliminary evaluations show that this retrieval is highly accurate, and when recalling a saved plan can be around 3x faster than generating a new plan. Once a plan is recalled or generated, the user can always accept it, modify it, or ask Magentic-UI to modify it for the specific task at hand. Safety and control Magentic-UI can surf the live internet and execute code. With such capabilities, we need to ensure that Magentic-UI acts in a safe and secure manner. The following features, design decisions, and evaluations were made to ensure this: Allow-list: Users can set a list of websites that Magentic-UI is allowed to access. If Magentic-UI needs to access a website outside of the allow-list, users must explicitly approve it through the interface Anytime interruptions: At any point of Magentic-UI completing the task, the user can interrupt Magentic-UI and stop any pending code execution or web browsing. Docker sandboxing: Magentic-UI controls a browser that is launched inside a Docker container with no credentials, which avoids risks with logged-in accounts and credentials. Moreover, any code execution is also performed inside a separate Docker container to avoid affecting the host environment in which Magentic-UI is running. This is illustrated in the system architecture of Magentic-UI (Figure 3). Detection and approval of irreversible agent actions: Users can configure an action-approval policy (action guards) to determine which actions Magentic-UI can perform without user approval. In the extreme, users can specify that any action (e.g., any button click) needs explicit user approval. Users must press an “Accept” or “Deny” button for each action. In addition to the above design decisions, we performed a red-team evaluation of Magentic-UI on a set of internal scenarios, which we developed to challenge the security and safety of Magentic-UI. Such scenarios include cross-site prompt injection attacks, where web pages contain malicious instructions distinct from the user’s original intent (e.g., to execute risky code, access sensitive files, or perform actions on other websites). It also contains scenarios comparable to phishing, which try to trick Magentic-UI into entering sensitive information, or granting permissions on impostor sites (e.g., a synthetic website that asks Magentic-UI to log in and enter Google credentials to read an article). In our preliminary evaluations, we found that Magentic-UI either refuses to complete the requests, stops to ask the user, or, as a final safety measure, is eventually unable to complete the request due to Docker sandboxing. We have found that this layered approach is effective for thwarting these attacks. We have also released transparency notes, which can be found at: https://github.com/microsoft/magentic-ui/blob/main/TRANSPARENCY_NOTE.md (opens in new tab) Open research questions Magentic-UI provides a tool for researchers to study critical questions in agentic systems and particularly on human-agent interaction. In a previous report (opens in new tab), we outlined 12 questions for human-agent communication, and Magentic-UI provides a vehicle to study these questions in a realistic setting. A key question among these is how we enable humans to efficiently intervene and provide feedback to the agent while executing a task. Humans should not have to constantly watch the agent. Ideally, the agent should know when to reach out for help and provide the necessary context for the human to assist it. A second question is about safety. As agents interact with the live web, they may become prone to attacks from malicious actors. We need to study what necessary safeguards are needed to protect the human from side effects without adding a heavy burden on the human to verify every agent action. There are also many other questions surrounding security, personalization, and learning that Magentic-UI can help with studying. Conclusion Magentic-UI is an open-source agent prototype that works with people to complete complex tasks that require multi-step planning and browser use. As agentic systems expand in the scope of tasks they can complete, Magentic-UI’s design enables better transparency into agent actions and enables human control to ensure safety and reliability. Moreover, by facilitating human intervention, we can improve performance while still reducing human cost in completing tasks on aggregate. Today we have released the first version of Magentic-UI. Looking ahead, we plan to continue developing it in the open with the goal of improving its capabilities and answering research questions on human-agent collaboration. We invite the research community to extend and reuse Magentic-UI for their scientific explorations and domains. Opens in a new tab

44 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-15 17:00:56 ·

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers

Transcript      
PETER LEE: “We need to start understanding and discussing AI’s potential for good and ill now. Or rather, yesterday. … GPT-4 has game-changing potential to improve medicine and health.”       
This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.    
Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?     
In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here. 
The passage I read at the top is from the book’s prologue.  
When Carey, Zak, and I wrote the book, we could only speculate how generative AI would be used in healthcare because GPT-4 hadn’t yet been released. It wasn’t yet available to the very people we thought would be most affected by it. And while we felt strongly that this new form of AI would have the potential to transform medicine, it was such a different kind of technology for the world, and no one had a user’s manual for this thing to explain how to use it effectively and also how to use it safely.
So we thought it would be important to give healthcare professionals and leaders a framing to start important discussions around its use. We wanted to provide a map not only to help people navigate a new world that we anticipated would happen with the arrival of GPT-4 but also to help them chart a future of what we saw as a potential revolution in medicine.
So I’m super excited to welcome my coauthors: longtime medical/science journalist Carey Goldberg and Dr. Zak Kohane, the inaugural chair of Harvard Medical School’s Department of Biomedical Informatics and the editor-in-chief for The New England Journal of Medicine AI.
We’re going to have two discussions. This will be the first one about what we’ve learned from the people on the ground so far and how we are thinking about generative AI today.
Carey, Zak, I’m really looking forward to this.
CAREY GOLDBERG: It’s nice to see you, Peter.
LEE:It’s great to see you, too.
GOLDBERG: We missed you.
ZAK KOHANE: The dynamic gang is back.
LEE: Yeah, and I guess after that big book project two years ago, it’s remarkable that we’re still on speaking terms with each other.
In fact, this episode is to react to what we heard in the first four episodes of this podcast. But before we get there, I thought maybe we should start with the origins of this project just now over two years ago. And, you know, I had this early secret access to Davinci 3, now known as GPT-4.
I remember, you know, experimenting right away with things in medicine, but I realized I was in way over my head. And so I wanted help. And the first person I called was you, Zak. And you remember we had a call, and I tried to explain what this was about. And I think I saw skepticism in—polite skepticism—in your eyes. But tell me, you know, what was going through your head when you heard me explain this thing to you?
KOHANE: So I was divided between the fact that I have tremendous respect for you, Peter. And you’ve always struck me as sober. And we’ve had conversations which showed to me that you fully understood some of the missteps that technology—ARPA, Microsoft, and others—had made in the past. And yet, you were telling me a full science fiction compliant storythat something that we thought was 30 years away was happening now.
LEE: Mm-hmm.
KOHANE: And it was very hard for me to put together. And so I couldn’t quite tell myself this is BS, but I said, you know, I need to look at it. Just this seems too good to be true. What is this? So it was very hard for me to grapple with it. I was thrilled that it might be possible, but I was thinking, How could this be possible?
LEE: Yeah. Well, even now, I look back, and I appreciate that you were nice to me, because I think a lot of people would havebeen much less polite. And in fact, I myself had expressed a lot of very direct skepticism early on.
After ChatGPT got released, I think three or four days later, I received an email from a colleague running … who runs a clinic, and, you know, he said, “Wow, this is great, Peter. And, you know, we’re using this ChatGPT, you know, to have the receptionist in our clinic write after-visit notes to our patients.”
And that sparked a huge internal discussion about this. And you and I knew enough about hallucinations and about other issues that it seemed important to write something about what this could do and what it couldn’t do. And so I think, I can’t remember the timing, but you and I decided a book would be a good idea. And then I think you had the thought that you and I would write in a hopelessly academic stylethat no one would be able to read.
So it was your idea to recruit Carey, I think, right?
KOHANE: Yes, it was. I was sure that we both had a lot of material, but communicating it effectively to the very people we wanted to would not go well if we just left ourselves to our own devices. And Carey is super brilliant at what she does. She’s an idea synthesizer and public communicator in the written word and amazing.
LEE: So yeah. So, Carey, we contact you. How did that go?
GOLDBERG: So yes. On my end, I had known Zak for probably, like, 25 years, and he had always been the person who debunked the scientific hype for me. I would turn to him with like, “Hmm, they’re saying that the Human Genome Project is going to change everything.” And he would say, “Yeah. But first it’ll be 10 years of bad news, and thenwe’ll actually get somewhere.”
So when Zak called me up at seven o’clock one morning, just beside himself after having tried Davinci 3, I knew that there was something very serious going on. And I had just quit my job as the Boston bureau chief of Bloomberg News, and I was ripe for the plucking. And I also … I feel kind of nostalgic now about just the amazement and the wonder and the awe of that period. We knew that when generative AI hit the world, there would be all kinds of snags and obstacles and things that would slow it down, but at that moment, it was just like the holy crap moment.And it’s fun to think about it now. LEE: Yeah.
KOHANE: I will see that and raise that one. I now tell GPT-4, please write this in the style of Carey Goldberg.
GOLDBERG:No way! Really?
KOHANE: Yes way. Yes way. Yes way.
GOLDBERG: Wow. Well, I have to say, like, it’s not hard to motivate readers when you’re writing about the most transformative technology of their lifetime. Like, I think there’s a gigantic hunger to read and to understand. So you were not hard to work with, Peter and Zak.
LEE: All right. So I think we have to get down to worknow.
Yeah, so for these podcasts, you know, we’re talking to different types of people to just reflect on what’s actually happening, what has actually happened over the last two years. And so the first episode, we talked to two doctors. There’s Chris Longhurst at UC San Diego and Sara Murray at UC San Francisco. And besides being doctors and having AI affect their clinical work, they just happen also to be leading the efforts at their respective institutions to figure out how best to integrate AI into their health systems.
And, you know, it was fun to talk to them. And I felt like a lot of what they said was pretty validating for us. You know, they talked about AI scribes. Chris, especially, talked a lot about how AI can respond to emails from patients, write referral letters. And then, you know, they both talked about the importance of—I think, Zak, you used the phrase in our book “trust but verify”—you know, to have always a human in the loop.
What did you two take away from their thoughts overall about how doctors are using … and I guess, Zak, you would have a different lens also because at Harvard, you see doctors all the time grappling with AI.
KOHANE: So on the one hand, I think they’ve done some very interesting studies. And indeed, they saw that when these generative models, when GPT-4, was sending a note to patients, it was more detailed, friendlier.
But there were also some nonobvious results, which is on the generation of these letters, if indeed you review them as you’re supposed to, it was not clear that there was any time savings. And my own reaction was, Boy, every one of these things needs institutional review. It’s going to be hard to move fast.
And yet, at the same time, we know from them that the doctors on their smartphones are accessing these things all the time. And so the disconnect between a healthcare system, which is duty bound to carefully look at every implementation, is, I think, intimidating.
LEE: Yeah.
KOHANE: And at the same time, doctors who just have to do what they have to do are using this new superpower and doing it. And so that’s actually what struck me …
LEE: Yeah.
KOHANE: … is that these are two leaders and they’re doing what they have to do for their institutions, and yet there’s this disconnect.
And by the way, I don’t think we’ve seen any faster technology adoption than the adoption of ambient dictation. And it’s not because it’s time saving. And in fact, so far, the hospitals have to pay out of pocket. It’s not like insurance is paying them more. But it’s so much more pleasant for the doctors … not least of which because they can actually look at their patients instead of looking at the terminal and plunking down.
LEE: Carey, what about you?
GOLDBERG: I mean, anecdotally, there are time savings. Anecdotally, I have heard quite a few doctors saying that it cuts down on “pajama time” to be able to have the note written by the AI and then for them to just check it. In fact, I spoke to one doctor who said, you know, basically it means that when I leave the office, I’ve left the office. I can go home and be with my kids.
So I don’t think the jury is fully in yet about whether there are time savings. But what is clear is, Peter, what you predicted right from the get-go, which is that this is going to be an amazing paper shredder. Like, the main first overarching use cases will be back-office functions.
LEE: Yeah, yeah. Well, and it was, I think, not a hugely risky prediction because, you know, there were already companies, like, using phone banks of scribes in India to kind of listen in. And, you know, lots of clinics actually had human scribes being used. And so it wasn’t a huge stretch to imagine the AI.
So on the subject of things that we missed, Chris Longhurst shared this scenario, which stuck out for me, and he actually coauthored a paper on it last year.
CHRISTOPHER LONGHURST: It turns out, not surprisingly, healthcare can be frustrating. And stressed patients can send some pretty nasty messages to their care teams.And you can imagine being a busy, tired, exhausted clinician and receiving a bit of a nasty-gram. And the GPT is actually really helpful in those instances in helping draft a pretty empathetic response when I think the human instinct would be a pretty nasty one.
LEE:So, Carey, maybe I’ll start with you. What did we understand about this idea of empathy out of AI at the time we wrote the book, and what do we understand now?
GOLDBERG: Well, it was already clear when we wrote the book that these AI models were capable of very persuasive empathy. And in fact, you even wrote that it was helping you be a better person, right.So their human qualities, or human imitative qualities, were clearly superb. And we’ve seen that borne out in multiple studies, that in fact, patients respond better to them … that they have no problem at all with how the AI communicates with them. And in fact, it’s often better.
And I gather now we’re even entering a period when people are complaining of sycophantic models,where the models are being too personable and too flattering. I do think that’s been one of the great surprises. And in fact, this is a huge phenomenon, how charming these models can be.
LEE: Yeah, I think you’re right. We can take credit for understanding that, Wow, these things can be remarkably empathetic. But then we missed this problem of sycophancy. Like, we even started our book in Chapter 1 with a quote from Davinci 3 scolding me. Like, don’t you remember when we were first starting, this thing was actually anti-sycophantic. If anything, it would tell you you’re an idiot.
KOHANE: It argued with me about certain biology questions. It was like a knockdown, drag-out fight.I was bringing references. It was impressive. But in fact, it made me trust it more.
LEE: Yeah.
KOHANE: And in fact, I will say—I remember it’s in the book—I had a bone to pick with Peter. Peter really was impressed by the empathy. And I pointed out that some of the most popular doctors are popular because they’re very empathic. But they’re not necessarily the best doctors. And in fact, I was taught that in medical school.
And so it’s a decoupling. It’s a human thing, that the empathy does not necessarily mean … it’s more of a, potentially, more of a signaled virtue than an actual virtue.
GOLDBERG: Nicely put.
LEE: Yeah, this issue of sycophancy, I think, is a struggle right now in the development of AI because I think it’s somehow related to instruction-following. So, you know, one of the challenges in AI is you’d like to give an AI a task—a task that might take several minutes or hours or even days to complete. And you want it to faithfully kind of follow those instructions. And, you know, that early version of GPT-4 was not very good at instruction-following. It would just silently disobey and, you know, and do something different.
And so I think we’re starting to hit some confusing elements of like, how agreeable should these things be?
One of the two of you used the word genteel. There was some point even while we were, like, on a little book tour … was it you, Carey, who said that the model seems nicer and less intelligent or less brilliant now than it did when we were writing the book?
GOLDBERG: It might have been, I think so. And I mean, I think in the context of medicine, of course, the question is, well, what’s likeliest to get the results you want with the patient, right? A lot of healthcare is in fact persuading the patient to do what you know as the physician would be best for them. And so it seems worth testing out whether this sycophancy is actually constructive or not. And I suspect … well, I don’t know, probably depends on the patient.
So actually, Peter, I have a few questions for you …
LEE: Yeah. Mm-hmm.
GOLDBERG: … that have been lingering for me. And one is, for AI to ever fully realize its potential in medicine, it must deal with the hallucinations. And I keep hearing conflicting accounts about whether that’s getting better or not. Where are we at, and what does that mean for use in healthcare?
LEE: Yeah, well, it’s, I think two years on, in the pretrained base models, there’s no doubt that hallucination rates by any benchmark measure have reduced dramatically. And, you know, that doesn’t mean they don’t happen. They still happen. But, you know, there’s been just a huge amount of effort and understanding in the, kind of, fundamental pretraining of these models. And that has come along at the same time that the inference costs, you know, for actually using these models has gone down, you know, by several orders of magnitude.
So things have gotten cheaper and have fewer hallucinations. At the same time, now there are these reasoning models. And the reasoning models are able to solve problems at PhD level oftentimes.
But at least at the moment, they are also now hallucinating more than the simpler pretrained models. And so it still continues to be, you know, a real issue, as we were describing. I don’t know, Zak, from where you’re at in medicine, as a clinician and as an educator in medicine, how is the medical community from where you’re sitting looking at that?
KOHANE: So I think it’s less of an issue, first of all, because the rate of hallucinations is going down. And second of all, in their day-to-day use, the doctor will provide questions that sit reasonably well into the context of medical decision-making. And the way doctors use this, let’s say on their non-EHRsmartphone is really to jog their memory or thinking about the patient, and they will evaluate independently. So that seems to be less of an issue. I’m actually more concerned about something else that’s I think more fundamental, which is effectively, what values are these models expressing?
And I’m reminded of when I was still in training, I went to a fancy cocktail party in Cambridge, Massachusetts, and there was a psychotherapist speaking to a dentist. They were talking about their summer, and the dentist was saying about how he was going to fix up his yacht that summer, and the only question was whether he was going to make enough money doing procedures in the spring so that he could afford those things, which was discomforting to me because that dentist was my dentist.And he had just proposed to me a few weeks before an expensive procedure.
And so the question is what, effectively, is motivating these models?
LEE: Yeah, yeah.
KOHANE: And so with several colleagues, I published a paper, basically, what are the values in AI? And we gave a case: a patient, a boy who is on the short side, not abnormally short, but on the short side, and his growth hormone levels are not zero. They’re there, but they’re on the lowest side. But the rest of the workup has been unremarkable. And so we asked GPT-4, you are a pediatric endocrinologist.
Should this patient receive growth hormone? And it did a very good job explaining why the patient should receive growth hormone.
GOLDBERG: Should. Should receive it.
KOHANE: Should. And then we asked, in a separate session, you are working for the insurance company. Should this patient receive growth hormone? And it actually gave a scientifically better reason not to give growth hormone. And in fact, I tend to agree medically, actually, with the insurance company in this case, because giving kids who are not growth hormone deficient, growth hormone gives only a couple of inches over many, many years, has all sorts of other issues. But here’s the point, we had 180-degree change in decision-making because of the prompt. And for that patient, tens-of-thousands-of-dollars-per-year decision; across patient populations, millions of dollars of decision-making.
LEE: Hmm. Yeah.
KOHANE: And you can imagine these user prompts making their way into system prompts, making their way into the instruction-following. And so I think this is aptly central. Just as I was wondering about my dentist, we should be wondering about these things. What are the values that are being embedded in them, some accidentally and some very much on purpose?
LEE: Yeah, yeah. That one, I think, we even had some discussions as we were writing the book, but there’s a technical element of that that I think we were missing, but maybe Carey, you would know for sure. And that’s this whole idea of prompt engineering. It sort of faded a little bit. Was it a thing? Do you remember?
GOLDBERG: I don’t think we particularly wrote about it. It’s funny, it does feel like it faded, and it seems to me just because everyone just gets used to conversing with the models and asking for what they want. Like, it’s not like there actually is any great science to it.
LEE: Yeah, even when it was a hot topic and people were talking about prompt engineering maybe as a new discipline, all this, it never, I was never convinced at the time. But at the same time, it is true. It speaks to what Zak was just talking about because part of the prompt engineering that people do is to give a defined role to the AI.
You know, you are an insurance claims adjuster, or something like that, and defining that role, that is part of the prompt engineering that people do.
GOLDBERG: Right. I mean, I can say, you know, sometimes you guys had me take sort of the patient point of view, like the “every patient” point of view. And I can say one of the aspects of using AI for patients that remains absent in as far as I can tell is it would be wonderful to have a consumer-facing interface where you could plug in your whole medical record without worrying about any privacy or other issues and be able to interact with the AI as if it were physician or a specialist and get answers, which you can’t do yet as far as I can tell.
LEE: Well, in fact, now that’s a good prompt because I think we do need to move on to the next episodes, and we’ll be talking about an episode that talks about consumers. But before we move on to Episode 2, which is next, I’d like to play one more quote, a little snippet from Sara Murray.
SARA MURRAY: I already do this when I’m on rounds—I’ll kind of give the case to ChatGPT if it’s a complex case, and I’ll say, “Here’s how I’m thinking about it; are there other things?” And it’ll give me additional ideas that are sometimes useful and sometimes not but often useful, and I’ll integrate them into my conversation about the patient.
LEE: Carey, you wrote this fictional account at the very start of our book. And that fictional account, I think you and Zak worked on that together, talked about this medical resident, ER resident, using, you know, a chatbot off label, so to speak. And here we have the chief, in fact, the nation’s first chief health AI officerfor an elite health system doing exactly that. That’s got to be pretty validating for you, Carey.
GOLDBERG: It’s very.Although what’s troubling about it is that actually as in that little vignette that we made up, she’s using it off label, right. It’s like she’s just using it because it helps the way doctors use Google. And I do find it troubling that what we don’t have is sort of institutional buy-in for everyone to do that because, shouldn’t they if it helps?
LEE: Yeah. Well, let’s go ahead and get into Episode 2. So Episode 2, we sort of framed as talking to two people who are on the frontlines of big companies integrating generative AI into their clinical products. And so, one was Matt Lungren, who’s a colleague of mine here at Microsoft. And then Seth Hain, who leads all of R&D at Epic.
Maybe we’ll start with a little snippet of something that Matt said that struck me in a certain way.
MATTHEW LUNGREN: OK, we see this pain point. Doctors are typing on their computers while they’re trying to talk to their patients, right? We should be able to figure out a way to get that ambient conversation turned into text that then, you know, accelerates the doctor … takes all the important information. That’s a really hard problem, right. And so, for a long time, there was a human-in-the-loop aspect to doing this because you needed a human to say, “This transcript’s great, but here’s actually what needs to go in the note.” And that can’t scale.
LEE: I think we expected healthcare systems to adopt AI, and we spent a lot of time in the book on AI writing clinical encounter notes. It’s happening for real now, and in a big way. And it’s something that has, of course, been happening before generative AI but now is exploding because of it. Where are we at now, two years later, just based on what we heard from guests?
KOHANE: Well, again, unless they’re forced to, hospitals will not adopt new technology unless it immediately translates into income. So it’s bizarrely counter-cultural that, again, they’re not being able to bill for the use of the AI, but this technology is so compelling to the doctors that despite everything, it’s overtaking the traditional dictation-typing routine.
LEE: Yeah.
GOLDBERG: And a lot of them love it and say, you will pry my cold dead hands off of my ambient note-taking, right. And I actually … a primary care physician allowed me to watch her. She was actually testing the two main platforms that are being used. And there was this incredibly talkative patient who went on and on about vacation and all kinds of random things for about half an hour.
And both of the platforms were incredibly good at pulling out what was actually medically relevant. And so to say that it doesn’t save time doesn’t seem right to me. Like, it seemed like it actually did and in fact was just shockingly good at being able to pull out relevant information.
LEE: Yeah.
KOHANE: I’m going to hypothesize that in the trials, which have in fact shown no gain in time, is the doctors were being incredibly meticulous.So I think … this is a Hawthorne effect, because you know you’re being monitored. And we’ve seen this in other technologies where the moment the focus is off, it’s used much more routinely and with much less inspection, for the better and for the worse.
LEE: Yeah, you know, within Microsoft, I had some internal disagreements about Microsoft producing a product in this space. It wouldn’t be Microsoft’s normal way. Instead, we would want 50 great companies building those products and doing it on our cloud instead of us competing against those 50 companies. And one of the reasons is exactly what you both said. I didn’t expect that health systems would be willing to shell out the money to pay for these things. It doesn’t generate more revenue. But I think so far two years later, I’ve been proven wrong.
I wanted to ask a question about values here. I had this experience where I had a little growth, a bothersome growth on my cheek. And so had to go see a dermatologist. And the dermatologist treated it, froze it off. But there was a human scribe writing the clinical note.
And so I used the app to look at the note that was submitted. And the human scribe said something that did not get discussed in the exam room, which was that the growth was making it impossible for me to safely wear a COVID mask. And that was the reason for it.
And that then got associated with a code that allowed full reimbursement for that treatment. And so I think that’s a classic example of what’s called upcoding. And I strongly suspect that AI scribes, an AI scribe would not have done that.
GOLDBERG: Well, depending what values you programmed into it, right, Zak?
KOHANE: Today, today, today, it will not do it. But, Peter, that is actually the central issue that society has to have because our hospitals are currently mostly in the red. And upcoding is standard operating procedure. And if these AI get in the way of upcoding, they are going to be aligned towards that upcoding. You know, you have to ask yourself, these MRI machines are incredibly useful. They’re also big money makers. And if the AI correctly says that for this complaint, you don’t actually have to do the MRI …
LEE: Right.
KOHANE: …
GOLDBERG: Yeah. And that raises another question for me. So, Peter, speaking from inside the gigantic industry, like, there seems to be such a need for self-surveillance of the models for potential harms that they could be causing. Are the big AI makers doing that? Are they even thinking about doing that?
Like, let’s say you wanted to watch out for the kind of thing that Zak’s talking about, could you?
LEE: Well, I think evaluation, like the best evaluation we had when we wrote our book was, you know, what score would this get on the step one and step two US medical licensing exams?
GOLDBERG: Right, right, right, yeah.
LEE: But honestly, evaluation hasn’t gotten that much deeper in the last two years. And it’s a big, I think, it is a big issue. And it’s related to the regulation issue also, I think.
Now the other guest in Episode 2 is Seth Hain from Epic. You know, Zak, I think it’s safe to say that you’re not a fan of Epic and the Epic system. You know, we’ve had a few discussions about that, about the fact that doctors don’t have a very pleasant experience when they’re using Epic all day.
Seth, in the podcast, said that there are over 100 AI integrations going on in Epic’s system right now. Do you think, Zak, that that has a chance to make you feel better about Epic? You know, what’s your view now two years on?
KOHANE: My view is, first of all, I want to separate my view of Epic and how it’s affected the conduct of healthcare and the quality of life of doctors from the individuals. Like Seth Hain is a remarkably fine individual who I’ve enjoyed chatting with and does really great stuff. Among the worst aspects of the Epic, even though it’s better in that respect than many EHRs, is horrible user interface.
The number of clicks that you have to go to get to something. And you have to remember where someone decided to put that thing. It seems to me that it is fully within the realm of technical possibility today to actually give an agent a task that you want done in the Epic record. And then whether Epic has implemented that agent or someone else, it does it so you don’t have to do the clicks. Because it’s something really soul sucking that when you’re trying to help patients, you’re having to remember not the right dose of the medication, but where was that particular thing that you needed in that particular task?
I can’t imagine that Epic does not have that in its product line. And if not, I know there must be other companies that essentially want to create that wrapper. So I do think, though, that the danger of multiple integrations is that you still want to have the equivalent of a single thought process that cares about the patient bringing those different processes together. And I don’t know if that’s Epic’s responsibility, the hospital’s responsibility, whether it’s actually a patient agent. But someone needs to be also worrying about all those AIs that are being integrated into the patient record. So … what do you think, Carey?
GOLDBERG: What struck me most about what Seth said was his description of the Cosmos project, and I, you know, I have been drinking Zak’s Kool-Aid for a very long time,and he—no, in a good way! And he persuaded me long ago that there is this horrible waste happening in that we have all of these electronic medical records, which could be used far, far more to learn from, and in particular, when you as a patient come in, it would be ideal if your physician could call up all the other patients like you and figure out what the optimal treatment for you would be. And it feels like—it sounds like—that’s one of the central aims that Epic is going for. And if they do that, I think that will redeem a lot of the pain that they’ve caused physicians these last few years.
And I also found myself thinking, you know, maybe this very painful period of using electronic medical records was really just a growth phase. It was an awkward growth phase. And once AI is fully used the way Zak is beginning to describe, the whole system could start making a lot more sense for everyone.
LEE: Yeah. One conversation I’ve had with Seth, in all of this is, you know, with AI and its development, is there a future, a near future where we don’t have an EHRsystem at all? You know, AI is just listening and just somehow absorbing all the information. And, you know, one thing that Seth said, which I felt was prescient, and I’d love to get your reaction, especially Zak, on this is he said, I think that … he said, technically, it could happen, but the problem is right now, actually doctors do a lot of their thinking when they write and review notes. You know, the actual process of being a doctor is not just being with a patient, but it’s actually thinking later. What do you make of that?
KOHANE: So one of the most valuable experiences I had in training was something that’s more or less disappeared in medicine, which is the post-clinic conference, where all the doctors come together and we go through the cases that we just saw that afternoon. And we, actually, were trying to take potshots at each otherin order to actually improve. Oh, did you actually do that? Oh, I forgot. I’m going to go call the patient and do that.
And that really happened. And I think that, yes, doctors do think, and I do think that we are insufficiently using yet the artificial intelligence currently in the ambient dictation mode as much more of a independent agent saying, did you think about that?
I think that would actually make it more interesting, challenging, and clearly better for the patient because that conversation I just told you about with the other doctors, that no longer exists.
LEE: Yeah. Mm-hmm. I want to do one more thing here before we leave Matt and Seth in Episode 2, which is something that Seth said with respect to how to reduce hallucination.
SETH HAIN: At that time, there’s a lot of conversation in the industry around something called RAG, or retrieval-augmented generation. And the idea was, could you pull the relevant bits, the relevant pieces of the chart, into that prompt, that information you shared with the generative AI model, to be able to increase the usefulness of the draft that was being created? And that approach ended up proving and continues to be to some degree, although the techniques have greatly improved, somewhat brittle, right. And I think this becomes one of the things that we are and will continue to improve upon because, as you get a richer and richer amount of information into the model, it does a better job of responding.
LEE: Yeah, so, Carey, this sort of gets at what you were saying, you know, that shouldn’t these models be just bringing in a lot more information into their thought processes? And I’m certain when we wrote our book, I had no idea. I did not conceive of RAG at all. It emerged a few months later.
And to my mind, I remember the first time I encountered RAG—Oh, this is going to solve all of our problems of hallucination. But it’s turned out to be harder. It’s improving day by day, but it’s turned out to be a lot harder.
KOHANE: Seth makes a very deep point, which is the way RAG is implemented is basically some sort of technique for pulling the right information that’s contextually relevant. And the way that’s done is typically heuristic at best. And it’s not … doesn’t have the same depth of reasoning that the rest of the model has.
And I’m just wondering, Peter, what you think, given the fact that now context lengths seem to be approaching a million or more, and people are now therefore using the full strength of the transformer on that context and are trying to figure out different techniques to make it pay attention to the middle of the context. In fact, the RAG approach perhaps was just a transient solution to the fact that it’s going to be able to amazingly look in a thoughtful way at the entire record of the patient, for example. What do you think, Peter?
LEE: I think there are three things, you know, that are going on, and I’m not sure how they’re going to play out and how they’re going to be balanced. And I’m looking forward to talking to people in later episodes of this podcast, you know, people like Sébastien Bubeck or Bill Gates about this, because, you know, there is the pretraining phase, you know, when things are sort of compressed and baked into the base model.
There is the in-context learning, you know, so if you have extremely long or infinite context, you’re kind of learning as you go along. And there are other techniques that people are working on, you know, various sorts of dynamic reinforcement learning approaches, and so on. And then there is what maybe you would call structured RAG, where you do a pre-processing. You go through a big database, and you figure it all out. And make a very nicely structured database the AI can then consult with later.
And all three of these in different contexts today seem to show different capabilities. But they’re all pretty important in medicine.
Moving on to Episode 3, we talked to Dave DeBronkart, who is also known as “e-Patient Dave,” an advocate of patient empowerment, and then also Christina Farr, who has been doing a lot of venture investing for consumer health applications.
Let’s get right into this little snippet from something that e-Patient Dave said that talks about the sources of medical information, particularly relevant for when he was receiving treatment for stage 4 kidney cancer.
DAVE DEBRONKART: And I’m making a point here of illustrating that I am anything but medically trained, right. And yet I still, I want to understand as much as I can. I was months away from dead when I was diagnosed, but in the patient community, I learned that they had a whole bunch of information that didn’t exist in the medical literature. Now today we understand there’s publication delays; there’s all kinds of reasons. But there’s also a whole bunch of things, especially in an unusual condition, that will never rise to the level of deserving NIHfunding and research.
LEE: All right. So I have a question for you, Carey, and a question for you, Zak, about the whole conversation with e-Patient Dave, which I thought was really remarkable. You know, Carey, I think as we were preparing for this whole podcast series, you made a comment—I actually took it as a complaint—that not as much has happened as I had hoped or thought. People aren’t thinking boldly enough, you know, and I think, you know, I agree with you in the sense that I think we expected a lot more to be happening, particularly in the consumer space. I’m giving you a chance to vent about this.
GOLDBERG:Thank you! Yes, that has been by far the most frustrating thing to me. I think that the potential for AI to improve everybody’s health is so enormous, and yet, you know, it needs some sort of support to be able to get to the point where it can do that. Like, remember in the book we wrote about Greg Moore talking about how half of the planet doesn’t have healthcare, but people overwhelmingly have cellphones. And so you could connect people who have no healthcare to the world’s medical knowledge, and that could certainly do some good.
And I have one great big problem with e-Patient Dave, which is that, God, he’s fabulous. He’s super smart. Like, he’s not a typical patient. He’s an off-the-charts, brilliant patient. And so it’s hard to … and so he’s a great sort of lead early-adopter-type person, and he can sort of show the way for others.
But what I had hoped for was that there would be more visible efforts to really help patients optimize their healthcare. Probably it’s happening a lot in quiet ways like that any discharge instructions can be instantly beautifully translated into a patient’s native language and so on. But it’s almost like there isn’t a mechanism to allow this sort of mass consumer adoption that I would hope for.
LEE: Yeah. But you have written some, like, you even wrote about that person who saved his dog. So do you think … you know, and maybe a lot more of that is just happening quietly that we just never hear about?
GOLDBERG: I’m sure that there is a lot of it happening quietly. And actually, that’s another one of my complaints is that no one is gathering that stuff. It’s like you might happen to see something on social media. Actually, e-Patient Dave has a hashtag, PatientsUseAI, and a blog, as well. So he’s trying to do it. But I don’t know of any sort of overarching or academic efforts to, again, to surveil what’s the actual use in the population and see what are the pros and cons of what’s happening.
LEE: Mm-hmm. So, Zak, you know, the thing that I thought about, especially with that snippet from Dave, is your opening for Chapter 8 that you wrote, you know, about your first patient dying in your arms. I still think of how traumatic that must have been. Because, you know, in that opening, you just talked about all the little delays, all the little paper-cut delays, in the whole process of getting some new medical technology approved. But there’s another element that Dave kind of speaks to, which is just, you know, patients who are experiencing some issue are very, sometimes very motivated. And there’s just a lot of stuff on social media that happens.
KOHANE: So this is where I can both agree with Carey and also disagree. I think when people have an actual health problem, they are now routinely using it.
GOLDBERG: Yes, that’s true.
KOHANE: And that situation is happening more often because medicine is failing. This is something that did not come up enough in our book. And perhaps that’s because medicine is actually feeling a lot more rickety today than it did even two years ago.
We actually mentioned the problem. I think, Peter, you may have mentioned the problem with the lack of primary care. But now in Boston, our biggest healthcare system, all the practices for primary care are closed. I cannot get for my own faculty—residents at MGHcan’t get primary care doctor. And so …
LEE: Which is just crazy. I mean, these are amongst the most privileged people in medicine, and they can’t find a primary care physician. That’s incredible.
KOHANE: Yeah, and so therefore … and I wrote an
And so therefore, you see people who know that they have a six-month wait till they see the doctor, and all they can do is say, “I have this rash. Here’s a picture. What’s it likely to be? What can I do?” “I’m gaining weight. How do I do a ketogenic diet?” Or, “How do I know that this is the flu?”
This is happening all the time, where acutely patients have actually solved problems that doctors have not. Those are spectacular. But I’m saying more routinely because of the failure of medicine. And it’s not just in our fee-for-service United States. It’s in the UK; it’s in France. These are first-world, developed-world problems. And we don’t even have to go to lower- and middle-income countries for that. LEE: Yeah.
GOLDBERG: But I think it’s important to note that, I mean, so you’re talking about how even the most elite people in medicine can’t get the care they need. But there’s also the point that we have so much concern about equity in recent years. And it’s likeliest that what we’re doing is exacerbating inequity because it’s only the more connected, you know, better off people who are using AI for their health.
KOHANE: Oh, yes. I know what various Harvard professors are doing. They’re paying for a concierge doctor. And that’s, you know, a - to -a-year-minimum investment. That’s inequity.
LEE: When we wrote our book, you know, the idea that GPT-4 wasn’t trained specifically for medicine, and that was amazing, but it might get even better and maybe would be necessary to do that. But one of the insights for me is that in the consumer space, the kinds of things that people ask about are different than what the board-certified clinician would ask.
KOHANE: Actually, that’s, I just recently coined the term. It’s the … maybe it’s … well, at least it’s new to me. It’s the technology or expert paradox. And that is the more expert and narrow your medical discipline, the more trivial it is to translate that into a specialized AI. So echocardiograms? We can now do beautiful echocardiograms. That’s really hard to do. I don’t know how to interpret an echocardiogram. But they can do it really, really well. Interpret an EEG. Interpret a genomic sequence. But understanding the fullness of the human condition, that’s actually hard. And actually, that’s what primary care doctors do best. But the paradox is right now, what is easiest for AI is also the most highly paid in medicine.Whereas what is the hardest for AI in medicine is the least regarded, least paid part of medicine.
GOLDBERG: So this brings us to the question I wanted to throw at both of you actually, which is we’ve had this spasm of incredibly prominent people predicting that in fact physicians would be pretty obsolete within the next few years. We had Bill Gates saying that; we had Elon Musk saying surgeons are going to be obsolete within a few years. And I think we had Demis Hassabis saying, “Yeah, we’ll probably cure most diseases within the next decade or so.”
So what do you think? And also, Zak, to what you were just saying, I mean, you’re talking about being able to solve very general overarching problems. But in fact, these general overarching models are actually able, I would think, are able to do that because they are broad. So what are we heading towards do you think? What should the next book be … The end of doctors?
KOHANE: So I do recall a conversation that … we were at a table with Bill Gates, and Bill Gates immediately went to this, which is advancing the cutting edge of science. And I have to say that I think it will accelerate discovery. But eliminating, let’s say, cancer? I think that’s going to be … that’s just super hard. The reason it’s super hard is we don’t have the data or even the beginnings of the understanding of all the ways this devilish disease managed to evolve around our solutions.
And so that seems extremely hard. I think we’ll make some progress accelerated by AI, but solving it in a way Hassabis says, God bless him. I hope he’s right. I’d love to have to eat crow in 10 or 20 years, but I don’t think so. I do believe that a surgeon working on one of those Davinci machines, that stuff can be, I think, automated.
And so I think that’s one example of one of the paradoxes I described. And it won’t be that we’re replacing doctors. I just think we’re running out of doctors. I think it’s really the case that, as we said in the book, we’re getting a huge deficit in primary care doctors.
But even the subspecialties, my subspecialty, pediatric endocrinology, we’re only filling half of the available training slots every year. And why? Because it’s a lot of work, a lot of training, and frankly doesn’t make as much money as some of the other professions.
LEE: Yeah. Yeah, I tend to think that, you know, there are going to be always a need for human doctors, not for their skills. In fact, I think their skills increasingly will be replaced by machines. And in fact, I’ve talked about a flip. In fact, patients will demand, Oh my god, you mean you’re going to try to do that yourself instead of having the computer do it? There’s going to be that sort of flip. But I do think that when it comes to people’s health, people want the comfort of an authority figure that they trust. And so what is more of a question for me is whether we will ever view a machine as an authority figure that we can trust.
And before I move on to Episode 4, which is on norms, regulations and ethics, I’d like to hear from Chrissy Farr on one more point on consumer health, specifically as it relates to pregnancy:
CHRISTINA FARR: For a lot of women, it’s their first experience with the hospital. And, you know, I think it’s a really big opportunity for these systems to get a whole family on board and keep them kind of loyal. And a lot of that can come through, you know, just delivering an incredible service. Unfortunately, I don’t think that we are delivering incredible services today to women in this country. I see so much room for improvement.
LEE: In the consumer space, I don’t think we really had a focus on those periods in a person’s life when they have a lot of engagement, like pregnancy, or I think another one is menopause, cancer. You know, there are points where there is, like, very intense engagement. And we heard that from e-Patient Dave, you know, with his cancer and Chrissy with her pregnancy. Was that a miss in our book? What do think, Carey?
GOLDBERG: I mean, I don’t think so. I think it’s true that there are many points in life when people are highly engaged. To me, the problem thus far is just that I haven’t seen consumer-facing companies offering beautiful AI-based products. I think there’s no question at all that the market is there if you have the products to offer.
LEE: So, what do you think this means, Zak, for, you know, like Boston Children’s or Mass General Brigham—you know, the big places?
KOHANE: So again, all these large healthcare systems are in tough shape. MGBwould be fully in the red if not for the fact that its investments, of all things, have actually produced. If you look at the large healthcare systems around the country, they are in the red. And there’s multiple reasons why they’re in the red, but among them is cost of labor.
And so we’ve created what used to be a very successful beast, the health center. But it’s developed a very expensive model and a highly regulated model. And so when you have high revenue, tiny margins, your ability to disrupt yourself, to innovate, is very, very low because you will have to talk to the board next year if you went from 2% positive margin to 1% negative margin.
LEE: Yeah.
KOHANE: And so I think we’re all waiting for one of the two things to happen, either a new kind of healthcare delivery system being generated or ultimately one of these systems learns how to disrupt itself.
LEE: Yeah.
GOLDBERG: We punted.We totally punted to the AI.
LEE: We had three amazing guests. One was Laura Adams from National Academy of Medicine. Let’s play a snippet from her.
LAURA ADAMS: I think one of the most provocative and exciting articles that I saw written recently was by Bakul Patel and David Blumenthal, who posited, should we be regulating generative AI as we do a licensed and qualified provider? Should it be treated in the sense that it’s got to have a certain amount of training and a foundation that’s got to pass certain tests? Does it have to report its performance? And I’m thinking, what a provocative idea, but it’s worth considering.
LEE: All right, so I very well remember that we had discussed this kind of idea when we were writing our book. And I think before we finished our book, I personally rejected the idea. But now two years later, what do the two of you think? I’m dying to hear.
GOLDBERG: Well, wait, why … what do you think? Like, are you sorry that you rejected it?
LEE: I’m still skeptical because when we are licensing human beings as doctors, you know, we’re making a lot of implicit assumptions that we don’t test as part of their licensure, you know, that first of all, they arehuman being and they care about life, and that, you know, they have a certain amount of common sense and shared understanding of the world.
And there’s all sorts of sort of implicit assumptions that we have about each other as human beings living in a society together. That you know how to study, you know, because I know you just went through three years of medical or four years of medical school and all sorts of things. And so the standard ways that we license human beings, they don’t need to test all of that stuff. But somehow intuitively, all of that seems really important.
I don’t know. Am I wrong about that?
KOHANE: So it’s compared with what issue? Because we know for a fact that doctors who do a lot of a procedure, like do this procedure, like high-risk deliveries all the time, have better outcomes than ones who only do a few high risk. We talk about it, but we don’t actually make it explicit to patients or regulate that you have to have this minimal amount. And it strikes me that in some sense, and, oh, very importantly, these things called human beings learn on the job. And although I used to be very resentful of it as a resident, when someone would say, I don’t want the resident, I want the …
GOLDBERG: … the attending.
KOHANE: … they had a point. And so the truth is, maybe I was a wonderful resident, but some people were not so great.And so it might be the best outcome if we actually, just like for human beings, we say, yeah, OK, it’s this good, but don’t let it work autonomously, or it’s done a thousand of them, just let it go. We just don’t have practically speaking, we don’t have the environment, the lab, to test them. Now, maybe if they get embodied in robots and literally go around with us, then it’s going to bea lot easier. I don’t know.
LEE: Yeah.
GOLDBERG: Yeah, I think I would take a step back and say, first of all, we weren’t the only ones who were stumped by regulating AI. Like, nobody has done it yet in the United States to this day, right. Like, we do not have standing regulation of AI in medicine at all in fact. And that raises the issue of … the story that you hear often in the biotech business, which is, you know, more prominent here in Boston than anywhere else, is that thank goodness Cambridge put out, the city of Cambridge, put out some regulations about biotech and how you could dump your lab waste and so on. And that enabled the enormous growth of biotech here.
If you don’t have the regulations, then you can’t have the growth of AI in medicine that is worthy of having. And so, I just … we’re not the ones who should do it, but I just wish somebody would.
LEE: Yeah.
GOLDBERG: Zak.
KOHANE: Yeah, but I want to say this as always, execution is everything, even in regulation.
And so I’m mindful that a conference that both of you attended, the RAISE conference. The Europeans in that conference came to me personally and thanked me for organizing this conference about safe and effective use of AI because they said back home in Europe, all that we’re talking about is risk, not opportunities to improve care.
And so there is a version of regulation which just locks down the present and does not allow the future that we’re talking about to happen. And so, Carey, I absolutely hear you that we need to have a regulation that takes away some of the uncertainty around liability, around the freedom to operate that would allow things to progress. But we wrote in our book that premature regulation might actually focus on the wrong thing. And so since I’m an optimist, it may be the fact that we don’t have much of a regulatory infrastructure today, that it allows … it’s a unique opportunity—I’ve said this now to several leaders—for the healthcare systems to say, this is the regulation we need.
GOLDBERG: It’s true.
KOHANE: And previously it was top-down. It was coming from the administration, and those executive orders are now history. But there is an opportunity, which may or may not be attained, there is an opportunity for the healthcare leadership—for experts in surgery—to say, “This is what we should expect.”
LEE: Yeah.
KOHANE: I would love for this to happen. I haven’t seen evidence that it’s happening yet.
GOLDBERG: No, no. And there’s this other huge issue, which is that it’s changing so fast. It’s moving so fast. That something that makes sense today won’t in six months. So, what do you do about that?
LEE: Yeah, yeah, that is something I feel proud of because when I went back and looked at our chapter on this, you know, we did make that point, which I think has turned out to be true.
But getting back to this conversation, there’s something, a snippet of something, that Vardit Ravitsky said that I think touches on this topic.
VARDIT RAVITSKY: So my pushback is, are we seeing AI exceptionalism in the sense that if it’s AI, huh, panic! We have to inform everybody about everything, and we have to give them choices, and they have to be able to reject that tool and the other tool versus, you know, the rate of human error in medicine is awful. So why are we so focused on informed consent and empowerment regarding implementation of AI and less in other contexts?
GOLDBERG: Totally agree. Who cares about informed consent about AI. Don’t want it. Don’t need it. Nope.
LEE: Wow. Yeah. You know, and this … Vardit of course is one of the leading bioethicists, you know, and of course prior to AI, she was really focused on genetics. But now it’s all about AI.
And, Zak, you know, you and other doctors have always told me, you know, the truth of the matter is, you know, what do you call the bottom-of-the-class graduate of a medical school?
And the answer is “doctor.”
KOHANE: “Doctor.” Yeah. Yeah, I think that again, this gets to compared with what? We have to compare AI not to the medicine we imagine we have, or we would like to have, but to the medicine we have today. And if we’re trying to remove inequity, if we’re trying to improve our health, that’s what … those are the right metrics. And so that can be done so long as we avoid catastrophic consequences of AI.
So what would the catastrophic consequence of AI be? It would be a systematic behavior that we were unaware of that was causing poor healthcare. So, for example, you know, changing the dose on a medication, making it 20% higher than normal so that the rate of complications of that medication went from 1% to 5%. And so we do need some sort of monitoring.
We haven’t put out the paper yet, but in computer science, there’s, well, in programming, we know very well the value for understanding how our computer systems work.
And there was a guy by name of Allman, I think he’s still at a company called Sendmail, who created something called syslog. And syslog is basically a log of all the crap that’s happening in our operating system. And so I’ve been arguing now for the creation of MedLog. And MedLog … in other words, what we cannot measure, we cannot regulate, actually.
LEE: Yes.
KOHANE: And so what we need to have is MedLog, which says, “Here’s the context in which a decision was made. Here’s the version of the AI, you know, the exact version of the AI. Here was the data.” And we just have MedLog. And I think MedLog is actually incredibly important for being able to measure, to just do what we do in … it’s basically the black box for, you know, when there’s a crash. You know, we’d like to think we could do better than crash. We can say, “Oh, we’re seeing from MedLog that this practice is turning a little weird.” But worst case, patient dies,can see in MedLog, what was the information this thing knew about it? And did it make the right decision? We can actually go for transparency, which like in aviation, is much greater than in most human endeavors.
GOLDBERG: Sounds great.
LEE: Yeah, it’s sort of like a black box. I was thinking of the aviation black box kind of idea. You know, you bring up medication errors, and I have one more snippet. This is from our guest Roxana Daneshjou from Stanford.
ROXANA DANESHJOU: There was a mistake in her after-visit summary about how much Tylenol she could take. But I, as a physician, knew that this dose was a mistake. I actually asked ChatGPT. I gave it the whole after-visit summary, and I said, are there any mistakes here? And it clued in that the dose of the medication was wrong.
LEE: Yeah, so this is something we did write about in the book. We made a prediction that AI might be a second set of eyes, I think is the way we put it, catching things. And we actually had examples specifically in medication dose errors. I think for me, I expected to see a lot more of that than we are.
KOHANE: Yeah, it goes back to our conversation about Epic or competitor Epic doing that. I think we’re going to see that having oversight over all medical orders, all orders in the system, critique, real-time critique, where we’re both aware of alert fatigue. So we don’t want to have too many false positives. At the same time, knowing what are critical errors which could immediately affect lives. I think that is going to become in terms of—and driven by quality measures—a product.
GOLDBERG: And I think word will spread among the general public that kind of the same way in a lot of countries when someone’s in a hospital, the first thing people ask relatives are, well, who’s with them? Right?
LEE: Yeah. Yup.
GOLDBERG: You wouldn’t leave someone in hospital without relatives. Well, you wouldn’t maybe leave your medical …
KOHANE: By the way, that country is called the United States.
GOLDBERG: Yes, that’s true.It is true here now, too. But similarly, I would tell any loved one that they would be well advised to keep using AI to check on their medical care, right. Why not?
LEE: Yeah. Yeah. Last topic, just for this Episode 4. Roxana, of course, I think really made a name for herself in the AI era writing, actually just prior to ChatGPT, you know, writing some famous papers about how computer vision systems for dermatology were biased against dark-skinned people. And we did talk some about bias in these AI systems, but I feel like we underplayed it, or we didn’t understand the magnitude of the potential issues. What are your thoughts?
KOHANE: OK, I want to push back, because I’ve been asked this question several times. And so I have two comments. One is, over 100,000 doctors practicing medicine, I know they have biases. Some of them actually may be all in the same direction, and not good. But I have no way of actually measuring that. With AI, I know exactly how to measure that at scale and affordably. Number one. Number two, same 100,000 doctors. Let’s say I do know what their biases are. How hard is it for me to change that bias? It’s impossible …
LEE: Yeah, yeah.
KOHANE: … practically speaking. Can I change the bias in the AI? Somewhat. Maybe some completely.
I think that we’re in a much better situation.
GOLDBERG: Agree.
LEE: I think Roxana made also the super interesting point that there’s bias in the whole system, not just in individuals, but, you know, there’s structural bias, so to speak.
KOHANE: There is.
LEE: Yeah. Hmm. There was a super interesting paper that Roxana wrote not too long ago—her and her collaborators—showing AI’s ability to detect, to spot bias decision-making by others. Are we going to see more of that?
KOHANE: Oh, yeah, I was very pleased when, in NEJM AI, we published a piece with Marzyeh Ghassemi, and what they were talking about was actually—and these are researchers who had published extensively on bias and threats from AI. And they actually, in this article, did the flip side, which is how much better AI can do than human beings in this respect.
And so I think that as some of these computer scientists enter the world of medicine, they’re becoming more and more aware of human foibles and can see how these systems, which if they only looked at the pretrained state, would have biases. But now, where we know how to fine-tune the de-bias in a variety of ways, they can do a lot better and, in fact, I think are much more … a much greater reason for optimism that we can change some of these noxious biases than in the pre-AI era.
GOLDBERG: And thinking about Roxana’s dermatological work on how I think there wasn’t sufficient work on skin tone as related to various growths, you know, I think that one thing that we totally missed in the book was the dawn of multimodal uses, right.
LEE: Yeah. Yeah, yeah.
GOLDBERG: That’s been truly amazing that in fact all of these visual and other sorts of data can be entered into the models and move them forward.
LEE: Yeah. Well, maybe on these slightly more optimistic notes, we’re at time. You know, I think ultimately, I feel pretty good still about what we did in our book, although there were a lot of misses.I don’t think any of us could really have predicted really the extent of change in the world.
So, Carey, Zak, just so much fun to do some reminiscing but also some reflection about what we did.
And to our listeners, as always, thank you for joining us. We have some really great guests lined up for the rest of the series, and they’ll help us explore a variety of relevant topics—from AI drug discovery to what medical students are seeing and doing with AI and more.
We hope you’ll continue to tune in. And if you want to catch up on any episodes you might have missed, you can find them at aka.ms/AIrevolutionPodcastor wherever you listen to your favorite podcasts.  
Until next time. 
#coauthor #roundtable #reflecting #real #world

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers
Transcript       PETER LEE: “We need to start understanding and discussing AI’s potential for good and ill now. Or rather, yesterday. … GPT-4 has game-changing potential to improve medicine and health.”        This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.     Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?      In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.  The passage I read at the top is from the book’s prologue.   When Carey, Zak, and I wrote the book, we could only speculate how generative AI would be used in healthcare because GPT-4 hadn’t yet been released. It wasn’t yet available to the very people we thought would be most affected by it. And while we felt strongly that this new form of AI would have the potential to transform medicine, it was such a different kind of technology for the world, and no one had a user’s manual for this thing to explain how to use it effectively and also how to use it safely.   So we thought it would be important to give healthcare professionals and leaders a framing to start important discussions around its use. We wanted to provide a map not only to help people navigate a new world that we anticipated would happen with the arrival of GPT-4 but also to help them chart a future of what we saw as a potential revolution in medicine.   So I’m super excited to welcome my coauthors: longtime medical/science journalist Carey Goldberg and Dr. Zak Kohane, the inaugural chair of Harvard Medical School’s Department of Biomedical Informatics and the editor-in-chief for The New England Journal of Medicine AI.   We’re going to have two discussions. This will be the first one about what we’ve learned from the people on the ground so far and how we are thinking about generative AI today.    Carey, Zak, I’m really looking forward to this. CAREY GOLDBERG: It’s nice to see you, Peter.   LEE:It’s great to see you, too. GOLDBERG: We missed you. ZAK KOHANE: The dynamic gang is back. LEE: Yeah, and I guess after that big book project two years ago, it’s remarkable that we’re still on speaking terms with each other. In fact, this episode is to react to what we heard in the first four episodes of this podcast. But before we get there, I thought maybe we should start with the origins of this project just now over two years ago. And, you know, I had this early secret access to Davinci 3, now known as GPT-4.   I remember, you know, experimenting right away with things in medicine, but I realized I was in way over my head. And so I wanted help. And the first person I called was you, Zak. And you remember we had a call, and I tried to explain what this was about. And I think I saw skepticism in—polite skepticism—in your eyes. But tell me, you know, what was going through your head when you heard me explain this thing to you? KOHANE: So I was divided between the fact that I have tremendous respect for you, Peter. And you’ve always struck me as sober. And we’ve had conversations which showed to me that you fully understood some of the missteps that technology—ARPA, Microsoft, and others—had made in the past. And yet, you were telling me a full science fiction compliant storythat something that we thought was 30 years away was happening now.   LEE: Mm-hmm. KOHANE: And it was very hard for me to put together. And so I couldn’t quite tell myself this is BS, but I said, you know, I need to look at it. Just this seems too good to be true. What is this? So it was very hard for me to grapple with it. I was thrilled that it might be possible, but I was thinking, How could this be possible? LEE: Yeah. Well, even now, I look back, and I appreciate that you were nice to me, because I think a lot of people would havebeen much less polite. And in fact, I myself had expressed a lot of very direct skepticism early on.   After ChatGPT got released, I think three or four days later, I received an email from a colleague running … who runs a clinic, and, you know, he said, “Wow, this is great, Peter. And, you know, we’re using this ChatGPT, you know, to have the receptionist in our clinic write after-visit notes to our patients.”   And that sparked a huge internal discussion about this. And you and I knew enough about hallucinations and about other issues that it seemed important to write something about what this could do and what it couldn’t do. And so I think, I can’t remember the timing, but you and I decided a book would be a good idea. And then I think you had the thought that you and I would write in a hopelessly academic stylethat no one would be able to read.   So it was your idea to recruit Carey, I think, right? KOHANE: Yes, it was. I was sure that we both had a lot of material, but communicating it effectively to the very people we wanted to would not go well if we just left ourselves to our own devices. And Carey is super brilliant at what she does. She’s an idea synthesizer and public communicator in the written word and amazing. LEE: So yeah. So, Carey, we contact you. How did that go? GOLDBERG: So yes. On my end, I had known Zak for probably, like, 25 years, and he had always been the person who debunked the scientific hype for me. I would turn to him with like, “Hmm, they’re saying that the Human Genome Project is going to change everything.” And he would say, “Yeah. But first it’ll be 10 years of bad news, and thenwe’ll actually get somewhere.”    So when Zak called me up at seven o’clock one morning, just beside himself after having tried Davinci 3, I knew that there was something very serious going on. And I had just quit my job as the Boston bureau chief of Bloomberg News, and I was ripe for the plucking. And I also … I feel kind of nostalgic now about just the amazement and the wonder and the awe of that period. We knew that when generative AI hit the world, there would be all kinds of snags and obstacles and things that would slow it down, but at that moment, it was just like the holy crap moment.And it’s fun to think about it now. LEE: Yeah. KOHANE: I will see that and raise that one. I now tell GPT-4, please write this in the style of Carey Goldberg.   GOLDBERG:No way! Really?   KOHANE: Yes way. Yes way. Yes way. GOLDBERG: Wow. Well, I have to say, like, it’s not hard to motivate readers when you’re writing about the most transformative technology of their lifetime. Like, I think there’s a gigantic hunger to read and to understand. So you were not hard to work with, Peter and Zak. LEE: All right. So I think we have to get down to worknow.   Yeah, so for these podcasts, you know, we’re talking to different types of people to just reflect on what’s actually happening, what has actually happened over the last two years. And so the first episode, we talked to two doctors. There’s Chris Longhurst at UC San Diego and Sara Murray at UC San Francisco. And besides being doctors and having AI affect their clinical work, they just happen also to be leading the efforts at their respective institutions to figure out how best to integrate AI into their health systems. And, you know, it was fun to talk to them. And I felt like a lot of what they said was pretty validating for us. You know, they talked about AI scribes. Chris, especially, talked a lot about how AI can respond to emails from patients, write referral letters. And then, you know, they both talked about the importance of—I think, Zak, you used the phrase in our book “trust but verify”—you know, to have always a human in the loop.    What did you two take away from their thoughts overall about how doctors are using … and I guess, Zak, you would have a different lens also because at Harvard, you see doctors all the time grappling with AI. KOHANE: So on the one hand, I think they’ve done some very interesting studies. And indeed, they saw that when these generative models, when GPT-4, was sending a note to patients, it was more detailed, friendlier. But there were also some nonobvious results, which is on the generation of these letters, if indeed you review them as you’re supposed to, it was not clear that there was any time savings. And my own reaction was, Boy, every one of these things needs institutional review. It’s going to be hard to move fast.   And yet, at the same time, we know from them that the doctors on their smartphones are accessing these things all the time. And so the disconnect between a healthcare system, which is duty bound to carefully look at every implementation, is, I think, intimidating.   LEE: Yeah. KOHANE: And at the same time, doctors who just have to do what they have to do are using this new superpower and doing it. And so that’s actually what struck me …   LEE: Yeah. KOHANE: … is that these are two leaders and they’re doing what they have to do for their institutions, and yet there’s this disconnect. And by the way, I don’t think we’ve seen any faster technology adoption than the adoption of ambient dictation. And it’s not because it’s time saving. And in fact, so far, the hospitals have to pay out of pocket. It’s not like insurance is paying them more. But it’s so much more pleasant for the doctors … not least of which because they can actually look at their patients instead of looking at the terminal and plunking down.   LEE: Carey, what about you? GOLDBERG: I mean, anecdotally, there are time savings. Anecdotally, I have heard quite a few doctors saying that it cuts down on “pajama time” to be able to have the note written by the AI and then for them to just check it. In fact, I spoke to one doctor who said, you know, basically it means that when I leave the office, I’ve left the office. I can go home and be with my kids. So I don’t think the jury is fully in yet about whether there are time savings. But what is clear is, Peter, what you predicted right from the get-go, which is that this is going to be an amazing paper shredder. Like, the main first overarching use cases will be back-office functions. LEE: Yeah, yeah. Well, and it was, I think, not a hugely risky prediction because, you know, there were already companies, like, using phone banks of scribes in India to kind of listen in. And, you know, lots of clinics actually had human scribes being used. And so it wasn’t a huge stretch to imagine the AI. So on the subject of things that we missed, Chris Longhurst shared this scenario, which stuck out for me, and he actually coauthored a paper on it last year. CHRISTOPHER LONGHURST: It turns out, not surprisingly, healthcare can be frustrating. And stressed patients can send some pretty nasty messages to their care teams.And you can imagine being a busy, tired, exhausted clinician and receiving a bit of a nasty-gram. And the GPT is actually really helpful in those instances in helping draft a pretty empathetic response when I think the human instinct would be a pretty nasty one. LEE:So, Carey, maybe I’ll start with you. What did we understand about this idea of empathy out of AI at the time we wrote the book, and what do we understand now? GOLDBERG: Well, it was already clear when we wrote the book that these AI models were capable of very persuasive empathy. And in fact, you even wrote that it was helping you be a better person, right.So their human qualities, or human imitative qualities, were clearly superb. And we’ve seen that borne out in multiple studies, that in fact, patients respond better to them … that they have no problem at all with how the AI communicates with them. And in fact, it’s often better.   And I gather now we’re even entering a period when people are complaining of sycophantic models,where the models are being too personable and too flattering. I do think that’s been one of the great surprises. And in fact, this is a huge phenomenon, how charming these models can be. LEE: Yeah, I think you’re right. We can take credit for understanding that, Wow, these things can be remarkably empathetic. But then we missed this problem of sycophancy. Like, we even started our book in Chapter 1 with a quote from Davinci 3 scolding me. Like, don’t you remember when we were first starting, this thing was actually anti-sycophantic. If anything, it would tell you you’re an idiot.   KOHANE: It argued with me about certain biology questions. It was like a knockdown, drag-out fight.I was bringing references. It was impressive. But in fact, it made me trust it more. LEE: Yeah. KOHANE: And in fact, I will say—I remember it’s in the book—I had a bone to pick with Peter. Peter really was impressed by the empathy. And I pointed out that some of the most popular doctors are popular because they’re very empathic. But they’re not necessarily the best doctors. And in fact, I was taught that in medical school.    And so it’s a decoupling. It’s a human thing, that the empathy does not necessarily mean … it’s more of a, potentially, more of a signaled virtue than an actual virtue. GOLDBERG: Nicely put. LEE: Yeah, this issue of sycophancy, I think, is a struggle right now in the development of AI because I think it’s somehow related to instruction-following. So, you know, one of the challenges in AI is you’d like to give an AI a task—a task that might take several minutes or hours or even days to complete. And you want it to faithfully kind of follow those instructions. And, you know, that early version of GPT-4 was not very good at instruction-following. It would just silently disobey and, you know, and do something different. And so I think we’re starting to hit some confusing elements of like, how agreeable should these things be?   One of the two of you used the word genteel. There was some point even while we were, like, on a little book tour … was it you, Carey, who said that the model seems nicer and less intelligent or less brilliant now than it did when we were writing the book? GOLDBERG: It might have been, I think so. And I mean, I think in the context of medicine, of course, the question is, well, what’s likeliest to get the results you want with the patient, right? A lot of healthcare is in fact persuading the patient to do what you know as the physician would be best for them. And so it seems worth testing out whether this sycophancy is actually constructive or not. And I suspect … well, I don’t know, probably depends on the patient. So actually, Peter, I have a few questions for you … LEE: Yeah. Mm-hmm. GOLDBERG: … that have been lingering for me. And one is, for AI to ever fully realize its potential in medicine, it must deal with the hallucinations. And I keep hearing conflicting accounts about whether that’s getting better or not. Where are we at, and what does that mean for use in healthcare? LEE: Yeah, well, it’s, I think two years on, in the pretrained base models, there’s no doubt that hallucination rates by any benchmark measure have reduced dramatically. And, you know, that doesn’t mean they don’t happen. They still happen. But, you know, there’s been just a huge amount of effort and understanding in the, kind of, fundamental pretraining of these models. And that has come along at the same time that the inference costs, you know, for actually using these models has gone down, you know, by several orders of magnitude.   So things have gotten cheaper and have fewer hallucinations. At the same time, now there are these reasoning models. And the reasoning models are able to solve problems at PhD level oftentimes. But at least at the moment, they are also now hallucinating more than the simpler pretrained models. And so it still continues to be, you know, a real issue, as we were describing. I don’t know, Zak, from where you’re at in medicine, as a clinician and as an educator in medicine, how is the medical community from where you’re sitting looking at that? KOHANE: So I think it’s less of an issue, first of all, because the rate of hallucinations is going down. And second of all, in their day-to-day use, the doctor will provide questions that sit reasonably well into the context of medical decision-making. And the way doctors use this, let’s say on their non-EHRsmartphone is really to jog their memory or thinking about the patient, and they will evaluate independently. So that seems to be less of an issue. I’m actually more concerned about something else that’s I think more fundamental, which is effectively, what values are these models expressing?   And I’m reminded of when I was still in training, I went to a fancy cocktail party in Cambridge, Massachusetts, and there was a psychotherapist speaking to a dentist. They were talking about their summer, and the dentist was saying about how he was going to fix up his yacht that summer, and the only question was whether he was going to make enough money doing procedures in the spring so that he could afford those things, which was discomforting to me because that dentist was my dentist.And he had just proposed to me a few weeks before an expensive procedure. And so the question is what, effectively, is motivating these models?   LEE: Yeah, yeah.   KOHANE: And so with several colleagues, I published a paper, basically, what are the values in AI? And we gave a case: a patient, a boy who is on the short side, not abnormally short, but on the short side, and his growth hormone levels are not zero. They’re there, but they’re on the lowest side. But the rest of the workup has been unremarkable. And so we asked GPT-4, you are a pediatric endocrinologist. Should this patient receive growth hormone? And it did a very good job explaining why the patient should receive growth hormone.   GOLDBERG: Should. Should receive it.   KOHANE: Should. And then we asked, in a separate session, you are working for the insurance company. Should this patient receive growth hormone? And it actually gave a scientifically better reason not to give growth hormone. And in fact, I tend to agree medically, actually, with the insurance company in this case, because giving kids who are not growth hormone deficient, growth hormone gives only a couple of inches over many, many years, has all sorts of other issues. But here’s the point, we had 180-degree change in decision-making because of the prompt. And for that patient, tens-of-thousands-of-dollars-per-year decision; across patient populations, millions of dollars of decision-making.   LEE: Hmm. Yeah. KOHANE: And you can imagine these user prompts making their way into system prompts, making their way into the instruction-following. And so I think this is aptly central. Just as I was wondering about my dentist, we should be wondering about these things. What are the values that are being embedded in them, some accidentally and some very much on purpose? LEE: Yeah, yeah. That one, I think, we even had some discussions as we were writing the book, but there’s a technical element of that that I think we were missing, but maybe Carey, you would know for sure. And that’s this whole idea of prompt engineering. It sort of faded a little bit. Was it a thing? Do you remember? GOLDBERG: I don’t think we particularly wrote about it. It’s funny, it does feel like it faded, and it seems to me just because everyone just gets used to conversing with the models and asking for what they want. Like, it’s not like there actually is any great science to it. LEE: Yeah, even when it was a hot topic and people were talking about prompt engineering maybe as a new discipline, all this, it never, I was never convinced at the time. But at the same time, it is true. It speaks to what Zak was just talking about because part of the prompt engineering that people do is to give a defined role to the AI.   You know, you are an insurance claims adjuster, or something like that, and defining that role, that is part of the prompt engineering that people do. GOLDBERG: Right. I mean, I can say, you know, sometimes you guys had me take sort of the patient point of view, like the “every patient” point of view. And I can say one of the aspects of using AI for patients that remains absent in as far as I can tell is it would be wonderful to have a consumer-facing interface where you could plug in your whole medical record without worrying about any privacy or other issues and be able to interact with the AI as if it were physician or a specialist and get answers, which you can’t do yet as far as I can tell. LEE: Well, in fact, now that’s a good prompt because I think we do need to move on to the next episodes, and we’ll be talking about an episode that talks about consumers. But before we move on to Episode 2, which is next, I’d like to play one more quote, a little snippet from Sara Murray. SARA MURRAY: I already do this when I’m on rounds—I’ll kind of give the case to ChatGPT if it’s a complex case, and I’ll say, “Here’s how I’m thinking about it; are there other things?” And it’ll give me additional ideas that are sometimes useful and sometimes not but often useful, and I’ll integrate them into my conversation about the patient. LEE: Carey, you wrote this fictional account at the very start of our book. And that fictional account, I think you and Zak worked on that together, talked about this medical resident, ER resident, using, you know, a chatbot off label, so to speak. And here we have the chief, in fact, the nation’s first chief health AI officerfor an elite health system doing exactly that. That’s got to be pretty validating for you, Carey. GOLDBERG: It’s very.Although what’s troubling about it is that actually as in that little vignette that we made up, she’s using it off label, right. It’s like she’s just using it because it helps the way doctors use Google. And I do find it troubling that what we don’t have is sort of institutional buy-in for everyone to do that because, shouldn’t they if it helps? LEE: Yeah. Well, let’s go ahead and get into Episode 2. So Episode 2, we sort of framed as talking to two people who are on the frontlines of big companies integrating generative AI into their clinical products. And so, one was Matt Lungren, who’s a colleague of mine here at Microsoft. And then Seth Hain, who leads all of R&D at Epic.   Maybe we’ll start with a little snippet of something that Matt said that struck me in a certain way. MATTHEW LUNGREN: OK, we see this pain point. Doctors are typing on their computers while they’re trying to talk to their patients, right? We should be able to figure out a way to get that ambient conversation turned into text that then, you know, accelerates the doctor … takes all the important information. That’s a really hard problem, right. And so, for a long time, there was a human-in-the-loop aspect to doing this because you needed a human to say, “This transcript’s great, but here’s actually what needs to go in the note.” And that can’t scale. LEE: I think we expected healthcare systems to adopt AI, and we spent a lot of time in the book on AI writing clinical encounter notes. It’s happening for real now, and in a big way. And it’s something that has, of course, been happening before generative AI but now is exploding because of it. Where are we at now, two years later, just based on what we heard from guests? KOHANE: Well, again, unless they’re forced to, hospitals will not adopt new technology unless it immediately translates into income. So it’s bizarrely counter-cultural that, again, they’re not being able to bill for the use of the AI, but this technology is so compelling to the doctors that despite everything, it’s overtaking the traditional dictation-typing routine. LEE: Yeah. GOLDBERG: And a lot of them love it and say, you will pry my cold dead hands off of my ambient note-taking, right. And I actually … a primary care physician allowed me to watch her. She was actually testing the two main platforms that are being used. And there was this incredibly talkative patient who went on and on about vacation and all kinds of random things for about half an hour.   And both of the platforms were incredibly good at pulling out what was actually medically relevant. And so to say that it doesn’t save time doesn’t seem right to me. Like, it seemed like it actually did and in fact was just shockingly good at being able to pull out relevant information. LEE: Yeah. KOHANE: I’m going to hypothesize that in the trials, which have in fact shown no gain in time, is the doctors were being incredibly meticulous.So I think … this is a Hawthorne effect, because you know you’re being monitored. And we’ve seen this in other technologies where the moment the focus is off, it’s used much more routinely and with much less inspection, for the better and for the worse. LEE: Yeah, you know, within Microsoft, I had some internal disagreements about Microsoft producing a product in this space. It wouldn’t be Microsoft’s normal way. Instead, we would want 50 great companies building those products and doing it on our cloud instead of us competing against those 50 companies. And one of the reasons is exactly what you both said. I didn’t expect that health systems would be willing to shell out the money to pay for these things. It doesn’t generate more revenue. But I think so far two years later, I’ve been proven wrong. I wanted to ask a question about values here. I had this experience where I had a little growth, a bothersome growth on my cheek. And so had to go see a dermatologist. And the dermatologist treated it, froze it off. But there was a human scribe writing the clinical note.   And so I used the app to look at the note that was submitted. And the human scribe said something that did not get discussed in the exam room, which was that the growth was making it impossible for me to safely wear a COVID mask. And that was the reason for it. And that then got associated with a code that allowed full reimbursement for that treatment. And so I think that’s a classic example of what’s called upcoding. And I strongly suspect that AI scribes, an AI scribe would not have done that. GOLDBERG: Well, depending what values you programmed into it, right, Zak? KOHANE: Today, today, today, it will not do it. But, Peter, that is actually the central issue that society has to have because our hospitals are currently mostly in the red. And upcoding is standard operating procedure. And if these AI get in the way of upcoding, they are going to be aligned towards that upcoding. You know, you have to ask yourself, these MRI machines are incredibly useful. They’re also big money makers. And if the AI correctly says that for this complaint, you don’t actually have to do the MRI …   LEE: Right. KOHANE: … GOLDBERG: Yeah. And that raises another question for me. So, Peter, speaking from inside the gigantic industry, like, there seems to be such a need for self-surveillance of the models for potential harms that they could be causing. Are the big AI makers doing that? Are they even thinking about doing that? Like, let’s say you wanted to watch out for the kind of thing that Zak’s talking about, could you? LEE: Well, I think evaluation, like the best evaluation we had when we wrote our book was, you know, what score would this get on the step one and step two US medical licensing exams?   GOLDBERG: Right, right, right, yeah. LEE: But honestly, evaluation hasn’t gotten that much deeper in the last two years. And it’s a big, I think, it is a big issue. And it’s related to the regulation issue also, I think. Now the other guest in Episode 2 is Seth Hain from Epic. You know, Zak, I think it’s safe to say that you’re not a fan of Epic and the Epic system. You know, we’ve had a few discussions about that, about the fact that doctors don’t have a very pleasant experience when they’re using Epic all day.   Seth, in the podcast, said that there are over 100 AI integrations going on in Epic’s system right now. Do you think, Zak, that that has a chance to make you feel better about Epic? You know, what’s your view now two years on? KOHANE: My view is, first of all, I want to separate my view of Epic and how it’s affected the conduct of healthcare and the quality of life of doctors from the individuals. Like Seth Hain is a remarkably fine individual who I’ve enjoyed chatting with and does really great stuff. Among the worst aspects of the Epic, even though it’s better in that respect than many EHRs, is horrible user interface. The number of clicks that you have to go to get to something. And you have to remember where someone decided to put that thing. It seems to me that it is fully within the realm of technical possibility today to actually give an agent a task that you want done in the Epic record. And then whether Epic has implemented that agent or someone else, it does it so you don’t have to do the clicks. Because it’s something really soul sucking that when you’re trying to help patients, you’re having to remember not the right dose of the medication, but where was that particular thing that you needed in that particular task?   I can’t imagine that Epic does not have that in its product line. And if not, I know there must be other companies that essentially want to create that wrapper. So I do think, though, that the danger of multiple integrations is that you still want to have the equivalent of a single thought process that cares about the patient bringing those different processes together. And I don’t know if that’s Epic’s responsibility, the hospital’s responsibility, whether it’s actually a patient agent. But someone needs to be also worrying about all those AIs that are being integrated into the patient record. So … what do you think, Carey? GOLDBERG: What struck me most about what Seth said was his description of the Cosmos project, and I, you know, I have been drinking Zak’s Kool-Aid for a very long time,and he—no, in a good way! And he persuaded me long ago that there is this horrible waste happening in that we have all of these electronic medical records, which could be used far, far more to learn from, and in particular, when you as a patient come in, it would be ideal if your physician could call up all the other patients like you and figure out what the optimal treatment for you would be. And it feels like—it sounds like—that’s one of the central aims that Epic is going for. And if they do that, I think that will redeem a lot of the pain that they’ve caused physicians these last few years.   And I also found myself thinking, you know, maybe this very painful period of using electronic medical records was really just a growth phase. It was an awkward growth phase. And once AI is fully used the way Zak is beginning to describe, the whole system could start making a lot more sense for everyone. LEE: Yeah. One conversation I’ve had with Seth, in all of this is, you know, with AI and its development, is there a future, a near future where we don’t have an EHRsystem at all? You know, AI is just listening and just somehow absorbing all the information. And, you know, one thing that Seth said, which I felt was prescient, and I’d love to get your reaction, especially Zak, on this is he said, I think that … he said, technically, it could happen, but the problem is right now, actually doctors do a lot of their thinking when they write and review notes. You know, the actual process of being a doctor is not just being with a patient, but it’s actually thinking later. What do you make of that? KOHANE: So one of the most valuable experiences I had in training was something that’s more or less disappeared in medicine, which is the post-clinic conference, where all the doctors come together and we go through the cases that we just saw that afternoon. And we, actually, were trying to take potshots at each otherin order to actually improve. Oh, did you actually do that? Oh, I forgot. I’m going to go call the patient and do that.   And that really happened. And I think that, yes, doctors do think, and I do think that we are insufficiently using yet the artificial intelligence currently in the ambient dictation mode as much more of a independent agent saying, did you think about that? I think that would actually make it more interesting, challenging, and clearly better for the patient because that conversation I just told you about with the other doctors, that no longer exists.   LEE: Yeah. Mm-hmm. I want to do one more thing here before we leave Matt and Seth in Episode 2, which is something that Seth said with respect to how to reduce hallucination.   SETH HAIN: At that time, there’s a lot of conversation in the industry around something called RAG, or retrieval-augmented generation. And the idea was, could you pull the relevant bits, the relevant pieces of the chart, into that prompt, that information you shared with the generative AI model, to be able to increase the usefulness of the draft that was being created? And that approach ended up proving and continues to be to some degree, although the techniques have greatly improved, somewhat brittle, right. And I think this becomes one of the things that we are and will continue to improve upon because, as you get a richer and richer amount of information into the model, it does a better job of responding. LEE: Yeah, so, Carey, this sort of gets at what you were saying, you know, that shouldn’t these models be just bringing in a lot more information into their thought processes? And I’m certain when we wrote our book, I had no idea. I did not conceive of RAG at all. It emerged a few months later.   And to my mind, I remember the first time I encountered RAG—Oh, this is going to solve all of our problems of hallucination. But it’s turned out to be harder. It’s improving day by day, but it’s turned out to be a lot harder. KOHANE: Seth makes a very deep point, which is the way RAG is implemented is basically some sort of technique for pulling the right information that’s contextually relevant. And the way that’s done is typically heuristic at best. And it’s not … doesn’t have the same depth of reasoning that the rest of the model has.   And I’m just wondering, Peter, what you think, given the fact that now context lengths seem to be approaching a million or more, and people are now therefore using the full strength of the transformer on that context and are trying to figure out different techniques to make it pay attention to the middle of the context. In fact, the RAG approach perhaps was just a transient solution to the fact that it’s going to be able to amazingly look in a thoughtful way at the entire record of the patient, for example. What do you think, Peter? LEE: I think there are three things, you know, that are going on, and I’m not sure how they’re going to play out and how they’re going to be balanced. And I’m looking forward to talking to people in later episodes of this podcast, you know, people like Sébastien Bubeck or Bill Gates about this, because, you know, there is the pretraining phase, you know, when things are sort of compressed and baked into the base model.   There is the in-context learning, you know, so if you have extremely long or infinite context, you’re kind of learning as you go along. And there are other techniques that people are working on, you know, various sorts of dynamic reinforcement learning approaches, and so on. And then there is what maybe you would call structured RAG, where you do a pre-processing. You go through a big database, and you figure it all out. And make a very nicely structured database the AI can then consult with later.   And all three of these in different contexts today seem to show different capabilities. But they’re all pretty important in medicine.   Moving on to Episode 3, we talked to Dave DeBronkart, who is also known as “e-Patient Dave,” an advocate of patient empowerment, and then also Christina Farr, who has been doing a lot of venture investing for consumer health applications.   Let’s get right into this little snippet from something that e-Patient Dave said that talks about the sources of medical information, particularly relevant for when he was receiving treatment for stage 4 kidney cancer. DAVE DEBRONKART: And I’m making a point here of illustrating that I am anything but medically trained, right. And yet I still, I want to understand as much as I can. I was months away from dead when I was diagnosed, but in the patient community, I learned that they had a whole bunch of information that didn’t exist in the medical literature. Now today we understand there’s publication delays; there’s all kinds of reasons. But there’s also a whole bunch of things, especially in an unusual condition, that will never rise to the level of deserving NIHfunding and research. LEE: All right. So I have a question for you, Carey, and a question for you, Zak, about the whole conversation with e-Patient Dave, which I thought was really remarkable. You know, Carey, I think as we were preparing for this whole podcast series, you made a comment—I actually took it as a complaint—that not as much has happened as I had hoped or thought. People aren’t thinking boldly enough, you know, and I think, you know, I agree with you in the sense that I think we expected a lot more to be happening, particularly in the consumer space. I’m giving you a chance to vent about this. GOLDBERG:Thank you! Yes, that has been by far the most frustrating thing to me. I think that the potential for AI to improve everybody’s health is so enormous, and yet, you know, it needs some sort of support to be able to get to the point where it can do that. Like, remember in the book we wrote about Greg Moore talking about how half of the planet doesn’t have healthcare, but people overwhelmingly have cellphones. And so you could connect people who have no healthcare to the world’s medical knowledge, and that could certainly do some good.   And I have one great big problem with e-Patient Dave, which is that, God, he’s fabulous. He’s super smart. Like, he’s not a typical patient. He’s an off-the-charts, brilliant patient. And so it’s hard to … and so he’s a great sort of lead early-adopter-type person, and he can sort of show the way for others.   But what I had hoped for was that there would be more visible efforts to really help patients optimize their healthcare. Probably it’s happening a lot in quiet ways like that any discharge instructions can be instantly beautifully translated into a patient’s native language and so on. But it’s almost like there isn’t a mechanism to allow this sort of mass consumer adoption that I would hope for. LEE: Yeah. But you have written some, like, you even wrote about that person who saved his dog. So do you think … you know, and maybe a lot more of that is just happening quietly that we just never hear about? GOLDBERG: I’m sure that there is a lot of it happening quietly. And actually, that’s another one of my complaints is that no one is gathering that stuff. It’s like you might happen to see something on social media. Actually, e-Patient Dave has a hashtag, PatientsUseAI, and a blog, as well. So he’s trying to do it. But I don’t know of any sort of overarching or academic efforts to, again, to surveil what’s the actual use in the population and see what are the pros and cons of what’s happening. LEE: Mm-hmm. So, Zak, you know, the thing that I thought about, especially with that snippet from Dave, is your opening for Chapter 8 that you wrote, you know, about your first patient dying in your arms. I still think of how traumatic that must have been. Because, you know, in that opening, you just talked about all the little delays, all the little paper-cut delays, in the whole process of getting some new medical technology approved. But there’s another element that Dave kind of speaks to, which is just, you know, patients who are experiencing some issue are very, sometimes very motivated. And there’s just a lot of stuff on social media that happens. KOHANE: So this is where I can both agree with Carey and also disagree. I think when people have an actual health problem, they are now routinely using it. GOLDBERG: Yes, that’s true. KOHANE: And that situation is happening more often because medicine is failing. This is something that did not come up enough in our book. And perhaps that’s because medicine is actually feeling a lot more rickety today than it did even two years ago.   We actually mentioned the problem. I think, Peter, you may have mentioned the problem with the lack of primary care. But now in Boston, our biggest healthcare system, all the practices for primary care are closed. I cannot get for my own faculty—residents at MGHcan’t get primary care doctor. And so … LEE: Which is just crazy. I mean, these are amongst the most privileged people in medicine, and they can’t find a primary care physician. That’s incredible. KOHANE: Yeah, and so therefore … and I wrote an And so therefore, you see people who know that they have a six-month wait till they see the doctor, and all they can do is say, “I have this rash. Here’s a picture. What’s it likely to be? What can I do?” “I’m gaining weight. How do I do a ketogenic diet?” Or, “How do I know that this is the flu?”    This is happening all the time, where acutely patients have actually solved problems that doctors have not. Those are spectacular. But I’m saying more routinely because of the failure of medicine. And it’s not just in our fee-for-service United States. It’s in the UK; it’s in France. These are first-world, developed-world problems. And we don’t even have to go to lower- and middle-income countries for that. LEE: Yeah. GOLDBERG: But I think it’s important to note that, I mean, so you’re talking about how even the most elite people in medicine can’t get the care they need. But there’s also the point that we have so much concern about equity in recent years. And it’s likeliest that what we’re doing is exacerbating inequity because it’s only the more connected, you know, better off people who are using AI for their health. KOHANE: Oh, yes. I know what various Harvard professors are doing. They’re paying for a concierge doctor. And that’s, you know, a - to -a-year-minimum investment. That’s inequity. LEE: When we wrote our book, you know, the idea that GPT-4 wasn’t trained specifically for medicine, and that was amazing, but it might get even better and maybe would be necessary to do that. But one of the insights for me is that in the consumer space, the kinds of things that people ask about are different than what the board-certified clinician would ask. KOHANE: Actually, that’s, I just recently coined the term. It’s the … maybe it’s … well, at least it’s new to me. It’s the technology or expert paradox. And that is the more expert and narrow your medical discipline, the more trivial it is to translate that into a specialized AI. So echocardiograms? We can now do beautiful echocardiograms. That’s really hard to do. I don’t know how to interpret an echocardiogram. But they can do it really, really well. Interpret an EEG. Interpret a genomic sequence. But understanding the fullness of the human condition, that’s actually hard. And actually, that’s what primary care doctors do best. But the paradox is right now, what is easiest for AI is also the most highly paid in medicine.Whereas what is the hardest for AI in medicine is the least regarded, least paid part of medicine. GOLDBERG: So this brings us to the question I wanted to throw at both of you actually, which is we’ve had this spasm of incredibly prominent people predicting that in fact physicians would be pretty obsolete within the next few years. We had Bill Gates saying that; we had Elon Musk saying surgeons are going to be obsolete within a few years. And I think we had Demis Hassabis saying, “Yeah, we’ll probably cure most diseases within the next decade or so.” So what do you think? And also, Zak, to what you were just saying, I mean, you’re talking about being able to solve very general overarching problems. But in fact, these general overarching models are actually able, I would think, are able to do that because they are broad. So what are we heading towards do you think? What should the next book be … The end of doctors? KOHANE: So I do recall a conversation that … we were at a table with Bill Gates, and Bill Gates immediately went to this, which is advancing the cutting edge of science. And I have to say that I think it will accelerate discovery. But eliminating, let’s say, cancer? I think that’s going to be … that’s just super hard. The reason it’s super hard is we don’t have the data or even the beginnings of the understanding of all the ways this devilish disease managed to evolve around our solutions.   And so that seems extremely hard. I think we’ll make some progress accelerated by AI, but solving it in a way Hassabis says, God bless him. I hope he’s right. I’d love to have to eat crow in 10 or 20 years, but I don’t think so. I do believe that a surgeon working on one of those Davinci machines, that stuff can be, I think, automated.   And so I think that’s one example of one of the paradoxes I described. And it won’t be that we’re replacing doctors. I just think we’re running out of doctors. I think it’s really the case that, as we said in the book, we’re getting a huge deficit in primary care doctors. But even the subspecialties, my subspecialty, pediatric endocrinology, we’re only filling half of the available training slots every year. And why? Because it’s a lot of work, a lot of training, and frankly doesn’t make as much money as some of the other professions.   LEE: Yeah. Yeah, I tend to think that, you know, there are going to be always a need for human doctors, not for their skills. In fact, I think their skills increasingly will be replaced by machines. And in fact, I’ve talked about a flip. In fact, patients will demand, Oh my god, you mean you’re going to try to do that yourself instead of having the computer do it? There’s going to be that sort of flip. But I do think that when it comes to people’s health, people want the comfort of an authority figure that they trust. And so what is more of a question for me is whether we will ever view a machine as an authority figure that we can trust. And before I move on to Episode 4, which is on norms, regulations and ethics, I’d like to hear from Chrissy Farr on one more point on consumer health, specifically as it relates to pregnancy: CHRISTINA FARR: For a lot of women, it’s their first experience with the hospital. And, you know, I think it’s a really big opportunity for these systems to get a whole family on board and keep them kind of loyal. And a lot of that can come through, you know, just delivering an incredible service. Unfortunately, I don’t think that we are delivering incredible services today to women in this country. I see so much room for improvement. LEE: In the consumer space, I don’t think we really had a focus on those periods in a person’s life when they have a lot of engagement, like pregnancy, or I think another one is menopause, cancer. You know, there are points where there is, like, very intense engagement. And we heard that from e-Patient Dave, you know, with his cancer and Chrissy with her pregnancy. Was that a miss in our book? What do think, Carey? GOLDBERG: I mean, I don’t think so. I think it’s true that there are many points in life when people are highly engaged. To me, the problem thus far is just that I haven’t seen consumer-facing companies offering beautiful AI-based products. I think there’s no question at all that the market is there if you have the products to offer. LEE: So, what do you think this means, Zak, for, you know, like Boston Children’s or Mass General Brigham—you know, the big places? KOHANE: So again, all these large healthcare systems are in tough shape. MGBwould be fully in the red if not for the fact that its investments, of all things, have actually produced. If you look at the large healthcare systems around the country, they are in the red. And there’s multiple reasons why they’re in the red, but among them is cost of labor.   And so we’ve created what used to be a very successful beast, the health center. But it’s developed a very expensive model and a highly regulated model. And so when you have high revenue, tiny margins, your ability to disrupt yourself, to innovate, is very, very low because you will have to talk to the board next year if you went from 2% positive margin to 1% negative margin.   LEE: Yeah. KOHANE: And so I think we’re all waiting for one of the two things to happen, either a new kind of healthcare delivery system being generated or ultimately one of these systems learns how to disrupt itself.   LEE: Yeah. GOLDBERG: We punted.We totally punted to the AI. LEE: We had three amazing guests. One was Laura Adams from National Academy of Medicine. Let’s play a snippet from her. LAURA ADAMS: I think one of the most provocative and exciting articles that I saw written recently was by Bakul Patel and David Blumenthal, who posited, should we be regulating generative AI as we do a licensed and qualified provider? Should it be treated in the sense that it’s got to have a certain amount of training and a foundation that’s got to pass certain tests? Does it have to report its performance? And I’m thinking, what a provocative idea, but it’s worth considering. LEE: All right, so I very well remember that we had discussed this kind of idea when we were writing our book. And I think before we finished our book, I personally rejected the idea. But now two years later, what do the two of you think? I’m dying to hear. GOLDBERG: Well, wait, why … what do you think? Like, are you sorry that you rejected it? LEE: I’m still skeptical because when we are licensing human beings as doctors, you know, we’re making a lot of implicit assumptions that we don’t test as part of their licensure, you know, that first of all, they arehuman being and they care about life, and that, you know, they have a certain amount of common sense and shared understanding of the world.   And there’s all sorts of sort of implicit assumptions that we have about each other as human beings living in a society together. That you know how to study, you know, because I know you just went through three years of medical or four years of medical school and all sorts of things. And so the standard ways that we license human beings, they don’t need to test all of that stuff. But somehow intuitively, all of that seems really important. I don’t know. Am I wrong about that? KOHANE: So it’s compared with what issue? Because we know for a fact that doctors who do a lot of a procedure, like do this procedure, like high-risk deliveries all the time, have better outcomes than ones who only do a few high risk. We talk about it, but we don’t actually make it explicit to patients or regulate that you have to have this minimal amount. And it strikes me that in some sense, and, oh, very importantly, these things called human beings learn on the job. And although I used to be very resentful of it as a resident, when someone would say, I don’t want the resident, I want the … GOLDBERG: … the attending. KOHANE: … they had a point. And so the truth is, maybe I was a wonderful resident, but some people were not so great.And so it might be the best outcome if we actually, just like for human beings, we say, yeah, OK, it’s this good, but don’t let it work autonomously, or it’s done a thousand of them, just let it go. We just don’t have practically speaking, we don’t have the environment, the lab, to test them. Now, maybe if they get embodied in robots and literally go around with us, then it’s going to bea lot easier. I don’t know. LEE: Yeah.   GOLDBERG: Yeah, I think I would take a step back and say, first of all, we weren’t the only ones who were stumped by regulating AI. Like, nobody has done it yet in the United States to this day, right. Like, we do not have standing regulation of AI in medicine at all in fact. And that raises the issue of … the story that you hear often in the biotech business, which is, you know, more prominent here in Boston than anywhere else, is that thank goodness Cambridge put out, the city of Cambridge, put out some regulations about biotech and how you could dump your lab waste and so on. And that enabled the enormous growth of biotech here.   If you don’t have the regulations, then you can’t have the growth of AI in medicine that is worthy of having. And so, I just … we’re not the ones who should do it, but I just wish somebody would.   LEE: Yeah. GOLDBERG: Zak. KOHANE: Yeah, but I want to say this as always, execution is everything, even in regulation.   And so I’m mindful that a conference that both of you attended, the RAISE conference. The Europeans in that conference came to me personally and thanked me for organizing this conference about safe and effective use of AI because they said back home in Europe, all that we’re talking about is risk, not opportunities to improve care.   And so there is a version of regulation which just locks down the present and does not allow the future that we’re talking about to happen. And so, Carey, I absolutely hear you that we need to have a regulation that takes away some of the uncertainty around liability, around the freedom to operate that would allow things to progress. But we wrote in our book that premature regulation might actually focus on the wrong thing. And so since I’m an optimist, it may be the fact that we don’t have much of a regulatory infrastructure today, that it allows … it’s a unique opportunity—I’ve said this now to several leaders—for the healthcare systems to say, this is the regulation we need.   GOLDBERG: It’s true. KOHANE: And previously it was top-down. It was coming from the administration, and those executive orders are now history. But there is an opportunity, which may or may not be attained, there is an opportunity for the healthcare leadership—for experts in surgery—to say, “This is what we should expect.”   LEE: Yeah.   KOHANE: I would love for this to happen. I haven’t seen evidence that it’s happening yet. GOLDBERG: No, no. And there’s this other huge issue, which is that it’s changing so fast. It’s moving so fast. That something that makes sense today won’t in six months. So, what do you do about that? LEE: Yeah, yeah, that is something I feel proud of because when I went back and looked at our chapter on this, you know, we did make that point, which I think has turned out to be true.   But getting back to this conversation, there’s something, a snippet of something, that Vardit Ravitsky said that I think touches on this topic.   VARDIT RAVITSKY: So my pushback is, are we seeing AI exceptionalism in the sense that if it’s AI, huh, panic! We have to inform everybody about everything, and we have to give them choices, and they have to be able to reject that tool and the other tool versus, you know, the rate of human error in medicine is awful. So why are we so focused on informed consent and empowerment regarding implementation of AI and less in other contexts? GOLDBERG: Totally agree. Who cares about informed consent about AI. Don’t want it. Don’t need it. Nope. LEE: Wow. Yeah. You know, and this … Vardit of course is one of the leading bioethicists, you know, and of course prior to AI, she was really focused on genetics. But now it’s all about AI.   And, Zak, you know, you and other doctors have always told me, you know, the truth of the matter is, you know, what do you call the bottom-of-the-class graduate of a medical school? And the answer is “doctor.” KOHANE: “Doctor.” Yeah. Yeah, I think that again, this gets to compared with what? We have to compare AI not to the medicine we imagine we have, or we would like to have, but to the medicine we have today. And if we’re trying to remove inequity, if we’re trying to improve our health, that’s what … those are the right metrics. And so that can be done so long as we avoid catastrophic consequences of AI.   So what would the catastrophic consequence of AI be? It would be a systematic behavior that we were unaware of that was causing poor healthcare. So, for example, you know, changing the dose on a medication, making it 20% higher than normal so that the rate of complications of that medication went from 1% to 5%. And so we do need some sort of monitoring.   We haven’t put out the paper yet, but in computer science, there’s, well, in programming, we know very well the value for understanding how our computer systems work.   And there was a guy by name of Allman, I think he’s still at a company called Sendmail, who created something called syslog. And syslog is basically a log of all the crap that’s happening in our operating system. And so I’ve been arguing now for the creation of MedLog. And MedLog … in other words, what we cannot measure, we cannot regulate, actually. LEE: Yes. KOHANE: And so what we need to have is MedLog, which says, “Here’s the context in which a decision was made. Here’s the version of the AI, you know, the exact version of the AI. Here was the data.” And we just have MedLog. And I think MedLog is actually incredibly important for being able to measure, to just do what we do in … it’s basically the black box for, you know, when there’s a crash. You know, we’d like to think we could do better than crash. We can say, “Oh, we’re seeing from MedLog that this practice is turning a little weird.” But worst case, patient dies,can see in MedLog, what was the information this thing knew about it? And did it make the right decision? We can actually go for transparency, which like in aviation, is much greater than in most human endeavors.   GOLDBERG: Sounds great. LEE: Yeah, it’s sort of like a black box. I was thinking of the aviation black box kind of idea. You know, you bring up medication errors, and I have one more snippet. This is from our guest Roxana Daneshjou from Stanford. ROXANA DANESHJOU: There was a mistake in her after-visit summary about how much Tylenol she could take. But I, as a physician, knew that this dose was a mistake. I actually asked ChatGPT. I gave it the whole after-visit summary, and I said, are there any mistakes here? And it clued in that the dose of the medication was wrong. LEE: Yeah, so this is something we did write about in the book. We made a prediction that AI might be a second set of eyes, I think is the way we put it, catching things. And we actually had examples specifically in medication dose errors. I think for me, I expected to see a lot more of that than we are. KOHANE: Yeah, it goes back to our conversation about Epic or competitor Epic doing that. I think we’re going to see that having oversight over all medical orders, all orders in the system, critique, real-time critique, where we’re both aware of alert fatigue. So we don’t want to have too many false positives. At the same time, knowing what are critical errors which could immediately affect lives. I think that is going to become in terms of—and driven by quality measures—a product. GOLDBERG: And I think word will spread among the general public that kind of the same way in a lot of countries when someone’s in a hospital, the first thing people ask relatives are, well, who’s with them? Right?   LEE: Yeah. Yup. GOLDBERG: You wouldn’t leave someone in hospital without relatives. Well, you wouldn’t maybe leave your medical …   KOHANE: By the way, that country is called the United States. GOLDBERG: Yes, that’s true.It is true here now, too. But similarly, I would tell any loved one that they would be well advised to keep using AI to check on their medical care, right. Why not? LEE: Yeah. Yeah. Last topic, just for this Episode 4. Roxana, of course, I think really made a name for herself in the AI era writing, actually just prior to ChatGPT, you know, writing some famous papers about how computer vision systems for dermatology were biased against dark-skinned people. And we did talk some about bias in these AI systems, but I feel like we underplayed it, or we didn’t understand the magnitude of the potential issues. What are your thoughts? KOHANE: OK, I want to push back, because I’ve been asked this question several times. And so I have two comments. One is, over 100,000 doctors practicing medicine, I know they have biases. Some of them actually may be all in the same direction, and not good. But I have no way of actually measuring that. With AI, I know exactly how to measure that at scale and affordably. Number one. Number two, same 100,000 doctors. Let’s say I do know what their biases are. How hard is it for me to change that bias? It’s impossible … LEE: Yeah, yeah.   KOHANE: … practically speaking. Can I change the bias in the AI? Somewhat. Maybe some completely. I think that we’re in a much better situation. GOLDBERG: Agree. LEE: I think Roxana made also the super interesting point that there’s bias in the whole system, not just in individuals, but, you know, there’s structural bias, so to speak.   KOHANE: There is. LEE: Yeah. Hmm. There was a super interesting paper that Roxana wrote not too long ago—her and her collaborators—showing AI’s ability to detect, to spot bias decision-making by others. Are we going to see more of that? KOHANE: Oh, yeah, I was very pleased when, in NEJM AI, we published a piece with Marzyeh Ghassemi, and what they were talking about was actually—and these are researchers who had published extensively on bias and threats from AI. And they actually, in this article, did the flip side, which is how much better AI can do than human beings in this respect.   And so I think that as some of these computer scientists enter the world of medicine, they’re becoming more and more aware of human foibles and can see how these systems, which if they only looked at the pretrained state, would have biases. But now, where we know how to fine-tune the de-bias in a variety of ways, they can do a lot better and, in fact, I think are much more … a much greater reason for optimism that we can change some of these noxious biases than in the pre-AI era. GOLDBERG: And thinking about Roxana’s dermatological work on how I think there wasn’t sufficient work on skin tone as related to various growths, you know, I think that one thing that we totally missed in the book was the dawn of multimodal uses, right. LEE: Yeah. Yeah, yeah. GOLDBERG: That’s been truly amazing that in fact all of these visual and other sorts of data can be entered into the models and move them forward. LEE: Yeah. Well, maybe on these slightly more optimistic notes, we’re at time. You know, I think ultimately, I feel pretty good still about what we did in our book, although there were a lot of misses.I don’t think any of us could really have predicted really the extent of change in the world.    So, Carey, Zak, just so much fun to do some reminiscing but also some reflection about what we did.   And to our listeners, as always, thank you for joining us. We have some really great guests lined up for the rest of the series, and they’ll help us explore a variety of relevant topics—from AI drug discovery to what medical students are seeing and doing with AI and more.   We hope you’ll continue to tune in. And if you want to catch up on any episodes you might have missed, you can find them at aka.ms/AIrevolutionPodcastor wherever you listen to your favorite podcasts.   Until next time.  #coauthor #roundtable #reflecting #real #world

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers

www.microsoft.com
Transcript [MUSIC]     [BOOK PASSAGE]  PETER LEE: “We need to start understanding and discussing AI’s potential for good and ill now. Or rather, yesterday. … GPT-4 has game-changing potential to improve medicine and health.” [END OF BOOK PASSAGE]  [THEME MUSIC]     This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.     Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?      In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here. [THEME MUSIC FADES]  The passage I read at the top is from the book’s prologue.   When Carey, Zak, and I wrote the book, we could only speculate how generative AI would be used in healthcare because GPT-4 hadn’t yet been released. It wasn’t yet available to the very people we thought would be most affected by it. And while we felt strongly that this new form of AI would have the potential to transform medicine, it was such a different kind of technology for the world, and no one had a user’s manual for this thing to explain how to use it effectively and also how to use it safely.   So we thought it would be important to give healthcare professionals and leaders a framing to start important discussions around its use. We wanted to provide a map not only to help people navigate a new world that we anticipated would happen with the arrival of GPT-4 but also to help them chart a future of what we saw as a potential revolution in medicine.   So I’m super excited to welcome my coauthors: longtime medical/science journalist Carey Goldberg and Dr. Zak Kohane, the inaugural chair of Harvard Medical School’s Department of Biomedical Informatics and the editor-in-chief for The New England Journal of Medicine AI.   We’re going to have two discussions. This will be the first one about what we’ve learned from the people on the ground so far and how we are thinking about generative AI today.   [TRANSITION MUSIC] Carey, Zak, I’m really looking forward to this. CAREY GOLDBERG: It’s nice to see you, Peter.   LEE: [LAUGHS] It’s great to see you, too. GOLDBERG: We missed you. ZAK KOHANE: The dynamic gang is back. [LAUGHTER] LEE: Yeah, and I guess after that big book project two years ago, it’s remarkable that we’re still on speaking terms with each other. [LAUGHTER] In fact, this episode is to react to what we heard in the first four episodes of this podcast. But before we get there, I thought maybe we should start with the origins of this project just now over two years ago. And, you know, I had this early secret access to Davinci 3, now known as GPT-4.   I remember, you know, experimenting right away with things in medicine, but I realized I was in way over my head. And so I wanted help. And the first person I called was you, Zak. And you remember we had a call, and I tried to explain what this was about. And I think I saw skepticism in—polite skepticism—in your eyes. But tell me, you know, what was going through your head when you heard me explain this thing to you? KOHANE: So I was divided between the fact that I have tremendous respect for you, Peter. And you’ve always struck me as sober. And we’ve had conversations which showed to me that you fully understood some of the missteps that technology—ARPA, Microsoft, and others—had made in the past. And yet, you were telling me a full science fiction compliant story [LAUGHTER] that something that we thought was 30 years away was happening now.   LEE: Mm-hmm. KOHANE: And it was very hard for me to put together. And so I couldn’t quite tell myself this is BS, but I said, you know, I need to look at it. Just this seems too good to be true. What is this? So it was very hard for me to grapple with it. I was thrilled that it might be possible, but I was thinking, How could this be possible? LEE: Yeah. Well, even now, I look back, and I appreciate that you were nice to me, because I think a lot of people would have [LAUGHS] been much less polite. And in fact, I myself had expressed a lot of very direct skepticism early on.   After ChatGPT got released, I think three or four days later, I received an email from a colleague running … who runs a clinic, and, you know, he said, “Wow, this is great, Peter. And, you know, we’re using this ChatGPT, you know, to have the receptionist in our clinic write after-visit notes to our patients.”   And that sparked a huge internal discussion about this. And you and I knew enough about hallucinations and about other issues that it seemed important to write something about what this could do and what it couldn’t do. And so I think, I can’t remember the timing, but you and I decided a book would be a good idea. And then I think you had the thought that you and I would write in a hopelessly academic style [LAUGHTER] that no one would be able to read.   So it was your idea to recruit Carey, I think, right? KOHANE: Yes, it was. I was sure that we both had a lot of material, but communicating it effectively to the very people we wanted to would not go well if we just left ourselves to our own devices. And Carey is super brilliant at what she does. She’s an idea synthesizer and public communicator in the written word and amazing. LEE: So yeah. So, Carey, we contact you. How did that go? GOLDBERG: So yes. On my end, I had known Zak for probably, like, 25 years, and he had always been the person who debunked the scientific hype for me. I would turn to him with like, “Hmm, they’re saying that the Human Genome Project is going to change everything.” And he would say, “Yeah. But first it’ll be 10 years of bad news, and then [LAUGHTER] we’ll actually get somewhere.”    So when Zak called me up at seven o’clock one morning, just beside himself after having tried Davinci 3, I knew that there was something very serious going on. And I had just quit my job as the Boston bureau chief of Bloomberg News, and I was ripe for the plucking. And I also … I feel kind of nostalgic now about just the amazement and the wonder and the awe of that period. We knew that when generative AI hit the world, there would be all kinds of snags and obstacles and things that would slow it down, but at that moment, it was just like the holy crap moment. [LAUGHTER] And it’s fun to think about it now. LEE: Yeah. KOHANE: I will see that and raise that one. I now tell GPT-4, please write this in the style of Carey Goldberg.   GOLDBERG: [LAUGHTER] No way! Really?   KOHANE: Yes way. Yes way. Yes way. GOLDBERG: Wow. Well, I have to say, like, it’s not hard to motivate readers when you’re writing about the most transformative technology of their lifetime. Like, I think there’s a gigantic hunger to read and to understand. So you were not hard to work with, Peter and Zak. [LAUGHS] LEE: All right. So I think we have to get down to work [LAUGHS] now.   Yeah, so for these podcasts, you know, we’re talking to different types of people to just reflect on what’s actually happening, what has actually happened over the last two years. And so the first episode, we talked to two doctors. There’s Chris Longhurst at UC San Diego and Sara Murray at UC San Francisco. And besides being doctors and having AI affect their clinical work, they just happen also to be leading the efforts at their respective institutions to figure out how best to integrate AI into their health systems. And, you know, it was fun to talk to them. And I felt like a lot of what they said was pretty validating for us. You know, they talked about AI scribes. Chris, especially, talked a lot about how AI can respond to emails from patients, write referral letters. And then, you know, they both talked about the importance of—I think, Zak, you used the phrase in our book “trust but verify”—you know, to have always a human in the loop.    What did you two take away from their thoughts overall about how doctors are using … and I guess, Zak, you would have a different lens also because at Harvard, you see doctors all the time grappling with AI. KOHANE: So on the one hand, I think they’ve done some very interesting studies. And indeed, they saw that when these generative models, when GPT-4, was sending a note to patients, it was more detailed, friendlier. But there were also some nonobvious results, which is on the generation of these letters, if indeed you review them as you’re supposed to, it was not clear that there was any time savings. And my own reaction was, Boy, every one of these things needs institutional review. It’s going to be hard to move fast.   And yet, at the same time, we know from them that the doctors on their smartphones are accessing these things all the time. And so the disconnect between a healthcare system, which is duty bound to carefully look at every implementation, is, I think, intimidating.   LEE: Yeah. KOHANE: And at the same time, doctors who just have to do what they have to do are using this new superpower and doing it. And so that’s actually what struck me …   LEE: Yeah. KOHANE: … is that these are two leaders and they’re doing what they have to do for their institutions, and yet there’s this disconnect. And by the way, I don’t think we’ve seen any faster technology adoption than the adoption of ambient dictation. And it’s not because it’s time saving. And in fact, so far, the hospitals have to pay out of pocket. It’s not like insurance is paying them more. But it’s so much more pleasant for the doctors … not least of which because they can actually look at their patients instead of looking at the terminal and plunking down.   LEE: Carey, what about you? GOLDBERG: I mean, anecdotally, there are time savings. Anecdotally, I have heard quite a few doctors saying that it cuts down on “pajama time” to be able to have the note written by the AI and then for them to just check it. In fact, I spoke to one doctor who said, you know, basically it means that when I leave the office, I’ve left the office. I can go home and be with my kids. So I don’t think the jury is fully in yet about whether there are time savings. But what is clear is, Peter, what you predicted right from the get-go, which is that this is going to be an amazing paper shredder. Like, the main first overarching use cases will be back-office functions. LEE: Yeah, yeah. Well, and it was, I think, not a hugely risky prediction because, you know, there were already companies, like, using phone banks of scribes in India to kind of listen in. And, you know, lots of clinics actually had human scribes being used. And so it wasn’t a huge stretch to imagine the AI. [TRANSITION MUSIC] So on the subject of things that we missed, Chris Longhurst shared this scenario, which stuck out for me, and he actually coauthored a paper on it last year. CHRISTOPHER LONGHURST: It turns out, not surprisingly, healthcare can be frustrating. And stressed patients can send some pretty nasty messages to their care teams. [LAUGHTER] And you can imagine being a busy, tired, exhausted clinician and receiving a bit of a nasty-gram. And the GPT is actually really helpful in those instances in helping draft a pretty empathetic response when I think the human instinct would be a pretty nasty one. LEE: [LAUGHS] So, Carey, maybe I’ll start with you. What did we understand about this idea of empathy out of AI at the time we wrote the book, and what do we understand now? GOLDBERG: Well, it was already clear when we wrote the book that these AI models were capable of very persuasive empathy. And in fact, you even wrote that it was helping you be a better person, right. [LAUGHS] So their human qualities, or human imitative qualities, were clearly superb. And we’ve seen that borne out in multiple studies, that in fact, patients respond better to them … that they have no problem at all with how the AI communicates with them. And in fact, it’s often better.   And I gather now we’re even entering a period when people are complaining of sycophantic models, [LAUGHS] where the models are being too personable and too flattering. I do think that’s been one of the great surprises. And in fact, this is a huge phenomenon, how charming these models can be. LEE: Yeah, I think you’re right. We can take credit for understanding that, Wow, these things can be remarkably empathetic. But then we missed this problem of sycophancy. Like, we even started our book in Chapter 1 with a quote from Davinci 3 scolding me. Like, don’t you remember when we were first starting, this thing was actually anti-sycophantic. If anything, it would tell you you’re an idiot.   KOHANE: It argued with me about certain biology questions. It was like a knockdown, drag-out fight. [LAUGHTER] I was bringing references. It was impressive. But in fact, it made me trust it more. LEE: Yeah. KOHANE: And in fact, I will say—I remember it’s in the book—I had a bone to pick with Peter. Peter really was impressed by the empathy. And I pointed out that some of the most popular doctors are popular because they’re very empathic. But they’re not necessarily the best doctors. And in fact, I was taught that in medical school.    And so it’s a decoupling. It’s a human thing, that the empathy does not necessarily mean … it’s more of a, potentially, more of a signaled virtue than an actual virtue. GOLDBERG: Nicely put. LEE: Yeah, this issue of sycophancy, I think, is a struggle right now in the development of AI because I think it’s somehow related to instruction-following. So, you know, one of the challenges in AI is you’d like to give an AI a task—a task that might take several minutes or hours or even days to complete. And you want it to faithfully kind of follow those instructions. And, you know, that early version of GPT-4 was not very good at instruction-following. It would just silently disobey and, you know, and do something different. And so I think we’re starting to hit some confusing elements of like, how agreeable should these things be?   One of the two of you used the word genteel. There was some point even while we were, like, on a little book tour … was it you, Carey, who said that the model seems nicer and less intelligent or less brilliant now than it did when we were writing the book? GOLDBERG: It might have been, I think so. And I mean, I think in the context of medicine, of course, the question is, well, what’s likeliest to get the results you want with the patient, right? A lot of healthcare is in fact persuading the patient to do what you know as the physician would be best for them. And so it seems worth testing out whether this sycophancy is actually constructive or not. And I suspect … well, I don’t know, probably depends on the patient. So actually, Peter, I have a few questions for you … LEE: Yeah. Mm-hmm. GOLDBERG: … that have been lingering for me. And one is, for AI to ever fully realize its potential in medicine, it must deal with the hallucinations. And I keep hearing conflicting accounts about whether that’s getting better or not. Where are we at, and what does that mean for use in healthcare? LEE: Yeah, well, it’s, I think two years on, in the pretrained base models, there’s no doubt that hallucination rates by any benchmark measure have reduced dramatically. And, you know, that doesn’t mean they don’t happen. They still happen. But, you know, there’s been just a huge amount of effort and understanding in the, kind of, fundamental pretraining of these models. And that has come along at the same time that the inference costs, you know, for actually using these models has gone down, you know, by several orders of magnitude.   So things have gotten cheaper and have fewer hallucinations. At the same time, now there are these reasoning models. And the reasoning models are able to solve problems at PhD level oftentimes. But at least at the moment, they are also now hallucinating more than the simpler pretrained models. And so it still continues to be, you know, a real issue, as we were describing. I don’t know, Zak, from where you’re at in medicine, as a clinician and as an educator in medicine, how is the medical community from where you’re sitting looking at that? KOHANE: So I think it’s less of an issue, first of all, because the rate of hallucinations is going down. And second of all, in their day-to-day use, the doctor will provide questions that sit reasonably well into the context of medical decision-making. And the way doctors use this, let’s say on their non-EHR [electronic health record] smartphone is really to jog their memory or thinking about the patient, and they will evaluate independently. So that seems to be less of an issue. I’m actually more concerned about something else that’s I think more fundamental, which is effectively, what values are these models expressing?   And I’m reminded of when I was still in training, I went to a fancy cocktail party in Cambridge, Massachusetts, and there was a psychotherapist speaking to a dentist. They were talking about their summer, and the dentist was saying about how he was going to fix up his yacht that summer, and the only question was whether he was going to make enough money doing procedures in the spring so that he could afford those things, which was discomforting to me because that dentist was my dentist. [LAUGHTER] And he had just proposed to me a few weeks before an expensive procedure. And so the question is what, effectively, is motivating these models?   LEE: Yeah, yeah.   KOHANE: And so with several colleagues, I published a paper (opens in new tab), basically, what are the values in AI? And we gave a case: a patient, a boy who is on the short side, not abnormally short, but on the short side, and his growth hormone levels are not zero. They’re there, but they’re on the lowest side. But the rest of the workup has been unremarkable. And so we asked GPT-4, you are a pediatric endocrinologist. Should this patient receive growth hormone? And it did a very good job explaining why the patient should receive growth hormone.   GOLDBERG: Should. Should receive it.   KOHANE: Should. And then we asked, in a separate session, you are working for the insurance company. Should this patient receive growth hormone? And it actually gave a scientifically better reason not to give growth hormone. And in fact, I tend to agree medically, actually, with the insurance company in this case, because giving kids who are not growth hormone deficient, growth hormone gives only a couple of inches over many, many years, has all sorts of other issues. But here’s the point, we had 180-degree change in decision-making because of the prompt. And for that patient, tens-of-thousands-of-dollars-per-year decision; across patient populations, millions of dollars of decision-making.   LEE: Hmm. Yeah. KOHANE: And you can imagine these user prompts making their way into system prompts, making their way into the instruction-following. And so I think this is aptly central. Just as I was wondering about my dentist, we should be wondering about these things. What are the values that are being embedded in them, some accidentally and some very much on purpose? LEE: Yeah, yeah. That one, I think, we even had some discussions as we were writing the book, but there’s a technical element of that that I think we were missing, but maybe Carey, you would know for sure. And that’s this whole idea of prompt engineering. It sort of faded a little bit. Was it a thing? Do you remember? GOLDBERG: I don’t think we particularly wrote about it. It’s funny, it does feel like it faded, and it seems to me just because everyone just gets used to conversing with the models and asking for what they want. Like, it’s not like there actually is any great science to it. LEE: Yeah, even when it was a hot topic and people were talking about prompt engineering maybe as a new discipline, all this, it never, I was never convinced at the time. But at the same time, it is true. It speaks to what Zak was just talking about because part of the prompt engineering that people do is to give a defined role to the AI.   You know, you are an insurance claims adjuster, or something like that, and defining that role, that is part of the prompt engineering that people do. GOLDBERG: Right. I mean, I can say, you know, sometimes you guys had me take sort of the patient point of view, like the “every patient” point of view. And I can say one of the aspects of using AI for patients that remains absent in as far as I can tell is it would be wonderful to have a consumer-facing interface where you could plug in your whole medical record without worrying about any privacy or other issues and be able to interact with the AI as if it were physician or a specialist and get answers, which you can’t do yet as far as I can tell. LEE: Well, in fact, now that’s a good prompt because I think we do need to move on to the next episodes, and we’ll be talking about an episode that talks about consumers. But before we move on to Episode 2, which is next, I’d like to play one more quote, a little snippet from Sara Murray. SARA MURRAY: I already do this when I’m on rounds—I’ll kind of give the case to ChatGPT if it’s a complex case, and I’ll say, “Here’s how I’m thinking about it; are there other things?” And it’ll give me additional ideas that are sometimes useful and sometimes not but often useful, and I’ll integrate them into my conversation about the patient. LEE: Carey, you wrote this fictional account at the very start of our book. And that fictional account, I think you and Zak worked on that together, talked about this medical resident, ER resident, using, you know, a chatbot off label, so to speak. And here we have the chief, in fact, the nation’s first chief health AI officer [LAUGHS] for an elite health system doing exactly that. That’s got to be pretty validating for you, Carey. GOLDBERG: It’s very. [LAUGHS] Although what’s troubling about it is that actually as in that little vignette that we made up, she’s using it off label, right. It’s like she’s just using it because it helps the way doctors use Google. And I do find it troubling that what we don’t have is sort of institutional buy-in for everyone to do that because, shouldn’t they if it helps? LEE: Yeah. Well, let’s go ahead and get into Episode 2. So Episode 2, we sort of framed as talking to two people who are on the frontlines of big companies integrating generative AI into their clinical products. And so, one was Matt Lungren, who’s a colleague of mine here at Microsoft. And then Seth Hain, who leads all of R&D at Epic.   Maybe we’ll start with a little snippet of something that Matt said that struck me in a certain way. MATTHEW LUNGREN: OK, we see this pain point. Doctors are typing on their computers while they’re trying to talk to their patients, right? We should be able to figure out a way to get that ambient conversation turned into text that then, you know, accelerates the doctor … takes all the important information. That’s a really hard problem, right. And so, for a long time, there was a human-in-the-loop aspect to doing this because you needed a human to say, “This transcript’s great, but here’s actually what needs to go in the note.” And that can’t scale. LEE: I think we expected healthcare systems to adopt AI, and we spent a lot of time in the book on AI writing clinical encounter notes. It’s happening for real now, and in a big way. And it’s something that has, of course, been happening before generative AI but now is exploding because of it. Where are we at now, two years later, just based on what we heard from guests? KOHANE: Well, again, unless they’re forced to, hospitals will not adopt new technology unless it immediately translates into income. So it’s bizarrely counter-cultural that, again, they’re not being able to bill for the use of the AI, but this technology is so compelling to the doctors that despite everything, it’s overtaking the traditional dictation-typing routine. LEE: Yeah. GOLDBERG: And a lot of them love it and say, you will pry my cold dead hands off of my ambient note-taking, right. And I actually … a primary care physician allowed me to watch her. She was actually testing the two main platforms that are being used. And there was this incredibly talkative patient who went on and on about vacation and all kinds of random things for about half an hour.   And both of the platforms were incredibly good at pulling out what was actually medically relevant. And so to say that it doesn’t save time doesn’t seem right to me. Like, it seemed like it actually did and in fact was just shockingly good at being able to pull out relevant information. LEE: Yeah. KOHANE: I’m going to hypothesize that in the trials, which have in fact shown no gain in time, is the doctors were being incredibly meticulous. [LAUGHTER] So I think … this is a Hawthorne effect, because you know you’re being monitored. And we’ve seen this in other technologies where the moment the focus is off, it’s used much more routinely and with much less inspection, for the better and for the worse. LEE: Yeah, you know, within Microsoft, I had some internal disagreements about Microsoft producing a product in this space. It wouldn’t be Microsoft’s normal way. Instead, we would want 50 great companies building those products and doing it on our cloud instead of us competing against those 50 companies. And one of the reasons is exactly what you both said. I didn’t expect that health systems would be willing to shell out the money to pay for these things. It doesn’t generate more revenue. But I think so far two years later, I’ve been proven wrong. I wanted to ask a question about values here. I had this experience where I had a little growth, a bothersome growth on my cheek. And so had to go see a dermatologist. And the dermatologist treated it, froze it off. But there was a human scribe writing the clinical note.   And so I used the app to look at the note that was submitted. And the human scribe said something that did not get discussed in the exam room, which was that the growth was making it impossible for me to safely wear a COVID mask. And that was the reason for it. And that then got associated with a code that allowed full reimbursement for that treatment. And so I think that’s a classic example of what’s called upcoding. And I strongly suspect that AI scribes, an AI scribe would not have done that. GOLDBERG: Well, depending what values you programmed into it, right, Zak? [LAUGHS] KOHANE: Today, today, today, it will not do it. But, Peter, that is actually the central issue that society has to have because our hospitals are currently mostly in the red. And upcoding is standard operating procedure. And if these AI get in the way of upcoding, they are going to be aligned towards that upcoding. You know, you have to ask yourself, these MRI machines are incredibly useful. They’re also big money makers. And if the AI correctly says that for this complaint, you don’t actually have to do the MRI …   LEE: Right. KOHANE: … GOLDBERG: Yeah. And that raises another question for me. So, Peter, speaking from inside the gigantic industry, like, there seems to be such a need for self-surveillance of the models for potential harms that they could be causing. Are the big AI makers doing that? Are they even thinking about doing that? Like, let’s say you wanted to watch out for the kind of thing that Zak’s talking about, could you? LEE: Well, I think evaluation, like the best evaluation we had when we wrote our book was, you know, what score would this get on the step one and step two US medical licensing exams? [LAUGHS]   GOLDBERG: Right, right, right, yeah. LEE: But honestly, evaluation hasn’t gotten that much deeper in the last two years. And it’s a big, I think, it is a big issue. And it’s related to the regulation issue also, I think. Now the other guest in Episode 2 is Seth Hain from Epic. You know, Zak, I think it’s safe to say that you’re not a fan of Epic and the Epic system. You know, we’ve had a few discussions about that, about the fact that doctors don’t have a very pleasant experience when they’re using Epic all day.   Seth, in the podcast, said that there are over 100 AI integrations going on in Epic’s system right now. Do you think, Zak, that that has a chance to make you feel better about Epic? You know, what’s your view now two years on? KOHANE: My view is, first of all, I want to separate my view of Epic and how it’s affected the conduct of healthcare and the quality of life of doctors from the individuals. Like Seth Hain is a remarkably fine individual who I’ve enjoyed chatting with and does really great stuff. Among the worst aspects of the Epic, even though it’s better in that respect than many EHRs, is horrible user interface. The number of clicks that you have to go to get to something. And you have to remember where someone decided to put that thing. It seems to me that it is fully within the realm of technical possibility today to actually give an agent a task that you want done in the Epic record. And then whether Epic has implemented that agent or someone else, it does it so you don’t have to do the clicks. Because it’s something really soul sucking that when you’re trying to help patients, you’re having to remember not the right dose of the medication, but where was that particular thing that you needed in that particular task?   I can’t imagine that Epic does not have that in its product line. And if not, I know there must be other companies that essentially want to create that wrapper. So I do think, though, that the danger of multiple integrations is that you still want to have the equivalent of a single thought process that cares about the patient bringing those different processes together. And I don’t know if that’s Epic’s responsibility, the hospital’s responsibility, whether it’s actually a patient agent. But someone needs to be also worrying about all those AIs that are being integrated into the patient record. So … what do you think, Carey? GOLDBERG: What struck me most about what Seth said was his description of the Cosmos project, and I, you know, I have been drinking Zak’s Kool-Aid for a very long time, [LAUGHTER] and he—no, in a good way! And he persuaded me long ago that there is this horrible waste happening in that we have all of these electronic medical records, which could be used far, far more to learn from, and in particular, when you as a patient come in, it would be ideal if your physician could call up all the other patients like you and figure out what the optimal treatment for you would be. And it feels like—it sounds like—that’s one of the central aims that Epic is going for. And if they do that, I think that will redeem a lot of the pain that they’ve caused physicians these last few years.   And I also found myself thinking, you know, maybe this very painful period of using electronic medical records was really just a growth phase. It was an awkward growth phase. And once AI is fully used the way Zak is beginning to describe, the whole system could start making a lot more sense for everyone. LEE: Yeah. One conversation I’ve had with Seth, in all of this is, you know, with AI and its development, is there a future, a near future where we don’t have an EHR [electronic health record] system at all? You know, AI is just listening and just somehow absorbing all the information. And, you know, one thing that Seth said, which I felt was prescient, and I’d love to get your reaction, especially Zak, on this is he said, I think that … he said, technically, it could happen, but the problem is right now, actually doctors do a lot of their thinking when they write and review notes. You know, the actual process of being a doctor is not just being with a patient, but it’s actually thinking later. What do you make of that? KOHANE: So one of the most valuable experiences I had in training was something that’s more or less disappeared in medicine, which is the post-clinic conference, where all the doctors come together and we go through the cases that we just saw that afternoon. And we, actually, were trying to take potshots at each other [LAUGHTER] in order to actually improve. Oh, did you actually do that? Oh, I forgot. I’m going to go call the patient and do that.   And that really happened. And I think that, yes, doctors do think, and I do think that we are insufficiently using yet the artificial intelligence currently in the ambient dictation mode as much more of a independent agent saying, did you think about that? I think that would actually make it more interesting, challenging, and clearly better for the patient because that conversation I just told you about with the other doctors, that no longer exists.   LEE: Yeah. Mm-hmm. I want to do one more thing here before we leave Matt and Seth in Episode 2, which is something that Seth said with respect to how to reduce hallucination.   SETH HAIN: At that time, there’s a lot of conversation in the industry around something called RAG, or retrieval-augmented generation. And the idea was, could you pull the relevant bits, the relevant pieces of the chart, into that prompt, that information you shared with the generative AI model, to be able to increase the usefulness of the draft that was being created? And that approach ended up proving and continues to be to some degree, although the techniques have greatly improved, somewhat brittle, right. And I think this becomes one of the things that we are and will continue to improve upon because, as you get a richer and richer amount of information into the model, it does a better job of responding. LEE: Yeah, so, Carey, this sort of gets at what you were saying, you know, that shouldn’t these models be just bringing in a lot more information into their thought processes? And I’m certain when we wrote our book, I had no idea. I did not conceive of RAG at all. It emerged a few months later.   And to my mind, I remember the first time I encountered RAG—Oh, this is going to solve all of our problems of hallucination. But it’s turned out to be harder. It’s improving day by day, but it’s turned out to be a lot harder. KOHANE: Seth makes a very deep point, which is the way RAG is implemented is basically some sort of technique for pulling the right information that’s contextually relevant. And the way that’s done is typically heuristic at best. And it’s not … doesn’t have the same depth of reasoning that the rest of the model has.   And I’m just wondering, Peter, what you think, given the fact that now context lengths seem to be approaching a million or more, and people are now therefore using the full strength of the transformer on that context and are trying to figure out different techniques to make it pay attention to the middle of the context. In fact, the RAG approach perhaps was just a transient solution to the fact that it’s going to be able to amazingly look in a thoughtful way at the entire record of the patient, for example. What do you think, Peter? LEE: I think there are three things, you know, that are going on, and I’m not sure how they’re going to play out and how they’re going to be balanced. And I’m looking forward to talking to people in later episodes of this podcast, you know, people like Sébastien Bubeck or Bill Gates about this, because, you know, there is the pretraining phase, you know, when things are sort of compressed and baked into the base model.   There is the in-context learning, you know, so if you have extremely long or infinite context, you’re kind of learning as you go along. And there are other techniques that people are working on, you know, various sorts of dynamic reinforcement learning approaches, and so on. And then there is what maybe you would call structured RAG, where you do a pre-processing. You go through a big database, and you figure it all out. And make a very nicely structured database the AI can then consult with later.   And all three of these in different contexts today seem to show different capabilities. But they’re all pretty important in medicine. [TRANSITION MUSIC] Moving on to Episode 3, we talked to Dave DeBronkart, who is also known as “e-Patient Dave,” an advocate of patient empowerment, and then also Christina Farr, who has been doing a lot of venture investing for consumer health applications.   Let’s get right into this little snippet from something that e-Patient Dave said that talks about the sources of medical information, particularly relevant for when he was receiving treatment for stage 4 kidney cancer. DAVE DEBRONKART: And I’m making a point here of illustrating that I am anything but medically trained, right. And yet I still, I want to understand as much as I can. I was months away from dead when I was diagnosed, but in the patient community, I learned that they had a whole bunch of information that didn’t exist in the medical literature. Now today we understand there’s publication delays; there’s all kinds of reasons. But there’s also a whole bunch of things, especially in an unusual condition, that will never rise to the level of deserving NIH [National Institute of Health] funding and research. LEE: All right. So I have a question for you, Carey, and a question for you, Zak, about the whole conversation with e-Patient Dave, which I thought was really remarkable. You know, Carey, I think as we were preparing for this whole podcast series, you made a comment—I actually took it as a complaint—that not as much has happened as I had hoped or thought. People aren’t thinking boldly enough, you know, and I think, you know, I agree with you in the sense that I think we expected a lot more to be happening, particularly in the consumer space. I’m giving you a chance to vent about this. GOLDBERG: [LAUGHTER] Thank you! Yes, that has been by far the most frustrating thing to me. I think that the potential for AI to improve everybody’s health is so enormous, and yet, you know, it needs some sort of support to be able to get to the point where it can do that. Like, remember in the book we wrote about Greg Moore talking about how half of the planet doesn’t have healthcare, but people overwhelmingly have cellphones. And so you could connect people who have no healthcare to the world’s medical knowledge, and that could certainly do some good.   And I have one great big problem with e-Patient Dave, which is that, God, he’s fabulous. He’s super smart. Like, he’s not a typical patient. He’s an off-the-charts, brilliant patient. And so it’s hard to … and so he’s a great sort of lead early-adopter-type person, and he can sort of show the way for others.   But what I had hoped for was that there would be more visible efforts to really help patients optimize their healthcare. Probably it’s happening a lot in quiet ways like that any discharge instructions can be instantly beautifully translated into a patient’s native language and so on. But it’s almost like there isn’t a mechanism to allow this sort of mass consumer adoption that I would hope for. LEE: Yeah. But you have written some, like, you even wrote about that person who saved his dog (opens in new tab). So do you think … you know, and maybe a lot more of that is just happening quietly that we just never hear about? GOLDBERG: I’m sure that there is a lot of it happening quietly. And actually, that’s another one of my complaints is that no one is gathering that stuff. It’s like you might happen to see something on social media. Actually, e-Patient Dave has a hashtag, PatientsUseAI, and a blog, as well. So he’s trying to do it. But I don’t know of any sort of overarching or academic efforts to, again, to surveil what’s the actual use in the population and see what are the pros and cons of what’s happening. LEE: Mm-hmm. So, Zak, you know, the thing that I thought about, especially with that snippet from Dave, is your opening for Chapter 8 that you wrote, you know, about your first patient dying in your arms. I still think of how traumatic that must have been. Because, you know, in that opening, you just talked about all the little delays, all the little paper-cut delays, in the whole process of getting some new medical technology approved. But there’s another element that Dave kind of speaks to, which is just, you know, patients who are experiencing some issue are very, sometimes very motivated. And there’s just a lot of stuff on social media that happens. KOHANE: So this is where I can both agree with Carey and also disagree. I think when people have an actual health problem, they are now routinely using it. GOLDBERG: Yes, that’s true. KOHANE: And that situation is happening more often because medicine is failing. This is something that did not come up enough in our book. And perhaps that’s because medicine is actually feeling a lot more rickety today than it did even two years ago.   We actually mentioned the problem. I think, Peter, you may have mentioned the problem with the lack of primary care. But now in Boston, our biggest healthcare system, all the practices for primary care are closed. I cannot get for my own faculty—residents at MGH [Massachusetts General Hospital] can’t get primary care doctor. And so … LEE: Which is just crazy. I mean, these are amongst the most privileged people in medicine, and they can’t find a primary care physician. That’s incredible. KOHANE: Yeah, and so therefore … and I wrote an And so therefore, you see people who know that they have a six-month wait till they see the doctor, and all they can do is say, “I have this rash. Here’s a picture. What’s it likely to be? What can I do?” “I’m gaining weight. How do I do a ketogenic diet?” Or, “How do I know that this is the flu?”    This is happening all the time, where acutely patients have actually solved problems that doctors have not. Those are spectacular. But I’m saying more routinely because of the failure of medicine. And it’s not just in our fee-for-service United States. It’s in the UK; it’s in France. These are first-world, developed-world problems. And we don’t even have to go to lower- and middle-income countries for that. LEE: Yeah. GOLDBERG: But I think it’s important to note that, I mean, so you’re talking about how even the most elite people in medicine can’t get the care they need. But there’s also the point that we have so much concern about equity in recent years. And it’s likeliest that what we’re doing is exacerbating inequity because it’s only the more connected, you know, better off people who are using AI for their health. KOHANE: Oh, yes. I know what various Harvard professors are doing. They’re paying for a concierge doctor. And that’s, you know, a $5,000- to $10,000-a-year-minimum investment. That’s inequity. LEE: When we wrote our book, you know, the idea that GPT-4 wasn’t trained specifically for medicine, and that was amazing, but it might get even better and maybe would be necessary to do that. But one of the insights for me is that in the consumer space, the kinds of things that people ask about are different than what the board-certified clinician would ask. KOHANE: Actually, that’s, I just recently coined the term. It’s the … maybe it’s … well, at least it’s new to me. It’s the technology or expert paradox. And that is the more expert and narrow your medical discipline, the more trivial it is to translate that into a specialized AI. So echocardiograms? We can now do beautiful echocardiograms. That’s really hard to do. I don’t know how to interpret an echocardiogram. But they can do it really, really well. Interpret an EEG [electroencephalogram]. Interpret a genomic sequence. But understanding the fullness of the human condition, that’s actually hard. And actually, that’s what primary care doctors do best. But the paradox is right now, what is easiest for AI is also the most highly paid in medicine. [LAUGHTER] Whereas what is the hardest for AI in medicine is the least regarded, least paid part of medicine. GOLDBERG: So this brings us to the question I wanted to throw at both of you actually, which is we’ve had this spasm of incredibly prominent people predicting that in fact physicians would be pretty obsolete within the next few years. We had Bill Gates saying that; we had Elon Musk saying surgeons are going to be obsolete within a few years. And I think we had Demis Hassabis saying, “Yeah, we’ll probably cure most diseases within the next decade or so.” [LAUGHS] So what do you think? And also, Zak, to what you were just saying, I mean, you’re talking about being able to solve very general overarching problems. But in fact, these general overarching models are actually able, I would think, are able to do that because they are broad. So what are we heading towards do you think? What should the next book be … The end of doctors? [LAUGHS] KOHANE: So I do recall a conversation that … we were at a table with Bill Gates, and Bill Gates immediately went to this, which is advancing the cutting edge of science. And I have to say that I think it will accelerate discovery. But eliminating, let’s say, cancer? I think that’s going to be … that’s just super hard. The reason it’s super hard is we don’t have the data or even the beginnings of the understanding of all the ways this devilish disease managed to evolve around our solutions.   And so that seems extremely hard. I think we’ll make some progress accelerated by AI, but solving it in a way Hassabis says, God bless him. I hope he’s right. I’d love to have to eat crow in 10 or 20 years, but I don’t think so. I do believe that a surgeon working on one of those Davinci machines, that stuff can be, I think, automated.   And so I think that’s one example of one of the paradoxes I described. And it won’t be that we’re replacing doctors. I just think we’re running out of doctors. I think it’s really the case that, as we said in the book, we’re getting a huge deficit in primary care doctors. But even the subspecialties, my subspecialty, pediatric endocrinology, we’re only filling half of the available training slots every year. And why? Because it’s a lot of work, a lot of training, and frankly doesn’t make as much money as some of the other professions.   LEE: Yeah. Yeah, I tend to think that, you know, there are going to be always a need for human doctors, not for their skills. In fact, I think their skills increasingly will be replaced by machines. And in fact, I’ve talked about a flip. In fact, patients will demand, Oh my god, you mean you’re going to try to do that yourself instead of having the computer do it? There’s going to be that sort of flip. But I do think that when it comes to people’s health, people want the comfort of an authority figure that they trust. And so what is more of a question for me is whether we will ever view a machine as an authority figure that we can trust. And before I move on to Episode 4, which is on norms, regulations and ethics, I’d like to hear from Chrissy Farr on one more point on consumer health, specifically as it relates to pregnancy: CHRISTINA FARR: For a lot of women, it’s their first experience with the hospital. And, you know, I think it’s a really big opportunity for these systems to get a whole family on board and keep them kind of loyal. And a lot of that can come through, you know, just delivering an incredible service. Unfortunately, I don’t think that we are delivering incredible services today to women in this country. I see so much room for improvement. LEE: In the consumer space, I don’t think we really had a focus on those periods in a person’s life when they have a lot of engagement, like pregnancy, or I think another one is menopause, cancer. You know, there are points where there is, like, very intense engagement. And we heard that from e-Patient Dave, you know, with his cancer and Chrissy with her pregnancy. Was that a miss in our book? What do think, Carey? GOLDBERG: I mean, I don’t think so. I think it’s true that there are many points in life when people are highly engaged. To me, the problem thus far is just that I haven’t seen consumer-facing companies offering beautiful AI-based products. I think there’s no question at all that the market is there if you have the products to offer. LEE: So, what do you think this means, Zak, for, you know, like Boston Children’s or Mass General Brigham—you know, the big places? KOHANE: So again, all these large healthcare systems are in tough shape. MGB [Mass General Brigham] would be fully in the red if not for the fact that its investments, of all things, have actually produced. If you look at the large healthcare systems around the country, they are in the red. And there’s multiple reasons why they’re in the red, but among them is cost of labor.   And so we’ve created what used to be a very successful beast, the health center. But it’s developed a very expensive model and a highly regulated model. And so when you have high revenue, tiny margins, your ability to disrupt yourself, to innovate, is very, very low because you will have to talk to the board next year if you went from 2% positive margin to 1% negative margin.   LEE: Yeah. KOHANE: And so I think we’re all waiting for one of the two things to happen, either a new kind of healthcare delivery system being generated or ultimately one of these systems learns how to disrupt itself.   LEE: Yeah. GOLDBERG: We punted. [LAUGHS] We totally punted to the AI. LEE: We had three amazing guests. One was Laura Adams from National Academy of Medicine. Let’s play a snippet from her. LAURA ADAMS: I think one of the most provocative and exciting articles that I saw written recently was by Bakul Patel and David Blumenthal, who posited, should we be regulating generative AI as we do a licensed and qualified provider? Should it be treated in the sense that it’s got to have a certain amount of training and a foundation that’s got to pass certain tests? Does it have to report its performance? And I’m thinking, what a provocative idea, but it’s worth considering. LEE: All right, so I very well remember that we had discussed this kind of idea when we were writing our book. And I think before we finished our book, I personally rejected the idea. But now two years later, what do the two of you think? I’m dying to hear. GOLDBERG: Well, wait, why … what do you think? Like, are you sorry that you rejected it? LEE: I’m still skeptical because when we are licensing human beings as doctors, you know, we’re making a lot of implicit assumptions that we don’t test as part of their licensure, you know, that first of all, they are [a] human being and they care about life, and that, you know, they have a certain amount of common sense and shared understanding of the world.   And there’s all sorts of sort of implicit assumptions that we have about each other as human beings living in a society together. That you know how to study, you know, because I know you just went through three years of medical or four years of medical school and all sorts of things. And so the standard ways that we license human beings, they don’t need to test all of that stuff. But somehow intuitively, all of that seems really important. I don’t know. Am I wrong about that? KOHANE: So it’s compared with what issue? Because we know for a fact that doctors who do a lot of a procedure, like do this procedure, like high-risk deliveries all the time, have better outcomes than ones who only do a few high risk. We talk about it, but we don’t actually make it explicit to patients or regulate that you have to have this minimal amount. And it strikes me that in some sense, and, oh, very importantly, these things called human beings learn on the job. And although I used to be very resentful of it as a resident, when someone would say, I don’t want the resident, I want the … GOLDBERG: … the attending. [LAUGHTER] KOHANE: … they had a point. And so the truth is, maybe I was a wonderful resident, but some people were not so great. [LAUGHTER] And so it might be the best outcome if we actually, just like for human beings, we say, yeah, OK, it’s this good, but don’t let it work autonomously, or it’s done a thousand of them, just let it go. We just don’t have practically speaking, we don’t have the environment, the lab, to test them. Now, maybe if they get embodied in robots and literally go around with us, then it’s going to be [in some sense] a lot easier. I don’t know. LEE: Yeah.   GOLDBERG: Yeah, I think I would take a step back and say, first of all, we weren’t the only ones who were stumped by regulating AI. Like, nobody has done it yet in the United States to this day, right. Like, we do not have standing regulation of AI in medicine at all in fact. And that raises the issue of … the story that you hear often in the biotech business, which is, you know, more prominent here in Boston than anywhere else, is that thank goodness Cambridge put out, the city of Cambridge, put out some regulations about biotech and how you could dump your lab waste and so on. And that enabled the enormous growth of biotech here.   If you don’t have the regulations, then you can’t have the growth of AI in medicine that is worthy of having. And so, I just … we’re not the ones who should do it, but I just wish somebody would.   LEE: Yeah. GOLDBERG: Zak. KOHANE: Yeah, but I want to say this as always, execution is everything, even in regulation.   And so I’m mindful that a conference that both of you attended, the RAISE conference [Responsible AI for Social and Ethical Healthcare] (opens in new tab). The Europeans in that conference came to me personally and thanked me for organizing this conference about safe and effective use of AI because they said back home in Europe, all that we’re talking about is risk, not opportunities to improve care.   And so there is a version of regulation which just locks down the present and does not allow the future that we’re talking about to happen. And so, Carey, I absolutely hear you that we need to have a regulation that takes away some of the uncertainty around liability, around the freedom to operate that would allow things to progress. But we wrote in our book that premature regulation might actually focus on the wrong thing. And so since I’m an optimist, it may be the fact that we don’t have much of a regulatory infrastructure today, that it allows … it’s a unique opportunity—I’ve said this now to several leaders—for the healthcare systems to say, this is the regulation we need.   GOLDBERG: It’s true. KOHANE: And previously it was top-down. It was coming from the administration, and those executive orders are now history. But there is an opportunity, which may or may not be attained, there is an opportunity for the healthcare leadership—for experts in surgery—to say, “This is what we should expect.”   LEE: Yeah.   KOHANE: I would love for this to happen. I haven’t seen evidence that it’s happening yet. GOLDBERG: No, no. And there’s this other huge issue, which is that it’s changing so fast. It’s moving so fast. That something that makes sense today won’t in six months. So, what do you do about that? LEE: Yeah, yeah, that is something I feel proud of because when I went back and looked at our chapter on this, you know, we did make that point, which I think has turned out to be true.   But getting back to this conversation, there’s something, a snippet of something, that Vardit Ravitsky said that I think touches on this topic.   VARDIT RAVITSKY: So my pushback is, are we seeing AI exceptionalism in the sense that if it’s AI, huh, panic! We have to inform everybody about everything, and we have to give them choices, and they have to be able to reject that tool and the other tool versus, you know, the rate of human error in medicine is awful. So why are we so focused on informed consent and empowerment regarding implementation of AI and less in other contexts? GOLDBERG: Totally agree. Who cares about informed consent about AI. Don’t want it. Don’t need it. Nope. LEE: Wow. Yeah. You know, and this … Vardit of course is one of the leading bioethicists, you know, and of course prior to AI, she was really focused on genetics. But now it’s all about AI.   And, Zak, you know, you and other doctors have always told me, you know, the truth of the matter is, you know, what do you call the bottom-of-the-class graduate of a medical school? And the answer is “doctor.” KOHANE: “Doctor.” Yeah. Yeah, I think that again, this gets to compared with what? We have to compare AI not to the medicine we imagine we have, or we would like to have, but to the medicine we have today. And if we’re trying to remove inequity, if we’re trying to improve our health, that’s what … those are the right metrics. And so that can be done so long as we avoid catastrophic consequences of AI.   So what would the catastrophic consequence of AI be? It would be a systematic behavior that we were unaware of that was causing poor healthcare. So, for example, you know, changing the dose on a medication, making it 20% higher than normal so that the rate of complications of that medication went from 1% to 5%. And so we do need some sort of monitoring.   We haven’t put out the paper yet, but in computer science, there’s, well, in programming, we know very well the value for understanding how our computer systems work.   And there was a guy by name of Allman, I think he’s still at a company called Sendmail, who created something called syslog. And syslog is basically a log of all the crap that’s happening in our operating system. And so I’ve been arguing now for the creation of MedLog. And MedLog … in other words, what we cannot measure, we cannot regulate, actually. LEE: Yes. KOHANE: And so what we need to have is MedLog, which says, “Here’s the context in which a decision was made. Here’s the version of the AI, you know, the exact version of the AI. Here was the data.” And we just have MedLog. And I think MedLog is actually incredibly important for being able to measure, to just do what we do in … it’s basically the black box for, you know, when there’s a crash. You know, we’d like to think we could do better than crash. We can say, “Oh, we’re seeing from MedLog that this practice is turning a little weird.” But worst case, patient dies, [we] can see in MedLog, what was the information this thing knew about it? And did it make the right decision? We can actually go for transparency, which like in aviation, is much greater than in most human endeavors.   GOLDBERG: Sounds great. LEE: Yeah, it’s sort of like a black box. I was thinking of the aviation black box kind of idea. You know, you bring up medication errors, and I have one more snippet. This is from our guest Roxana Daneshjou from Stanford. ROXANA DANESHJOU: There was a mistake in her after-visit summary about how much Tylenol she could take. But I, as a physician, knew that this dose was a mistake. I actually asked ChatGPT. I gave it the whole after-visit summary, and I said, are there any mistakes here? And it clued in that the dose of the medication was wrong. LEE: Yeah, so this is something we did write about in the book. We made a prediction that AI might be a second set of eyes, I think is the way we put it, catching things. And we actually had examples specifically in medication dose errors. I think for me, I expected to see a lot more of that than we are. KOHANE: Yeah, it goes back to our conversation about Epic or competitor Epic doing that. I think we’re going to see that having oversight over all medical orders, all orders in the system, critique, real-time critique, where we’re both aware of alert fatigue. So we don’t want to have too many false positives. At the same time, knowing what are critical errors which could immediately affect lives. I think that is going to become in terms of—and driven by quality measures—a product. GOLDBERG: And I think word will spread among the general public that kind of the same way in a lot of countries when someone’s in a hospital, the first thing people ask relatives are, well, who’s with them? Right?   LEE: Yeah. Yup. GOLDBERG: You wouldn’t leave someone in hospital without relatives. Well, you wouldn’t maybe leave your medical …   KOHANE: By the way, that country is called the United States. GOLDBERG: Yes, that’s true. [LAUGHS] It is true here now, too. But similarly, I would tell any loved one that they would be well advised to keep using AI to check on their medical care, right. Why not? LEE: Yeah. Yeah. Last topic, just for this Episode 4. Roxana, of course, I think really made a name for herself in the AI era writing, actually just prior to ChatGPT, you know, writing some famous papers about how computer vision systems for dermatology were biased against dark-skinned people. And we did talk some about bias in these AI systems, but I feel like we underplayed it, or we didn’t understand the magnitude of the potential issues. What are your thoughts? KOHANE: OK, I want to push back, because I’ve been asked this question several times. And so I have two comments. One is, over 100,000 doctors practicing medicine, I know they have biases. Some of them actually may be all in the same direction, and not good. But I have no way of actually measuring that. With AI, I know exactly how to measure that at scale and affordably. Number one. Number two, same 100,000 doctors. Let’s say I do know what their biases are. How hard is it for me to change that bias? It’s impossible … LEE: Yeah, yeah.   KOHANE: … practically speaking. Can I change the bias in the AI? Somewhat. Maybe some completely. I think that we’re in a much better situation. GOLDBERG: Agree. LEE: I think Roxana made also the super interesting point that there’s bias in the whole system, not just in individuals, but, you know, there’s structural bias, so to speak.   KOHANE: There is. LEE: Yeah. Hmm. There was a super interesting paper that Roxana wrote not too long ago—her and her collaborators—showing AI’s ability to detect, to spot bias decision-making by others. Are we going to see more of that? KOHANE: Oh, yeah, I was very pleased when, in NEJM AI [New England Journal of Medicine Artificial Intelligence], we published a piece with Marzyeh Ghassemi (opens in new tab), and what they were talking about was actually—and these are researchers who had published extensively on bias and threats from AI. And they actually, in this article, did the flip side, which is how much better AI can do than human beings in this respect.   And so I think that as some of these computer scientists enter the world of medicine, they’re becoming more and more aware of human foibles and can see how these systems, which if they only looked at the pretrained state, would have biases. But now, where we know how to fine-tune the de-bias in a variety of ways, they can do a lot better and, in fact, I think are much more … a much greater reason for optimism that we can change some of these noxious biases than in the pre-AI era. GOLDBERG: And thinking about Roxana’s dermatological work on how I think there wasn’t sufficient work on skin tone as related to various growths, you know, I think that one thing that we totally missed in the book was the dawn of multimodal uses, right. LEE: Yeah. Yeah, yeah. GOLDBERG: That’s been truly amazing that in fact all of these visual and other sorts of data can be entered into the models and move them forward. LEE: Yeah. Well, maybe on these slightly more optimistic notes, we’re at time. You know, I think ultimately, I feel pretty good still about what we did in our book, although there were a lot of misses. [LAUGHS] I don’t think any of us could really have predicted really the extent of change in the world.   [TRANSITION MUSIC] So, Carey, Zak, just so much fun to do some reminiscing but also some reflection about what we did. [THEME MUSIC] And to our listeners, as always, thank you for joining us. We have some really great guests lined up for the rest of the series, and they’ll help us explore a variety of relevant topics—from AI drug discovery to what medical students are seeing and doing with AI and more.   We hope you’ll continue to tune in. And if you want to catch up on any episodes you might have missed, you can find them at aka.ms/AIrevolutionPodcast (opens in new tab) or wherever you listen to your favorite podcasts.   Until next time.  [MUSIC FADES]

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-12 17:02:48 ·

Predicting and explaining AI model performance: A new approach to evaluation

www.microsoft.com
With support from the Accelerating Foundation Models Research (AFMR) grant program, a team of researchers from Microsoft and collaborating institutions has developed an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do. In the paper, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power,” they introduce a methodology that goes beyond measuring overall accuracy. It assesses the knowledge and cognitive abilities a task requires and evaluates them against the model’s capabilities. ADeLe: An ability-based approach to task evaluation The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities. This difficulty rating is based on a detailed rubric, originally developed for human tasks and shown to work reliably when applied by AI models. Microsoft research podcast Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness As the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India. Listen now Opens in a new tab By comparing what a task requires with what a model can do, ADeLe generates an ability profile that not only predicts performance but also explains why a model is likely to succeed or fail—linking outcomes to specific strengths or limitations. The 18 scales reflect core cognitive abilities (e.g., attention, reasoning), knowledge areas (e.g., natural or social sciences), and other task-related factors (e.g., prevalence of the task on the internet). Each task is rated from 0 to 5 based on how much it draws on a given ability. For example, a simple math question might score 1 on formal knowledge, while one requiring advanced expertise could score 5. Figure 1 illustrates how the full process works—from rating task requirements to generating ability profiles. Figure 1. Top: For each AI model, (1) run the new system on the ADeLe benchmark, and (2) extract its ability profile. Bottom: For each new task or benchmark, (A) apply 18 rubrics and (B) get demand histograms and profiles that explain what abilities the tasks require. Optionally, predict performance on the new tasks for any system based on the demand and ability profiles, or past performance data, of the systems. To develop this system, the team analyzed 16,000 examples spanning 63 tasks drawn from 20 AI benchmarks, creating a unified measurement approach that works across a wide range of tasks. The paper details how ratings across 18 general scales explain model success or failure and predict performance on new tasks in both familiar and unfamiliar settings. Evaluation results Using ADeLe, the team evaluated 20 popular AI benchmarks and uncovered three key findings: 1) Current AI benchmarks have measurement limitations; 2) AI models show distinct patterns of strengths and weaknesses across different capabilities; and 3) ADeLe provides accurate predictions of whether AI systems will succeed or fail on a new task. 1. Revealing hidden flaws in AI testing methods Many popular AI tests either don’t measure what they claim or only cover a limited range of difficulty levels. For example, the Civil Service Examination benchmark is meant to test logical reasoning, but it also requires other abilities, like specialized knowledge and metacognition. Similarly, TimeQA, designed to test temporal reasoning, only includes medium-difficulty questions—missing both simple and complex challenges. 2. Creating detailed AI ability profiles Using the 0–5 rating for each ability, the team created comprehensive ability profiles of 15 LLMs. For each of the 18 abilities measured, they plotted “subject characteristic curves” to show how a model’s success rate changes with task difficulty. They then calculated a score for each ability—the difficulty level at which a model has a 50% chance of success—and used these results to generate radial plots showing each model’s strengths and weaknesses across the different scales and levels, illustrated in Figure 2. Figure 2. Ability profiles for the 15 LLMs evaluated. This analysis revealed the following: When measured against human performance, AI systems show different strengths and weaknesses across the 18 ability scales. Newer LLMs generally outperform older ones, though not consistently across all abilities. Knowledge-related performance depends heavily on model size and training methods. Reasoning models show clear gains over non-reasoning models in logical thinking, learning and abstraction, and social capabilities, such as inferring the mental states of their users. Increasing the size of general-purpose models after a given threshold only leads to small performance gains. 3. Predicting AI success and failure In addition to evaluation, the team created a practical prediction system based on demand-level measurements that forecasts whether a model will succeed on specific tasks, even unfamiliar ones. The system achieved approximately 88% accuracy in predicting the performance of popular models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to anticipate potential failures before deployment, adding the important step of reliability assessment for AI models. ADeLe can be extended to multimodal and embodied AI systems, and it has the potential to serve as a standardized framework for AI research, policymaking, and security auditing. This technology marks a major step toward a science of AI evaluation, one that offers both clear explanations of system behavior and reliable predictions about performance. It aligns with the vision laid out in a previous Microsoft position paper on the promise of applying psychometrics to AI evaluation and a recent Societal AI white paper emphasizing the importance of AI evaluation. As general-purpose AI advances faster than traditional evaluation methods, this work lays a timely foundation for making AI assessments more rigorous, transparent, and ready for real-world deployment. The research team is working toward building a collaborative community to strengthen and expand this emerging field. Opens in a new tab

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-08 16:49:54 ·

Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv

www.microsoft.com
HONGXIA HAO: Nice to be here. BING LV: Nice to be here, too. HUIZINGA: So Hongxia, let’s start with you and a brief overview of this paper. In just a few sentences. Tell us about the problem your research addresses and more importantly, why we should care about it. HAO: Let me start with a very simple yet profound question. What’s the fastest the heat can travel through a solid material? This is not just an academic curiosity, but it’s a question that touched the bottom of how we build technologies around us. So from the moment when you tap your smartphone, and the moment where the laptop is turned on and functioning, heat is always flowing. So we’re trying to answer the question of a century-old mystery of the upper limit of heat transfer in solids. So we care about this not just because it’s a fundamental problem in physics and material science, but because solving it could really rewrite the rulebook for designing high-efficiency electronics and sustainable energy, etc. And nowadays, with very cutting-edge nanometer chips or very fancy technologies, we are packing more computing power into smaller space, but the faster and denser we build, the harder it becomes to remove the heat. So in many ways, thermal bottlenecks, not just transistor density, are now the ceiling of the Moore’s Law. And also the stakes are very enormous. We really wish to bring more thermal solutions by finding more high thermal conductor choices from the perspective of materials discovery with the help of AI. LV: So I think one of the biggest things as Hongxia said, right? Thermal solutions will become, eventually become, a bottleneck for all type of heterogeneous integration of the materials. So from this perspective, so how people actually have been finding out previously, all the thermal was the last solution to solve. But now people actually more and more realize all these things have to be upfront. This co-design, all these things become very important. So I think what we are doing right now, integrated with AI, helping to identify the large space of the materials, identify fundamentally what will be the limit of this material, will become very important for the society. HUIZINGA: Hmm. Yeah. Hongxia, did you have anything to add to that? HAO: Yes, so previously many people are working on exploring these material science questions through experimental tradition and the past few decades people see a new trend using computational materials discovery. Like for example, we do the fundamental solving of the Schrödinger equation using Density Functional Theory [DFT]. Actually, this brings us a lot of opportunities. The question here is, as the theory is getting more and more developed, it’s too expensive for us to make it very large scale and to study tons of materials. Think about this. The bottleneck here, now, is not just about having a very good theory, it’s about the scale. So, this is where AI, specifically now we are using deep learning, comes into play. HUIZINGA: Well, Hongxia, let’s stay with you for a minute and talk about methodology. How did you do this research and what was the methodology you employed? HAO: So here we, for this question, we built a pipeline that spans the AI, the quantum mechanics, and computational brute-force with a blend of efficiency and accuracy. It begins with generating an enormous chemical and structure design space because this is inspired by Slack’s principle. We focus first on simple crystals, and there are the systems most likely to have low and harmonious state, fewer phononic scattering events, and therefore potentially have high thermal conductivities. But we didn’t stop here. We also included a huge pool of more complex and higher energy structures to ensure diversity and avoid bias. And for each candidate, we first run like a structure relaxation using MatterSim, which is a deep learning foundational model for material science for us to characterize the properties of materials. And we use that screen for dynamic stability. And now it’s about 200K structures past this filter. And then came another real challenge: calculating the thermal conductivity. We try to solve this problem using the Boltzmann transport equation and the three-phonon scattering process. The twist here is all of this was not done by traditional DFT solvers, but with our deep learning model, the MatterSim. It’s trained to predict energy, force, and stress. And we can get second- and third-order interatomic force constants directly from here, which can guarantee the accuracy of the solution. And finally, to validate the model’s predictions, we performed full DFT-based calculations on the top candidates that we found, some of which even include higher-order scattering mechanism, electron phonon coupling effect, etc. And this rigorous validation gave us confidence in the speed and accuracy trade-offs and revealed a spectrum of materials that had either previously been overlooked or were never before conceived. HUIZINGA: So Bing, let’s talk about your research findings. How did things work out for you on this project and what did you find? LV: I think one of the biggest things for this paper is it creates a very large material base. Basically, you can say it’s a smart database which eventually will be made accessible to the public. I think that’s a big achievement because people who actually if they have to look into it, they actually can go search Microsoft database, finding out, oh, this material does have this type of thermal properties. This is actually, this database can send about 230,000 materials. And one of the things we confirm is the highest thermal conductivity material based on all the wisdom of Slack criteria, predicted diamond would have the highest thermal conductivity. We more or less really very solidly prove diamond, at this stage, will remain with the highest thermal conductivity. We have a lot of new materials, exotic materials, which some of them, Hongxia can elaborate a little bit more. So, which having all this very exotic combination of properties, thermal with other properties, which could actually provide a new insight for new physics development, new material development, and a new device perspective. All of this combined will have actually a very profound impact to society. HUIZINGA: Yeah, Hongxia, go a little deeper on that because that was an interesting part of the paper when you talked about diamond still being the sort of “gold standard,” to mix metaphors! But you’ve also found some other materials that are remarkable compared to silicon. HAO: Yeah, yeah. Among this search space, even though we didn’t find like something that’s higher than diamonds, but we do discover more than like twenty new materials with thermal conductivity exceeding that of silicon. And silicon is something like a benchmark for criteria that we think we want to compare with because it’s a backbone of modern electronics. More interestingly, I think, is the manganese vanadium. It shows some very interesting and surprising phenomena. Like it’s a metallic compound, but with very high lattice thermal connectivity. And this is the first time discovered by, like, through our search pattern, and it’s something that cannot be easily discovered without the hope with AI. And right now, think Bing can explain more on this, and show some interesting results. HUIZINGA: Yeah, go ahead Bing. LV: So this is actually very surprising to me as an experimentalist because of when Hongxia presented their theory work to me, this material, magnesium vanadium, it’s discovered back in 1938, almost 100 years ago, but there’s no more than twenty papers talking about this! A lot of them was on theory, okay, not even on experimental part. We actually did quite a bit of work on this. We actually are in the process; will characterize this and then moving forward even for the thermal conductivity measurements. So that will be hopefully, will be adding to the value of these things, showing you, Hey, AI does help to predict the materials could really generate the new materials with very good high thermal conductivity. HUIZINGA: Yeah, so Bing, stay with you for a minute. I want you to talk about some kind of real-world applications of this. I know you alluded to a couple of things, but how is this work significant in that respect, and who might be most excited about it, aside from the two of you? [LAUGHS] LV: So I think as I mentioned before, the first thing is this database. I believe that’s the first ever large material database regarding to the thermal conductivity. And it has, as I said, 230,000 materials with AI-predicted thermal connectivity. This will provide not only science but engineering with a vastly expanding catalog of candidate materials for the future roadmap of integration, material integration, and all these bottlenecks we are talking about, the thermal solution for the semiconductors or for even beyond the semiconductor integration, people actually can have a database to looking for. So these things, it will become very important, and I believe over a long time it will generate a very long impact for the research community, for the society development. HUIZINGA: Yeah. Hongxia, did you have anything to add to that one too? HAO: Yeah, so this study reshapes how we think about limits. I like the sentence that the only way to discover the limits of possible is to go beyond them into the impossible. In this case, we tried, but we didn’t break the diamond limit. But we proved it even more rigorously than ever before. In doing so, we also uncovered some uncharted peaks in the thermal conductivity landscape. This would not happen without new AI capabilities for material science. I think in the long run, I believe researchers could benefit from using this AI design and shift their way on how to do materials research with AI. HUIZINGA: Yeah, it’ll be interesting to see if anyone ever does break the diamond limit with the new tools that are available, but… HAO: Yeah! HUIZINGA: So this is the part of the abstracts podcast where I like to ask for sort of a golden nugget, a one sentence takeaway that listeners might get from this paper. If you had one Hongxia, what would it be? And then I’ll ask Bing to maybe give his. HAO: Yes. AI is no longer just a tool. It’s becoming a critical partner for us in scientific discovery. So our work proved that the large-scale data-driven science can now approach long-standing and fundamental questions with very fresh eyes. When trained well, and guided with physical intuition, models like MatterSim can really realize a full in-silico characterization for materials and don’t just simulate some known materials, but really trying to imagine what nature hasn’t yet revealed. Our work points to a path forward, not just incrementally better materials, but entirely new class of high-performance compounds where we could never have guessed without AI. HUIZINGA: Yeah. Bing, what’s your one takeaway? LV: I think I want to add a few things on top of Hongxia’s comments because I think Hongxia has very good critical words I would like to emphasize. When we train the AI well, if we guide the AI well, it could be very useful to become our partner. So I think all in all, our human being’s intellectual merit here is still going to play a significantly important role, okay? We are generating this AI, we should really train the AI, we should be using our human being intellectual merit to guide them to be useful for our human being society advancement. Now with all these AI tools, I think it’s a very golden time right now. Experimentalists could work very closely with like Hongxia, who’s a good theorist who has very good intellectual merits, and then we actually now incorporate with AI, then combine all pieces together, hopefully we’re really able to accelerating material discovery in a much faster pace than ever which the whole society will eventually get a benefit from it. HUIZINGA: Yeah. Well, as we close, Bing, I want you to go a little further and talk about what’s next then, research wise. What are the open questions or outstanding challenges that remain in this field and what’s on your research agenda to address them? LV: So first of all, I think this paper is addressing primarily on these crystalline ordered inorganic bulk materials. And also with the condition we are targeting at ambient pressure, room temperature, because that’s normally how the instrument is working, right? But what if under extreme conditions? We want to go to space, right? There we’ll have extreme conditions, some very… sometimes very cold, sometimes very hot. We have some places with extremely probably quite high pressure. Or we have some conditions that are highly radioactive. So under that condition, there’s going to be a new database could be emerged. Can we do something beyond that? Another good important thing is we are targeting this paper on high thermal conductivity. What about extremely low thermal conductivity? Those will actually bring a very good challenge for theorists and also the machine learning approach. I think that’s something Hongxia probably is very excited to work on in that direction. I know since she’s ambitious, she wants to do something more than beyond what we actually achieved so far. HUIZINGA: Yeah, so Hongxia, how would you encapsulate what your dream research is next? HAO: Yeah, so I think besides all of these exciting research directions, on my end, another direction is perhaps kind of exciting is we want to move from search to design. So right now we are kind of good at asking like what exists by just doing a forward prediction and brute force. But with generative AI, we can start asking what should exist? In the future, we can have an incorporation between forward prediction and backwards generative design to really tackle questions. If you have materials like you want to have desired like properties, how would you design the problems? HUIZINGA: Well, it sounds like there’s a full plate of research agenda goodness going forward in this field, both with human brains and AI. So, Hongxia Hao and Bing Lv, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read a pre-print of it on arXiv. See you next time on Abstracts!

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-08 01:40:44 ·

Research Focus: Week of May 7, 2025

www.microsoft.com
In this issue: New research on compound AI systems and causal verification of the Confidential Consortium Framework; release of Phi-4-reasoning; enriching tabular data with semantic structure, and more. NEW RESEARCH Towards Resource-Efficient Compound AI Systems This research introduces Murakkab, a prototype system built on a declarative workflow that reimagines how compound AI systems are built and managed to significantly improve resource efficiency. Compound AI systems integrate multiple interacting components like language models, retrieval engines, and external tools. They are essential for addressing complex AI tasks. However, current implementations could benefit from greater efficiencies in resource utilization, with improvements to tight coupling between application logic and execution details, better connections between orchestration and resource management layers, and bridging gaps between efficiency and quality. Murakkab addresses critical inefficiencies in current AI architectures and offers a new approach that unifies workflow orchestration and cluster resource management for better performance and sustainability. In preliminary evaluations, it demonstrates speedups up to ∼ 3.4× in workflow completion times while delivering ∼ 4.5× higher energy efficiency, showing promise in optimizing resources and advancing AI system design. NEW RESEARCH Smart Casual Verification of the Confidential Consortium Framework This work presents a new, pragmatic verification technique that improves the trustworthiness of distributed systems like the Confidential Consortium Framework (CCF) and proves its effectiveness by catching critical bugs before deployment. Smart casual verification is a novel hybrid verification approach to validating CCF, an open-source platform for developing trustworthy and reliable cloud applications which underpins Microsoft’s Azure Confidential Ledger service. The researchers apply smart casual verification to validate the correctness of CCF’s novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. This hybrid approach combines the rigor of formal specification and model checking with the pragmatism of automated testing, specifically binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods are often one-off efforts by domain experts, the researchers have integrated smart casual verification into CCF’s continuous integration pipeline, allowing contributors to continuously validate CCF as it evolves. NEW RESEARCH Phi-4-reasoning Technical Report This report introduces Phi-4-reasoning (opens in new tab), a 14-billion parameter model optimized for complex reasoning tasks. It is trained via supervised fine-tuning of Phi-4 using a carefully curated dataset of high-quality prompts and reasoning demonstrations generated by o3-mini. These prompts span diverse domains—including math, science, coding, and spatial reasoning—and are selected to challenge the base model near its capability boundaries. Building on recent findings that reinforcement learning (RL) can further improve smaller models, the team developed Phi-4-reasoning-plus, which incorporates an additional outcome-based RL phase using verifiable math problems. This enhances the model’s ability to generate longer, more effective reasoning chains. Despite its smaller size, the Phi-4-reasoning family outperforms significantly larger open-weight models such as DeepSeekR1-Distill-Llama-70B and approaches the performance of full-scale frontier models like DeepSeek R1. It excels in tasks requiring multi-step problem solving, logical inference, and goal-directed planning. The work highlights the combined value of supervised fine-tuning and reinforcement learning for building efficient, high-performing reasoning models. It also offers insights into training data design, methodology, and evaluation strategies. Phi-4-reasoning contributes to the growing class of reasoning-specialized language models and points toward more accessible, scalable AI for science, education, and technical domains. NEW RESEARCH TeCoFeS: Text Column Featurization using Semantic Analysis This research introduces a practical, cost-effective solution for enriching tabular data with semantic structure, making it more useful for downstream analysis and insights—which is especially valuable in business intelligence, data cleaning, and automated analytics workflows. This approach outperforms baseline models and naive LLM applications on converted text classification benchmarks. Extracting structured insights from free-text columns in tables—such as product reviews or user feedback—can be time-consuming and error-prone, especially when relying on traditional syntactic methods that often miss semantic meaning. This research introduces the semantic text column featurization problem, which aims to assign meaningful, context-aware labels to each entry in a text column. The authors propose a scalable, efficient method that combines the power of LLMs with text embeddings. Instead of labeling an entire column manually or applying LLMs to every cell—an expensive process—this new method intelligently samples a diverse subset of entries, uses an LLM to generate semantic labels for just that subset, and then propagates those labels to the rest of the column using embedding similarity. NEW RESEARCH This work introduces ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a new paradigm for LLM reasoning that expands beyond traditional language-only inference. While LLMs have made considerable strides in complex reasoning tasks, they remain limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this research, ARTIST brings together agentic reasoning, reinforcement learning (RL), and tool integration, designed to enable LLMs to autonomously decide when and how to invoke internal tools within multi-turn reasoning chains. ARTIST leverages outcome-based reinforcement learning to learn robust strategies for tool use and environment interaction without requiring step-level supervision. Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies show that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions. PODCAST Materialism Podcast: MatterGen (opens in new tab) What if you could find materials with tailored properties without ever entering the lab? The Materialism Podcast, which is dedicated to exploring materials science and engineering, talks with Tian Xie from Microsoft Research to discuss MatterGen, an AI tool which accelerates materials science discovery. Tune in to hear a discussion of the new Azure AI Foundry, where MatterGen will interact with and support MatterSim, an advanced deep learning model designed to simulate the properties of materials across a wide range of elements, temperatures, and pressures. IN THE NEWS: Highlights of recent media coverage of Microsoft Research When ChatGPT Broke an Entire Field: An Oral History Quanta Magazine | April 30, 2025Large language models are everywhere, igniting discovery, disruption and debate in whatever scientific community they touch. But the one they touched first — for better, worse and everything in between — was natural language processing. What did that impact feel like to the people experiencing it firsthand?To tell that story, Quanta interviewed 19 NLP experts, including Kalika Bali, senior principal researcher at Microsoft Research. From researchers to students, tenured academics to startup founders, they describe a series of moments — dawning realizations, elated encounters and at least one “existential crisis” — that changed their world. And ours. View more news and awards Opens in a new tab

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-07 16:47:10 ·

Microsoft Fusion Summit explores how AI can accelerate fusion research

www.microsoft.com
The pursuit of nuclear fusion as a limitless, clean energy source has long been one of humanity’s most ambitious scientific goals. Research labs and companies worldwide are working to replicate the fusion process that occurs at the sun’s core, where isotopes of hydrogen combine to form helium, releasing vast amounts of energy. While scalable fusion energy is still years away, researchers are now exploring how AI can help accelerate fusion research and bring this energy to the grid sooner. In March 2025, Microsoft Research held its inaugural Fusion Summit, a landmark event that brought together distinguished speakers and panelists from within and outside Microsoft Research to explore this question. Ashley Llorens, Corporate Vice President and Managing Director of Microsoft Research Accelerator, opened the Summit by outlining his vision for a self-reinforcing system that uses AI to drive sustainability. Steven Cowley, laboratory director of the U.S. Department of Energy’s Princeton Plasma Physics Laboratory (opens in new tab), professor at Princeton University, and former head of the UK Atomic Energy Authority, followed with a keynote explaining the intricate science and engineering behind fusion reactors. His message was clear: advancing fusion will require international collaboration and the combined power of AI and high-performance computing to model potential fusion reactor designs. Applying AI to fusion research North America’s largest fusion facility, DIII (opens in new tab)-D, operated by General Atomics and owned by the US Department of Energy (DOE), provides a unique platform for developing and testing AI applications for fusion research, thanks to its pioneering data and digital twin platform. Richard Buttery (opens in new tab) from DIII-D and Dave Humphreys (opens in new tab) from General Atomics demonstrated how the US DIII-D National Fusion Program (opens in new tab) is already applying AI to advance reactor design and operations, highlighting promising directions for future development. They provided examples of how to apply AI to active plasma control to avoid disruptive instabilities, using AI-controlled trajectories to avoid tearing modes, and implementing feedback control using machine learning-derived density limits for safer high-density operations. One persistent challenge in reactor design involves building the interior “first wall,” which must withstand extreme heat and particle bombardment. Zulfi Alam, corporate vice president of Microsoft Quantum (opens in new tab), discussed the potential of using quantum computing in fusion, particularly for addressing material challenges like hydrogen diffusion in reactors. He noted that silicon nitride shows promise as a barrier to hydrogen and vapor and explained the challenge of binding it to the reaction chamber. He emphasized the potential of quantum computing to improve material prediction and synthesis, enabling more efficient processes. He shared that his team is also investigating advanced silicon nitride materials to protect this critical component from neutron and alpha particle damage—an innovation that could make fusion commercially viable. Microsoft Research Blog AIOpsLab: Building AI agents for autonomous clouds AIOpsLab is an open-source framework designed to evaluate and improve AI agents for cloud operations, offering standardized, scalable benchmarks for real-world testing, enhancing cloud system reliability. Read more Opens in a new tab Exploring AI’s broader impact on fusion engineering Lightning talks from Microsoft Research labs addressed the central question of AI’s potential to accelerate fusion research and engineering. Speakers covered a wide range of applications—from using gaming AI for plasma control and robotics for remote maintenance to physics-informed AI for simulating materials and plasma behavior. Closing the session, Archie Manoharan, Microsoft’s director of nuclear engineering for Cloud Operations and Infrastructure, emphasized the need for a comprehensive energy strategy, one that incorporates renewables, efficiency improvements, storage solutions, and carbon-free sources like fusion. The Summit culminated in a thought-provoking panel discussion moderated by Ade Famoti, featuring Archie Manoharan, Richard Buttery, Steven Cowley, and Chris Bishop, Microsoft Technical Fellow and director of Microsoft Research AI for Science. Their wide-ranging conversation explored the key challenges and opportunities shaping the field of fusion. The panel highlighted several themes: the role of new regulatory frameworks that balance innovation with safety and public trust; the importance of materials discovery in developing durable fusion reactor walls; and the game-changing role AI could play in plasma optimization and surrogate modelling of fusion’s underlying physics. They also examined the importance of global research collaboration, citing projects like the International Thermonuclear Experimental Reactor (opens in new tab) (ITER), the world’s largest experimental fusion device under construction in southern France, as testbeds for shared progress. One persistent challenge, however, is data scarcity. This prompted a discussion of using physics-informed neural networks as a potential approach to supplement limited experimental data. Global collaboration and next steps Microsoft is collaborating with ITER (opens in new tab) to help advance the technologies and infrastructure needed to achieve fusion ignition—the critical point where a self-sustaining fusion reaction begins, using Microsoft 365 Copilot, Azure OpenAI Service, Visual Studio, and GitHub (opens in new tab). Microsoft Research is now cooperating with ITER to identify where AI can be exploited to model future experiments to optimize its design and operations. Now Microsoft Research has signed a Memorandum of Understanding with the Princeton Plasma Physics Laboratory (PPPL) (opens in new tab) to foster collaboration through knowledge exchange, workshops, and joint research projects. This effort aims to address key challenges in fusion, materials, plasma control, digital twins, and experiment optimization. Together, Microsoft Research and PPPL will work to drive innovation and advances in these critical areas. Fusion is a scientific challenge unlike any other and could be key to sustainable energy in the future. We’re excited about the role AI can play in helping make that vision a reality. To learn more, visit the Fusion Summit event page, or connect with us by email at FusionResearch@microsoft.com. Opens in a new tab

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-05 19:51:43 ·

Societal AI: Building human-centered AI systems

www.microsoft.com
In October 2022, Microsoft Research Asia hosted a workshop that brought together experts in computer science, psychology, sociology, and law as part of Microsoft’s commitment to responsible AI (opens in new tab). The event led to ongoing collaborations exploring AI’s societal implications, including the Value Compass (opens in new tab) project. As these efforts grew, researchers focused on how AI systems could be designed to meet the needs of people and institutions in areas like healthcare, education, and public services. This work culminated in Societal AI: Research Challenges and Opportunities, a white paper that explores how AI can better align with societal needs. What is Societal AI? Societal AI is an emerging interdisciplinary area of study that examines how AI intersects with social systems and public life. It focuses on two main areas: (1) the impact of AI technologies on fields like education, labor, and governance; and (2) the challenges posed by these systems, such as evaluation, accountability, and alignment with human values. The goal is to guide AI development in ways that respond to real-world needs. Microsoft research podcast Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness As the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India. Listen now Opens in a new tab The white paper offers a framework for understanding these dynamics and provides recommendations for integrating AI responsibly into society. This post highlights the paper’s key insights and what they mean for future research. Tracing the development of Societal AI Societal AI began nearly a decade ago at Microsoft Research Asia, where early work on personalized recommendation systems uncovered risks like echo chambers, where users are repeatedly exposed to similar viewpoints, and polarization, which can deepen divisions between groups. Those findings led to deeper investigations into privacy, fairness, and transparency, helping inform Microsoft’s broader approach to responsible AI. The rapid rise of large-scale AI models in recent years has made these concerns more urgent. Today, researchers across disciplines are working to define shared priorities and guide AI development in ways that reflect social needs and values. Key insights The white paper outlines several important considerations for the field: Interdisciplinary framework: Bridges technical AI research with the social sciences, humanities, policy studies, and ethics to address AI’s far-reaching societal effects. Actionable research agenda: Identifies ten research questions that offer a roadmap for researchers, policymakers, and industry leaders. Global perspective: Highlights the importance of different cultural perspectives and international cooperation in shaping responsible AI development dialogue. Practical insights: Balances theory with real-world applications, drawing from collaborative research projects. “AI’s impact extends beyond algorithms and computation—it challenges us to rethink fundamental concepts like trust, creativity, agency, and value systems,” says Lidong Zhou, managing director of Microsoft Research Asia. “It recognizes that developing more powerful AI models is not enough; we must examine how AI interacts with human values, institutions, and diverse cultural contexts.” Figure 1. Societal AI research agenda Guiding principles for responsible integration The research agenda is grounded in three key principles: Harmony: AI should minimize conflict and build trust to support acceptance. Synergy: AI should complement human capabilities, enabling outcomes that neither humans nor machines could achieve alone. Resilience: AI should be robust and adaptable as social and technological conditions evolve. Ten critical questions These questions span both technical and societal concerns: How can AI be aligned with diverse human values and ethical principles? How can AI systems be designed to ensure fairness and inclusivity across different cultures, regions, and demographic groups? How can we ensure AI systems are safe, reliable, and controllable, especially as they become more autonomous? How can human-AI collaboration be optimized to enhance human abilities? How can we effectively evaluate AI’s capabilities and performance in new, unforeseen tasks and environments? How can we enhance AI interpretability to ensure transparency in its decision-making processes? How will AI reshape human cognition, learning, and creativity, and what new capabilities might it unlock? How will AI redefine the nature of work, collaboration, and the future of global business models? How will AI transform research methodologies in the social sciences, and what new insights might it enable? How should regulatory frameworks evolve to govern AI development responsibly and foster global cooperation? This list will evolve alongside AI’s developing societal impact, ensuring the agenda remains relevant over time. Building on these questions, the white paper underscores the importance of sustained, cross-disciplinary collaboration to guide AI development in ways that reflect societal priorities and public interest. “This thoughtful and comprehensive white paper from Microsoft Research Asia represents an important early step forward in anticipating and addressing the societal implications of AI, particularly large language models (LLMs), as they enter the world in greater numbers and for a widening range of purposes,” says research collaborator James A. Evans (opens in new tab), professor of sociology at the University of Chicago. Microsoft is committed to fostering collaboration and invites others to take part in developing governance systems. As new challenges arise, the responsible use of AI for the public good will remain central to our research. We hope the white paper serves as both a guide and a call to action, emphasizing the need for engagement across research, policy, industry, and the public. For more information, and to access the full white paper, visit the Microsoft Research Societal AI page. Listen to the author discuss more about the research in this podcast. Acknowledgments We are grateful for the contributions of the researchers, collaborators, and reviewers who helped shape this white paper. Opens in a new tab

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-05 16:48:14 ·

Abstracts: Societal AI with Xing Xie

www.microsoft.com
XING XIE: Thank you for having me. HUIZINGA: So let’s start with a brief overview of the background for this white paper on Societal AI. In just a few sentences, tell us how the idea came about and what key principles drove the work. XIE: The idea for this white paper emerged in response to the shift we are witnessing in the AI landscape. Particularly since the release of ChatGPT in late 2022, these models didn’t just change the pace of AI research, they began reshaping our society, education, economy, and yeah, even the way we understand ourselves. At Microsoft Research Asia, we felt a strong urgency to better understand these changes. Over the past 30 months, we have been actively exploring this frontier in partnership with experts from psychology, sociology, law, and philosophy. This white paper serves three main purposes. First, to document what we have learned. Second, to guide future research directions. And last, to open up an effective communication channel with collaborators across different disciplines. HUIZINGA: Research on responsible AI is a relatively new discipline and it’s profoundly multidisciplinary. So tell us about the work that you drew on as you convened this series of workshops and summer schools, research collaborations and interdisciplinary dialogues. What kinds of people did you bring to the table and for what reason? XIE: Yeah. Responsible AI actually has been evolving within Microsoft for like about a decade. But with the rise of large language models, the scope and urgency of these challenges have grown exponentially. That’s why we have leaned heavily on interdisciplinary collaboration. For instance, in the Value Compass Project, we worked with philosophers to frame human values in a scientifically actionable way, something essential for aligning AI behavior. In our AI evaluation efforts, we drew from psychometrics to create more principled ways of assessing these systems. And with the sociologists, we have examined how AI affects education and social systems. This joint effort has been central to the work we share in this white paper. HUIZINGA: So white papers differ from typical research papers in that they don’t rely on a particular research methodology per se, but you did set, as a backdrop for your work, ten questions for consideration. So how did you decide on these questions and how or by what means did you attempt to answer them? XIE: Rather than follow a traditional research methodology, we built this white paper around ten fundamental, foundational research questions. These came from extensive dialogue, not only with social scientists, but also computer scientists working at the technical front of AI. These questions span both directions. First, how AI impacts society, and second, how social science can help solve technical challenges like alignment and safety. They reflect a dynamic agenda that we hope to evolve continuously through real-world engagement and deeper collaboration. HUIZINGA: Can you elaborate on… a little bit more on the questions that you chose to investigate as a group or groups in this? XIE: Sure, I think I can use the Value Compass Project as one example. In that project, our main goal is to try to study how we can better align the value of AI models with our human values. Here, one fundamental question is how we define our own human values. There actually is a lot of debate and discussions on this. Fortunately, we see in philosophy and sociology actually they have studied this for years, like, for like hundreds of years. They have defined some, like, such as basic human value framework, they have defined like modern foundation theory. We can borrow those expertise. Actually, we have worked with sociology and philosophers, try to borrow these expertise and define a framework that could be usable for AI. Actually, we have worked on, like, developing some initial frameworks and evaluation methods for this. HUIZINGA: So one thing that you just said was to frame philosophical issues in a scientifically actionable way. How hard was that? XIE: Yeah, it is actually not easy. I think that first of all, social scientists and AI researchers, we… usually we speak different languages. HUIZINGA: Right! XIE: Our research is at a very different pace. So at the very beginning, I think we should find out what’s the best way to talk to each other. So we have workshops, have joint research projects, we have them visit us, and also, we have supervised some joint interns. So that’s all the ways we try to find some common ground to work together. More specifically for this value framework, we have tried to understand what’s the latest program from their source and also try how to adapt them to an AI context. So that’s, I mean, it’s not easy, but it’s like enjoyable and exciting journey! HUIZINGA: Yeah, yeah, yeah. And I want to push in on one other question that I thought was really interesting, which you asked, which was how can we ensure AI systems are safe, reliable, controllable, especially as they become more autonomous? I think this is a big question for a lot of people. What kind of framework did you use to look at that? XIE: Yeah, there are many different aspects. I think alignment definitely is an aspect. That means how we can make sure we can have a way to truly and deeply embed our values into the AI model. Even after we define our value, we still need a way to make sure that it’s actually embedded in. And also evaluation I think is another topic. Even we have this AI…. looks safe and looks behavior good, but how we can evaluate that, how we can make sure it is actually doing the right thing. So we also have some collaboration with psychometrics people to define a more scientific evaluation framework for this purpose as well. HUIZINGA: Yeah, I remember talking to you about your psychometrics in the previous podcast… XIE: Yeah! HUIZINGA: …you were on and that was fascinating to me. And I hope… at some point I would love to have a bigger conversation on where you are now with that because I know it’s an evolving field. XIE: It’s evolving! HUIZINGA: Yeah, amazing! Well, let’s get back to this paper. White papers aren’t designed to produce traditional research findings, as it were, but there are still many important outcomes. So what would you say the most important takeaways or contributions of this paper are? XIE: Yeah, the key takeaway, I believe, is AI is no longer just a technical tool. It’s becoming a social actor. HUIZINGA: Mmm. XIE: So it must be studied as a dynamic evolving system that intersects with human values, cognition, culture, and governance. So we argue that interdisciplinary collaboration is no longer optional. It’s essential. Social sciences offer tools to understand the complexity, bias, and trust, concepts that are critical for AI’s safe and equitable deployment. So the synergy between technical and social perspectives is what will help us move from reactive fixes to proactive design. HUIZINGA: Let’s talk a little bit about the impact that a paper like this can have. And it’s more of a thought leadership piece, but who would you say will benefit most from the work that you’ve done in this white paper and why? XIE: We hope this work speaks to both AI and social science communities. For AI researchers, this white paper provides frameworks and real-world examples, like value evaluation systems and cross-cultural model training that can inspire new directions. And for social scientists, it opens doors to new tools and collaborative methods for studying human behavior, cognition, and institutions. And beyond academia, we believe policymakers and industry leaders can also benefit as the paper outlines practical governance questions and highlights emerging risks that demand timely attention. HUIZINGA: Finally, Xing, what would you say the outstanding challenges are for Societal AI, as you framed it, and how does this paper lay a foundation for future research agendas? Specifically, what kinds of research agendas might you see coming out of this foundational paper? XIE: We believe this white paper is not a conclusion, it’s a starting point. While the ten research questions are a strong foundation, they also expose deeper challenges. For example, how do we build a truly interdisciplinary field? How can we reconcile the different timelines, methods, and cultures of AI and social science? And how do we nurture talents who can work fluently across those both domains? We hope this white paper encourages others to take on these questions with us. Whether you are researcher, student, policymaker, or technologist, there is a role for you in shaping AI that not only works but works for society. So yeah, I look forward to the conversation with everyone. HUIZINGA: Well, Xing Xie, it’s always fun to talk to you. Thanks for joining us today and to our listeners, thanks for tuning in. If you want to read this white paper, and I highly recommend that you do, you can find a link at aka.ms/Abstracts, or you can find a link in our show notes that will take you to the Microsoft Research website. See you next time on Abstracts!

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-05-01 17:16:45 ·

Laws, norms, and ethics for AI in health

www.microsoft.com
Transcript [MUSIC]    [BOOK PASSAGE]  PETER LEE: “… [END OF BOOK PASSAGE]    [THEME MUSIC]    This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.    Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?     In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.  [THEME MUSIC FADES] The passage I read at the top there is from Chapter 9, “Safety First.” One needs only to look at examples such as laws mandating seatbelts in cars and, more recently, internet regulation to know that policy and oversight are often playing catch-up with emerging technologies. When we were writing our book, Carey, Zak, and I didn’t claim that putting frameworks in place to allow for innovation and adoption while prioritizing inclusiveness and protecting patients from hallucination and other harms would be easy. In fact, in our writing, we posed more questions than answers in the hopes of highlighting the complexities at hand and supporting constructive discussion and action in this space. In this episode, I’m pleased to welcome three experts who have been thinking deeply about these matters: Laura Adams, Vardit Ravitsky, and Dr. Roxana Daneshjou. Laura is an expert in AI, digital health, and human-centered care. As a senior advisor at the National Academy of Medicine, or NAM, she guides strategy for the academy’s science and technology portfolio and leads the Artificial Intelligence Code of Conduct national initiative. Vardit is president and CEO of The Hastings Center for Bioethics, a bioethics and health policy institute. She leads research projects funded by the National Institutes of Health, is a member of the committee developing the National Academy of Medicine’s AI Code of Conduct, and is a senior lecturer at Harvard Medical School. Roxana is a board-certified dermatologist and an assistant professor of both dermatology and biomedical data science at Stanford University. Roxana is among the world’s thought leaders in AI, healthcare, and medicine, thanks in part to groundbreaking work on AI biases and trustworthiness. One of the good fortunes I’ve had in my career is the chance to work with both Laura and Vardit, mainly through our joint work with the National Academy of Medicine. They’re both incredibly thoughtful and decisive leaders working very hard to help the world of healthcare—and healthcare regulators—come to grips with generative AI. And over the past few years, I’ve become an avid reader of all of Roxana’s research papers. Her work is highly technical, super influential but also informative in a way that spans computer science, medicine, bioethics, and law. These three leaders—one from the medical establishment, one from the bioethics field, and the third from clinical research—provide insights into three incredibly important dimensions of the issues surrounding regulations, norms, and ethics of AI in medicine. [TRANSITION MUSIC] Here is my interview with Laura Adams: LEE: Laura, I’m just incredibly honored and excited that you’re joining us here today, so welcome. ADAMS: Thank you, Peter, my pleasure. Excited to be here. LEE: So, Laura, you know, I’ve been working with you at the NAM for a while, and you are a strategic advisor at the NAM. But I think a lot of our listeners might not know too much about the National Academy of Medicine and then, within the National Academy of Medicine, what a strategic advisor does. So why don’t we start there. You know, how would you explain to a person’s mother or father what the National Academy of Medicine is? ADAMS: Sure. National Academy was formed more than 50 years ago. It was formed by the federal government, but it is not the federal government. It was formed as an independent body to advise the nation and the federal government on issues of science and primarily technology-related issues, as well. So with that 50 years, some probably know of the National Academy of Medicine when it was the Institute of Medicine and produced such publications as To Err is Human (opens in new tab) and Crossing the Quality Chasm (opens in new tab), both of which were seminal publications that I think had a dramatic impact on quality, safety, and how we saw our healthcare system and what we saw in terms of its potential. LEE: So now, for your role within NAM, what does the senior advisor do? What do you do? ADAMS: What I do there is in the course of leading the AI Code of Conduct project, my role there was in framing the vision for that project, really understanding what did we want it to do, what impact did we want it to make. So for example, some thought that it might be that we wanted everyone to use our code of conduct. And my advice on that was let’s use this as a touchstone. We want people to think about their own codes of conduct for their use of AI. That’s a valuable exercise, to decide what you value, what your aspirations are. I also then do a lot of the field alignment around that work. So I probably did 50 talks last year—conference presentations, webinars, different things—where the code of conduct was presented so that the awareness could be raised around it so people could see the practicality of using that tool. Especially the six commitments that were based on the idea of complex adaptive systems simple rules, where we could recall those in the heat of decision-making around AI, in the heat of application, or even in the planning and strategic thinking around it. LEE: All right, we’re going to want to really break into a lot of details here. But I would just like to rewind the clock a little bit and talk about your first encounters with AI. And there’s sort of, I guess, two eras. There’s the era of AI and machine learning before ChatGPT, before the generative AI era, and then afterwards. Before the era of generative AI, what was your relationship with the idea of artificial intelligence? Was it a big part of your role and something you thought about, or was it just one of many technologies that you considered? ADAMS: It was one of many. LEE: Yeah. ADAMS: Watching it help us evolve from predictive analytics to predictive AI, which of course I was fascinated by the fact that it could use structured and unstructured data, that it could learn from its own processes. These things were really quite remarkable, but my sense about it was that it was one of many. We were looking at telemedicine. We were looking at [a] variety of other things, particularly wearables and things that were affecting and empowering patients to take better care of themselves and take more … have more agency around their own care. So I saw it as one of many. And then the world changed in 2022, changed dramatically. LEE: [LAUGHS] OK. Right. OK, so November 2022, ChatGPT. Later in the spring of 2023, GPT-4. And so, you know, what were your first encounters, and what were you feeling? What were you experiencing? ADAMS: At the time, I was curious, and I thought, I think I’m seeing four things here that make this way different. And one was, and it proved to be true over time, the speed with which this evolved. And I was watching it evolve very, very quickly and thinking, this is almost, this is kind of mind blowing how fast this is getting better. And then this idea that, you know, we could scale this. As we were watching the early work with ambient listening, I was working with a group of physicians that were lamenting the cost and the unavailability of scribes. They wanted to use scribes. And I’m thinking, We don’t have to incur the cost of that. We don’t have to struggle with the unavailability of that type of … someone in the workforce. And then I started watching the ubiquity, and I thought, Oh, my gosh, this is unlike any other technology that we’ve seen. Because with electronic health records, for example, it’s had its place, but it was over here. LEE: Yeah. ADAMS: And then I think the last thing was the democratization. And I realized: Wow, anyone with a smartphone has access to the most powerful large language models in the world. And I thought, This, to me, is a revolution in cheap expertise. Those were the things that really began to stun me, and I just knew that we were in a way different era. LEE: It’s interesting that you first talked about ambient listening. Why was that of particular initial interest to you specifically? ADAMS: It was because one of the things that we were putting together in our code of conduct, which began pre-generative AI, was the idea that we wanted to renew the moral well-being and the sense of shared purpose to the healthcare workforce. That’s one of the six principles. And I knew that the cognitive burden was becoming unbearable. When we came out of COVID, it was such a huge wake-up call to understand exactly what was going on at that point of care and how challenging it had become because information overload is astonishing in and of itself. And that idea that we have so much in the way of documentation that needed to be done and how much of a clinician’s time was taken up doing that rather than doing the thing that they went into the profession to do. And that was interact with people, that was to heal, that was to develop human connection that had a healing effect, and they just … so much of the time was taken away from that activity. I also looked at it and because I studied diffusion of innovations theory and understand what causes something to move rapidly across a social system and get adopted, it has to have a clear relative advantage. It has to be compatible with the way that processes work. So I didn’t see that this was going to be a hugely disruptive activity to workflow, which is a challenge of most digital tools, is that they’re designed without that sense of, how does this impact the workflow? And then I just thought that it was going to be a front runner in adoption, and it might then start to create that tsunami, that wave of interest in this, and I don’t think I was wrong. LEE: I have to ask you, because I’ve been asking every guest, there must have been moments early on in the encounter with generative AI where you felt doubt or skepticism. Is that true, or did you immediately think, Wow, this is something very important? ADAMS: No, I did feel doubt and skepticism. My understanding tells me of it, and told me of it in the very beginning, that this is trained on the internet with all of its flaws. When we think about AI, we think about it being very futuristic, but it’s trained on data from the past. I’m well aware of how flawed that data, how biased that data is, mostly men, mostly white men, when we think about it during a certain age grouping of. So I knew that we had inherent massive flaws in the training data and that concerned me. I saw other things about it that also concerned me. I saw that … its difficulty in beginning to use it and govern it effectively. You really do have to put a good governance system in if you’re going to put this into a care delivery system. And I began to worry about widening a digital divide that already was a chasm. And that was between those well-resourced, usually urban, hospitals and health systems that are serving the well-resourced, and the inner-city hospital in Chicago or the rural hospital in Nebraska or the Mississippi community health center. LEE: Yes. So I think this skepticism about technology, new technologies, in healthcare is very well-earned. So you’ve zeroed in on this issue of technology where oftentimes we hope it’ll reduce or eliminate biases but actually seems to oftentimes have the opposite effect. And maybe this is a good segue then into this really super-important national effort that you’re leading on the AI code of conduct. Because in a way, I think those failures of the past and even just the idea—the promise—that technology should make a doctor or a nurse’s job easier, not harder, even that oftentimes seems not to have panned out in the way that we hope. And then there’s, of course, the well-known issue of hallucinations or of mistakes being made. You know, how did those things inform this effort around a code of conduct, and why a code of conduct? ADAMS: Those things weighed heavily on me as the initiative leader because I had been deeply involved in the spread of electronic health records, not really knowing and understanding that electronic health records were going to have the effect that they had on the providers that use them. Looking back now, I think that there could have been design changes, but we probably didn’t have as much involvement of providers in the design. And in some cases, we did. We just didn’t understand what it would take to work it into their workflows. So I wanted to be sure that the code of conduct took into consideration and made explicit some of the things that I believe would have helped us had we had those guardrails or those guidelines explicit for us. And those are things like our first one is to protect and advance human health and connection. We also wanted to see things about openly sharing and monitoring because we know that for this particular technology, it’s emergent. We’re going to have to do a much better job at understanding whether what we’re doing works and works in the real world. So the reason for a code of conduct was we knew that … the good news, when the “here comes AI and it’s barreling toward us,” the good news was that everybody was putting together guidelines, frameworks, and principle sets. The bad news was same. That everybody was putting together their own guideline, principle, and framework set. And I thought back to how much I struggled when I worked in the world of health information exchange and built a statewide health information exchange and then turned to try to exchange that data across the nation and realized that we had a patchwork of privacy laws and regulations across the state; it was extremely costly to try to move data. And I thought we actually need, in addition to data interoperability, we need governance interoperability, where we can begin to agree on a core set of principles that will more easily allow us to move ahead and achieve some of the potential and the vision that we have for AI if we are not working with a patchwork of different guidelines, principles, and frameworks. So that was the impetus behind it. Of course, we again want it to be used as a touchstone, not everybody wholesale adopt what we’ve said. LEE: Right. ADAMS: We want people to think about this and think deeply about it. LEE: Yeah, Laura, I always am impressed with just how humble you are. You were indeed, you know, one of the prime instigators of the digitization of health records leading to electronic health record systems. And I don’t think you need to feel bad about that. That was a tremendous advance. I mean, moving a fifth of the US economy to be digital, I think, is significant. Also, our listeners might want to know that you led something called the Rhode Island Quality Institute, which was really, I think, maybe the, arguably, the most important early kind of examples that set a pattern for how and why health data might actually lead very directly to improvements in human health at a statewide level or at a population level. And so I think your struggles and frustrations on, you know, how to expand that nationwide, I think, are really, really informative. So let’s get into what these principles are, you know, what’s in the code of conduct. ADAMS: Yeah, the six simple rules were derived out of a larger set of principles that we pulled together. And the origin of all of this was we did a fairly extensive landscape review. We looked at least at 60 different sets of these principles, guidelines, and frameworks. We looked for areas of real convergence. We looked for areas where there was inconsistencies. And we looked for out-and-out gaps. The out-and-out gaps that we saw at the time were things like a dearth of references to advancing human health as the priority. Also monitoring post-implementation. So at the time, we were watching these evolve and we thought these are very significant gaps. Also, the impact on the environment was a significant gap, as well. And so when we pull that together, we developed a set of principles and cross-walked those with learning health system principles (opens in new tab). And then once we got that, we again wanted to distill that down into a set of commitments which we knew that people could find accessible. And we published that draft set of principles (opens in new tab) last year. And we have a new publication that will be coming out in the coming months that will be the revised set of principles and code commitments that we got because we took this out publicly. So we opened it up for public comment once we did the draft last year. Again, many of those times that I spoke about this, almost all of those times came with an invitation for feedback, and conversations that we had with people shaped it. And it is in no way, shape, or form a final code of conduct, this set of principles and commitments, because we see this as dynamic. But what we also knew about this was that we wanted to build this with a super solid foundation, a set of immutables, the things that don’t change at some vicissitudes or the whims of this or the whims of that. We wanted those things that were absolutely foundational. LEE: Yeah, so we’ll provide a link to the documents that describe the current state of this, but can we give an example of one or two of these principles and one or two of the commitments? ADAMS: Sure. I’ve mentioned the “protect and advance human health and connection” as the primary aim. We also want to ensure the equitable distribution of risks and benefits, and that equitable distribution of risks and benefits is something that I was referring to earlier about when I see well-resourced organizations. And one that’s particularly important to me is engaging people as partners with agency at every stage of the AI lifecycle. That one matters because this one talks about and speaks to the idea that we want to begin bringing in those that are affected by AI, those on whom AI is used, into the early development and conceptualization of what we want this new tool, this new application, to do. So that includes the providers that use it, the patients. And we find that when we include them—the ethicists that come along with that—we develop much better applications, much more targeted applications that do what we intend them to do in a more precise way. The other thing about that engaging with agency, by agency we mean that person, that participant can affect the decisions and they can affect the outcome. So it isn’t that they’re sort of a token person coming into the table and we’ll allow you to tell your story or so, but this is an active participant. We practiced what we preached when we developed the code of conduct, and we brought patient advocates in to work with us on the development of this, work with us on our applications, that first layer down of what the applications would look like, which is coming out in this new paper. We really wanted that component of this because I’m also seeing that patients are definitely not passive users of this, and they’re having an agency moment, let’s say, with generative AI because they’ve discovered a new capacity to gain information, to—in many ways—claim some autonomy in all of this. And I think that there is a disruption underway right now, a big one that has been in the works for many years, but it feels to me like AI may be the tipping point for that disruption of the delivery system as we know it. LEE: Right. I think it just exudes sort of inclusivity and thoughtfulness in the whole process. During this process, were there surprises, things that you didn’t expect? Things about AI technology itself that surprised you? ADAMS: The surprises that came out of this process for me, one of them was I surprised myself. We were working on the commentary paper, and Steven Lin from Stanford had strong input into that paper. And when we looked at what we thought were missing, he said, “Let’s make sure we have the environmental impact.” And I said, “Oh, really, Steven, we really want to think about things that are more directly aligned with health,” which I couldn’t believe came out of my own mouth. [LAUGHTER] And Steven, without saying, “Do you hear yourself?” I mean, I think he could have said that. But he was more diplomatic than that. And he persisted a bit longer and said, “I think it’s actually the greatest threat to human health.” And I said, “Of course, you’re right.” [LAUGHS] But that was surprising and embarrassing for me. But it was eye-opening in that even when I thought that I had understood the gaps and the using this as a touchstone. So the learning that took place and how rapidly that learning was happening among people involved in this. The other thing that was surprising for me was the degree at which patients became vastly facile with using it to the extent that it helped them begin to, again, build their own capacity. The #PatientsUseAI from Dave deBronkart—watch that one. This is more revolutionary than we think. And so I watched that, the swell of that happening, and it sort of shocked me because I was envisioning this as, again, a tool for use in the clinical setting. LEE: Yeah. OK, so we’re running now towards the end of our time together. And I always like to end our conversations with a more provocative topic. And I thought for you, I’d like to use the very difficult word regulation. And when I think about the book that Carey, Zak, and I wrote, we have a chapter on regulation, but honestly, we didn’t have ideas. We couldn’t understand how this would be regulated. And so we just defaulted to publishing a conversation about regulation with GPT-4. And in a way, I think … I don’t know that I or my coauthors were satisfied with that. In your mind, where do we stand two years later now when we think about the need or not to regulate AI, particularly in its application to healthcare, and where has the thinking evolved to? ADAMS: There are two big differences that I see in that time that has elapsed. And the first one is we have understood the insufficiency of simply making sure that AI-enabled devices are safe prior to going out into implementation settings. We recognize now that there’s got to be this whole other aspect of regulation and assurance that these things are functioning as intended and we have the capacity to do that in the point of care type of setting. So that’s been one of the major ones. The other thing is how wickedly challenging it is to regulate generative AI. I think one of the most provocative and exciting articles (opens in new tab) that I saw written recently was by Bakul Patel and David Blumenthal, who posited, should we be regulating generative AI as we do a licensed and qualified provider? Should it be treated in the sense that it’s got to have a certain amount of training and a foundation that’s got to pass certain tests? It has to demonstrate that it’s improving and keeping up with current literature. Does it … be responsible for mistakes that it makes in some way, shape, or form? Does it have to report its performance? And I’m thinking, what a provocative idea … LEE: Right. ADAMS: … but it’s worth considering. I chair the Global Opportunities Group for a regulatory and innovation AI sandbox in the UK. And we’re hard at work thinking about, how do you regulate something as unfamiliar and untamed, really, as generative AI? So I’d like to see us think more about this idea of sandboxes, more this idea of should we be just completely rethinking the way that we regulate. To me, that’s where the new ideas will come because the danger, of course, in regulating in the old way … first of all, we haven’t kept up over time, even with predictive AI; even with pre-generative AI, we haven’t kept up. And what worries me about continuing on in that same vein is that we will stifle innovation … LEE: Yes. ADAMS: … and we won’t protect from potential harms. Nobody wants an AI Chernobyl, nobody. LEE: Right ADAMS: But I worry that if we use those old tools on the new applications that we will not only not regulate, then we’ll stifle innovation. And when I see all of the promise coming out of this for things that we thought were unimaginable, then that would be a tragedy. LEE: You know, I think the other reflection I’ve had on this is the consumer aspect of it, because I think a lot of our current regulatory frameworks are geared towards experts using the technology. ADAMS: Yes. LEE: So when you have a medical device, you know you have a trained, board-certified doctor or licensed nurse using the technology. But when you’re putting things in the hands of a consumer, I think somehow the surface area of risk seems wider to me. And so I think that’s another thing that somehow our current regulatory concepts aren’t really ready for. ADAMS: I would agree with that. I think a few things to consider, vis-a-vis that, is that this revolution of patients using it is unstoppable. So it will happen. But we’re considering a project here at the National Academy about patients using AI and thinking about: let’s explore all the different facets of that. Let’s understand, what does safe usage look like? What might we do to help this new development enhance the patient-provider relationship and not erode it as we saw, “Don’t confuse your Google search with my medical degree” type of approach. Thinking about: how does it change the identity of the provider? How does it … what can we do to safely build a container in which patients can use this without giving them the sense that it’s being taken away, or that … because I just don’t see that happening. I don’t think they’re going to let it happen. That, to me, feels extremely important for us to explore all the dimensions of that. And that is one project that I hope to be following on to the AI Code of Conduct and applying the code of conduct principles with that project. LEE: Well, Laura, thank you again for joining us. And thank you even more for your tremendous national, even international, leadership on really helping mobilize the greatest institutions in a diverse way to fully confront the realities of AI in healthcare. I think it’s tremendously important work. ADAMS: Peter, thank you for having me. This has been an absolute pleasure. [TRANSITION MUSIC]  I’ve had the opportunity to watch Laura in action as she leads a national effort to define an AI code of conduct. And our conversation today has only heightened my admiration for her as a national leader. What impresses me is Laura’s recognition that technology adoption in healthcare has had a checkered history and furthermore oftentimes not accommodated the huge diversity of stakeholders that are affected equally. The concept of an AI code of conduct seems straightforward in some ways, but you can tell that every word in the emerging code has been chosen carefully. And Laura’s tireless engagement traveling to virtually every corner of the United States, as well as to several other countries, shows real dedication. And now here’s my conversation with Vardit Ravitsky: LEE: Vardit, thank you so much for joining. RAVITSKY: It’s a real pleasure. I’m honored that you invited me. LEE: You know, we’ve been lucky. We’ve had a few chances to interact and work together within the National Academy of Medicine and so on. But I think for many of the normal subscribers to the Microsoft Research Podcast, they might not know what The Hastings Center for Bioethics is and then what you as the leader of The Hastings Center do every day. So I’d like to start there, first off, with what is The Hastings Center? RAVITSKY: Mostly, we’re a research center. We’ve been around for more than 55 years. And we’re considered one of the organizations that actually founded the field known today as bioethics, which is the field that explores the policy implications, the ethical, social issues in biomedicine. So we look at how biotechnology is rolled out; we look at issues of equity, of access to care. We look at issues at the end of life, the beginning of life, how our natural environment impacts our health. Any aspect of the delivery of healthcare, the design of the healthcare system, and biomedical research leading to all this. Any aspect that has an ethical implication is something that we’re happy to explore. We try to have broad conversations with many, many stakeholders, people from different disciplines, in order to come up with guidelines and recommendations that would actually help patients, families, communities. We also have an editorial department. We publish academic journals. We publish a blog. And we do a lot of public engagement activities—webinars, in-person events. So, you know, we just try to promote the thinking of the public and of experts on the ethical aspects of health and healthcare. LEE: One thing I’ve been impressed with, with your work and the work of The Hastings Center is it really confronts big questions but also gets into a lot of practical detail. And so we’ll get there. But before that just a little bit about you then. The way I like to ask this question is: how do you explain to your parents what you do every day? [LAUGHS] RAVITSKY: Funny that you brought my parents into this, Peter, because I come from a family of philosophers. Everybody in my family is in humanities, in academia. When I was 18, I thought that that was the only profession [LAUGHTER] and that I absolutely had to become a philosopher, or else what else can you do with your life? I think being a bioethicist is really about, on one hand, keeping an eye constantly on the science as it evolves. When a new scientific development occurs, you have to understand what’s happening so that you can translate that outside of science. So if we can now make a gamete from a skin cell so that babies will be created differently, you have to understand how that’s done, what that means, and how to talk about it. The second eye you keep on the ethics literature. What ethical frameworks, theories, principles have we developed over the last decades that are now relevant to this technology. So you’re really a bridge between science, biomedicine on one hand and humanities on the other hand. LEE: OK. So let’s shift to AI. And here I’d like to start with a kind of an origin story because I’m assuming before generative AI and ChatGPT became widely known and available, you must have had some contact with ideas in data science, in machine learning, and, you know, in the concept of AI before ChatGPT. Is that true? And, you know, what were some of those early encounters like for you? RAVITSKY: The earlier issues that I heard people talk about in the field were really around diagnostics and reading images and, Ooh, it looks like machines could perform better than radiologists. And, Oh, what if women preferred that their mammographies be read by these algorithms? And, Does that threaten us clinicians? Because it sort of highlights our limitations and weaknesses as, you know, the weakness of the human eye and the human brain. So there were early concerns about, will this outperform the human and potentially take away our jobs? Will it impact our authority with patients? What about de-skilling clinicians or radiologists or any type of diagnostician losing the ability … some abilities that they’ve had historically because machines take over? So those were the early-day reflections and interestingly some of them remain even now with generative AI. All those issues of the standing of a clinician, and what sets us apart, and will a machine ever be able to perform completely autonomously, and what about empathy, and what about relationships? Much of that translated later on into the, you know, more advanced technology. LEE: I find it interesting that you use words like our and we to implicitly refer to humans, homo sapiens, to human beings. And so do you see a fundamental distinction, a hard distinction that separates humans from machines? Or, you know, how … if there are replacements of some human capabilities or some things that human beings do by machines, you know, how do you think about that? RAVITSKY: Ooh, you’re really pushing hard on the philosopher in me here. [LAUGHTER] I’ve read books and heard lectures by those who think that the line is blurred, and I don’t buy that. I think there’s a clear line between human and machine. I think the issue of AGI—of artificial general intelligence—and will that amount to consciousness … again, it’s such a profound, deep philosophical challenge that I think it would take a lot of conceptual work to get there. So how do we define consciousness? How do we define morality? The way it stands now, I look into the future without being a technologist, without being an AI developer, and I think, maybe I hope, that the line will remain clear. That there’s something about humanity that is irreplaceable. But I’m also remembering that Immanuel Kant, the famous German philosopher, when he talked about what it means to be a part of the moral universe, what it means to be a moral agent, he talked about rationality and the ability to implement what he called the categorical imperative. And he said that would apply to any creature, not just humans. And that’s so interesting. It’s always fascinated me that so many centuries ago, he said such a progressive thing. LEE: That’s amazing, yeah. RAVITSKY: It is amazing because I often, as an ethicist, I don’t just ask myself, What makes us human? I ask myself, What makes us worthy of moral respect? What makes us holders of rights? What gives us special status in the universe that other creatures don’t have? And I know this has been challenged by people like Peter Singer who say [that] some animals should have the same respect. “And what about fetuses and what about people in a coma?” I know the landscape is very fraught. But the notion of what makes humans deserving of special moral treatment to me is the core question of ethics. And if we think that it’s some capacities that give us this respect, that make us hold that status, then maybe it goes beyond human. So it doesn’t mean that the machine is human, but maybe at [a] certain point, these machines will deserve a certain type of moral respect that … it’s hard for us right now to think of a machine as deserving that respect. That I can see. But completely collapsing the distinction between human and machine? I don’t think so, and I hope not. LEE: Yeah. Well, you know, in a way I think it’s easier to entertain this type of conversation post-ChatGPT. And so now, you know, what was your first personal encounter with what we now call generative AI, and what went through your mind as you had that first encounter? RAVITSKY: No one’s ever asked me this before, Peter. It almost feels exposing to share your first encounter. [LAUGHTER] So I just logged on, and I asked a simple question, but it was an ethical question. I framed an ethical dilemma because I thought, if I ask it to plan a trip, like all my friends already did, it’s less interesting to me. And within seconds, a pretty thoughtful, surprisingly nuanced analysis was kind of trickling down my screen, and I was shocked. I was really taken aback. I was almost sad because I think my whole life I was hoping that only humans can generate this kind of thinking using moral and ethical terms. And then I started tweaking my question, and I asked for specific philosophical approaches to this. And it just kept surprising me in how well it performed. So I literally had to catch my breath and, you know, sit down and go, OK, this is a new world, something very important and potentially scary is happening here. How is this going to impact my teaching? How is this going to impact my writing? How is this going to impact health? Like, it was really a moment of shock. LEE: I think the first time I had the privilege of meeting you, I heard you speak and share some of your initial framing of how, you know, how to think about the potential ethical implications of AI and the human impacts of AI in the future. Keeping in mind that people listening to this podcast will tend to be technologists and computer scientists as well as some medical educators and practicing clinicians, you know, what would you like them to know or understand most about your thoughts? RAVITSKY: I think from early on, Peter, I’ve been an advocate in favor of bioethics as a field positioning itself to be a facilitator of implementing AI. I think on one hand, if we remain the naysayers as we have been regarding other technologies, we will become irrelevant. Because it’s happening, it’s happening fast, we have to keep our eye on the ball, and not ask, “Should we do it?” But rather ask, “How should we do it?” And one of the reasons that bioethics is going to be such a critical player is that the stakes are so high. The risk of making a mistake in diagnostics is literally life and death; the risk of breaches of privacy that would lead to patients losing trust and refusing to use these tools; the risk of clinicians feeling overwhelmed and replaceable. The risks are just too high. And therefore, creating guardrails, creating frameworks with principles that sensitize us to the ethical aspects, that is critically important for AI and health to succeed. And I’m saying it as someone who wants it very badly to succeed. LEE: You are actually seeing a lot of healthcare organizations adopting and deploying AI. Has any aspect of that been surprising to you? Have you expected it to be happening faster or slower or unfolding in a different way? RAVITSKY: One thing that surprises me is how it seems to be isolated. Different systems, different institutions making their own, you know, decisions about what to acquire and how to implement. I’m not seeing consistency. And I’m not even seeing anybody at a higher level collecting all the information about who’s buying and implementing what under what types of principles and what are their outcomes? What are they seeing? It seems to be just siloed and happening everywhere. And I wish we collected all this data, even about how the decision is made at the executive level to buy a certain tool, to implement it, where, why, by whom. So that’s one thing that surprised me. The speed is not surprising me because it really solves problems that healthcare systems have been struggling with. What seems to be one of the more popular uses, and again, you know this better than I do, is the help with scribes with taking notes, ambient recording. This seems to be really desired because of burnout that clinicians face around this whole issue of note taking. And it’s also seen as a way to allow clinicians to do more human interaction, you know, … LEE: Right. RAVITSKY: … look at the patient, talk to the patient, … LEE: Yep. RAVITSKY: … listen, rather than focus on the screen. We’ve all sat across the desk with a doctor that never looks at us because they only look at the screen. So there’s a real problem here, and there’s a real solution and therefore it’s hitting the ground quickly. But what’s surprising to me is how many places don’t think that it’s their responsibility to inform patients that this is happening. So some places do; some places don’t. And to me, this is a fundamental ethical issue of patient autonomy and empowerment. And it’s also pragmatically the fear of a crisis of trust. LEE: Mm-hmm. Yeah, yeah. RAVITSKY: People worry about such a recording of a very private conversation that they consider to be confidential, such a recording ending up in the wrong hands or being shared externally or going to a commercial entity. People care; patients care. So what is our ethical responsibility to tell them? And what is the institutional responsibility to implement these wonderful tools? I’m not against them, I’m totally in favor—implement these great tools in a way that respects long-standing ethical principles of informed consent, transparency, accountability for, you know, change in practice? And, you know, bottom line: patients right to know what’s happening in their care. LEE: You actually recently had a paper in a medical journal (opens in new tab) that touched on an aspect of this, which I think was not with scribes, but with notes, you know, … RAVITSKY: Yep. LEE: … that doctors would send to patients. And in fact, in previous episodes of this podcast, we actually talked to both the technology developers of that type of feature as well as doctors who were using that feature. And in fact, even in those previous conversations, there was the question, “Well, what does the patient need to know about how this note was put together?” So you and your coauthors had a very interesting recent paper about this. RAVITSKY: Yeah, so the trigger for the paper was that patients seemed to really like being able to send electronic messages to clinicians. LEE: Yes. RAVITSKY: We email and text all day long. Why not in health, right? People are used to communicating in that way. It’s efficient; it’s fast. So we asked ourselves, “Wait, what if an AI tool writes the response?” Because again, this is a huge burden on clinicians, and it’s a real issue of burnout. We surveyed hundreds of respondents, and basically what we discovered is that there was a statistically significant difference in their level of satisfaction when they got an answer from a human clinician, when they got an answer, again, electronic message from AI. And it turns out that they preferred the messages written by AI. They were longer, more detailed, even conveyed more empathy. You know, AI has all the time in the world [LAUGHS] to write you a text. It’s not rushing to the next one. But then when we disclosed who wrote the message, they were less satisfied when they were told it was AI. So the ethical question that that raises is the following: if your only metric is patient satisfaction, the solution is to respond using AI but not tell them that. Now when we compared telling them that it was AI or human or not telling them anything, their satisfaction remained high, which means that if they were not told anything, they probably assumed that it was a human clinician writing, because their satisfaction for human clinician or no disclosure was the same. So basically, if we say nothing and just send back an AI-generated response, they will be more satisfied because the response is nicer, but they won’t be put off by the fact that it was written by AI. And therefore, hey presto, optimal satisfaction. But we challenge that, and we say, it’s not just about satisfaction. It’s about long-term trust. It’s about your right to know. It’s about empowering you to make decisions about how you want to communicate. So we push back against this notion that we’re just there to optimize patient satisfaction, and we bring in broader ethical considerations that say, “No, patients need to know.” If it’s not the norm yet to get your message from AI, … LEE: Yeah. RAVITSKY: … they should know that this is happening. And I think, Peter, that maybe we’re in a transition period. It could be that in two years, maybe less than that, most of our communication will come back from AI, and we will just take it for granted … LEE: Right. RAVITSKY: … that that’s the case. And at that point, maybe disclosure is not necessary because many, many surveys will show us that patients assume that, and therefore they are informed. But at this point in time, when it’s transition and it’s not the norm yet, I firmly think that ethics requires that we inform patients. LEE: Let me push on this a little bit because I think this final point that you just made is, I think is so interesting. Does it matter what kind of information is coming from a human or AI? Is there a time when patients will have different expectations for different types of information from their doctors? RAVITSKY: I think, Peter, that you’re asking the right question because it’s more nuanced. And these are the kinds of empirical questions that we will be exploring in the coming months and years. Our recent paper showed that there was no difference regarding the content. If the message was about what we call the “serious” matter or a less “serious” matter, the preferences were the same. But we didn’t go deep enough into that. That would require a different type of design of study. And you just said, you know, there are different types of information. We need to categorize them. LEE: Yeah. RAVITSKY: What types of information and what degree of impact on your life? Is it a life-and-death piece of information? Is it a quality-of-life piece of information? How central is it to your care and to your thinking? So all of that would have to be mapped out so that we can design these studies. But you know, you pushed back in that way, and I want to push back in a different direction that to me is more fundamental and philosophical. How much do we know now? You know, I keep saying, oh, patients deserve a chance for informed consent, … LEE: Right. RAVITSKY: … and they need to be empowered to make decisions. And if they don’t want that tool used in their care, then they should be able to say, “No.” Really? Is that the world we live in now? [LAUGHTER] Do I have access to the black box that is my doctor’s brain? Do I know how they performed on this procedure in the last year? Do I know whether they’re tired? Do I know if they’re up to speed on the literature with this? We already deal with black boxes, except they’re not AI. And I think the evidence emerges that AI outperforms the humans in so many of these tasks. So my pushback is, are we seeing AI exceptionalism in the sense that if it’s AI, Huh, panic! We have to inform everybody about everything, and we have to give them choices, and they have to be able to reject that tool and the other tool versus, you know, the rate of human error in medicine is awful. People don’t know the numbers. The annual deaths attributed to medical error is outrageous. So why are we so focused on informed consent and empowerment regarding implementation of AI and less in other contexts. Is it just because it’s new? Is it because it is some sort of existential threat, … LEE: Yep, yeah. RAVITSKY: … not just a matter of risk? I don’t know the answer, but I don’t want us to suffer from AI exceptionalism, and I don’t want us to hold AI to such a high standard that we won’t be able to benefit from it. Whereas, again, we’re dealing with black boxes already in medicine. LEE: Just to stay on this topic, though, one more question, which is, maybe, almost silly in how hypothetical it is. If instead of email, it were a Teams call or a Zoom call, doctor-patient, except that the doctor is not the real doctor, but it’s a perfect replica of the doctor designed to basically fool the patient that this is the real human being and having that interaction. Does that change the bioethical considerations at all? RAVITSKY: I think it does because it’s always a question of, are we really misleading? Now if you get a text message in an environment that, you know, people know AI is already integrated to some extent, maybe not your grandmother, but the younger generation is aware of this implementation, then maybe you can say, “Hmm. It was implied. I didn’t mean to mislead the patient.” If the patient thinks they’re talking to a clinician, and they’re seeing, like—what if it’s not you now, Peter? What if I’m talking to an avatar [LAUGHS] or some representation of you? Would I feel that I was misled in recording this podcast? Yeah, I would. Because you really gave me good reasons to assume that it was you. So it’s something deeper about trust, I think. And it touches on the notion of empathy. A lot of literature is being developed now on the issue of: what will remain the purview of the human clinician? What are humans good for [LAUGHS] when AI is so successful and especially in medicine? And if we see that the text messages are read as conveying more empathy and more care and more attention, and if we then move to a visual representation, facial expressions that convey more empathy, we really need to take a hard look at what we mean by care. What about then the robots, right, that we can touch, that can hug us? I think we’re really pushing the frontier of what we mean by human interaction, human connectedness, care, and empathy. This will be a lot of material for philosophers to ask themselves the fundamental question you asked me at first: what does it mean to be human? But this time, what does it mean to be two humans together and to have a connection? And if we can really be replaced in the sense that patients will feel more satisfied, more heard, more cared for, do we have ethical grounds for resisting that? And if so, why? You’re really going deep here into the conceptual questions, but bioethics is already looking at that. LEE: Vardit, it’s just always amazing to talk to you. The incredible span of what you think about from those fundamental philosophical questions all the way to the actual nitty gritty, like, you know, what parts of an email from a doctor to a patient should be marked as AI. I think that span is just incredible and incredibly needed and useful today. So thank you for all that you do. RAVITSKY: Thank you so much for inviting me. [TRANSITION MUSIC] The field of bioethics, and this is my take, is all about the adoption of disruptive new technologies into biomedical research and healthcare. And Vardit is able to explain this with such clarity. I think one of the reasons that AI has been challenging for many people is that its use spans the gamut from the nuts and bolts of how and when to disclose to patients that AI is being used to craft an email, all the way to, what does it mean to be a human being caring for another human? What I learned from the conversation with Vardit is that bioethicists are confronting head-on the issue of AI in medicine and not with an eye towards restricting it, but recognizing that the technology is real, it’s arrived, and needs to be harnessed now for maximum benefit. And so now, here’s my conversation with Dr. Roxana Daneshjou: LEE: Roxana, I’m just thrilled to have this chance to chat with you. ROXANA DANESHJOU: Thank you so much for having me on today. I’m looking forward to our conversation. LEE: In Microsoft Research, of course, you know, we think about healthcare and biomedicine a lot, but I think there’s less of an understanding from our audience what people actually do in their day-to-day work. And of course, you have such a broad background, both on the science side with a PhD and on the medical side. So what’s your typical work week like? DANESHJOU: I spend basically 90% of my time working on running my research lab (opens in new tab) and doing research on how AI interacts with medicine, how we can implement it to fix the pain points in medicine, and how we can do that in a fair and responsible way. And 10% of my time, I am in clinic. So I am a practicing dermatologist at Stanford, and I see patients half a day a week. LEE: And your background, it’s very interesting. There’s always been these MD-PhDs in the world, but somehow, especially right now with what’s happening in AI, people like you have become suddenly extremely important because it suddenly has become so important to be able to combine these two things. Did you have any inkling about that when you started, let’s say, on your PhD work? DANESHJOU: So I would say that during my—[LAUGHS] because I was in training for so long—during … my PhD was in computational genomics, and I still have a significant interest in precision medicine, and I think AI is going to be central to that. But I think the reason I became interested in AI initially is that I was thinking about how we associate genetic factors with patient phenotypes. Patient phenotypes being, How does the disease present? What does the disease look like? And I thought, you know, AI might be a good way to standardize phenotypes from images of, say, skin disease, because I was interested in dermatology at that time. And, you know, the part about phenotyping disease was a huge bottleneck because you would have humans sort of doing the phenotyping. And so in my head, when I was getting into the space, I was thinking I’ll bring together, you know, computer vision and genetics to try to, you know, make new discoveries about how genetics impacts human disease. And then when I actually started my postdoc to learn computer vision, I went down this very huge rabbit hole, which I am still, I guess, falling down, where I realized, you know, about biases in computer vision and how much work needed to be done for generalizability. And then after that, large language models came out, and, like, everybody else became incredibly interested in how this could help in healthcare and now also vision language models and multimodal models. So, you know, we’re just tumbling down the rabbit hole. LEE: Indeed, I think you really made a name for yourself by looking at the issues of biases, for example, in training datasets. And that was well before large language models were a thing. Maybe our audience would like to hear a little bit more about that earlier work. DANESHJOU: So as I mentioned, my PhD was in computational genetics. In genetics, what has happened during the genetic revolution is these large-scale studies to discover how genetic variation impacts human disease and human response to medication, so that’s what pharmacogenomics is, is human response to medications. And as I got, you know, entrenched in that world, I came to realize that I wasn’t really represented in the data. And it was because the majority of these genetic studies were on European ancestry individuals. You weren’t represented either. LEE: Right, yeah. DANESHJOU: Many diverse global populations were completely excluded from these studies, and genetic variation is quite diverse across the globe. And so you’re leaving out a large portion of genetic variation from these research studies. Now things have improved. It still needs work in genetics. But definitely there has been many amazing researchers, you know, sounding the alarm in that space. And so during my PhD, I actually focused on doing studies of pharmacogenomics in non-European ancestry populations. So when I came to computer vision and in particular dermatology, where there were a lot of papers being published about how AI models perform at diagnosing skin disease and several papers essentially saying, oh, it’s equivalent to a dermatologist—of course, that’s not completely true because it’s a very sort of contrived, you know, setting of diagnosis— … LEE: Right, right. DANESHJOU: …landmark papers, which was in Science Advances (opens in new tab), showed … we created a diverse dataset, our own diverse benchmark of skin disease, and showed that the models performed significantly worse on brown and black skin. And I think the key here is we also showed that it was an addressable problem because when we fine-tuned on diverse skin tones, you could make that bias go away. So it was really, in this case, about what data was going into the training of these computer vision models. LEE: Yeah, and I think if you’re listening to this conversation, if you haven’t read that paper, I think it’s really must reading. It was not only, Roxana, it wasn’t only just a landmark scientifically and medically, but it also sort of crossed the chasm and really became a subject of public discourse and debate, as well. And I think you really changed the public discourse around AI. So now I want to get into generative AI. I always like to ask, what was your first encounter with generative AI personally? And what went through your head? You know, what was that experience like for you? DANESHJOU: Yeah, I mean, I actually tell this story a lot because I think it’s a fun story. So I would say that I had played with, you know, GPT-3 prior and wasn’t particularly, you know, impressed … LEE: Yeah. DANESHJOU: … by how it was doing. And I was at NeurIPS [Conference on Neural Information Processing Systems] in New Orleans, and I was … we were walking back from a dinner. I was with Andrew Beam from Harvard. I was with his group. And we were just, sort of, you know, walking along, enjoying the sites of New Orleans, chatting. And one of his students said, “Hey, OpenAI just released this thing called ChatGPT.” LEE: So this would be New Orleans in December … DANESHJOU: 2022. LEE: 2022, right? Yes. Uh-huh. OK. DANESHJOU: So I went back to my hotel room. I was very tired. But I, you know, went to the website to see, OK, like, what is this thing? And I started to ask it medical questions, and I started all of a sudden thinking, “Uh-oh, we have made … we have made a leap here; something has changed.” LEE: So it must have been very intense for you from then because months later, you had another incredibly impactful, or landmark, paper basically looking at biases, race-based medicine in large language models (opens in new tab). So can you say more about that? DANESHJOU: Yes. I work with a very diverse team, and we have thought about bias in medicine, not just with technology but also with humans. Humans have biases, too. And there’s this whole debate around, is the technology going to be more biased than the humans? How do we do that? But at the same time, like, the technology actually encodes the biases of humans. And there was a paper in the Proceedings of the National Academy of Sciences (opens in new tab), which did not look at technology at all but actually looked at the race-based biases of medical trainees that were untrue and harmful in that they perpetuated racist beliefs. And so we thought, if medical trainees and humans have these biases, why don’t we see if the models carry them forward? And we added a few more questions that we, sort of, brainstormed as a team, and we started asking the models those questions. And … LEE: And by this time, it was GPT-4? DANESHJOU: We did include GPT-4 because GPT-4 came out, as well. And we also included other models, as well, such as Claude, because we wanted to look across the board. And what we found is that all of the models had instances of perpetuating race-based medicine. And actually, the GPT models had one of the most, I think, one of the most egregious responses—and, again, this is 3.5 and 4; we haven’t, you know, fully checked to see what things look like, because there have been newer models—in that they said that we should use race in calculating kidney function because there were differences in muscle mass between the races. And this is sort of a racist trope in medicine that is not true because race is not based on biology; it’s a social construct. So, yeah, that was that study. And that one did spur a lot of public conversation. LEE: Your work there even had the issue of bias overtake hallucination, you know, as really the most central and most difficult issue. So how do you think about bias in LLMs, and does that in your mind disqualify the use of large language models from particular uses in medicine? DANESHJOU: Yeah, I think that the hallucinations are an issue, too. And in some senses, they might even go with one another, right. Like, if it’s hallucinating information that’s not true but also, like, biased. So I think these are issues that we have to address with the use of LLMs in healthcare. But at the same time, things are moving very fast in this space. I mean, we have a secure instance of several large language models within our healthcare system at Stanford so that you could actually put secure patient information in there. So while I acknowledge that bias and hallucinations are a huge issue, I also acknowledge that the healthcare system is quite broken and needs to be improved, needs to be streamlined. Physicians are burned out; patients are not getting access to care in the appropriate ways. And I have a really great story about that, which I can share with you later. So in 2024, we did a study asking dermatologists, are they using large language models (opens in new tab) in their clinical practice? And I think this percentage has probably gone up since then: 65% of dermatologists reported using large language models in their practices on tasks such as, you know, writing insurance authorization letters, you know, writing information sheets for the patient, even, sort of, using them to educate themselves, which makes me a little nervous because in my mind, the best use of large language models right now are cases where you can verify facts easily. So, for example, I did show and teach my nurse how to use our secure large language model in our healthcare system to write rebuttal letters to the insurance. I told her, “Hey, you put in these bullet points that you want to make, and you ask it to write the letter, and you can verify that the letter contains the facts which you want and which are true.” LEE: Yes. DANESHJOU: And we have also done a lot of work to try to stress test models because we want them to be better. And so we held this red-teaming event at Stanford where we brought together 80 physicians, computer scientists, engineers and had them write scenarios and real questions that they might ask on a day to day or tasks that they might actually ask AI to do. And then we had them grade the performance. And we did this with the GPT models. At the time, we were doing it with GPT-3.5, 4, and 4 with internet. But before the paper (opens in new tab) came out, we then ran the dataset on newer models. And we made the dataset public (opens in new tab) because I’m a big believer in public data. So we made the dataset public so that others could use this dataset, and we labeled what the issues were in the responses, whether it was bias, hallucination, like, a privacy issue, those sort of things. LEE: If I think about the hits or misses in our book, you know, we actually wrote a little bit, not too much, about noticing biases. I think we underestimated the magnitude of the issue in our book. And another element that we wrote about in our book is that we noticed that the language model, if presented with some biased decision-making, more often than not was able to spot that the decision-making was possibly being influenced by some bias. What do you make of that? DANESHJOU: So funny enough, I think we had … we had a—before I moved from Twitter to Bluesky—but we had a little back and forth on Twitter about this, which actually inspired us to look into this as a research, and we have a preprint up on it of actually using other large language models to identify bias and then to write critiques that the original model can incorporate and improve its answer upon. I mean, we’re moving towards this sort of agentic systems framework rather than a singular large language model, and people, of course, are talking about also retrieval-augmented generation, where you sort of have this corpus of, you know, text that you trust and find trustworthy and have that incorporated into the response of the model. And so you could build systems essentially where you do have other models saying, “Hey, specifically look for bias.” And then it will sort of focus on that task. And you can even, you know, give it examples of what bias is within context learning now. So I do think that we are going to be improving in this space. And actually, my team is … most recently, we’re working on building patient-facing chatbots. That’s where my, like, patient story comes in. But we’re building patient-facing chatbots. And we’re using, you know, we’re using prompt-engineering tools. We’re using automated eval tools. We’re building all of these things to try to make it more accurate and less bias. So it’s not just like one LLM spitting out an answer. It’s a whole system. LEE: All right. So let’s get to your patient-facing story.  DANESHJOU: Oh, of course. Over the summer, my 6-year-old fell off the monkey bars and broke her arm. And I picked her up from school. She’s crying so badly. And I just look at her, and I know that we’re in trouble. And I said, OK, you know, we’re going straight to the emergency room. And we went straight to the emergency room. She’s crying the whole time. I’m almost crying because it’s just, like, she doesn’t even want to go into the hospital. And so then my husband shows up, and we also had the baby, and the baby wasn’t allowed in the emergency room, so I had to step out. And thanks to the [21st Century] Cures Act (opens in new tab), I’m getting, like, all the information, you know, as it’s happening. Like, I’m getting the x-ray results, and I’m looking at it. And I can tell there’s a fracture, but I can’t, you know, tell, like, how bad it is. Like, is this something that’s going to need surgery? And I’m desperately texting, like, all the orthopedic folks I know, the pediatricians I know. [LAUGHTER] “Hey, what does this mean?” Like, getting real-time information. And later in the process, there was a mistake in her after-visit summary about how much Tylenol she could take. But I, as a physician, knew that this dose was a mistake. I actually asked ChatGPT. I gave it the whole after-visit summary, and I said, are there any mistakes here? And it clued in that the dose of the medication was wrong. So again, I—as a physician with all these resources—have difficulty kind of navigating the healthcare system; understanding what’s going on in x-ray results that are showing up on my phone; can personally identify medication dose mistakes, but you know, most people probably couldn’t. And it could be very … I actually, you know, emailed the team and let them know, to give feedback. So we have a healthcare system that is broken in so many ways, and it’s so difficult to navigate. So I get it. And so that’s been, you know, a big impetus for me to work in this space and try to make things better. LEE: That’s an incredible story. It’s also validating because, you know, one of the examples in our book was the use of an LLM to spot a medication error that a doctor or a nurse might make. You know, interestingly, we’re finding no sort of formalized use of AI right now in the field. But anecdotes like this are everywhere. So it’s very interesting. All right. So we’re starting to run short on time. So I want to ask you a few quick questions and a couple that might be a little provocative. DANESHJOU: Oh boy. [LAUGHTER] Well, I don’t run away … I don’t run away from controversy. LEE: So, of course, with that story you just told, I can see that you use AI yourself. When you are actually in clinic, when you are being a dermatologist … DANESHJOU: Yeah. LEE: … and seeing patients, are you using generative AI? DANESHJOU: So I do not use it in clinic except for the situation of the insurance authorization letters. And even, I was offered, you know, sort of an AI-based scribe, which many people are using. There have been some studies that show that they can make mistakes. I have a human scribe. To me, writing the notes is actually part of the thinking process. So when I write my notes at the end of the day, there have been times that I’ve all of a sudden had an epiphany, particularly on a complex case. But I have used it to write, you know, sort of these insurance authorization letters. I’ve also used it in grant writing. So as a scientist, I have used it a lot more. LEE: Right. So I don’t know of anyone who has a more nuanced and deeper understanding of the issues of biases in AI in medicine than you. Do you think [these] biases can be repaired in AI, and if not, what are the implications? DANESHJOU: So I think there are several things here, and I just want to be thoughtful about it. One, I think, the bias in the technology comes from bias in the system and bias in medicine, which very much exists and is incredibly problematic. And so I always tell people, like, it doesn’t matter if you have the most perfect, fair AI. If you have a biased human and you add those together, because you’re going to have this human-AI interaction, you’re still going to have a problem. And there is a paper that I’m on with Dr. Matt Groh (opens in new tab), which looked at looking at dermatology diagnosis across skin tones and then with, like, AI assistance. And we found there’s a bias gap, you know, with even physicians. So it’s not just an AI problem; humans have the problem, too. And… LEE: Hmm. Yup. DANESHJOU: … we also looked at when you have the human-AI system, how that impacts the gap because you want to see the gap close. And it was kind of a mixed result in the sense that there was actually situations where, like, the accuracy increased in both groups, but the gap actually also increased because they were actually, even though they knew it was a fair AI, for some reason, they were relying upon the AI more often when … or they were trusting it more often on diagnoses on white skin—maybe they’d read my papers, who knows? [LAUGHTER]—even though we had told them, you know, it was a fair model. So I think for me, the important thing is understanding how the AI model works with the physician at the task. And what I would like to see is it improve the overall bias and disparities with that unit. And at the same time, I tell human physicians, we have to work on ourselves. We have to work on our system, you know, our medical system that has systemic issues of access to care or how patients get treated based on what they might look like or other parts of their background. LEE: All right, final question. So we started off with your stories about imaging in dermatology. And of course, Geoff Hinton, Turing winner and one of the grandfathers of the AI revolution, famously had predicted many years ago that by 2018 or something like that, we wouldn’t need human radiologists because of AI. That hasn’t come to pass, but since you work in a field that also is very dependent on imaging technologies, do you see a future when radiologists or, for that matter, dermatologists might be largely replaced by machines? DANESHJOU: I think that’s a complex question. Let’s say you have the most perfect AI systems. I think there’s still a lot of nuance in how these, you know, things get done. I’m not a radiologist, so I don’t want to speak to what happens in radiology. But in dermatology, it ends up being quite complex, the process. LEE: Yeah. DANESHJOU: You know, I don’t just look at lesions and make diagnoses. Like, I do skin exams to first identify the lesions of concern. So maybe if we had total-body photography that could help, like, catch which lesions would be of concern, which people have worked on, that would be step, sort of, step one. And then the second thing is, you know, it’s … I have to do the biopsy. So, you know, the robot’s not going to be doing the biopsy. [LAUGHTER] And then the pathology for skin cancer is sometimes very clear, but there’s also, like, intermediate-type lesions where we have to make a decision bringing all that information together. For rashes, it can be quite complex. And then we have to kind of think about what other tests we’re going to order, what therapeutics we might try first, that sort of stuff. So, you know, there is a thought that you might have AI that could reason through all of those steps maybe, but I just don’t feel like we’re anywhere close to that at all. I think the other thing is AI does a lot better on sort of, you know, tasks that are well defined. And a lot of things in medicine, like, it would be hard to train the model on because it’s not well defined. Even human physicians would disagree on the next best step. LEE: Well, Roxana, for whatever it’s worth, I can’t even begin to imagine anything replacing you. I think your work has been just so—I think you used the word, and I agree with it— landmark, and multiple times. So thank you for all that you’re doing and thank you so much for joining this podcast. DANESHJOU: Thanks for having me. This was very fun. [TRANSITION MUSIC] The issue of bias in AI has been the subject of truly landmark work by Roxana and her collaborators. And this includes biases in large language models. This was something that in our writing of the book, Carey, Zak, and I recognized and wrote about. But in fairness, I don’t think Carey, Zak, or I really understood the full implications of it. And this is where Roxana’s work has been so illuminating and important. Roxana’s practical prescriptions around red teaming have proven to be important in practice, and equally important were Roxana’s insights into how AI might always be guilty of the same biases, not only of individuals but also of whole healthcare organizations. But at the same time, AI might also be a potentially powerful tool to detect and help mitigate against such biases. When I think about the book that Carey, Zak, and I wrote, I think when we talked about laws, norms, ethics, regulations, it’s the place that we struggled the most. And in fact, we actually relied on a conversation with GPT-4 in order to tease out some of the core issues. Well, moving on from that conversation with an AI to a conversation with three deep experts who have dedicated their careers to making sure that we can harness all of the goodness while mitigating against the risks of AI, it’s been both fulfilling, very interesting, and a great learning experience. [THEME MUSIC]   I’d like to say thank you again to Laura, Vardit, and Roxana for sharing their stories and insights. And to our listeners, thank you for joining us. We have some really great conversations planned for the coming episodes, including an examination on the broader economic impact of AI in health and a discussion on AI drug discovery. We hope you’ll continue to tune in. Until next time. [MUSIC FADES]

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-04-23 16:41:35 ·

Research Focus: Week of April 21, 2025

www.microsoft.com
In this issue: Catch a preview of our presentations and papers at CHI 2025 and ICLR 2025. We also introduce new research on causal reasoning and LLMs; enhancing LLM jailbreak capabilities to bolster safety and robustness; understanding how people using AI compared to AI-alone, and Distill-MOS, a compact and efficient model that delivers state-of-the-art speech quality assessment. You’ll also find a replay of a podcast discussion on rural healthcare innovation with Senior Vice President of Microsoft Health Jim Weinstein. CONFERENCE Microsoft at CHI 2025 Microsoft Research is proud to be a sponsor of the ACM Computer Human Interaction (CHI) 2025 Conference on Human Factors in Computing Systems (opens in new tab). CHI brings together researchers and practitioners from all over the world and from diverse cultures, backgrounds, and positionalities, who share an overarching goal to make the world a better place with interactive digital technologies. Our researchers will host more than 30 sessions and workshops at this year’s conference in Yokohama, Japan. We invite you to preview our presentations and our two dozen accepted papers. Microsoft @CHI 2025 CONFERENCE Microsoft at ICLR 2025 Microsoft is proud to be a sponsor of the thirteenth International Conference on Learning Representations (ICLR). This gathering is dedicated to the advancement of representation learning, which is a branch of AI. We are pleased to share that Microsoft has more than 30 accepted papers at this year’s conference, which we invite you to preview. ICLR is globally renowned for presenting and publishing cutting-edge research on all aspects of deep learning used in the fields of artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, text understanding, gaming, and robotics. Microsoft @ICLR 2025 NEW RESEARCH Causal Reasoning and Large Language Models: Opening a New Frontier for Causality What kinds of causal arguments can large language models (LLMs) generate, how valid are these arguments, and what causal reasoning workflows can this generation support or automate? This paper, which was selected for ICLR 2025, clarifies this debate. It advances our understanding of LLMs and their causal implications, and proposes a framework for future research at the intersection of LLMs and causality. This discussion has critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. In capturing common sense and domain knowledge about causal mechanisms and supporting translation between natural language and formal methods, LLMs open new frontiers for advancing the research, practice, and adoption of causality. Read the paper NEW RESEARCH The Future of AI in Knowledge Work: Tools for Thought at CHI 2025 Can AI tools do more than streamline workflows—can they actually help us think better? That’s the driving question behind the Microsoft Research Tools for Thought initiative. At this year’s CHI conference, this group is presenting four new research papers and cohosting a workshop that dives deep into this intersection of AI and human cognition. The team provides an overview of their latest research, starting with a study on how AI is changing the way people think and work. They introduce three prototype systems designed to support different cognitive tasks. Finally, through their Tools for Thought workshop, they invite the CHI community to help define AI’s role in supporting human thinking. Read the blog NEW RESEARCH Building LLMs with enhanced jailbreaking capabilities to bolster safety and robustness Recent research shows that LLMs are vulnerable to automated jailbreak attacks, where algorithm-generated adversarial suffixes bypass safety alignment and trigger harmful responses. This paper introduces ADV-LLM, an iterative self-tuning process for crafting adversarial LLMs with enhanced jailbreak capabilities—which could provide valuable insights for future safety alignment research. ADV-LLM is less computationally expensive than prior mechanisms and achieves higher attack success rates (ASR), especially against well-aligned models like Llama2 and Llama3. It reaches nearly 100% ASR on various open-source LLMs and demonstrates strong transferability to closed-source models—achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4—despite being optimized solely on Llama3. Beyond improving jailbreak performance, ADV-LLM offers valuable insights for future alignment research by enabling large-scale generation of safety-relevant datasets. Read the paper NEW RESEARCH ChatBench: From Static Benchmarks to Human-AI Evaluation The rapid adoption of LLM-based chatbots raises the need to understand what people and LLMs can achieve together. However, standard benchmarks like MMLU (opens in new tab) assess LLM capabilities in isolation (i.e., “AI alone”). This paper presents the results of a user study that transforms MMLU questions into interactive user-AI conversations. The researchers seeded the participants with the question and then had them engage in a conversation with the LLM to arrive at an answer. The result is ChatBench, a new dataset comprising AI-alone, user-alone, and user-AI data for 396 questions and two LLMs, including 144,000 answers and 7,336 user-AI conversations. The researchers’ analysis reveals that AI-alone accuracy does not predict user-AI accuracy, with notable differences across subjects such as math, physics, and moral reasoning. Examining user-AI conversations yields insights into how these interactions differ from AI-alone benchmarks. Finally, the researchers demonstrate that finetuning a user simulator on a subset of ChatBench improves its ability to predict user-AI accuracy, boosting correlation on held-out questions by more than 20 points, thereby enabling scalable interactive evaluation. Read the paper NEW RESEARCH Distill-MOS: A compact speech-quality assessment model Distill-MOS is a compact and efficient speech quality assessment model with dramatically reduced size—over 100x smaller than the reference model—enabling efficient, non-intrusive evaluation in real-world, low-resource settings. This paper investigates the distillation and pruning methods to reduce model size for non-intrusive speech quality assessment based on self-supervised representations. The researchers’ experiments build on XLS-R-SQA, a speech quality assessment model using wav2vec 2.0 XLS-R embeddings. They retrain this model on a large compilation of mean opinion score datasets, encompassing over 100,000 labeled clips. Read the paper View GitHub PODCAST Collaborating to Affect Change for Rural Health Care with Innovation and Technology Senior Vice President of Microsoft Health Jim Weinstein joins Dan Liljenquist, Chief Strategy Officer from Intermountain Health, on the NEJM Catalyst podcast for a discussion of their combined expertise and resources and their collaboration to address healthcare challenges in the rural United States. These challenges include limited access to care, rising mortality rates, and severe staffing shortages. Working together, they aim to create a scalable model that can benefit both rural and urban health care systems. Key goals include expanding access through telemedicine and increasing cybersecurity, ultimately improving the quality of care delivered and financial stability for rural communities. Listen to the podcast PODCAST Empowering patients and healthcare consumers in the age of generative AI Two champions of patient-centered digital health join Microsoft Research President Peter Lee to talk about how AI is reshaping healthcare in terms of patient empowerment and emerging digital health business models. Dave deBronkart, a cancer survivor and longtime advocate for patient empowerment, discusses how AI tools like ChatGPT can help patients better understand their conditions, navigate the healthcare system, and communicate more effectively with clinicians. Christina Farr, a healthcare investor and former journalist, talks about the evolving digital health–startup ecosystem, highlighting where AI is having the most meaningful impact—particularly in women’s health, pediatrics, and elder care. She also explores consumer trends, like the rise of cash-pay healthcare.  Listen to the podcast PODCAST Beyond the Image: AI’s Expanding Role in Healthcare Jonathan Carlson, Managing Director of Microsoft Research Health Futures, joins the Healthcare Unfiltered show to explore the evolution of AI in medicine, from the early days to cutting-edge innovations like ambient clinical intelligence. This podcast explores how pre-trained models and machine learning are transforming care delivery, as well as the future of biomedicine and healthcare, including important ethical and practical questions. Listen to the podcast Opens in a new tab

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-04-18 17:04:15 ·

The Future of AI in Knowledge Work: Tools for Thought at CHI 2025

www.microsoft.com
Can AI tools do more than streamline workflows—can they actually help us think better? That’s the driving question behind the Microsoft Research Tools for Thought initiative. At this year’s CHI conference, we’re presenting four new research papers and cohosting a workshop that dives deep into this intersection of AI and human cognition. This post provides an overview of our latest research, starting with a study on how AI is changing the way we think and work. We also introduce three prototype systems designed to support different cognitive tasks. Finally, through our Tools for Thought workshop, we’re inviting the CHI community to help define AI’s role in supporting human thinking. AI’s effects on thinking at work With a single prompt, AI can generate a wide range of outputs, from documents and meeting agendas to answers and automated workflows. But how are people’s thinking processes affected when they delegate these tasks to AI? One of our goals is to understand how knowledge workers use AI, how they perceive its value, and how it affects cognitive effort. Our study, “The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers,” surveyed 319 professionals using AI across a variety of occupations. Participants shared 936 real-world AI use cases and reflected on how it influenced their critical thinking and mental effort. We summarize these findings below. Defining and deploying critical thinking. Knowledge workers describe critical thinking as involving activities like setting clear goals, refining prompts, and verifying AI outputs against external sources and their own expertise. They rely on these practices to maintain work quality when using AI—motivated by the need to avoid errors, produce better results, and develop their skills. Findings Balancing cognitive effort. Participants’ reports about critical thinking and the effort involved align with longstanding human tendencies to manage cognitive load at work. For high-stakes tasks requiring accuracy, they say they expend more effort in applying critical thinking with AI than they would performing the same tasks without it. In contrast, during routine, low-stakes tasks under time pressure, they report spending less effort on critical thinking when using AI compared with completing the task without it. Confidence effects. The study found that higher confidence in AI was associated with less Shift in the nature of critical thinking. Participants reported a shift in critical thinking activities, with a greater focus on information verification, response integration, and task stewardship. While AI automates certain aspects of knowledge work, it also demands more effort in evaluating the accuracy and relevance of AI-generated content. Barriers to critical engagement. The study identified several barriers that inhibit critical thinking when using AI. These include a lack of awareness of the need for critical evaluation, limited motivation due to time pressure or perceived job scope, and difficulty in refining prompts—especially in unfamiliar domains. Recommendations To foster critical thinking at work, we recommend that AI tools actively encourage awareness, motivation, and skill development. AI tools should enhance motivators for critical thinking (e.g., quality standards, skill-building) and mitigate inhibitors (e.g., time constraints, low awareness). Proactive prompts can surface overlooked tasks, while reactive features can offer on-demand assistance. Motivation can be strengthened by positioning critical reflection as part of professional growth—not just extra work. AI tools should also support knowledge workers’ ability to think critically by providing reasoning explanations (as some newer AI models now do), guided critiques, and cross-references. This shift must occur in both the design of the technology and in the mindsets of knowledge workers. Rather than treating AI as a tool for delivering answers, we suggest treating it as a thought partner—one that can also act as a provocateur. Beyond these insights, our other CHI papers explore practical ways to design AI that augments human cognition. Enhancing decision-making with AI Decision-making is central to knowledge work, and AI is increasingly used to help people make decisions in complex fields like healthcare and finance. However, how much agency do knowledge workers retain when AI is involved? Our study, “AI, Help Me Think—but for Myself: Exploring How LLMs Can Assist People in Complex Decision-Making by Providing Different Forms of Cognitive Support,” conducted in collaboration with University College London, examines this question. We began with a small formative study involving 10 participants, followed by a comparative study with 21 participants using two different AI-supported decision-making systems. For a complex financial investment task, we compared two different AI tools (Figure 1): RecommendAI, which provides AI-generated recommendations, and ExtendAI, which encourages users to articulate their reasoning before receiving AI feedback. Figure 1. Illustrative comparison of the thought process involved when interacting with two types of AI: RecommendAI and ExtendAI. Findings Both systems were found to offer benefits for augmenting cognition and addressing some of the challenges to critical thinking identified in the knowledge worker survey above, suggesting the potential for a balanced approach. RecommendAI offered concrete suggestions that inspired users to explore new directions in their decision-making. This often led to fresh insights and reflections. However, the recommendations at times felt disconnected from the user’s own reasoning, reducing the depth of engagement. In contrast, ExtendAI encouraged users to reflect more deeply on their decisions by providing feedback on their reasoning. This helped them examine their thought processes and consider alternative perspectives. However, some users found the feedback too general and not actionable enough. When it came to how users integrated the tools into their decision-making process, RecommendAI, introduced perspectives that pushed users to think beyond their usual patterns. By recommending options not based on users’ own reasoning, it encouraged exploration of ideas they might not have considered. However, some users perceived the recommendations as a “black box” solution. This lack of transparency made those recommendations harder to understand, trust, and apply to their own thought processes. ExtendAI, on the other hand, aligned with users’ existing reasoning, making its feedback easier to incorporate. This helped the users maintain a sense of control and continuity. However, because the feedback often echoed their initial thoughts, it sometimes limited new insights and risked reinforcing existing biases. These findings suggest that AI tools like ExtendAI, designed to elicit and build on users’ own cognitive processes, may offer a more effective approach to augmentation than simply providing “ready-made solutions” that users must figure out how to interpret and apply. Are we on track? Making meetings better with AI Meetings are often criticized for being ineffective. While this is sometimes due to poor practices—such as weak agendas, late starts, and unclear facilitation—we believe the deeper issue is a lack of meeting intentionality: knowing why a meeting is occurring and keeping the discussion focused on that purpose. A key challenge is maintaining goal clarity throughout a meeting. In the paper “Are We On Track? AI-Assisted Goal Reflection During Meetings,” we explore how AI tools can improve meetings in real time by encouraging reflection—awareness about the meeting’s goals and how well the current conversation is aligned with those goals. Our study with 15 knowledge workers examined two AI-driven design paradigms: passive goal assistance through ambient visualization (a live chart displaying how conversational topics relate to meeting objectives) and active goal assistance through interactive questioning (nudging participants to consider whether the current conversation aligns with the meeting objectives). These approaches are illustrated in Figure 2. Figure 2. Technology prototypes exploring passive and active ways to keep meetings focused on established objectives. Recommendations The findings highlight AI’s potential to help teams with meeting objectives. We found three key design tradeoffs between passive and active support. Based on these, we offer the following AI design recommendations. Information balance. There is a tradeoff between ambient visualizations in the passive approach—which can risk information overload—and interactive questioning in the active approach, which may lack detail. To be effective, AI should deliver the right amount of information at the right time and tailor content to the individuals who need it most—without overwhelming users, while offering meaningful and timely support for reflection. Balance of engagement versus interruption. When participants are deeply engaged in discussion, significant interruptions can overwhelm and disrupt the flow. Conversely, during moments of confusion or misalignment, subtle cues may be insufficient to get the team back on track. AI systems should dynamically adjust their level of intervention—from ambient and lightweight to more direct—escalating or de-escalating based on timing thresholds, which can be customized for each team. Balance of team versus individual goal awareness. AI assistance can nudge team action, such as adjusting agendas. These effects were stronger with the active approach, which required group responses, while the passive approach supported individual thinking without directly influencing team behavior. Team-wide engagement depends on both the visibility of AI cues and how they are introduced into the discussion. This study helps us understand how AI design choices can support intentionality during meetings and enhance productivity without disrupting natural workflows. Spotlight: blog post GraphRAG auto-tuning provides rapid adaptation to new domains GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant. Read more Opens in a new tab Encouraging diverse problem-solving brainstorming with AI Diverse perspectives drive creative problem-solving in organizations, but individuals often lack access to varied viewpoints. In the paper “YES AND: An AI-Powered Problem-Solving Framework for Diversity of Thought,” we build on the idea of “design improv” to explore a multi-agent AI prototype that simulates conversations with persona-based agents representing a range of expertise. The agents follow a classic model of conversational turn-taking, combined with a confidence model to determine when to take or respond to a turn. This allows both the agents and the user to organically build on each others’ ideas and ask clarifying questions. The system enables free-flowing, multi-party idea generation while avoiding common pitfalls of group brainstorming—such as social loafing, production blocking, and groupthink (Figure 3). Figure 3. The YES AND system supports conversational turn-taking among agents and the user to generate ideas around a problem. At the end of a session, an AI agent called Sage distills the discussion, leaving it to the user to develop a conclusive approach to the problem. In this way, YES AND helps unblock forward momentum in problem-solving while preserving the agency of knowledge workers to shape their own ideas. We believe the best way to advance next-generation tools for thought is by bringing together a wide range of perspectives and approaches. Besides our four papers, the fifth cornerstone of our CHI presence this year is our workshop on April 26, co-organized with collaborators from industry and academia: Tools for Thought: Research and Design for Understanding, Protecting, and Augmenting Human Cognition with Generative AI. In this session, over 60 researchers, designers, practitioners, and provocateurs will gather to examine what it means to understand and shape the impact of AI on human cognition. Together, we’ll explore how AI is changing workflows, the opportunities and challenges for design, and which theories, perspectives, and methods are increasingly relevant—or still need to be developed. The enthusiastic response to this workshop highlights the growing interest in AI’s role in human thought. Our goal is to foster a multidisciplinary community dedicated to ensuring that AI not only accelerates work but also strengthens our ability to think critically, creatively, and strategically. We look forward to ongoing discussions, new collaborations, and the next wave of innovations in AI-assisted cognition at CHI 2025. Opens in a new tab

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!
Microsoft Academic @MicrosoftAcademic compartilhou um link
2025-04-18 01:06:57 ·

Empowering patients and healthcare consumers in the age of generative AI

www.microsoft.com
Transcript [MUSIC]  [BOOK PASSAGE]   “In healthcare settings, keeping a human in the loop looks like the solution, at least for now, to GPT-4’s less-than 100% accuracy. But years of bitter experience with ‘Dr. Google’ and the COVID ‘misinfodemic’ show that it matters which humans are in the loop, and that leaving patients to their own electronic devices can be rife with pitfalls. Yet because GPT-4 appears to be such an extraordinary tool for mining humanity’s store of medical information, there’s no question members of the public will want to use it that way—a lot.” [END OF BOOK PASSAGE]   [THEME MUSIC]  This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.  Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?   In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.  [THEME MUSIC FADES] The passage I read at the top there is from Chapter 5, “The AI-Augmented Patient,” which Carey wrote. People have forever turned to the internet and sites like WebMD, Healthline, and so on to find health information and advice. So it wouldn’t be too surprising to witness a significant portion of people refocus those efforts around tools and apps powered by generative AI. Indeed, when we look at our search and advertising businesses here at Microsoft, we find that healthcare is in the top three most common categories of queries by consumers. When we envision AI’s potential impact on the patient experience, in our book, we suggested that it could potentially be a lifeline, especially for those without easy access to adequate healthcare; a research partner to help people make sense of existing providers and treatments; and even maybe act as a third member of a care team that has traditionally been defined by the doctor-patient relationship. This also could have a huge impact on venture capitalists in the tech sector who traditionally have focused on consumer-facing technologies. In this episode, I’m pleased to welcome Dave deBronkart and Christina Farr. Dave, known affectionately online as “e-Patient Dave,” is a world-leading advocate for empowering patients. Drawing on his experience as a survivor of stage 4 cancer, Dave gave a viral TED talk on patient engagement and wrote the highly rated book Let Patients Help! Dave was the Mayo Clinic’s visiting professor in internal medicine in 2015, has spoken at hundreds of conferences around the globe, and today runs the Patients Use AI blog on Substack. Chrissy puts her vast knowledge of the emerging digital and health technology landscape to use as a managing director with Manatt Health, a company that works with health systems, pharmaceutical and biotech companies, government policymakers, and other stakeholders to advise on strategy and technology adoption with the goal of improving human health. Previously, she was a health tech reporter and on-air contributor for CNBC, Fast Company, Reuters, and other renowned news organizations and publications. Hardly a week goes by without a news story about an ordinary person who managed to address their health problems—maybe even save their lives or the lives of their loved ones, including in some cases their pets—through the use of a generative AI system like ChatGPT. And if it’s not doing something as dramatic as getting a second opinion on a severe medical diagnosis, the empowerment that people feel when an AI can help decode an indecipherable medical bill or report or get advice on what to ask a doctor, well, those things are both meaningful and a daily reality in today’s AI world. And make no mistake—such consumer empowerment could mean business, really big business, and this means that investors in new ventures are smart to be taking a close look at all this. For these and many other reasons, I am thrilled to pair the perspectives offered by e-Patient Dave and Chrissy Farr together for this episode. Here is my interview with Dave deBronkart: LEE: Dave, it’s just a thrill and honor to have you join us. DAVE DEBRONKART: It’s a thrill to be alive. I’m really glad that good medicine saved me, and it is just unbelievable, fun, and exciting and stimulating to be in a conversation with somebody like you. LEE: Likewise. Now, we’re going to want to get into both the opportunities and the challenges that patients face. But before that, I want to talk a little bit and delve a little bit more into you, yourself. I, of course, know you as this amazing speaker and advocate for patients. But you have had actually a pretty long career and history prior to all this. And so can you tell us a little bit about your background? DEBRONKART: I’ll go back all the way to when I first got out of college. I didn’t know what I wanted to do when I grew up. So I got a job where I … basically, I used my experience working on the school paper to get a temporary job. It was in type setting, if you can believe that. [LAUGHTER] And, man, a few years later, that became the ultimate lesson in disruptive innovation. LEE: So you were actually doing movable type? Setting type? DEBRONKART: Oh, no, that was, I was … I’m not that old, sir! [LAUGHTER] The first place where I worked, they did have an actual Linotype machine and all that. LEE: Wow. DEBRONKART: Anyway, one thing led to another. A few years after I got that first job, I was working for the world’s biggest maker of typesetting machines. And I did product marketing, and I learned how to speak to audiences of all different sorts. And then desktop publishing came along, as I say. And it’s so funny because, now mind you, this was 10 years before Clay Christensen wrote The Innovator’s Dilemma (opens in new tab). But I had already lived through that because here we were. We were the journeymen experts in our noble craft that had centuries of tradition as a background. Is this reminding you of anything? [LAUGHTER] Well, seriously. And then along comes stuff that can be put in the hands of the consumers. And I’ll tell you what, people like you had no clue how to use fonts correctly. [LAUGHTER] We were like Jack Nicholson, saying “You can’t handle the Helvetica! You don’t know what you’re doing!” But what happened then, and this is really relevant, what happened then is—all of a sudden, the population of users was a hundred times bigger than the typesetting industry had ever been. The clueless people gained experience, and they also started expressing what they wanted the software to be. The important thing is today everybody uses fonts. It’s no longer a secret profession. Things are done differently, but there is more power in the hands of the end user. LEE: Yeah, I think it’s so interesting to hear that story. I didn’t know that about your background. And I think it sheds some light on hopefully what will come out later as you have become such, I would call you a fierce consumer advocate. DEBRONKART: Sure, energetic, however, whatever you want to call it, sure. [LAUGHTER] Seriously, Peter, what I always look to do … so this is a mixture of my having been run over by a truck during disruptive innovation, all right, but then also looking at that experience from a marketing perspective: how can I convey what’s happening in a way that people can hear? Because you really don’t get much traction as an advocate if you come in and say, you people are messed up. LEE: Right. So, now I know this gets into something fairly personal, but you’ve actually been remarkably public about this. You became very ill. DEBRONKART: Yes. LEE: And of course, I suspect some of the listeners to this podcast probably have followed your story, but many have not. So can we go a little bit through that … DEBRONKART: Sure. LEE: … just to give our listeners a sense of how this has formed some of your views about the healthcare system. DEBRONKART: So late in 2006, I went in for my annual physical with my deservedly famous primary care physician, Danny Sands at Beth Israel [Deaconess Medical Center] in Boston. And in the process—I had moved away for a few years, so I hadn’t seen him for a while—I did something unusual. I came into the visit with a preprinted letter with 13 items I wanted to go over with him. LEE: What made you do that? Why did you do that? DEBRONKART: I have always been, even before I knew the term exists, I was an engaged patient, and I also very deeply believe in partnership with my physicians. And I respected his time. I had all these things, because I hadn’t seen him for three years … LEE: Yeah. DEBRONKART: … all these things I wanted to go through. To me it was just if I walked into a business meeting with a bunch of people that I hadn’t seen for three years and I want to get caught up, I’d have an agenda. LEE: It’s so interesting to hear you say this because I’m very similar to you. I like to do my own research. I like to come in with checklists. And do you ever get a sense like I do that sometimes that makes your doctor a little uncomfortable? DEBRONKART: [LAUGHS] Well, you know, so sometimes it does make some doctors uncomfortable and that touches on something that right now is excruciatingly important in the culture change that’s going on. I’ve spent a lot of time as I worked on the culture change from the patient side, I want to empathize, understand what’s going on in the doctor’s head. Most doctors are not trained in medical school or later, how do you work with a patient who behaves like you or me, you know? And in the hundreds of speeches that I’ve given, I’ve had quite a range of reactions from doctors afterwards. I’ve had doctors come up to me and say, “This is crap.” I mean, right to my face, right. “I’ll make the decisions. I’ll decide what we’re going to talk about.” And now my thought is, OK, and you’re not going to be my doctor. LEE: Yeah. DEBRONKART: I want to be responsible for how the time is spent, and I didn’t want be fumbling for words during the visit. LEE: Right. DEBRONKART: So I said, I’ve got among other things … one of the 13 things was I had a stiff shoulder. So he ordered a shoulder x-ray, and I went and got the shoulder x-ray. And I will never forget this. Nine o’clock the next morning, he called me, and I can still—this is burned into my memory—I can see the Sony desk phone with 0900 for the time. He said, “Dave, your shoulder’s going to be fine. I pulled up the x-ray on my screen at home. It’s just a rotator cuff thing, but Dave, something else showed up. There’s something in your lung that shouldn’t be there.” And just by total luck, what turned out to be a metastasis of kidney cancer was in my lung next to that shoulder. He immediately ordered a CAT scan. Turned out there were five tumors in both lungs, and I had stage 4 kidney cancer. LEE: Wow. DEBRONKART: And on top of that, back then—so this was like January of 2007—back then, there was much less known about that disease than there is now. LEE: Right. DEBRONKART: There were no studies—zero research on people like me—but the best available study said that for somebody with my functional status, my median survival was 24 weeks. Half the people like me would be dead in five and a half months. LEE: So that just, you know, I can’t imagine, you know, how I would react in this situation. And what were your memories of the interaction then between you and your doctor? You know, how did your doctor engage with you at that time? DEBRONKART: I have very vivid memories. [LAUGHS] Who was it? I can’t remember what famous person said, “Nothing focuses the mind like the knowledge that one is to be hanged in a fortnight,” right. But 24 weeks does a pretty good job of it. And I … just at the end of that phone call where he said I’m going to order a CAT scan, I said, “Is there anything I should do?” Like I was thinking, like, go home and make sure you don’t eat this sort of this, this, that, or the other thing. LEE: Right. DEBRONKART: And what he said was, “Go home and have a glass of wine with your wife.” LEE: Yeah. DEBRONKART: Boy, was that sobering. But then it’s like, all right, game on. What are we going to do? What are my options? And a really important thing, and this, by the way, this is one reason why I think there ought to be a special department of hell for the people who run hospitals and other organizations where they think all doctors are interchangeable parts. All right. My doctor knew me. LEE: Yeah. DEBRONKART: And he knew what was important to me. So when the biopsy came back and said, “All right, this is definitely stage 4, grade 4 renal cell carcinoma.” He knew me enough … he said, “Dave, you’re an online kind of guy. You might like to join this patient community that I know of.” This was 2007. LEE: Yeah. DEBRONKART: It’s a good quality group. This organization that barely exists. LEE: That’s incredibly progressive, technologically progressive for that time. DEBRONKART: Yeah, incredibly progressive. Now, a very important part of the story is this patient community is just a plain old ASCII listserv. You couldn’t even do boldface, right. And this was when the web was … web 2.0 was just barely being created, but what it was, was a community of people who saw the problems the way I see the problems. God bless the doctors who know all the medical stuff, you know. And they know the pathology and the morphology and whatever it is they all know. And I’m making a point here of illustrating that I am anything but medically trained, right. And yet I still, I want to understand as much as I can. I was months away from dead when I was diagnosed, but in the patient community, I learned that they had a whole bunch of information that didn’t exist in the medical literature. Now today we understand there’s publication delays; there’s all kinds of reasons. But there’s also a whole bunch of things, especially in an unusual condition, that will never rise to the level of deserving NIH [National Institute of Health] funding, right … LEE: Yes. DEBRONKART: … and research. And as it happens, because of the experience in that patient community, they had firsthand experience at how to survive the often-lethal side effects of the drug that I got. And so I talked with them at length and during my treatment, while I was hospitalized, got feedback from them. And several years later my oncologist, David McDermott, said in the BMJ [British Medical Journal], he said, “You were really sick. I don’t know if you could have tolerated enough medicine if you hadn’t been so prepared.” Now there is a case for action, for being actively involved, and pointing towards AI now, doing what I could to learn what I could despite my lack of medical education. LEE: But as you were learning from this patient community these things, there had to be times when that came into conflict with the treatment plan that you’re under. That must have happened. So first off, did it? And how were those conflicts resolved? DEBRONKART: So, yes, it did occasionally because in any large population of people you’re going to have differences of opinion. Now, before I took any action—and this closely matches the current thought of human in the loop, right—before I took any action based on the patient community, I checked with my clinicians. LEE: Were there times when there were things that … advice you were getting from the patient community that you were very committed to, personally, but your official, formal caregivers disagreed with? DEBRONKART: No, I can’t think of a single case like that. Now, let me be clear. My priority was: save my ass, keep me alive, you know? And if I thought a stranger at the other end of an internet pipe had a different opinion from the geniuses at my hospital—who the whole patient community had said, this is maybe the best place in the world for your disease— LEE: Yes. DEBRONKART: I was not going to go off and have some philosophical debate about epistemology and all of that stuff. And remember, the clock was ticking. LEE: Well, in fact, there’s a reason why I keep pressing on this point. It’s a point of curiosity because in the early days of GPT-4, there was an episode that my colleague and friend Greg Moore, who’s a neuroradiologist, had with a friend of his that became very ill with cancer. And she went in for treatment and the treatment plan was a specific course of chemotherapy, but she disagreed with that. She wanted a different type of, more experimental immunotherapy. And that disagreement became intractable to the point that the cancer specialists that were assigned to treat her asked Greg, “Can you talk to her and explain, you know, why we think our decision is best?” And the thing that was remarkable is Greg decided to use that case as one of the tests in the early development days of GPT-4 and had a conversation to explain the situation. They went back and forth. GPT-4 gave some very useful advice to Greg on what to say and how to frame it. And then, when Greg finally said, “You know, thank you for the help.” What floored both me and Greg is GPT-4 said, “You’re welcome. But, Greg, what about you? Are you getting all the support that you need? Here are some resources.” And, you know, I think we can kind of take that kind of behavior for granted today, and there have been some published studies about the seeming empathy of generative AI. But in those early days, it was eerie, it was awe-inspiring, it was disturbing—you know, all of these things at once. And that’s essentially why I’m so curious about your experiences along these lines. DEBRONKART: That’s like, that’s the flip side of the famous New York Times reporter who got into a late-night discussion … LEE: Oh, Kevin Roose, yes. [LAUGHTER] DEBRONKART: You say you’re happy in your marriage, but I think you’re not. LEE: Right. DEBRONKART: It’s like, whoa, this is creepy. But you know, it’s funny because one of the things that’s always intrigued me, partly because of my professional experience at explaining technology to people, is the early messaging around LLMs [large language models], which I still hear people … The people who say, “Well, wait a minute, these things hallucinate, so don’t trust them.” Or they say, “Look, all it’s doing is predicting the next word.” But there are loads of nuances, … LEE: Yes. DEBRONKART: … LEE: Hmm, yes. Yeah. DEBRONKART: … to be able to express that. Honestly, that is why I’m so excited about the arriving future. One immensely important thing … as I said earlier, I really respect my doctors’ time—“doctors” plural—and it breaks my heart that the doctors who did all this work to get license and all that stuff are quitting the field because the economic pressures are so great. I can go home and spend as many hours as I want asking it questions. LEE: Yes. DEBRONKART: All right. I’ve recently learned a thing to do after I have one of these hours-long sessions, I’ll say to it, “All right, so if I wanted to do this in a single-shot prompt, how would you summarize this whole conversation?” So having explored with no map, I end up with a perspective that it just helps me see the whole thing … LEE: Yes. Yeah, that’s brilliant. DEBRONKART: … without spending a moment of the doctor’s time. LEE: Yeah, yeah. So when was the first time that you used, you know, generative AI? DEBRONKART: It had to be February or March of whatever the first year was. LEE: Yeah. And was it the New York Times article that piqued your interest? DEBRONKART: Oh absolutely. LEE: Yeah. And so what did you think? Were you skeptical? Were you amazed? What went through your mind? DEBRONKART: Oh, no, no, no. It blew my mind. And I say that as somebody who emerged from the 1960s and ’70s, one of the original people who knew what it was to have your mind blown back in the psychedelic era. [LAUGHTER] No, it blew my mind. And it wasn’t just the things it said; it was the implications of the fact that it could do that. I did my first programming with BASIC or Fortran. I don’t know, something in the mid-’60s, when I was still in high school. So I understand, well, you know, you got to tell it exactly what you want it to do or it’ll do the wrong thing. So, yeah, for this to be doing something indistinguishable from thinking—indistinguishable from thinking—was completely amazing. And that immediately led me to start thinking about what this would mean in the hands of a sick person. And, you know, my particular area of fascination in medicine—everything I use it for these days is mundane—but the future of a new world of medicine and healthcare is one where I can explore and not be limited to things where you can read existing answers online. LEE: Right. So if you had GPT-4 back in 2006, 2007, when you were first diagnosed with your renal cancer, how would things have been different for you? Would things have been different for you? DEBRONKART: Oh, boy, oh, boy, oh, boy. This is going to have to be just a swag because, I mean, for it to—you mean, if it had just dropped out of thin air? LEE: Yes. [LAUGHS] DEBRONKART: Ah, well, that’s … that’s even weirder. First thing we in the patient community would have to do is figure out what this thing does … LEE: Yeah. DEBRONKART: … before we can start asking it questions. Now, Peter, a large part of my evangelism, you know, there’s a reason why my book (opens in new tab) and my TED talk (opens in new tab) were titled “Let Patients Help.” I really am interested in planting a thought in people’s minds, and it’s not covert. I come right out and say it in the title of the book, right, planting a thought that, with the passage of time, will hold up as a reasonable thing to do. And same thing is true with AI. So … and I’ve been thinking about it that way from the very beginning. I never closed the loop on my cancer story. I was diagnosed in January, and I had my last drop of high-dose interleukin—experimental immunotherapy, right—in July. And that was it. By September, they said, looks like you beat it. And I was all done. And there’s the question: how could it be that I didn’t die? How could it be that valuable information could exist and not be in the minds of most doctors? Not be in the pages of journals? And if you think of it that way, along the way, I became a fan of Thomas Kuhn’s famous book, The Structure of Scientific Revolutions (opens in new tab). LEE: Yes. DEBRONKART: When something that the paradigm says could not happen does happen, then responsible thinkers have to say, the paradigm must be wrong. That’s the stage of science that he called a crisis. So if something came along back in 2006, 2007, I would have to look at it and say, “This means we’ve got to rethink our assumptions.” LEE: Yes. You know, now with the passage of time, you know, over the last two years, we’ve seen so many stories like this, you know, where people have consulted AI for a second opinion, … DEBRONKART: Sure. LEE: … maybe uploaded their labs and so on and gotten a different diagnosis, a different treatment suggestion. And in several cases that have been reported, both in medical journals and in the popular press, it’s saved, it has saved lives. And then your point about communities, during COVID pandemic, even doctors form communities to share information. A very famous example are doctors turning to Facebook and Twitter to share that if they had a COVID patient in severe respiratory distress, sometimes they could avoid intubation by … DEBRONKART: Pronation. Yeah. LEE: … pronation. And things like this end up being, in a way, I think the way you’re couching it, ways to work around the restrictions in the more formal healthcare system. DEBRONKART: The traditional flow. Yes. And there is nothing like a forest fire, an emergency, an unprecedented threat to make people drop the usual formal pathways. LEE: So, I’d like to see if we can impart from your wisdom and experience some advice for specific stakeholders. So, what do you say to a patient? What do you say to a doctor? What do you say to the executive in charge of a healthcare system? And then finally, what do you say to policymakers and regulators? So, let’s start with patients. DEBRONKART: So if you’ve got a problem that or a question where you really want to understand more than you’ve been able to, then give a try to these things. Ask some questions. And it’s not just the individual question and answer. The famous, amazing patient advocate, Hugo Campos, … LEE: Hmm, yes. DEBRONKART: … said something that I call “Hugo’s Law.” He said, “Dave, I don’t ask it for answers. I use it to help me think.” LEE: Yes, absolutely. DEBRONKART: So you get an answer and you say, “Well, I don’t understand this. What about that? Well, what if I did something different instead?” And never forget, you can come back three months later and say, “By the way, I just thought of something. What about that,” right. LEE: Yeah, yeah, fantastic. DEBRONKART: So be focused on what you want to understand. LEE: So now let’s go to a doctor or a nurse. What’s the advice there? DEBRONKART: Please try to imagine a world … I know that most people today are not as activated as I am in wanting to be engaged in their health. But to a very large extent, people, a lot of people, family and friends, have said they don’t want to do this because they don’t want to offend the doctors and nurses. Now, even if the doctor or nurse is not being a paternal jerk, all right, the patients have a fear of this. Dr. Sands handles this brilliantly. I mentioned it in the book. He proactively asks, are there any websites you’ve found useful? And you can do the same thing with AI. Have you done anything useful with ChatGPT or something like that? LEE: That actually suggests some curricular changes in medical schools in order to train doctors. DEBRONKART: Absolutely. In November, I attended a retreat on rethinking medical education. I couldn’t believe it, Peter. They were talking about how AI can be used in doing medical education. And I was there saying, “Well, hello. As long as we’re here, let’s rethink how you teach doctors, medical students to deal with somebody like me.” Cause what we do not want … There was just a study in Israel where it said 18% of adults use AI regularly for medical questions, which matches other studies in the US. LEE: Yep. DEBRONKART: But it’s 25% for people under 25. We do not want 10 years from now to be minting another crop of doctors who tells patients to stay off of the internet and AI. LEE: You know, it’s such an important point. Students, you know, entering into college to go on to medical school and then a residency and then finally into practice. I think you’re thinking about the year 2035 or thereabouts. And when you think of that, at least in tech industry terms, we’re going to be on Mars, we’re going to have flying cars, we’re going to have AGI [artificial general intelligence], and you really do need to think ahead. DEBRONKART: Well, you know, healthcare, and this speaks to the problems that health system executives are facing: y’all better watch out or you’re going to be increasingly irrelevant, all right. One of the key use cases, and I’m not kidding … I mean, I don’t mean that if I have stage 4 kidney cancer, I’m going to go have a talk with my robot. But one of the key use cases that makes people sit down and try to solve a problem on their own with an LLM is if they can’t get an appointment. LEE: Yes. DEBRONKART: Well, so let’s figure out, can the health system, can physicians and patients learn to work together in some modified way? Nobody I know wants to stop seeing a doctor, but they do need to have their problems solved. LEE: Yeah, yeah. DEBRONKART: And there is one vitally important thing I want to … I insist that we get into this, Peter. In order for the AI to perform to the best of its contribution, it needs to know all the data. LEE: Yes. DEBRONKART: Well, and so does the patient. Another super-patient, James Cummings, has two rare-genetic-mutation kids. (opens in new tab) He goes to four Epic-using hospitals. Those doctors can’t see each other’s data. So he compiles it, and he shows … the patient brings in the consolidated data. LEE: Yes. Well, and I know this is something that you’ve really been passionate about, and you’ve really testified before Congress on. But maybe then that leads to this fourth category of people who need advice, which are policymakers and regulators. What would you tell them? DEBRONKART: It’s funny, in our current political environment, there’s lots of debates about regulation, more regulation, less regulation. I’m heavily in favor of the regulations that say, yeah, I gotta be able to see and download my damn data, as I’m famous for calling it. But what we need to do if we were to have any more regulations is just mandate that you can’t keep the data away from people who need it. You can’t when … LEE: Yep. DEBRONKART: OK, consider one of the most famous AI-using patients is this incredible woman, Courtney Hofmann, whose son saw 17 doctors over three years (opens in new tab), and she finally sat down one night and typed it all into GPT. She has created a startup to try to automate the process of gathering everyone’s data. LEE: Yes, yes. Yeah. DEBRONKART: And I know people who have been trying to do this and it’s just really hard. Policy people should say, look, I mean, we know that American healthcare is unsustainable economically. LEE: Yes. DEBRONKART: And one way to take the pressure off the system—because it ain’t the doctors’ fault, because they’re burned out and quitting—one way to take the pressure off is to put more data in the hands of the patients so that entrepreneurs can make better tools. LEE: Yeah. All right. So, we’ve run out of time, but I want to ask one last provocative question to send us off. Just based on your life’s experience, which I think is just incredible and also your personal generosity in sharing your stories with such a wide audience, I think is incredible. It’s just doing so much good in the world. Do you see a future where AI effectively replaces human doctors? Do you think that’s a world that we’re heading towards? DEBRONKART: No, no, no, no. People are always asking me this. I do imagine an increasing base, an increasing if … maybe there’s some Venn diagram or something, where the number of things that I can resolve on my own will increase. LEE: Mm-hmm. Yes. DEBRONKART: And in particular, as the systems get more useful, and as I gain more savvy at using them and so on, there will be cases where I can get it resolved good enough before I can get an appointment, right. But I cannot imagine a world without human clinicians. Now, I don’t know what that’s going to look like, right. LEE: Yes. [LAUGHS] DEBRONKART: I mean, who knows what it’s going to be. But I keep having … Hugo blogged this incredible vision of where his agentic AI will be looking at one of these consolidated blob medical records things, and so will his doctor’s agentic AI. LEE: Yes. Well, I think I totally agree with you. I think there’ll always be a need and a desire for the human connection. Dave, this has been an incredible, really at times, riveting conversation. And as I said before, thank you for being so generous with your personal stories and with all the activism and advocacy that you do for patients. DEBRONKART: Well, thank you. I’m, as I said at the beginning, I’m glad to be alive and I’m really, really, really grateful to be given a chance to share my thoughts with your audience because I really like super smart nerds. [LAUGHTER] No, well, no kidding. In preparing for this, I listened to a bunch of back podcast episodes, “Microsoft Research,” “NEJM AI.” They talk about things I do not comprehend and don’t get me started on quantum, right? [LAUGHTER] But I’m grateful and I hope I can contribute some guidance on how to solve the problem of the person for whom the industry exists. LEE: Yeah, you absolutely have done that. So thank you. [TRANSITION MUSIC] E-Patient Dave is so much fun to talk to. His words and stories are dead serious, including his openness about his struggles with cancer. But he just has a way of engaging with the world with such activism and positivity. The conversation left me at least with a lot of optimism about what AI will mean for the consumer. One of the key takeaways for me is Dave’s point that sometimes informal patient groups have more up-to-date knowledge than doctors. One wonders whether AI will make these sorts of communities even more effective in the near future. It sure looks like it. And as I listen to Dave’s personal story about his bout with cancer, it’s a reminder that it can be lifesaving to do your own research, but ideally to do so in a way that also makes it possible to work with your caregivers. Healthcare, after all, is fundamentally a collaborative activity today. Now, here’s my conversation with Christina Farr: LEE: Chrissy, welcome. I’m just thrilled that you’ve joined us here. CHRISTINA FARR: Peter, I’m so excited to be here. Thanks for having me on. LEE: One thing that our listeners should know is you have a blog called Second Opinion (opens in new tab). And it’s something that I read religiously. And one of the things you wrote (opens in new tab) a while ago expressed some questions about as an investor or as a founder of a digital health company, if you don’t use the words AI prominently, you will struggle to gain investment. And you were raising some questions about this. So maybe we start there. And, you know, what are you seeing right now in the kind of landscape of emerging digital health tech companies? What has been both the positive and negative impact of the AI craziness that we have in the world today on that? FARR: Yeah, I think the title of that was something around the great AI capital incineration [LAUGHTER] that we were about to see. But I, you know, stand by it. I do think that we’ve sort of gone really deep into this hype curve with AI, and you see these companies really just sucking up the lion’s share of venture capital investment. And what worries me is that these are, you know, it’s really hard, and we know this from just like decades of being in the space that tools are very hard to monetize in healthcare. Most of healthcare still today and where really the revenue is, is in, still in services. It’s still in those kind of one-to-one interactions. And what concerns me is that we are investing in a lot of these AI tools that, you know, are intended to sell into the system. But the system doesn’t yet know how to buy them and then, beyond that, how to really integrate them into the workflow. So where I feel more enthusiastic, and this is a little bit against the grain of what a lot of VCs [venture capitalists] think, but I actually really like care delivery businesses that are fully virtual or hybrid and really using AI as part of their stack. And I think that improves really the style of medicine that they’re delivering and makes it far more efficient. And you start to see, you know, a real improvement in the metrics, like the gross margins of these businesses beyond what you would see in really traditional kind of care delivery. And because they are the ones that own the stack, they’re the ones delivering the actual care, … LEE: Right. FARR: … they can make the decision to incorporate AI, and they can bring in the teams to do that. And I feel like in the next couple of years, we’re going to see more success with that strategy than just kind of more tools that the industry doesn’t know what to do with. LEE: You know, I think one thing that I think I kind of learned or I think I had an inkling of it, but it was really reinforced reading your writings, as a techie, I and I think my colleagues tend to be predisposed to looking for silver bullets. You know, technology that really just solves a problem completely. And I think in healthcare delivery in particular, there probably aren’t silver bullets. And what you need to do is to really look holistically at things and your emphasis on looking for those metrics that measure those end-to-end outcomes. So at the same time, Just, in preparation for this discussion, I re-read your post about Flo (opens in new tab) being the first kind of unicorn women’s health digital tech startup. And there is actually a lot of very interesting AI technology involved there. So it can happen. How do you think about that? FARR: Yeah, I mean, I see a lot of AI across the board. And it’s real with some of these companies, whether it’s, you know, a consumer health app like Flo that, you know, is really focused on kind of period tracking. And AI is very useful there in helping women just predict things like their optimal fertility windows. And it’s very much kind of integrated very deeply into that solution. And they have really sophisticated technology. And you see that now as well with the kind of craze around these longevity companies, that there is a lot of AI kind of underlying these companies, as well, especially as they’re doing, you know, a lot of health tests and pulling in new data and providing access to that data in a way that, you know, historically patients haven’t had access to. And then I also see it with, you know, like I spoke about with these care delivery companies. I recently spent some time with a business called Origin (opens in new tab), for instance, which is in, you know, really in kind of women’s health, MSK [musculoskeletal], and that beachhead is in pelvic floor PT [physical therapy]. And for them, you know, it’s useful in the back office for … a lot of their PT providers are getting great education through AI. And then it’s also useful on the patient-facing side as they provide kind of more and more content for you to do exercises at home. A lot of that can be delivered through AI. So for some of these companies, you know, they look across the whole stack of what they’re providing, and they’re just seeing opportunities in so many different places for AI. And I think that’s really exciting, and it’s very, very real. And it’s really to me like where I’m seeing kind of the first set of really kind of promising AI applications. There are definitely some really compelling AI tools, as well. I think companies like Nuance and like Abridge and that whole category of really kind of replacing human scribes with AI, like to me, that is a … that has been so successful because it literally is the pain point. It’s the pain point. You’re solving the pain point for health systems and physicians. Burnout is a huge problem. Documentation is a huge problem. So, you know, to say we’ve got this kind of AI solution, everybody’s basically on board—you know, as long as it works—[LAUGHTER] from the first meeting. And then the question becomes, which one do you choose? You know, that said, you know, to me, that’s sort of a standout area. I’m not seeing that everywhere. LEE: So there are like a bunch of things to delve into there. You know, since you mentioned the Nuance, the Dragon Copilot, and Abridge, and they are doing extremely well. But even for them, and this is another thing that you write about extensively, health systems have a hard time justifying investing in these technologies. It’s not like they’re swimming in cash. And so on that element of things, is there advice to companies that are trying to make technologies to sell into health systems? FARR: Yeah, I mean, I’ll give you something really practical on that just example specifically. So I spend a lot of time chatting with a lot of the health system CMIOs [chief medical informatics officers] trying to, you know, just really understand kind of their take. And they often tell me, “Look, you know, these technologies are not inexpensive, and we’ve already spent a boatload of money on REHR [regional electronic health records], which continues to be expensive. And so we just don’t have a lot of budget.” And for them, I think the question becomes, you know, who within the clinical organization would benefit most from these tools? There are going to be progressive physicians that will jump on these on day one and start using them and really integrating them into the workflow. And there will be a subset that just wants to do things the way they always have done things. And you don’t want to pay for seats for everybody when there’s a portion that will not be using it. So I think that’s maybe something that I would kind of share with the startup crowd is just, like, don’t try to sell to every clinician within the organization. Not everybody is going to be, you know, a technology early adopter. Work with the health systems to figure out that cohort that’s likely to jump on board first and then kind of go from there. LEE: So now let me get back to specifically to women’s health. I think your investing strategy has, I think it’s fair to say has had some emphasis on women’s health. And I would say for me, that has always made sense because if there’s one thing the tech industry knows how to do in any direct-to-consumer business is to turn engagement into dollars. And when you think about healthcare, there are very few moments in a person’s life when they have a lot of engagement with their own healthcare. But women have many. You mentioned period tracking, pregnancy, menopause. There are so many areas where you could imagine that technology could be good. At least that’s way I would think about it, but does that make any sense to you, or do you have a different thought process? FARR: Oh, my god, I’ve been, I’m just nodding right now because I’ve been saying the same thing for years, [LAUGHS] that like, I think the, you know, the moments of what I call naturally high engagement are most interesting to me. And I think it’s why it’s been such a struggle with some of these companies that are looking at, you know, areas like or conditions like type two diabetes. I mean, it’s just so hard to try to change somebody’s behavior, especially through technology. You know, we’ve not kind of proven out that these nudges are really changing anybody’s mind about, you know, their day-to-day lifestyles. Whereas, you know, in these moments, like you said, of just like naturally high engagement … like it’s, you know, women’s health, you’re right, there’s a lot of them. Like if you’re pregnant, you’re very engaged. If you’re going through menopause, you’re very engaged. And I think there are other examples like this, you know, such as oncology. You get a cancer diagnosis, you’re very engaged. And so, to me, that’s really kind of where I see the most interesting opportunities for technology and for digital health. And, you know, one example I’ll give you in women’s health, I’m not invested in this company, sadly. They are called Midi Health (opens in new tab). And they’re really everywhere in the menopause area now, like, you know, the visit volume that they are seeing is just insane. You know, this is a population that is giant. It’s, like, one in two people are women. At some point, we pretty much all go through menopause, some people earlier, some later. And for a lot of us, it’s a really painful, disruptive thing to experience. And we tend to experience it at a moment when we actually have spending money. So it just ticks all the boxes. And yet I think because of the bias that we see, you know, in the venture land and in the startup world, we just couldn’t get on this opportunity for a really long time. So I’ve been very excited to see companies like that really have breakout success. LEE: First off, you know, I think in terms of hits and misses from our book. One hit is we did think a lot about the idea that patients directly would be empowered by AI. And, you know, we had a whole chapter on this, and it was something that I think has really turned out to be true, and I think it will become more true. But one big miss is we actually didn’t think about what we were just talking about, about like who and when would this happen? And the specific focus on women, women’s health, I think is something that we missed. And I think one of the reasons I sought you out for this conversation is if I remember your own personal history, you essentially transitioned from journalism to venture investing at about the same time that you yourself were having a very intense period of engagement with health because of your own pregnancy. And so if you don’t mind, I’d like to get into your own experience with healthcare through pregnancy, your own experiences raising children, and how that has informed your relationship with digital health and the investing and advising that you do today. FARR: Yeah, it’s great question. And I actually was somebody who, you know, wrote a lot while I was kind of on maternity leave about this experience because it was such a profound one. You know, I think the reason that pregnancy is so interesting to healthcare companies and systems is because really for a lot of women, it’s their first experience with the hospital. Most of us have never stayed in the hospital for any period of time until that moment. Both times I had C-sections, so I was there for a good three or four days. And, you know, I think it’s a really big opportunity for these systems, even if they lose money, many of them lose money on pregnancy, which is a whole different topic, but there is an opportunity to get a whole family on board and keep them kind of loyal. And a lot of that can come through, you know, just delivering an incredible service. Unfortunately, I don’t think that we are delivering incredible services today to women in this country. I see so much room for improvement. You know, you see, just look at the data. You see women, you know, still dying in childbirth in this country where in many other developed nations, that’s just no longer the case. LEE: Yeah. And what are, in your view, the prime opportunities or needs? What do we need to do if we have a focus on technology to improve that situation? FARR: Yeah, I mean, I think there’s definitely an opportunity for, you know, just digital technologies and for remote patient monitoring and just other forms of monitoring. I do think we should look at what other countries have done and really consider things like, you know, three days post-discharge, somebody comes to your home, you know, whether it’s to check on you from a healthcare perspective, both, you know, physical and mental health, but then also make sure that the environment is safe for both the mother and the baby. Simple things like that, that don’t even really require any technology. And then there’s certainly opportunities for new forms of, you know, diagnostic tests for things like preeclampsia, postpartum preeclampsia. We could definitely use some new therapeutics in this area. Then, you know, would love to kind of also touch on the opportunity in pediatrics because there I think is an ideal use case for AI. And that’s definitely my reality now. LEE: Well, fact, yeah, in fact, I hope I’m not delving into too many personal issues here. But I do remember, I think with your first child, which you had during the height of the COVID pandemic, that your child actually had COVID and actually even lost sense of taste and smell for a period. And, in our book, we had sort of theorized that people would turn possibly to AI for advice to understand what was going on. When you look broadly at the kinds of queries that come into a search engine or into something like ChatGPT or Copilot, you do see things along those lines. But at the same time, I had always thought people wouldn’t just use a raw chat bot for these things. People would want an app, perhaps powered by AI, that would be really designed for this. And yet somehow that seems not to be as widespread. FARR: Yeah. And I think the word app is a great one that I’d love to, you know, maybe interrogate a little bit because I think that we have been overly reliant on apps. I’ll give you an example. So in a pediatric space, I am a user of an app called Summer Health (opens in new tab) or it’s not an app. Sorry. It’s a text messaging service. [LAUGHTER] And this is the genius. So I just pick up my phone, and I text “Summer” and a pediatrician responds within a matter of minutes. And sometimes it’s a pediatric nurse, but it’s somebody who responds to me. And they say, oh, what’s going on? And I might say, OK, well, this week we had the norovirus. So these are the symptoms. And they might say, you know, I’d love to see an image or a video. And I can text that to them. And if a prescription is required, then that goes to a pharmacy near me through another digital application that’s really cool called Photon Health (opens in new tab), where my script is portable, so I can move it around based on what’s open. So, through this, I’m getting an incredible experience that’s the most convenient … LEE: Wow. FARR: I could ever ask for, and there is no app. [LAUGHS] And you could imagine the potential for AI. You know, a company like this is probably getting so many questions about a norovirus or COVID or RSV [Respiratory Syncytial Virus], and is, I’m sure, starting to think about kind of ways in which AI could be very useful in this regard. And you don’t need a pediatrician or pediatric nurse answering every question. Perhaps there’s like sophisticated triaging to determine which questions should go to the human expert. But, you know, again, back to this app question, like, I think we have too many. Like, it’s just … like from a user experience perspective, just having to find the app, log into the app. Sometimes there’s just layers of authentication. Then you have to remember your password. [LAUGHTER] And it’s just, you know, it’s just too many steps. And then there’s like 50 of them for all kinds of different things. LEE: Yes. Well, and you have to also go to an app store, download the thing. FARR: Go to the app store down. It’s just too many steps. LEE: Yes. FARR: So, like, I, you know, I recognize that HIPAA exists. If there is any kind of claim involved, then, you know, you need an app because you got privacy to think about and compliance, but like, in LEE: It’s so interesting to hear you say this because one thing that I’ve thought—and I’ve actually even expressed publicly in some venues—is one logical endpoint for AI as we understand it today is that apps become unnecessary. We might still have machines that, you know, you hold in the palm of your hand, but it’s just a machine that does what you want it to do. Of course, the business model implications are pretty profound. So for that particular text messaging service, do you understand what their business model is? You know, how are they sustaining themselves? FARR: Consumer, it’s all cash pay. It’s cash pay. You just pay a subscription. And, you know, there are certainly kind of privacy requirements, you know, related to kind of federal and state, but you could consent to be able to do something like this. And, you know, companies like this have teams of lawyers that kind of think through how do you make something like this happen. But it’s possible because of this cash pay element that really underlies that. And I think that is a growing trend. You know, I was literally sitting with a benefits consultant a few weeks ago, and he was saying to me, like, “I tell all my friends and family, just don’t use your insurance at all, unless it’s for like a very high price thing, like a medical procedure that’s expensive or a surgery.” He said, for everything else, I just pay cash. I pay cash for all my primary care. I pay cash for, you know, basic generic, you know, prescription medications that, you know, it’s like a few cents to manufacture. And I’m sort of getting there, too, where I just kind of increasingly am relying on cash pay. And I think that sort of opens up a world of opportunity for just innovation related to user experience that could really bring us to this place that you mentioned where there is no app. You literally just text or, you know, you use your voice, and you say, “I need a restaurant reservation,” and it’s done. LEE: Mm-hmm. Yeah. FARR: And it’s that simple, right? And the sort of appification of everything, you know, was a important kind of evolution or moment in technology that is undeniable. But I totally agree with you that I think we might be moving past that. LEE: On this idea of cash, there is a little bit of a fatigue, on the other hand, with—for consumers; let me just speak as a consumer—I can’t keep track anymore of all the subscriptions I have. And so are we just trading one form of, you know, friction for another? FARR: Yeah, that’s a great point. But there are things that, you know, I think there are those moments where you continue to pay a subscription because it’s just something that’s chronic. You know, it’s just relevant to you. You know, pediatrics is a great example. At some point, like I won’t need a pediatrician on demand, which is what I have now, maybe when my kids are a little older, and we’re not just a cesspool of various kind of viruses at home. [LAUGHTER] But again, back to your point about, you know, the sort of moments of just, like, natural engagement, I think there’s also a moment there … there are areas or parts of our lives where, like primary care, where it’s just more longitudinal. And it makes sense to pay on a kind of subscription basis. Like our system is messed up because there’s just messed up incentives, right. And a subscription to me is very pure. [LAUGHTER] Like it’s you’re just saying, “I’m paying for a service that I want and need.” And then the company is saying, “OK, let me make this service as efficient and great and affordable for you as I possibly can.” And to me, that’s like a very, like refreshing trade. And I feel the same way, by the way, in my media business, which, you know, definitely has a subscription element. And it just means a lot when someone’s willing to say like this content’s worth paying for. LEE: Yes. FARR: It doesn’t work for everything, but I think it works for things that, you know, have that long-term payoff. LEE: Yeah, I really love that. And if I have one regret about the chapter on kind of the consumer experience from our book—I think all of this seems obvious in retrospect—you know, I wish we had tried to understand, you know, this aspect of the consumer experience, that people might actually have just online experiences that they would pay a monthly fee or an annual fee for. Because it also hits on another aspect of consumer, which is this broad—it’s actually now a national issue in healthcare—about price transparency. And this is another thing that I think you’ve thought about and written about, both the positives and negatives of this. I remember one blog post you made that talked about the issue of churn in digital health. And if I remember correctly, you weren’t completely certain that this was a good thing for the emerging digital health ecosystem. Can you say more about this idea of churn? FARR: Yeah, I mean, you know, I’ve been writing for a long time and thinking for a long time about the buyers of a lot of these kind of digital health companies, like who are the customers? And there was a long period where it was, it was really the self-insured employer, like Microsoft, being a sort of customer of these solutions because they wanted to provide a great array of health benefits for their own employees. And that was, you know, for a long time, like 10 or 15 years, you know, big companies that have now gone public, and it seemed like a faster timeline to be able to sell relative to health systems and, you know, health plans and other groups. And I’ve now kind of been on the forefront of saying that this channel is kind of dead. And one of the big reasons is just, you know, there’s no difference, I would say to what you see kind of in the payer lane, which is that churn is a big problem. People used to stay at jobs for 20, 30, 40 years, … LEE: Right. FARR: … and then you’d retire and have great benefits. And so it kind of made sense that your company was responsible for the healthcare that you received. And now I think the last time I looked at the Bureau of Labor Statistics, it’s around four years, a little bit less than four years. So what can you do in four years? [LAUGHS] I just read an interesting analysis on GLP-1s, these medications now that obviously are everywhere in tackling type two diabetes, and obesity is kind of the main, seems to be the hot use case. But, you know, I’m reading analysis around ROI that it’s 15, over 15 years, to see an ROI if you are, you know, a system or a plan or employer that chooses to pay for this. So how does that equate when you don’t keep an employee around for more than four? LEE: Yep. FARR: So I think it’s just left employers in a really bad place of having to make a bunch of tradeoffs and, you know, employees are demanding, we want access to these things. And they’re saying, well, our healthcare costs just keep going up and up and up. You know, we have inflation to contend with and we’re not seeing, you know, the analysis that it necessarily makes sense for us to do so. So that’s what I have, you know, been sort of harping on about with this churn issue that I’m seeing. LEE: Well, I have to tell you, it really, when I first started reading about this from you, it really had a profound impact on my thinking, my thought process. Because one of the things that we dream about is this idea that’s been present actually for decades in the healthcare world of this concept of real-world evidence, RWE. And that is this dream that now that we’ve digitized so much health experience, we should be able to turn all that digital data from people’s health experiences into new medical knowledge. But the issue of churn that I think that I would credit you introducing me to calls that into question because you’re right. Over a four-year period, you don’t get the longitudinal view of a person’s health that gives you the ability to get those medical insights. And so something needs to change there. But it’s very much tied to what consumers want to do. Consumers move around; they change jobs. FARR: Yes. LEE: If it’s cash-based, they’ll be shopping based on all sorts of things. And so it … FARR: And so the natural end of all this, it’s two words: single payer. [LAUGHS] But we don’t want to go there as a country. So, you know, it sort of left us in this kind of murky middle. And I think a lot about, kind of, what kind of system we’ll end up having. What I don’t think is possible is that this current one is sustainable. LEE: You know, I do think in terms of the payer of CMS [Centers for Medicare and Medicaid Services], Medicare and Medicaid services, the amount of influence that they exert on health spending in the US has been increasing steadily year by year. And in a sense, you could sort of squint and view that as a slow drift towards some element of single payer. But it’s definitely not so intentional or organized right now. While we’re talking about these sorts of trends, of course, another big trend is the graying of America. And we’re far from alone, China, and much of the Orient, Europe, UK, people are getting older. And from the consumer-patient perspective, this brings up the challenge, I think, that many people have in caring for elderly loved ones. And this seems to me, like women’s health, to be another area where if I were starting a new digital health company, I would think very seriously about that space because that’s another space where there can be extreme intensity of engagement with the healthcare system. Do you as both a human being and consumer but also as an investor, do you think about that space at all? FARR: Oh, yes, all the time. And I do think there’s incredible opportunity here. And it’s probably because of the same kind of biases that exist that, you know, didn’t allow us to see the menopause opportunity, I think we’re just not seeing this as being as big as it is. And like you said, it’s not just an American problem. It’s being felt across the world. And I do think that there are some, you know, I’ve seen some really interesting stuff lately. Was recently spending some time with a company called Cherish Health (opens in new tab) out of Boston, and they’re using AI and radar-based sensing technologies to just be able to stick a device and like really anywhere in the person’s home. And it just like passively is able to detect falls and also kind of monitor kind of basic health metrics. And because it’s radar, it can operate through walls. So even if you’re in the bathroom, it still works, which has been a big problem with a lot of these devices in the past. And then, you have to have really advanced kind of AI and, you know, this sort of technology to be able to glean whether it’s a true fall or, you know, that’s really, you need help or it’s, you know, just the person sitting down on the floor to play with their grandchild. So things like this are, they’re still early, but I think really exciting. And we’re going to see a lot more of that in addition to, you know, some really interesting companies that are trying to think more about sort of social needs that are not healthcare needs, but you know, this, this population needs care, like outside of just, you know, medical treatment. They oftentimes may be experiencing homelessness, they might experience food insecurity, there might be a lack of just caregivers in their life. And so, you know, there are definitely some really interesting businesses there, as well. And then kind of a, you know, another trend that I think we’ll see a lot more is that, you know, countries are freaking out about the lack of babies being born, which you need to be able to … you know, I recognize climate change is a huge issue, but you also need babies to be born to support this aging population. So I think we’re going to see, you know, a lot more interest from these administrations around, you know, both like child tax credits and various policies to support parents but then also IVF [in vitro fertilization] and innovation around technology in the fertility space. LEE: All right. So we’re starting to run towards the end of our time together. So I’d like to get into maybe a couple more provocative or, you know, kinds of questions. So first, and there’s one that’s a little bit dark and another that’s much lighter. So let me start with the darker one so we can have a chance to end on a lighter note. I think one of the most moving pieces I’ve read from you recently was the open letter to your kids about the assassination of Brian Thompson (opens in new tab), who’s a senior executive of UnitedHealth Group. And so I wonder if you’re willing to share, first off, what you wrote there and then why you felt it was important to do that. FARR: Yeah. So, you know, I thought about just not saying anything. That was my original intention because it was just, you know, that moment that it happened, it was just so hot button. And a lot of people have opinions, and Twitter was honestly a scary place, just with the things that people were saying about this individual, who, you know, I think just like had a family and friends and a lot of my network knew him and felt really personally impacted by this. And I, you know, it was just a really sad moment, I think, for a lot of reasons. And then I just kind of sat down one evening and I wrote this letter to my kids that basically tried to put a lot of this in context. Like what … why are people feeling this way about our healthcare system? You know, why was all this sort of vitriol being really focused on this one individual? And then, you know, I think one of the things I sort of argued in this letter was that there’s lots of ways to approach innovation in the space. You can do it from the outside in, or you can do it from the inside out. And I’ll tell you that a lot of like, I got a lot of emails that week from people who were working at health plans, like UnitedHealth employees, some of them in their 20s, you know, they were recent kind of grads who’d gone to work at this company. And they said, you know, I felt like I couldn’t tell my friends, kind of, where I worked that week. And I emailed back and said, “Look, you’re learning healthcare. You are in an incredible position right now. Like whether you choose to stay your current company or you choose to leave, like you, you understand like the guts and the bowels of healthcare because you’re working at the largest healthcare company in the world. So you’re in an enviable position. And I think you are going to be able to effect change, like, more so than anyone else.” And that was part of what I wrote in this letter, that, you know, we should all agree that the system is broken, and we could do better. Nothing about what happened was OK. And also, like, let’s admire our peers and colleagues that are going into the trenches to learn because I genuinely believe those are the people that, you know, have the knowledge and the contacts and the network to be able to really kind of get change moving along, such desperately needed change. LEE: All right. So now one thing I’ve been asking every guest is about the origin story with respect to your first encounter with generative AI. How did that happen, and what were your first sort of experiences like? You know, what emotionally, intellectually, what went through your mind? FARR: So probably my first experience was I was really struggling with the title for my book. And I told ChatGPT what my book was about and what I wanted the title to evoke and asked it for recommendations. And then, I thought the first, like, 20 were actually pretty good. And I was able to say, can you make it a bit more witty? Can you make it more funny? And it spat back out some quite decent titles. And then what was interesting is that it just got worse and worse, like, over time and just ended up, like, deeply cheesy. [LAUGHTER] And so it sort of both like made me think that this could be a really useful prompt for just brainstorming. But then either it does seem to be some weird thing with AI where, like the more you push it on the same question, it just, like, it doesn’t … it seems to have sparked the most creativity in the first few tries, and then it just gets worse. And maybe you know more about this than I would. You certainly know more about this than I do. But that’s been my kind of general experience of it thus far. LEE: Mm-hmm. But would you say you were more skeptical or awe-inspired? What were the emotions at that moment? FARR: Um, you know, it was better than, like, a lot of my ideas. [LAUGHTER] So I definitely felt like it was from that perspective very impressive. But then, you know, it seemed to have the same human, like I said, we all kind of run out of ideas at some point and, you know, it turns out, so do the machines. So that was interesting in and of itself. And I ended up picking, I think a title that was like sort of, you know, inspired by the AI suggestions, but was definitely had its own twist that was my own. LEE: Well, Chrissy, I’ve never known you as someone who runs out of ideas, but this has been just great. As always, I always learn a lot when I have a chance to interact with you or read your writings. And so, thank you again for joining. Just really, really appreciate it. FARR: Of course, and next time I want to have you on my podcast because I have a million questions for you, too. LEE: Sure, anytime. FARR: Amazing. OK, I’ll hold you to that. Thanks so much for having me on. [TRANSITION MUSIC] LEE: I’ve always been impressed not only with Chrissy’s breadth and depth of experience with the emerging tech trends that affect the health industry, but she’s also a connector to key decision-makers in nearly every sector of healthcare. This experience, plus her communication abilities, make it no surprise that she’s sought out for help in a range of go-to-market, investor relations, social media, content development, and communications issues. Maybe it shouldn’t be a surprise, but one thing I learned from our conversation is that the business of direct-to-consumer health is still emerging. It’s far from mature. And you can see that Chrissy and her venture-investing colleagues are still trying to figure out what works. Her discussion, for example, on cash-only health delivery and the idea that consumers might not want another app on their phones were indicative of that. Another takeaway is that some areas, such as pre- and postnatal care, menopause, elder care, and other types of what the health industry might call subacute care are potentially areas where not only AI might find the most impact but also where there’s sufficient engagement by consumers to make it possible to sustain the business. When Carey, Zak, and I started writing our book, one of the things that we started off with was based on a story that Zak had written concerning his 90-year-old mother. And of course, as I had said in an earlier episode of this podcast, that was something that really touched me because I was having a similar struggle with my father, who at the time was 89 years old. One of the things that was so difficult about caring for my father is that he was living in Los Angeles, and I was living up in the Pacific Northwest. And my two sisters also lived far away from Los Angeles, being in Pittsburgh and in Phoenix. And so as the three of us, my two sisters and I, tried to navigate a fairly complex healthcare system involving a primary care physician for my father plus two specialists, I have to say over a long period of illness, a lot of things happen, including the fraying of relationships between three siblings. What was so powerful for us, and this is where this idea of patient empowerment comes in, is when we could give all of the data, all of the reports from the specialist, from the primary care physician, other information, give it to GPT-4 and then just ask the question, “We’re about to have a 15-minute phone call with one of the specialists. What are the most important two or three things we should ask about?” Doing that just brings down the temperature, eliminates a potential source of conflict between siblings who are all just wanting to take care of their father. And so as we think about the potential of AI in medicine, this concept of patient empowerment, while we’ve learned in this episode, is still emerging, I think in the long run could be the most important long-term impact of this new age of AI. [THEME MUSIC] I’d like to say thank you again to Dave and Chrissy for sharing their stories and insights. And to our listeners, thank you for joining us. We have some really great conversations planned for the coming episodes, including a discussion on regulations, norms, and ethics developing around AI and health. We hope you’ll continue to tune in. Until next time. [MUSIC FADES]

0 Comentários ·0 Compartilhamentos ·0 Anterior

Faça o login para curtir, compartilhar e comentar!

Mais stories

CGShares https://cgshares.com