Share Podcast
AI Takes the Wheel: New Advances in Autonomous Driving
Can generative AI change how we drive?
- Subscribe:
- Apple Podcasts
- Spotify
- RSS
Artificial Intelligence is on every business leader’s agenda. How do we make sense of the fast-moving new developments in AI over the past year? Azeem Azhar returns to bring clarity to leaders who face a complicated information landscape.
This week, Azeem joins Alex Kendall, co-founder and CEO of autonomous driving start-up Wayve, to uncover how the AI revolution is enabling new strides in self-driving. They delve into the implications of these advancements for urban mobility and the transformation of cities in the future.
They discuss:
- How business models in the automotive industry are shifting towards AI integration and subscription-based services.
- The role “embodied AI” is playing in shaping everyday assistance, beyond just digital interactions, in the future.
- The challenges and breakthroughs of applying AI in complex, unpredictable environments, like road traffic.
Further resources:
- Ride the Wayve: Azeem Azhar Goes for an Autonomous Drive on London’s Toughest Roads (Wayve, 2023)
- UK Start-up Wayve Unveils Self-Driving System that Explains Its Actions (Financial Times, 2023)
AZEEM AZHAR: Hi, I’m Azeem Azhar, founder of Exponential View and your host on the Exponential View podcast. When ChatGPT launched back in November, 2022, it became the fastest growing consumer product ever and it catapulted artificial intelligence to the top of business priorities. It’s a vivid reminder of the transformative potential of the technology. And like many of you, I’ve woven generative AI into the fabric of my daily work. It’s indispensable for my research and analysis. And I know there’s a sense of urgency out there. In my conversations with industry leaders, the common thread is that urgency. How do they bring clarity to this fast-moving noisy arena? What is real and what isn’t? What, in short, matters? If you follow my newsletter, Exponential View, you’ll know that we’ve done a lot of work in the past year equipping our members to understand the strengths and limitations of this technology and how it might progress. We’ve helped them understand how they can apply it to their careers and to their teams and what it means for their organizations. And that’s what we’re going to do here on this podcast. Once a week, I’ll bring you a conversation from the frontiers of AI to help you cut through that noise. We record each conversation in depth for 60 to 90 minutes, but you’ll hear the most vital parts distilled for clarity and impact on this podcast. If you want to listen to the full unedited conversations as soon as they’re available, head to exponentialview.co. My next conversation started in the front seat of a self-driving car. I joined Alex Kendall, the co-founder and CEO of Wayve. It’s an autonomous driving startup and we went for a test drive on some really unpleasant roads in North London. Alex’s team is behind some impressive work using generative AI and synthetic data to train autonomous systems that can safely navigate roads and human behavior in complex environments. I was really impressed by the experience. This car driving itself navigated some quite difficult roads, cyclists all over the place, people crossing without looking, a huge bevy of police and ambulances and roadworks. The types of things you often see in London. You’ve got to watch the video for yourself. We had a couple of GoPros recording us and our reactions. The link to that video is in the show notes. Now, Alex and I go under the hood of Wayve self-driving systems. We discuss the evolving business model in the car industry and what needs to happen to bring improved safety through AI to cars on our streets, and potentially one day self-driving vehicles as well. Enjoy. Alex, we’re now back in your office and I have to say I am still buzzing and reflecting on that amazing experience, that 15, 20 minute drive we did around some pretty tough streets in London in one of your vehicles powered by your AI system. Is that the normal experience? Is that how you find people responding to their first drive in a Wayve?
ALEX KENDALL: It’s magic every time. I try and get out in the car every week and stepping out of it… Every time you go for a drive you see something new, whether it’s different weather, different road layouts. We had some crazy interesting scenarios today and it’s always a treat to see how the AI learns and grows over time. And when you see a new behavior for the first time or something like that. I mean, most people have had their ChatGPT moment with AI, but for me getting in a physical car and seeing it interact in the real world, there’s nothing like that. It’s really special.
AZEEM AZHAR: I like the way you’ve used that actually, the ChatGPT moment, because when I first used ChatGPT and we’re recording this nearly a year to the day from the launch of ChatGPT, it really was a moment and I came away, the way scientists say, rethinking my priors. And I would say that having been in the car for 20 minutes, sitting next to the safety driver that you have to have for legal reasons, watching where his hands were and they were near the wheel, but off the wheel, while we drove through those really difficult streets and running across situations which I could imagine someone who had been driving for only three or four years would’ve really, really struggled with, and the car handled them with real great aplomb. So does that feel… Am I rethinking my assumptions about this technology after that? Well, right now I am, but it’s really fresh in my mind, so yes. So maybe it is a bit like a chat gpt moment. Maybe it is. But let me ask you, what was the journey that took you from a researcher to deciding that this is something that you wanted to build, that was even had potential, that the time frames were right for it?
ALEX KENDALL: Well, going back to the ChatGPT comment, if you ask people what AI is today, I think that’s what people gravitate to. And don’t get me wrong, that technology is mind-blowing and incredible. The first thing I asked ChatGPT when I was playing with it, I asked it the trolley problem, how do you navigate risk when you’re self-driving a car? It was interesting to see how it answered. But I think in 10, 20 years, if you ask people what AI is, they’re going to be referring to the embodied AI, the system that whether it’s the bipedal robot in their living room helping out with their domestic tasks or the autonomous self-driving car that just dropped them off or delivered their groceries or what have you. It’s the AI that’s going to be around us and able to support the lives we live in, accelerate what we do and free up our time and make our lives safer. This is where I think we’re going to go. And when we founded Wayve, the reason why we started was in order to build that future, it’s not going to be built by a robot that’s hand programmed to drive in it or to operate in a way that’s designed with a set number of rules and to operate in a set given way because the world is just too unstructured, unpredictable. You need to have a system that has the intelligence to understand that and make its own decisions. And there’s no better way of doing that than end-to-end deep learning, big machine learning models that can learn through data, learn things that are more complex than we can hand program as engineers. And I’d seen that during my research with Computer Vision. I’d been inspired by similar breakthroughs at the time, whether it was to AlphaGo, to be able to solve the hardest ball game in the world and be able to learn how to beat the world champion at it. The work that Google DeepMind did. These kinds of things at the time made me think, look, now’s the right time to go and build this in an embodied physical system and we can actually go and make this technology safe to be deployed in the physical world.
AZEEM AZHAR: Right. So that’s actually very helpful to understand because deep learning has been with us since a moment of inflection probably since 2010, 2011, although the theory and the first prototypes are somewhat older than that. And what it did was it challenged the assumption that in order to build AI systems, you would need to have lots and lots of rules, rules about the world and those rules and that approach is now called GoFi. Good old fashioned AI would construct, the joke was millions and millions of if, then, else statements, right? If you see a child, then brake, else continue as you were driving. And you have to construct so many rules, it becomes very complex. So was that approach, the approach that the autonomous vehicle industry started with when they used to have those DARPA challenges across the desert in the US? Is that how those systems were built 15 years ago?
ALEX KENDALL: Well, you make a good observation there. I mean, this has been the same pattern that’s played out, whether it’s the very first systems that could beat humans at chess or image recognition or even ChatGPT, as we talked about before. People used to build those technologies with rule-based systems and it was going to an end-to-end neural network that’s enabled them to hit this inflection point. But actually one of the very first autonomous vehicles in, I think it was 1989 or around that time, was an end-to-end learning approach. It was built by Dean Pomerleau and colleagues at Carnegie Mellon University in the US. And he was inspired because, what I’ve read about is, he wanted to build a system that could drive on the east and west coast of the US. It could generalize, it could scale. And so he built, it was a very small, I think, hundreds of parameter neural networks, a very, very small neural network, but it could get a vehicle at the time to do lane following. But after that, we had this AI winter where people couldn’t… There wasn’t, for many factors, compute resources, algorithmic maturity, things like this, it wasn’t possible to scale these systems. In the meantime, rules-based systems were the fashionable thing at the time. And in the 2005, 2007 era, the US government poured a lot of funding into those DARPA grand challenges as you described. They allowed academic research labs to scale up rules-based systems based on mapping and LiDAR technologies. And ultimately when Google ended up acquiring and funding one of them commercially, that’s what led to the first commercial self-driving car effort. And that’s now since we’ve seen a lot of offshoots from that project that have formed many of the existing commercial efforts today. But the result has been the prevailing commercial technology has been that traditional rules-based stack. So we’re not the first and it’s not a new idea to do end-to-end learning for self-driving. Many have tried along the way, but when we started in 2017, I think there are a number of reasons around timing that meant that it’s been possible to build and scale it now. And then with all of the breakthroughs and foundation models and generative AI, it’s just simply accelerated this. But certainly in 2017 when we started and there were multi-billion dollar efforts behind these rules-based approaches, and everyone thought, look, it’s going to scale and commercialize, it’s going to do that in a year, it’s all done, the market has won, I think there was a certain sense of contrarianism and bravery that we had to grasp in order to go and set off on this alternate path, which has been a very contrarian approach for the last few years.
AZEEM AZHAR: Well, I mean, you and I have met a few times over the past few years, and I think you may remember that I was very skeptical about your contrarian approach initially back then. I mean, as I was skeptical about the self-driving efforts that were happening elsewhere because they kept missing their goals and it was becoming clear that it was more and more complex. So I’m very happy to be here and having had the drive in the car and I’ve had an opportunity over the last couple of years to say, you’re doing really interesting things and so on. But I was, myself, skeptical about whether you could do this with end-to-end learning and without using additional sensor packages. And I think it’s worth saying that your current system works just using a series of cameras, whereas a large part of the autonomous vehicle industry had made assumptions that you would need a wide variety of sensors, including LiDAR, which is I suppose laser style radar, maybe other types of radar, maybe ultrasonic sensors, many different types of cameras. So I was quite skeptical. I’m happy to admit that and happy to see the progress that you’ve made.
ALEX KENDALL: Yeah, the sensing question is an important one, but the first point I’d make is that you can’t consider the sensor packages without also considering the AI or the software that’s running behind it. You could have all the best sensors in the world, but the worst AI and your system be terrible. Or you could have the worst sensors and the best AI and you can actually work, you can see examples in the animal kingdom with the mantis shrimp. It’s got the best eyes in the animal kingdom, far better than human eyes, but its intelligence is very poor and you wouldn’t trust a mantis shrimp to drive your car. So you need to look at both in together. And from our perspective with our AI driver, we’ve built it in a way that makes it agnostic to the sensors it uses. So it can learn to drive with cameras, radar, LiDAR, whatever future sensors people invent. And we want to stay on the bleeding edge there. But the point is that we need to be able to learn to adapt those different signals and learn to generalize across them. To make a safe system, you want to have some redundant sensing modality. You want to have two different types that give you protection against orthogonal failure modes and things like that. And so you also want to have a system that’s affordable and scalable and then you want to have an AI that’s good enough to handle that. And so bringing those factors together, I think camera radar solutions make a lot of sense today because cameras and radars are in most production vehicles today.
AZEEM AZHAR: They’re cheap.
ALEX KENDALL: We know how to manufacture them, they’re cheap, hundreds of dollars each, but I am convinced that people will invent amazing new sensors in the future and we need to make sure our AI is capable of learning to adopt them and improving the safety of the system.
AZEEM AZHAR: So let’s do this. Let’s go back and step back slightly into the old paradigm and just let’s describe that so we understand what it is and what your new paradigm looks like. So as I understand the old paradigm was that the brain of the system would be largely, but not exclusively, driven by a complex set of rules. And you might have modules that are using learning for things like just on the machine vision but not on the decision-making. So you’ve got a machine vision module that might be deep learning that is helpful for identifying whether you’ve got a fox or a human or a bicycle but then the decision-making goes into the rules-based system. You would then have a set of sensors and with the assumption quite likely that you would need this expensive LiDAR sensor as well as cameras. And then you would have the physical hardware that’s doing all of this processing, and you might have many different subsystems that all come together at one point. And so that becomes quite expensive. Is that a fair reflection, first of all, of generation one of what the industry thought of, reasonable?
ALEX KENDALL: I think that’s reasonable of the AV1.0 solutions out there. So if you contrast that to what we’ve set out, we replace that entire software stack with one giant neural network, a neural network, a large transformer, deep learning model that takes the sensor data as input and outputs a motion plan. And that allows us to simplify the whole system. It allows us to learn holistically and to optimize it together and allows us to operate with much more compute efficiency. So it allows us to operate with just cameras and radar, operate on a single GPU and simplify the entire software stack.
AZEEM AZHAR: So this is what you would call AV two. It’s the Wayve approach. And again, drawing parallels as we have to large language models that are out there, large language models learn on trillions of tokens, which is effectively trillions of words of text. And what we understand from that process is that they learn from the text on a large rig of GPUs with some very, very dedicated processors, and that might take days or weeks of training, it might cost millions of dollars. Once that’s done, you then go in and you fine-tune that language model for your particular use case, and that might be to make the language model very, very good at answering questions. And then you might further fine-tune it through using some kind of reinforcement learning with human feedback, which is a a pavlovian mechanism of giving the dog a treat when it does well and giving it a stiff word when it doesn’t do well in order to get it behaving within a set of parameters that are acceptable. So that’s a process, I think, that many people will understand because we’ve said it a lot about that with large language models over the last couple of years. How similar is that to your training process? I mean, what is your equivalent of tokens? What does the inference, sorry, the training stage look like and is there a fine tuning stage?
ALEX KENDALL: It’s very analogous and one of the interesting observations that we’ve seen over the last couple of years is when I was doing my PhD, if you were building a computer vision or a speech or a game playing agent or whatever AI problem you’re working on, it had a very different architecture and it was customized for that problem. But over the last few years, we’ve seen everything converge to a single transformer architecture or whatever the new architecture is. It is almost converging to the same types of models. And I think we’re seeing the same thing here where these models are so powerful and can be trained on such diverse sets of data, even multimodal data. We were talking about how AI is not just trained on driving data, but now also text data and other sources, synthetic, generative AI, simulation data. And we essentially want to have as much data into our model, as much as possible to give it the richest understanding of the world. But in similar to large language models, there is that retraining step of giving it that base level of knowledge. And then there’s a fine-tuning RLHF style step where you’re giving it feedback to drive in the specific way that our customers will want. Whether AI is supporting a logistics fleet or a consumer passenger vehicle, whether it’s driving in London, in San Francisco, or whichever country around the world, there are local behaviors, rules, customs that it needs to be able to respond to. But the base foundation model is going to be shared across all of those jurisdictions.
AZEEM AZHAR: Roughly how much training data do you need? What is your equivalent of the 3 trillion tokens? Is it 3 trillion hours? Is it 300 hours of driving? Is it lots of camera data? What does that quantum look like?
ALEX KENDALL: We’ve started off with a very modest amount of data. We’ve got a small fleet. We’ve partnered with some of the biggest grocery delivery companies here in the UK, and that’s given us a small amount of data to get started, but we are looking to grow this substantially. I think that hundreds of thousands, millions of hours is probably the level of data that you are going to want to put out and verify a level four autonomous driving system. But I think there’s a chance to offset a lot of that through other sources of information, whether it’s internet text and video, whether it’s synthetic data that’s generated through a generative AI system. It’s all, we should be learning from the most efficient information sources possible.
AZEEM AZHAR: For the average person, the real large language model moment was ChatGPT back in November, 2022. But for people in the AI world, it was really when open AI delayed the release of GPT-2 because they were concerned with how it could be used. And anyone using GPT-2 would be slightly disappointed with the experience, unless you were an AI researcher and you could understand what had happened. And what we’ve seen since GPT-2 was GPT-3, 3.5, and then competitors and there’s been a substantial performance improvement. Quality, the context window, the level of hallucinations, and there’s also been an understanding of what’s known as scaling laws within transformer models. In other words, how much better do they get for the additional training data and computation time you put into them? So I’m curious about where you think you are today with Wayve?
ALEX KENDALL: The exciting and perhaps daunting thing is that I think our AI model today is the equivalent of a GPT-1 level of maturity. It is very small in parameter count and amount of data it’s trained on. The roadmap that we have in front of us over the next one to two years is going to scale the size of… Follow those scaling laws and bring us up to a GPT-3 level size of model. Now you’ve seen the performance that we can achieve at the small scale. The interesting thing that we’ve observed is many of the dynamics that have played out in large language models, we see paralleled here in our model. For me, there’s really been three big trends this year in the LLM AI space. So the first one, as you say, scaling laws. Second, multi-modality. And third around synthetic data. Just quickly to talk through those three. The first one on scaling laws, there is such a strong response of performance of these systems against the data and compute, and we see the same thing with our embodied AI systems. As you increase the amount of data, as you train on more compute, the performance goes up with a very clean correlation. Now that’s what I think is going to allow us to solve edge cases at scale. That’s the big problem in self-driving. How do you deal with the weird and unusual and unlikely scenarios and make sure that you’re safe and robust to them? The traditional approach to self-driving required armies of engineers to identify and solve all of those scenarios, and that becomes exponentially hard as you grow. Scaling laws allow you to just get better, the more data you put into the system, more compute, and that growth curve is something that we can write and we can use to address the long tail problem of edge cases.
AZEEM AZHAR: Can I ask about one particular edge case? So there was a horrible example with Uber’s autonomous vehicle and it knocked over the woman pushing the bike across the road. And I’m going to ask, did you take that data and construct a simulation to test your vehicles against in silico as opposed to on the road against something like that? And if you did, what did you learn from it?
ALEX KENDALL: Well, I mentioned the first one on scaling laws, but let me jump to the third one on synthetic data. We’ve seen this big movement of large language models to move from just being trained on internet data to synthetic data where you can control the bias, you can control the distribution of that training data. And we’re seeing the same thing here as well. And I raise this because when you have scenarios like the one you describe, you want to make sure that you can understand all the different permutations of how they might play out and ensure that you’re robust to make sure that never happens again. Our generative AI technology such as the models Gaia that we released earlier this year, allows us to take experiences that we’ve either had or observed and replay them in new ways and re-simulate them through a video generative AI model. In fact, it’s a full world model that allows us to drive and re-experience that.
AZEEM AZHAR: Yeah, Gaia created a real splash in the AI community. I certainly saw some of the big, big luminaries talking about it. So help me understand that that effectively, is it constructing, effectively, a synthetic world against which you can train your AIs?
ALEX KENDALL: Well, this goes back to one of the differences to large language models in self-driving. It’s a safety-critical application. You need to be really sure you’re making the right decision so that you don’t cause any harm because you’re interacting with the physical world. If you get a tap prompt wrong, I mean, it’s not usually safety-critical. And so that’s led us for many years to pursue a line of research on enabling a world model, enabling the AI model to understand the implications of its decisions, allowing it to simulate forward. If I take this certain action, what is going to happen in the future? And so Gaia is the culmination of six years of work there. It’s a large ten-billion-parameter world model that’s able to take your current situation and your current action and predict how things might unfold in a multimodal way. It can look at the different scenarios that might play out and use that to be able to ensure that the decision you make, you’re aware of the consequences and the AI is able to make a decision that’s safe and robust.
AZEEM AZHAR: And the key thing there is you said multimodal. So in a way, Gaia can effectively come out with a forecast transcript of what the implications of particular actions would be, which is in certain driving-
ALEX KENDALL: Yeah, when you’re in a really busy intersection, that’s essential, right? Might a car speed and cut you off, or are people going to obey the road rules or are the lights going to change and do you need to be ready to adapt?
AZEEM AZHAR: That feels like, though, it’s a breakthrough in interpretability, which is one of the challenges that we have with deep learning models, which is that we were always struggling, particularly with the vision-based ones to interpret why they made the decisions they did. And when I saw Gaia come out and looking at its ability to generate explanations, I was thinking, wait, we are now able to explain the decision-making that’s going on. And that felt like it was a milestone of some sort.
ALEX KENDALL: There’s been a paradigm shift this year, even particularly in the automotive industry. A couple of years ago, everyone was just very dismissive of an end-to-end AI approach saying, how would this ever work? How is it safe? How can we trust it? This year, everyone has changed their tune.
AZEEM AZHAR: So why have they changed their tune?
ALEX KENDALL: I think there’s been a few reasons. ChatGPT has a lot to answer for, people just being inspired around what is possible with AI. Secondly, the struggles and missed deadlines that we’ve seen with the traditional self-driving approach. People have realized how hard a problem it is and why you need to take an AI approach. And then thirdly, I think the inflection we’ve seen in performance and capabilities, I’d like to think has played a small part. But I want to talk about what you said around interpretability because that comes back to the second big trend that we’ve seen on multimodality. And one of the exciting things there is that we’ve been able to now build foundation models that cover many different modes of data. And so for us, that’s enabled us to train one, which is an industry-first vision-language action model. This means that it can not only see and act, drive a car, but also ground it in language. This opens up a ton of possibilities, whether it’s being able to interact and talk to our car and literally ask it questions and ask why and what it’s doing to prompt it to drive in a certain way. And we put out a model lingo, which shows that capability. Or being able to understand why it made decisions, debug and interpret the system and get it to explain its actions and text. And that just democratizes access to understanding it and opens up a whole new range of possibilities and trust and interoperability.
AZEEM AZHAR: I mean, there’s so many little places you could take that, drive more like Kit, who’s been my favorite autonomous vehicle so far. Or I’m feeling a little bit hungover, go easy on me. And just being able to tune the car in that way. I guess one of the things the automotive industry has been obsessed about is their score from zero to five, right? The so-called, the levels, level zero being where we are, no automation, all the way to level five, which is full automation. What David Hasselhoff had in the film, the TV series, Knight Rider. And we were promised full automation years ago by a number of the players in the industry. I guess what I’m seeing in the industry is a sense that level five full automation is quite far away. There’s even been a bit of a pullback from level three, which was, this for me was always a weird one where the car does everything, but the humans got to be ready to take control and not have drifted off to sleep or be daydreaming because for the last hour, the car’s been doing everything for it. Back towards this idea of level two, the partial automation, and that seems to be what is in the roadmaps of the big OEMs, the vehicle manufacturers as a premium product that they can offer. But when you look at that zero to five score, how do you now think about it? Do you think it’s helpful? Do you think it’s become a distraction? Do you think it’s actually the way in which car companies will start to articulate what their offerings are?
ALEX KENDALL: So the SAE level one to five is interesting. I mean, it’s a helpful framework, but actually we’ve seen a couple of L3 products come on the market, but in a very limited way, under 50 kilometers an hour on a highway, for example, and only on certain highways. And so you want to look at both advancing on that score, but also the breadth of capabilities you can offer. We’ve also seen L4 systems, for example, in San Francisco, but only in certain regions and times of day. And so there’s a question of not just advancing the score, but how generalized and scalable can you be? So look, I think what’s becoming clear is that taking a approach where in a very limited environment you look to retrofit a solution and trial it there is extraordinarily expensive and is going to be very hard to sustain. In contrast, I think what the industry is realizing that self-driving is an AI problem, it’s a problem where you do need to have a learning loop that can expose your system to data, to experience, and to allow it to learn and scale and performance and for you to release L2, L3, L4 capabilities as you can both achieve and verify that level of performance.
AZEEM AZHAR: So in a sense, it’s a useful set of benchmarks because everybody understands them, but even those level three systems that you refer to, as you say, are very, very constrained, right? They’re not what we imagined level three to be four or five years ago. That’s why I think you have people referring them as level two plus. So when you think about where you are in the market and you are going off to talk to car companies with your technology, what are they buying from you?
ALEX KENDALL: Here’s the perfect storm of so many different trends that are just catalyzing this year. It’s the ability for AI to scale and generalize as we’ve talked about. Secondly, it’s automotive manufacturers have made the investments to bring in everything from redundant actuation, surround cameras and a forward-facing radar or the sensing you need, plus a GPU platform like the NVIDIA Auron system. These are now coming on high-end vehicles as of this year and next year. And so the culmination of AI maturity plus the fact that we are now seeing vehicles with an open embedded software platform and the right sensing in actuation to run these kind of systems, these things coming together create a perfect storm that allow an AV2.0 technology like ours to be deployed in production vehicles. Now, this is notably far from the requirements of the AV1.0 architectures, which have significantly more hardware compute and sensing requirements. But with our approach, the AI capabilities and what is currently about to be manufactured and launched from automotive are colliding. And that is a really powerful prospect.
AZEEM AZHAR: Well, thanks for listening. What you heard was an excerpt of a much longer conversation. To hear the rest of it go to exponentialview.co. Members of Exponential View and the community get access to the full recording as soon as it is available, and they’re invited to continue the conversation with me and other experts. I do hope you join us. In the meantime, you can follow me on LinkedIn, Threads, and Substack for daily updates. Just search for Azeem, A-Z-E-E-M, or if you’re in the US and Canada, A-Z-E-E-M. Thanks.