Autonomous vehicles rely on very complex artificial intelligence, so the industry needs to build safe, reliable, and trustworthy AI systems. Swiss company Lakera can bring AI to a mission-critical level of safety.
Henry Ford once said that “quality is doing the right thing when no one is looking”. As our world becomes increasingly connected, we are now expected to do the right thing while everyone is looking, except for us. Connected and autonomous vehicles promise to make the world a better place; from optimized electric cars to autonomous transport, transparent supply chains, and independence for people with various abilities. We are seeing benefits to society beyond the Herbies, Javas, and Knight Riders of Hollywood.
However, these AVs (autonomous vehicles) rely on complex AI (artificial intelligence), and while AVs are mission-critical systems, AI is still being developed like consumer technology. The Silicon Valley saying “move fast and break things” does not and should not apply to human life. Nobody ever said, “Move fast and break humans.” So how do we bring AI to a mission-critical level of safety? We can take inspiration from traditional software systems.
What can we learn from software?
As C. A. R. Hoare’s classic 1996 article “How did software get so reliable without proof?” points out, it’s due to rigorous development processes, continuous improvement of existing software, and extensive testing. There is a process. Software engineers are more than familiar with concepts such as:
- Test-driven development
- Unit tests
- Regression tests
- Integration tests
Tests are a part of CI/CD (continuous integration/continuous development) pipelines. Engineers don’t merge code unless all tests have passed. By the time they go to production, they are confident that the software works as expected. They follow a “test-to-ship” strategy.
How do we currently test AI?
To evaluate the performance of ML (machine learning) systems, it’s common practice to split our dataset into training, validation, and testing subsets. The first two become part of the model training loop, whereas the testing subset is used separately to assess performance on unseen data. A typical evaluation would include calculating various metrics over these data subsets and using them as an indication of real-world system performance.
This strategy is insufficient. Many teams find that their ML systems end up performing “well enough” on their carefully selected datasets but are too brittle to be used in the real world. At the same time, creating complete quantitative testing and release processes is often seen as time-consuming, especially within smaller teams.
We have observed many who instead spend a lot of time on qualitative testing – which tends to fall short of constructing a thorough understanding of performance. As a result, computer vision development follows a “ship-to-test’” strategy.
The fact that ML systems are only really tested during operation has obvious and major implications for AV development. These systems tend to operate with significant risk as vulnerabilities only tend to surface during operation: pedestrians are not detected at night for example. At best, this leads to low customer satisfaction or products that never make it in the market, and at worst, it puts people in danger.
Combining the best of both worlds
The good news is that we can bring some of the concepts from traditional software development to ML development. We need to ensure that vulnerabilities are found during development. So, we need to bring back ‘test-to-ship.’
We can apply best practices from the software industry to create reliable machine learning systems. Operationalizing AI and putting systematic testing at the core of machine learning development will be the sheer driving force behind bringing autonomous vehicles to market – while providing uncompromising safety for everyone.