How autonomous vehicles learn to drive

30 July 2025
8 minutos
Blog

Companies like Tesla, Waymo and Mercedes promise us a future in which it will not be necessary to go to driving school.

A car is driving along a street in the city centre. There is no one at the wheel. In front of a zebra crossing, he notices a pedestrian approaching the road. Without being told to do so, he slows down, stops and waits. Only when the pedestrian has completely crossed the roadway does it resume its march... to the astonishment of the pedestrian himself, who has just seen a "ghost".

What represented "the future" just a decade ago is now being made possible by companies such as Waymo, Tesla and Mercedes-Benz, thanks to computer vision systems powered by cameras, depth sensors, LiDAR scanners and algorithms capable of analysing each image, identifying patterns and acting accordingly, promising that we will never need to set foot in a driving school again.

But what exactly is this technology and why is it so important? As the students of the Master's in Artificial Intelligence at UDIT, University of Design, Innovation and Technology,are quick to discover , we are talking about a branch of artificial intelligence that, in addition to increasingly intelligent cars, is present in facial recognition cameras, production lines that identify defects in milliseconds or in medical applications.

From pixels to understanding

What exactly do machines understand? When we say that a machine "can see",what kind of "sight" are we talking about? What it actually does is interpret the world in a very different way to how a human being does. Instead of seeing objects or scenes, what it receives is a matrix of pixels: a succession of numerical values representing colours and intensities. From there, a complex process of visual interpretationbegins that converts this data into knowledge.

This whole fantasy begins with a stage called pre-processing, in which the image captured by the camera is adjusted to facilitate its analysis. This can include tasks such as correcting the lighting if areas are too dark or bright, removing visual noise (small imperfections or distortions that can confuse the system) and highlighting the edges of objects to make them easier to identify.

The artificial intelligence then begins to look for patterns: shapes, contours or structures that match what it has learned in its training. For example, it can identify that a round shape with certain contrasts corresponds to a face, or that an elongated, rectangular shape with specific colours is a road sign.

Once these key elements are located, the system can apply more sophisticated models. Some classify what it sees (e.g. distinguishing between a car, a bicycle or a pedestrian), others perform semantic segmentation, dividing the image into zones and assigning a meaning to each (pavement, road, sky, person, etc.), and in the case of video sequences, algorithms come into play to track the movement of an object over time, such as a ball in a game or a person crossing a street.

In an autonomous car,this whole process is repeated dozens of times per second. For safe driving, each frame must be analysed in less than 100 milliseconds (ideally between 30 and 50 milliseconds), allowing the system to work at a rate comparable to real-time video (20-30 frames per second).

When the system identifies the scene (e.g. a pedestrian crossing, a red light or an unexpected obstacle), it moves on to decision-making. The AI uses this visual information to trigger automatic responses: braking, turning, maintaining distance or simply continuing to move forward. This phase interfaces with other vehicle systems (such as cruise control or steering) and, in some cases, with predictive systems that anticipate what might happen in the next few seconds.

Passing the driving test

For an autonomous car to recognise a bicycle, anticipate a pedestrian crossing or detect a stop sign partially covered by a tree, it first needs to learn.

Its training begins by "studying" gigantic databases made up of images and videos collected by other vehicles, city cameras, digital simulations or test fleets. Each image is labelled manually or by algorithms with information about what appears in it: cars, people, signs, traffic lights, curbs, etc.

From there, deep learning comes into play: neural networksare used which, layer by layer, learn to recognise increasingly complex visual patterns; from the contours and shapes they incorporate into their "knowledge" in the early stages, to objects and behaviours.

These networks take the form of specialised models, each with a role within the vehicle's "visual system". For example:

YOLO (You Only Look Once) is able to detect objects in real time, ideal for responding quickly if a pedestrian crosses unexpectedly.
Faster R-CNN is more accurate and better suited to manoeuvres where every centimetre matters, such as in automated parking.
Mask R-CNN adds the exact contour of the detected object, useful for estimating real distances.
U-Net or SegNet segment the scene by zones: road, pavement, bike lane...etc.
Kalman Filter and Deep SORT allow the car to continuously track moving objects, such as a motorbike zigzagging between lanes.
And new visualtransformers such as ViT or Swin Transformr make it possible to understand entire scenes and anticipate possible future behaviour.

The goal is for the neural network to generalise: if it has learned to detect a dog in thousands of images, it will also recognise it when it is partially covered by a fence or running in the rain, but also that if it has learned that the right indicator means an upcoming right turn of the car it is following, it will probably mean that the same vehicle will slow down.

Once trained, the model is integrated into the car's system and tested. Firstly, in all kinds of virtual simulators, where it evaluates its behaviour in different situations; in a second stage, on closed and controlled circuits; and finally, if everything has gone well, in real-life scenarios.

This is the approach followed by companies like Waymo, which already operates driverless robotaxi services in cities like Phoenix or San Francisco, where the vehicles circulate completely autonomously in delimited areas and under remote supervision because...

A wall of reality

Despite impressive advances in computer vision, there are still very few autonomous cars driving around our cities. Why? Because reality is complex, chaotic and, above all, unpredictable.

One of the main challenges remains human behaviour. No matter how well trained a model is, it still finds it difficult to anticipate erratic decisions: a pedestrian who stops suddenly in the middle of the road, a cyclist who jumps a traffic light or a driver who turns without warning... The autonomous car needs more than just vision: it needs to understand intentions, and this reading of the social context remains one of the most difficult areas for an AI.

In addition, the state of the infrastructure plays a critical role. Blurred lane markings, signs obscured by vegetation, poor lighting or adverse weather conditions (such as dense fog, heavy rain or sun glare) can drastically degrade the accuracy of sensors and models. Although perception systems have improved enormously, they are not yet ready to operate with complete safety in any environment, at any time and under any conditions.

For this reason, most of today's vehicles operate in hybrid or advanced assistance modes (known as autonomy levels 2 and 3), where the driver is still needed as a backup in unexpected situations. Fully autonomous driving (level 5), with no human intervention at any time, remains a medium and probably long-term goal.

However, there are reasons to be optimistic. According to Waymo's own data (an evaluation of 7.1 million miles showed only 3 injury accidents), its vehicles have already travelled millions of kilometres without causing accidents with casualties, and various studies suggest that autonomous systems make fewer errors per kilometre than human drivers. In a world where human error is behind more than 90% of traffic accidents, every advance in computer vision and automated decision-making represents not just a technological step forward, but a potential improvement in road safety.