Scalability and Power: The Boundaries of Deep Learning

interview with Marco Canini

As is now well known, the recent rise of artificial intelligence stems from the convergence of two key factors: the vast amounts of data generated by contemporary digital lifereadily available onlineand the ongoing advancement of computational power.
While the accumulation of data is an intuitive phenomenon, rooted in our daily experience, the evolution of computing capabilities follows more complex and less tangible dynamics. Yet these are crucial to fully understanding the nature and limitations of AI.

The deep learning process relies on three factors: computational power, the number of parameters, and dataset size. Could you explain what these are and why they matter?

Deep learning models are trained through an iterative process in which their capabilities gradually improve through repeated update cycles. Initially, the model starts with randomised parameters, making it relatively ineffective. It is then exposed to data tailored to the specific task at hand. This data is typically labelled with information that acts as solutions, helping the model to understand and learn how to perform the task.
Naturally, performing the computations required during training becomes increasingly demanding as the model grows. A larger number of parameters increases the computational load, but it also enables the model to capture more complex correlations and finer details.
The size of the training dataset is equally critical: larger datasets enhance the models learning ability, resulting in more accurate predictions and higher overall quality.
In general, we can say that in deep learning, improvements in computational power, parameter count, and data volume lead to significant performance gains.

What is Moores Law, and how has it influenced the development and scalability of deep learning?

Moores Law is named after Gordon Moore, co-founder and CEO of Intel, who observed that the number of transistors in digital circuits doubled approximately every two years. From this historical trend emerged a predictive law suggesting that technological advances would allow the semiconductor industry to continue boosting circuit performance by increasing transistor density.
This law has had a major impact on the development and scalability of deep learning. Although the origins of deep learning date back to the 1970s with early neural networks, recent breakthroughs have only become feasible thanks to access to vast datasets and rising computational poweradvancements made possible in large part due to Moores Law.
In the past, performing complex calculations was prohibitively expensive. Today, thanks to GPUs, we can achieve computational throughput measured in teraflops (one teraflop equals one trillion real-number operations per second). This has dramatically transformed the field, enabling tasks that were once unimaginable.

One solution to the scalability challenge is Distributed Deep Learning. What are its characteristics?

Despite the expectations outlined by Moores Law, todays technological growth is largely driven by parallelisation across extended computing resources rather than increases in individual processor speeds.
However, one of the slowing trends concerns memory capacity and bandwidth. For example, the memory available in modern GPUs is not increasing at the same pace as computational power. This presents a bottleneck, as deep learning models demand ever-larger parameter sets to function effectively.
When a single GPU no longer offers sufficient memory for training, the process becomes distributed, using multiple GPUs in parallel to share the computational workload.
This approach is known as distributed deep learning. It addresses scalability by distributing training tasks across several GPUs. In theory, this allows computational capacity and training speed to grow linearly with the number of GPUs used, reducing training time accordingly. In practice, however, the scalability is not perfectly linear due to various technical constraints.
Despite these challenges, substantial progress has been made in making parallel computation more efficient. The emergence of Transformer-based models, used in Large Language Models (LLMs), has accelerated this trend. These models contain tens or even hundreds of billions of parameters, making it necessary to train them using hundreds or thousands of GPUs in parallel.
While great strides have been made, research into improving these processes is ongoing, driven by efforts from companies, universities, and researchers worldwide.

Recent studies suggest that neural network-based artificial intelligence has reached a stage where further efficiency gains are increasingly difficult. Is that true?

Yes, particularly in the case of LLMs, there is a well-documented correlation between computational power, parameter count, and dataset size. Improvements in model quality result from increases in all threebut the relationship is exponential.
This means that even modest quality gains require substantial increases in computing resources, parameters, and data. As a result, making models more efficient has become increasingly difficult. Marginal improvements now come at enormous cost.
Training large models today can require tens of millions of dollars, limiting the ability to explore new ideas or substantially refine existing methods. This often leads to a preference for replicating established approaches rather than innovating, due to high costs and complexity.
This environment makes it more challenging to introduce novel techniques, despite vibrant research and innovation. Much of this innovation is driven by large corporations that have the necessary resources, while academia, despite its talent pool, is constrained by high costs.
Moreover, much of the knowledge gained remains proprietary, driven by commercial interests. However, open-source initiativessuch as Metas LLaMAplay a vital role by contributing to the wider community and encouraging knowledge sharing.

MARCO CANINI

Marco Canini is Associate Professor of Computer Science at KAUST. He received his PhD in computer science and engineering from the University of Genoa in 2009 after spending the last year as a visiting student at the University of Cambridge. He was a postdoctoral researcher at EPFL and a senior researcher at Deutsche Telekom Innovation Labs and TU Berlin. Before joining KAUST, he was an associate professor at UCLouvain. He has also held positions at Intel, Microsoft and Google.