Skip to main content
  1. Blogs/

How Limits Make AI Limitless: Intuition for the Derivative

··1064 words·5 mins· ·
Author
Aakash Nand
Senior Data Engineer @ Kraken Technologies

What does the slope of a curve have to do with ChatGPT? More than you might think. The high school calculus we all studied, specifically differentiation, is the mathematical foundation behind how AI learns. You may know differentiation by this notation: \(\frac{dy}{dx}\)

A few years ago I started a graduate course in AI and tried to understand what happens behind the scenes. Since then I have learned a great deal, and in every course there is mention of Gradient Descent. It is the algorithm that makes AI learn. Of course, it is not the only reason we are witnessing this impact of AI; many other things combine to make it possible, such as GPU power, huge datasets, embeddings, transformers, and more. While learning these concepts, I revisited limits and differentiation and encountered a really good mathematics channel by Eddie Woo. Eddie teaches with so much passion and enthusiasm, and makes mathematics genuinely interesting. What struck me watching his videos was how naturally the idea of a derivative emerges from a simple geometric problem, one I want to walk you through today. This article is inspired by his video on the same topic. Credit to him for making these concepts so accessible. Definitely check out his channel.

To set the stage, let us start with a real example. Isaac Newton was sitting under a tree and witnessed an apple fall. It made him curious: why does the moon, which is also up in the sky, not fall like the apple from the tree? Why was gravity affecting the apple but not the moon? This curiosity led him to formulate the problem into equations and find the relationship between distance and the force it exhibits. The famous formula, also known as Newton’s law of universal gravitation, goes as follows:

$$ F = G\frac{m_1 m_2}{r^2} $$ Where:

  • \(F\) is the gravitational force between two objects
  • \(G\) is the gravitational constant \(6.674 \times 10^{-11} \ \text{Nm}^2/\text{kg}^2\)
  • \(m_1\) and \(m_2\) are the masses of the two objects
  • \(r\) is the distance between the centers of the two objects

The graph below shows the relationship between distance \(r\) and gravitational force \(F\), which are inversely proportional to each other. Notice that the curve is not a straight line:

alt text

alt text

This curve is exactly the kind of problem we want to solve. Suppose Newton wanted to know: at a given distance \(r\), how quickly is the gravitational force changing? That is, what is the slope of this curve at any given point?

To understand slope on a straight line, we use \(m= \frac{rise}{run}\). Given two points \((x_1,y_1)\) and \((x_2,y_2)\), the slope is \(m= \frac{y_2-y_1}{x_2-x_1}\). On a straight line the slope is always the same no matter which two points you pick. However, on a curve like Newton’s gravitational formula, the slope keeps changing as the points move. This is the central problem: how do we calculate the slope at a single point on a nonlinear curve?

The answer, in principle, is a tangent line. A tangent at a point tells us the instantaneous slope there. But here is the problem: to calculate a slope we need two points, and a tangent by definition touches the curve at only one point.

alt text
alt text

The way out of this is to use a secant line, which passes through two points on the curve. Now let’s look at what happens as we slide those two points closer and closer together. The secant line rotates and starts to look more and more like the tangent. In the limit, when the two points are infinitely close, the secant becomes the tangent. This is the key insight.

Now let’s turn this into a formula. For a single secant intercepting the function as shown below, we can write the gradient as follows:

alt text

To get the gradient of the tangent from this, we need the distance \(h\) between the two points to become 0. But \(h\) cannot actually be zero because we are dividing by \(h\).

This is exactly where limits come in. We do not ask what happens when \(h\) equals zero. We ask what happens as \(h\) gets closer and closer to zero, which in mathematical notation is denoted as \(\displaystyle{\lim_{h \to 0}}\).

Before we apply this to our gradient formula, let us build intuition for limits with a simpler example:

$$ \lim_{x \to 5}f(x) = \frac{x^2-25}{x-5} $$

We cannot substitute \(x=5\) directly because we have \(x-5\) in the denominator, but if we simplify the expression:

$$ \lim_{x \to 5}f(x) = \frac{x^2-25}{x-5} = \frac{(x+5)(x-5)}{(x-5)} = (x+5) = 10 $$

Even though we cannot substitute \(5\) into the original expression directly, we can still determine what the function approaches as \(x\) gets closer to \(5\). The limit gives us the answer without ever needing to divide by zero.

Applying the same idea to our gradient problem gives us the formula for the tangent:

$$ m_{\ tangent}=\lim_{h \to 0} \frac{f(x+h)-f(x)}{h} $$

Simply put, this is just the change in \(y\) divided by the change in \(x\), which is denoted by \(\frac{dy}{dx}\). This formula is also called first principles. Let us use it to calculate the derivative of \(f(x)=x^2\):

$$ \begin{align*} f(x) &= x^2 \\ f’(x) &= \lim_{h \to 0} \frac{f(x+h)-f(x)}{h} \\ &= \frac{(x+h)^2-x^2}{h} \\ &= \frac{x^2+2xh+h^2-x^2}{h} \\ &= \frac{2xh+h^2}{h} \\ &= 2x+h \\ &= 2x \end{align*} $$

Notice where the limit does its work: in the second to last step we have \(2x + h\). Because we are taking the limit as \(h \to 0\), the \(h\) simply vanishes, leaving us with \(2x\). That single step is the heart of calculus.

So how does this connect to AI? When a neural network trains, it makes predictions and measures how wrong those predictions are using a loss function, which is a function that measures the error between predicted and actual values. To improve, the network needs to know which direction to nudge each parameter to reduce that loss. That direction is the gradient, and it is computed using exactly the derivative we just derived. The algorithm that does this repeatedly until the loss is minimized is called gradient descent, and it deserves its own separate article.

Next time you hear that an AI model was “trained”, you now know that somewhere underneath all of that, a limit was taken, a derivative was computed, and a small step was made in the right direction. Millions of times over. That is the calculus powering AI.