Math for Machine Learning

It’s not hard, just unfamiliar

I heard a presenter at a conference make this statement over some new programming language constructs introduced at the time. I think it captures a way to fight the feeling many of us get when we think about math.

Don’t get me wrong. I know some of us enjoy math, me included. However, when it comes to math for machine learning, I believe some present the concepts as more of a barrier to understanding than is warranted.

This post attempts to dispel the idea that you need to master difficult mathematical concepts to do machine learning. By reviewing the common math concepts and focusing on the intuition behind them, I hope that, after a while, those concepts will become more familiar. And, in the end, you’ll find most of the hard stuff has already been done and is hidden away behind simple coding libraries, anyway.

What do all these words mean

In this section, I want to just talk about vocabulary. The words used in the machine learning domain may sound familiar, but they mean very specific things, and sometimes the words are overloaded causing confusion.

Let’s start with regression. In machine learning, regression refers to tasks where we predict a continuous outcome (e.g., forecasting temperature) rather than classifying data into categories. In contrast, linear regression is a foundational statistical technique that models the linear relationship between a dependent variable and one or more independent variables, and it is often used both for prediction and for interpreting data relationships. Auto-regressive transformers use a sequential approach to generate text: they predict one token at a time, conditioning each prediction on all previously generated tokens. This mechanism contrasts with the regression technique in predictive modeling, despite the similar-sounding term. Lastly, psychological regression, which in psychology refers to a temporary return to earlier behavior patterns under stress, is entirely different from the technical concepts we’ve discussed. (Hopefully, reading this article won’t cause you to regress!)

Another common word used is ‘linear‘. Linear concepts appear frequently. Besides linear regression, we encounter linear projection and linear transformation. A linear projection maps data from a higher-dimensional space onto a lower-dimensional one—imagine casting a 3D object’s shadow onto a 2D surface, a process used to reduce dimensions or simplify data. On the other hand, a linear transformation typically scales the input (as in y = mx) but, if you include an additional bias term (as in y = mx + b), the operation is technically an affine transformation, which slightly extends the definition. This nuance is important in mathematical contexts.

Another word I hear a lot is ‘pipelines‘. As a software developer, I’m familiar with Git pipelines used for continuous integration and delivery. In machine learning, however, pipelines refer to data pipelines—systems designed to ingest, clean, transform, and load data into the model or application. Data transformations within these pipelines can include a variety of operations, such as normalization, encoding, feature scaling, or applying specific mathematical functions (which may be linear or non-linear) to prepare the data for analysis or model training.

What are parameters and hyper-parameters? Parameters and hyper-parameters are fundamental concepts in machine learning. Parameters are the components of a model—such as the weights and biases in an artificial neural network—that are learned automatically during training through processes like backpropagation. In contrast, hyper-parameters are the configurable settings that you set before the training process begins. These include aspects such as the learning rate, number of layers, batch size, and the number of training epochs. While the model optimizes its parameters during training, hyper-parameters must be adjusted manually, often through trial and error or systematic search methods.

Mathematical data structures

One thing you’ll find if you start trying to code machine learning models right away is there are data structures that look very Mathy. Rather than trying to understand these data structures intellectually, I believe it is far easier to understand them operationally.

Let’s take the PyTorch library’s data structure tensor. Now, you may know this; Albert Einstein invented a math called tensor calculus. But, that ain’t it, bruh; so, don’t worry about it. A PyTorch tensor is just a data structure for storing high dimensional vectors or matrices and performing efficient mathematical operations on them. Use it as you need it; learn it as you use it.

Another common data structure that is very powerful is the Pandas DataFrame. This structure is like having Microsoft Excel in your pocket. Here’s a nice reference: Comparison with spreadsheets — pandas 2.2.3 documentation.

And then there’s the actual math

Activation Functions

Non-Linearity: Without activation functions, a neural network would behave like a linear model regardless of the number of layers. The non-linearity allows the network to capture more intricate relationships.

Transformation: It transforms the weighted sum of inputs into an output signal for the neuron, often squashing the output within a certain range.

Common Examples:

Tanh: Maps inputs to values between -1 and 1, centering the output around zero, which can be useful for certain tasks.
Sigmoid: Maps input values to a range between 0 and 1. Often used in binary classification.
ReLU (Rectified Linear Unit): Outputs zero for negative inputs and a linear relationship for positive inputs. Widely used in hidden layers for its computational efficiency.

Loss Functions

Below are five common loss functions along with brief explanations of each:

Mean Squared Error (MSE) Loss
- Description: MSE computes the average of the squares of the differences between the predicted and actual values.
- Usage: It is predominantly used for regression problems because it heavily penalizes larger errors, encouraging the model to reduce big mistakes.
Mean Absolute Error (MAE) Loss
- Description: MAE calculates the average of the absolute differences between the predicted and actual values.
- Usage: This loss function is also used in regression tasks and is more robust to outliers than MSE since it does not square the error terms.
Huber Loss
- Description: Huber Loss is a combination of MSE and MAE. It behaves like MSE for small error values and switches to MAE for larger errors, reducing the sensitivity to outliers.
- Usage: It is particularly useful when you want to balance the robustness of MAE with the smooth gradients of MSE, especially in regression tasks with potential outliers.
Hinge Loss
- Description: Hinge Loss is designed for classification tasks—typically used with Support Vector Machines (SVMs). It calculates the loss based on the margin between the correct class and the incorrect classes.
- Usage: This loss function is useful for “maximum-margin” classification, ensuring that the data points are classified correctly with a certain margin of confidence.
Cross-Entropy Loss
- Description: Cross-Entropy Loss (often referred to as Log Loss in the binary case) measures the difference between two probability distributions. In classification tasks, it quantifies the dissimilarity between the predicted probability distribution and the true distribution.
- Usage:
  - Binary Cross-Entropy: Applied in binary classification problems.
  - Categorical Cross-Entropy: Used in multiclass classification problems, where the model predicts probabilities across multiple classes.

Each of these loss functions serves a specific purpose and is chosen based on the nature of the problem (regression vs. classification) and the behavior (sensitivity to outliers, margin-based constraints) desired during training.

What about all the statistical terminology?

Independent and dependent variables, hypothesis testing, probability distributions, normalization, L1 and L2 regularization, correlation and confounding variables, confusion matrices, t-statistic, f-statistic, p-value, standard deviation, mean square error, variance, standard deviation, Pearson’s coefficient, …, and on and on???!! That subject deserves a separate post.

Machine learning has a lot of concepts, math, tools, and is ever evolving. So, how do you learn everything? The short answer is you don’t. Instead, look for other similar programmatical pieces you can experiment with and add to your toolset. Use Google and, of course, by all means, use a GenAI to help explain and generate examples!

Don’t let the math bog you down

Remember your goal is to train models to solve problems, not to become a mathematician (not that there’s anything wrong with that). Besides, many of the libraries you’ll use to train those models have done all the hard math; it’s the intuition behind them you need to master. Remember two mantras to keep in mind can help you not get bogged down:

It’s not that hard, it’s just unfamiliar
Use it as you learn it; learn it as you use it

AI Trenches: Coding, Creativity, and Chaos

Math for Machine Learning

It’s not hard, just unfamiliar

What do all these words mean

Mathematical data structures

And then there’s the actual math

Activation Functions

Loss Functions

What about all the statistical terminology?

Don’t let the math bog you down

Leave a comment Cancel reply

Math for Machine Learning

It’s not hard, just unfamiliar

What do all these words mean

Mathematical data structures

And then there’s the actual math

Activation Functions

Loss Functions

What about all the statistical terminology?

Don’t let the math bog you down

Share this:

Leave a comment Cancel reply