The StatQuest Illustrated Guide to Machine Learning!!!

5 minute read

Published: June 09, 2025

Author: Josh Starmer, Ph.D.
Book Amazon link: https://www.amazon.com/dp/B0BLM4TLPY
Youtube link: https://www.youtube.com/@statquest

Chapter 1 : Fundamental Concepts of ML

What is Machine Learning (ML)?

According to the author, ML is a collection of tools and techniques that transforms data into decisions.
Basically, ML is about 2 things:
1. Classifying things (Classification) and
2. Quantifying Predictions (Regression).
Comparing ML Methods:
- To choose, which methods to use for your application, we can just compare the prediction of the method/model with the actual outcomes. This is called evaluation of a model and the metrics used are called evaluation metrics.
- For this, we first fit the model to the training data.
- Then make predictions based on the trained model.
- Then we evaluate the predictions made on test set with the actual outcome.
- We can do this for different model/methods and based on the evaluation metrics we can select a suitable method for our application.
- Here, just because a machine learning methods fits the training data well, it doesn’t mean it will perform well with the Testing Data.
- Fit Train Data well but poor predictions = Overfitting
- Doesn’t fit train data well = Underfitting.
Independent and Dependent Variables
- variable: value of which vary from record to record.
- Say that we have two variables, ‘height’ and ‘weight’. And let us also say that height prediction depends on weight of a person, then here, the ‘height’ is a dependent variable, and ‘weight’ is an independent variable, as this variable used to predict a dependent variable.
- Here, the independent variables are also called features.
Discrete and Continuous Data
- discrete data: countable values. only takes specific values.
- continuous data: measurable values under a particular pre-defined range.

Chapter 2: Cross Validation

From Chapter 1 we learned, that we train the model on ‘train set’ and evaluate the model on ‘test set’.
But how do we decide on what data points to choose for ‘test set’ or the ‘train set’.
The answer is cross validation.
Say you have 10 data points. And let us say that we have chosen a 80/20 train-test split. This means that we are going to assign 8 points randomly to train set and the rest of the 2 data point for test set. Here the 2 points chosen will not be used in test set for the next cross validation. So, for second we choose another 2 data points for test set and remaining for train set. We can do this 5 times, since our total data points is 10 and we have chosen a 80/20 train-test split.
Therefore, cross validation is a way solving the problem of not knowing which points are the best for testing by using them all in an iterative way.
You can also think of it like make 5 groups. And each time using one group as the ‘test set’ and remaining as the ‘train set’.
The number of iterations/groups are also called folds. Therefore, this is an example of 5-fold cross validation.
But why can’t we use all the data as ‘train set’.
- Because, only way to determine if a model has been overfit or not is to evaluate on new data.
- Reusing same data points for training and testing is called Data Leakage.
The main advantage of cross-validation is that it is a proper measure of how good a model has performed, instead of relying on chance for train-test split. Here, if test set is by chance easy, then the model will be interpreted as better than it actually is.
When we have a lot of data, 10-Fold Cross Validation is commonly used.
Another commonly used cross validation is Leave-One-Out.
- used all but one point for training, and the remaining point for testing.
- iterate until every single point has been tested.
- we usually use this for small dataset.
Commonly, sometimes a particular model performs better in some iteration and another model can perform better in other iteration. In such case we use Statistics to decide the better model.

Chapter 3: Fundamental Concepts in Statistics!!!

Main Idea of Statistics:
- Statistics provide us a set of tools to quantify the variation that we find in everyday life.
- For example, the number of fries you get in a bucket is not always the same. But say that we track it. Then from statistics, we can predict how many fires will we have tomorrow. And how confident can we be in that prediction can also be determined.
- Here, say that you predict a positive result, but are not confident, then you will look for alternative approach.
- We know to make a prediction, we need to understand the trend of data.
  - And histogram is a good way visualizing the trend of data.
    - divide the range into number of bins.
    - and stack the element based on the frequency of element that fall into a bin.
    - Here, the question to think of when making a histogram, is the number of bins you should use.
    - A Naive Bayes algorithm makes prediction using histogram.
    - Calculating probability:
      - Probability of occurrence of something is the total number of occurrence divided by the number of observations made.
      - Here, the more number of observation we have more confident we can be of our predictions.
      - But, we know that collecting more samples is expensive both monetarily and time-wise.
      - We can solve this problem using Probability Distribution.
Probability Distribution

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Why Triton is gaining popularity?

1 minute read

Published: June 25, 2025

References:

Triton

In a typical machine learning(ML) work flow, we program the feature production, training, and inference. We do that mostly using frameworks to write high-level program and not have to manage the low level details required for ML or deep learning (DL). The pytorch or tensorflow (frameworks) calls the cuda if available and the operations are now performed on the GPU. DL models have achieved state-of-the-art (SOTA) performance in multiple-domains due to their hierarchial structure of parametric as well as non-parametric layers. Therefore, CUDA has to descide on how to perform the operations. Libraries like cuBLAS, cuDNN, or PyTorch’s built-in kernels are highly optimized for common operations (matrix multiply, convolutions, etc.). But if our applications have specialized algorithms, unique data layouts, and non-standard precision or formats, then CUDA might not perform well. Therefore, you write CUDA program for faster execution.

However, CUDA programming is very manual and tedious. It works on the principle of Scalar Program, Blocked threads. This means we have to define what each thread does and manage it. It is a low-level programming method. Therefore, Triton was developed to make the specialized algorithms faster and CUDA programming a little less tedious and manual. Triton is a high-level CUDA programming method. It works on the principle of Blocked Program, Scalar Threads. This means that instead of managing each thread we manage a group of threads instead. And Trition handles the actual operation based on our memory flow and our data flow and chooses the optimum way to perform the given task/operation. Making it faster for the specialized use cases.

Therefore, Triton has gained popularity and is helping researchers and developers with cuda programming.

Book Notes: LLM Engineer’s Handbook

8 minute read

Published: June 23, 2025

Book:

Amazon link

Notes

An LLM engineer should have the knowledge in the following:
- Data preparation
- Fine-tune LLM
- Inference Optimization
- Product Deployment (MLOps)
What the book will teach:
- Data Engineering
- Supervised Fine-tuning
- Model Evaluation
- Inference Optimization
- RAG
For every project there must be planning. And the three planning steps the book talks about is as follows:
1. Understand the problem
  - What we want to build ?
  - Why are we building it?
2. Minimal Viable Product reflecting real-world scenario.
  - Bridge the gap between the idealistic and the reality of what can be built.
  - What are the steps that is required to build it?
  - not clear on this part
3. System Design step
  - Core architecture and design choices
  - How are we going to build it?
What the book covers:

Chapter 1: Understanding

The chapter covers the following topics:
- Understanding the LLM Twin concept
- Planning the MVP of the LLM Twin product.
- Building ML systems with feature/training/inference pipelines
- Designing the system architecture of the LLM Twin
The key of the LLM Twin stands in the following:
- What data we collect
- How we preprocess the data
- How we feed the data into the LLM
- How we chain multiple prompts for the desired results
- How we evaluate the generated content
We have to consider how to do the following (MLOps):
- Ingest, clean, and validate fresh data
- Training versus inference setups
- Compute and serve features in the right environment
- Serve the model in a cost-effective way
- Version, track, and share the datasets and models
- Monitor your infrastructure and models
- Deploy the model on a scalable infrastructure
- Automate the deployments and training
In every software architecture, Database->Business Logic->UI. And, any layer can be as complex as required. But for ML, what do we require? Well, that is the FTI architecture. Feature->Training->Inference.

FTI Architecture

To conclude, the most important thing you must remember about the FTI pipelines is their interface:

The feature pipeline takes in data and outputs the features and labels saved to the feature store.
The training pipeline queries the features store for features and labels and outputs a model to the model registry.
The inference pipeline uses the features from the feature store and the model from the model registry to make predictions.

Requirements of the ML system from a purely technical perspective:

Data
- collect
- standardize
- clean the raw data
- create instruct database for fine-tuning an LLM
- chunk and embed the cleaned data. Store the vectorized data into a vector DB for RAG.
Training
- Fine-tune LLMs of various sizes
- Fine-tune on instruction datasets of multiple sizes.
- Switch between LLM types
- Track and compare experiments.
- Test potential production LLM candidates before deploying them.
- Automatically start the training when new instruction datasets are available.
Inference
- A REST API interface for clients to interact with the LLM
- Access to the vector DB in real time for RAG.
- Inference with LLMs of various sizes
- Autoscaling based on user requests
- Automatically deploy the LLMs that pass the evaluation step
LLMOPs
- Instruction dataset versioning, lineage, and reusability
- Model versioning, lineage, and reusability
- Experiment tracking
- Continuous training, continuous integration, and continuous delivery (CT/CI/CD)
- Prompt and system monitoring

LLM Twin high-level architecture

Chapter2: Tooling and Installation

The chapter covers:
- Python ecosystem and project installation
- MLOps and LLMOps tooling
- Databases for storing unstructured and vector data
- Preparing for AWS
Any Python project needs three fundamental tools: the Python interpreter, dependency management, and a task execution tool.
Poetry is one of the most popular dependency and virtual environment managers within the Python ecosystem.
An orchestrator is a system that automates, schedules, and coordinates all your ML pipelines. It ensures that each pipeline—such as data ingestion, preprocessing, model training, and deployment—executes in the correct order and handles dependencies efficiently.
ZenML is one such orchestrator.
- It orchestrates by pipelines and steps. They are just python functions. Where steps are called in pipeline functions. Modular code should be written for this.
- ZenML transforms any step output into artifacts.
- Any file produced during the ML lifecycle.
Experiment Tracker:
- Training ML models is an entirely iterative and experimental process. Therefore, an experiment tracker is required.
- CometML is one that helps us in this aspect.
Prompt monitoring
- you cannot use standard logging tools as prompts are complex and unstructured chains.
- Optik is simple to use prompt monitoring compared to other prompt monitoring tools.
MongoDB, NoSQL dataset.
Qdrant, vector database.
For our MVP, AWS, it’s the perfect option as it provides robust features for everything we need, such as S3 (object storage), ECR (container registry), and SageMaker (compute for training and inference).

Chapter 3: Data Engineering

In this chapter, we will study the following topics:
- Designing the LLM Twin’s data collection pipeline
- Implementing the LLM Twin’s data collection pipeline
- Gathering raw data into the data warehouse
Collect and curate the dataset
From raw data, Extract -> Transform -> Load into MongoDB. (ETL)
- crawling
- standardizing data
- load into data warehouse

Chapter 4: RAG Feature Pipeline

Retrieval-augmented generation (RAG)
Chapter teaches you what RAG is and how to implement it.
The main sections of this chapter are:
- Understanding RAG
- An overview of advanced RAG
- Exploring the LLM Twin’s RAG feature pipeline architecture
- Implementing the LLM Twin’s RAG feature pipeline

Chapter 5: Supervised Fine-Tuning

SFT refines the model’s capabilities (here model refers to pre-trained model that can predict the new sequence) learning to predict instruction-answer pair.
Makes the general ability of pre-trained LLMs of understanding language to specific application, or in this case more conversational.
In this chapter, the authors cover the following topics:
- Creating a high-quality instruction dataset
- SFT techniques
- Implementing fine-tuning in practice

Chapter 6: Fine-Tuning with Preference Alignment

SFT cannot address a human’s preference of how a conversation should be, therefore we use preference alignment, specifically the Direct Preference Optimization (DPO).
Authors cover the following topics in this chapter:
- Understanding preference datasets
- How to create our own preference dataset
- Direct preference optimization (DPO)
- Implementing DPO in practice to align our model

Chapter 7: Evaluating LLMs

no unified approach to measuring a model’s performance but there are patterns and recipes that we can adapt to specific use cases.
The chapter covers:
- Model evaluation
- RAG evaluation
- Evaluating TwinLlama-3.1-8B

Chapter 8: Inference Optimization

Some tasks like document generation take hours and some tasks like code completion take a small amount of time, this is why optimization of the inference is quite important. The things that are optimized are the latency (the speed of the generation of the first token), throughput (number of tokens generated per second), and memory footprint of the LLM.
The chapter covers:
- Model optimization strategies
- Model parallelism
- Model quantization

Chapter 9: RAG Inference Pipeline

Where the magic happens for the RAG system.
The chapter covers the following topics:
- Understanding the LLM Twin’s RAG inference pipeline
- Exploring the LLM Twin’s advanced RAG techniques
- Implementing the LLM Twin’s RAG inference pipeline

Chapter 10: Inference Pipeline Deployment

The chapter covers:
- Criteria for choosing deployment types
- Understanding inference deployment types
- Monolithic versus microservices architecture in model serving
- Exploring the LLM Twin’s inference pipeline deployment strategy
- Deploying the LLM Twin service
- Autoscaling capabilities to handle spikes in usage

Chapter 11: MLOps and LLMOps

This chapter covers:
- The path to LLMOps: Understanding its roots in DevOps and MLOps
- Deploying the LLM Twin’s pipelines to the cloud
- Adding LLMOps to the LLM Twin

Dynamic Arrays

2 minute read

Published: June 23, 2025

Resources:

Video: Dynamic Arrays
Link: The Simple and Elegant Idea behind Efficient Dynamic Arrays
Link: What if you had to invent a dynamic array?

Dynamic Array

should be able to change the shape of the array dynamically.
- should be able to add/delete element fast
- should be able to insert/delete a element in the middle.
since we need to make this as efficient as possible. Let’s try what would we have done if we had to invent it for ourself.
First, we take the functionality of it and try to simplify it as much as possible.
Here, let’s take only the ‘adding dynamically’ part.

Adding Dynamically

say we have an fixed array of 4 elements. Then, how can we make it such that we can add an element to it.
Here, we know that we need to describe on a fixed space required for our task before hand to utilize a memory (refer to how memory works).

Alternative #1: Make an array of 5 element then copy all the data to the new array.

Here, using this we can make an dynamic array. However, it is very expensive to do this for huge amount of data.
For example, for an 1M length array, we need to perform around 90 billion copies.
Here lets assume we are continuously adding element to the array. So, for 5th element we need 5 copying operation. For the 6th element, we need to first create a new array of size 6 and copy the 6 elements. So here our total operations is 5+6. For the 7th element, it is 5+6+7. In big O notation it is $O(N^2)$.

Alternative #2: Making a new array of size of the fixed array + 8 (say).

Here, it will reduce a lot of copying operations, however it is still of $O(N^2)$ complexity.

Alternative #3: Making the new array the double size of the array.

Here, the number of copying operation needed for an array of size N is always N. So it’s complexity is $O(N)$.
This is very cool problem, so if you are math savvy then take out a piece of paper and do the math, it is quite fun to think about this problem. Find how this is $O(N)$.
This is how programming languages define dynamic arrays.

For deletion, say if the filled element is less that the half of the size then we reduce the size of array by half. Here, this means our memory usage has also been optimized.

Similarly, with this way we can also easily perform insertions and deletion from the middle or front of the array with high speed .

Array and Hashing

less than 1 minute read

Published: June 23, 2025

Prerequisites:

Bishnu Khadka

The StatQuest Illustrated Guide to Machine Learning!!!

Chapter 1 : Fundamental Concepts of ML

Chapter 2: Cross Validation

Chapter 3: Fundamental Concepts in Statistics!!!

Share on

You May Also Enjoy

Why Triton is gaining popularity?

Book Notes: LLM Engineer’s Handbook

Notes

Chapter 1: Understanding

Chapter2: Tooling and Installation

Chapter 3: Data Engineering

Chapter 4: RAG Feature Pipeline

Chapter 5: Supervised Fine-Tuning

Chapter 6: Fine-Tuning with Preference Alignment

Chapter 7: Evaluating LLMs

Chapter 8: Inference Optimization

Chapter 9: RAG Inference Pipeline

Chapter 10: Inference Pipeline Deployment

Chapter 11: MLOps and LLMOps

Dynamic Arrays

Dynamic Array

Array and Hashing