Narain

Llm Inference on Lambda

2025-01-03

when it comes to deploying machine learning models, there are varied options to choose from depending on your scalability and cost requirements. A dedicated instance, for example, offers a stable environment for serving models but often falls short in scalability, making it less ideal for workloads with unpredictable traffic patterns. This is where a scalable distributed system like AWS Lambda. What does Lambda provide AWS Lambda offers a serverless architecture that scales automatically with demand, ensuring you only pay for the actual computation times used.

OMP parallelization

2024-11-20

#multithreading #parallelization

OpenMP (Open Multi-Processing) is an High level API for parallel programming in C, C++, and Fortran. It’s a widely-used, open-standard, and platform-independent library that allows developers to write parallel code that can run on multiple CPU cores, reducing the overall execution time of a program. OpenMP does the heavy lifting in several areas: Thread Management: OpenMP automatically manages thread creation, synchronization, and termination, freeing the developer from the complexity of thread management.

Understanding floating point numbers

2024-06-21

#floats #

introduction In 1991, a Patriot missile defense system in Saudi Arabia failed to intercept an incoming Scud missile, resulting in the tragic deaths of 28 American soldiers. The cause? A tiny floating point error that accumulated over time, throwing off the system’s timing by just 0.34 seconds. This incident highlights the critical importance of understanding floating point numbers in computer systems. Floating point representation allows computers to work with a wide range of real numbers, from the microscopic to the astronomical.

Distributed Training

2024-04-07

#ml engineering #

This blog dives into the challenges and innovations in distributed training for deep learning models. We explore how GPU and TPU accelerators are crucial for managing large neural networks and detail the use of gradient descent in model optimization. Key discussions include the strategies of data parallel and distributed data parallel, with insights into implementations like Horovod’s ring-allreduce and PyTorch’s process-based approach. This overview provides a solid foundation for understanding how modern computational resources can be leveraged to enhance deep learning training efficiency.

Online Softmax

2024-02-26

# #

One of the most important task in deep learning is classification. This involves predicting the class to which a given input data belongs. Models such as convolution neural networks (CNN) and Large language models use classification layers. These models produce output predictions for all possible classes, but these predictions are not immediately usable as they can be any floating point number. The softmax function is essential in converting these raw model outputs, known as logits, into probabilities that sum to one, making them interpretable

Older posts →