when it comes to deploying machine learning models, there are varied options to choose from depending on your scalability and cost requirements. A dedicated instance, for example, offers a stable environment for serving models but often falls short in scalability, making it less ideal for workloads with unpredictable traffic patterns. This is where a scalable distributed system like AWS Lambda.

What does Lambda provide#

AWS Lambda offers a serverless architecture that scales automatically with demand, ensuring you only pay for the actual computation times used. For lightweight and quantized machine learning models. especially those finetuned for specific tasks, Lambda provides an efficient deployment option. with its support for up to 6 vCPUs and 10 GB of memory, it can handle smaller models effectively, sufficient to run some mobile YOLO models and llm models which are optimized for inference using GGML.

However deploying complex models like llama-cpp in a virtualized Lambda environment comes with unique challeges.

Restrictions on specialized CPU instructions (such as AVX512 and AMX) which are available on some of the servers that AWS provides but are not available for use.
If some lambda function is idle it will go to cold start state and the next time it is invoked it requires a startup time to allocate a machine and start up the server.

running machine learning models which are compute demanding require some careful configuration.

below i have provided some basic configuration of running llama cpp on lambda on aws. with this setup you should be good enough to run nearly all machine learning models on lambda once it is converted to gguf format

python inference script


from llama_cpp import Llama

def download_model_to_tmp():
    pass

def initialize_model():
    if model_exists_in_tmp(): download_model_to_tmp()
    llm = Llama(
        model_path="/tmp/lama-model.gguf",
    )
    return True

def lambda_handler(event, context):
    if event['body']['health_check']: return initialize_model()
    response = llm(
        "Q: Name the planets in the solar system? A: ", # Prompt
        max_tokens=32,
        stop=["Q:", "\n"],
        echo=True 
    ) 
    return response

Code of the dockerfile which is required for the build#

FROM --platform=linux/amd64 python:3.11-slim as build-image

ARG FUNCTION_DIR="/function"

RUN mkdir -p ${FUNCTION_DIR}
WORKDIR ${FUNCTION_DIR}

COPY inference.py .

RUN apt-get update \
    && apt-get install -y --no-install-recommends \
    build-essential \
    cmake \
    libopenblas-dev \
    libgomp1 \
    pkg-config \
    && rm -rf /var/lib/apt/lists/*

RUN CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS \
        -DCMAKE_CXX_FLAGS=\"-march=x86-64\" -DLLAMA_AVX512=OFF \ 
        -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF \ 
        -DLLAMA_F16C=OFF -DLLAMA_BUILD_SERVER=1 \ 
        -DLLAMA_CUBLAS=OFF -DGGML_NATIVE=OFF" \
        pip install --target ${FUNCTION_DIR} llama-cpp-python

RUN pip install --target ${FUNCTION_DIR} --no-cache-dir boto3
RUN pip install --target ${FUNCTION_DIR} --no-cache-dir awslambdaric==2.0.7

COPY --from=public.ecr.aws/lambda/python:3.11 /var/runtime /var/runtime
COPY --from=public.ecr.aws/lambda/python:3.11 /var/lang /var/lang
COPY --from=public.ecr.aws/lambda/python:3.11 /usr/lib64 /usr/lib64
COPY --from=public.ecr.aws/lambda/python:3.11 /opt /opt

FROM --platform=linux/amd64 python:3.11-slim as runtime-image

ARG FUNCTION_DIR="/function"
WORKDIR ${FUNCTION_DIR}

RUN apt-get update \
    && apt-get -y install libopenblas-dev libgomp1 \
    && rm -rf /var/lib/apt/lists/*

COPY --from=build-image ${FUNCTION_DIR} ${FUNCTION_DIR}

COPY --from=public.ecr.aws/lambda/python:3.11 /var/runtime /var/runtime

ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ]
CMD [ "inference.lambda_handler" ]

Llm Inference on Lambda

What does Lambda provide#

Code of the dockerfile which is required for the build#

References#