Blog  
🔧  Training Classifiers with Modal

Developer Blog: Training and Deploying Hundreds of Classifiers with Modal Labs


Modal Labs Logo

At Taylor, we offer powerful off-the-shelf text classification models for complex taxonomies. For example, we help engineering teams at recruiting and HR companies categorize a job description into one of 1000 O*NET occupation codes. We have pre-trained models for topic classification, occupation classification, intent classification, and more. But for customers that want to use their own private taxonomy, we provide custom models, trained by us or through our self-serve training flow. Training and deploying custom models per-user presents several engineering challenges, including cold-starts, autoscaling, and long-running background processes. In this blog, I'll explain how we tackled these challenges using the best-in-class infrastructure provided by Modal Labs (opens in a new tab).

What is Modal?

Modal is a serverless platform designed for AI and data-intensive workloads. It lets developers deploy a function to the cloud that auto-scales to meet demand, and scales to zero when not in use. Compared to AWS Lambda, Modal offers a greatly improved developer experience, along with killer features like containers with GPUs, simple Python container definitions, and sandboxes for safe execution of untrusted code. For startups, using Modal may mean never having to set up Celery or a similar tool for background workers, as any job too slow or heavy for the webserver can be offloaded to a Modal function that runs in the background for up to 24 hours.

At Taylor, we began using Modal to finetune open-source language models like T5, Llama, and Mistral. Modal greatly simplified the infrastructure headaches associated with finetuning, like finding GPUs, setting up the environment without breaking CUDA, and most importantly, remembering to spin it all down when the job is done. As we grew, we found more and more uses for Modal in our infrastructure. We now use it to send Slack notifications, schedule cron jobs to test our API, roll over subscriptions, and more.

In the following sections, I'll focus specifically on the infrastructure challenges associated with training and deploying custom classification models, and how we built a robust system for managing this on Modal.

Training Classifiers

Taylor classification models are compact and don't require a GPU for inference, but training them is still a complex process. We offer customers the ability to train classifiers with no data, or with only unlabeled data. This means our job is a lot harder than import xgboost. Instead, our semi-supervised labeling framework uses language models to generate, filter, and label training examples, and our ensemble search trains and tests hundreds of combinations of different models to find the best one. For the complex taxonomies that are our bread and butter, this can take a while—anywhere from a few minutes to a few hours.

Architecture Diagram

A long-running job like this has to happen in the background—the webserver starts the training, informs the user, and then the user can go run some errands or make toast while the model is training. Very lightweight background jobs like asynchronous logging can run on the webserver (e.g. with FastAPI's BackgroundTasks API), but model training does not fit the bill, as it is resource-intensive and could slow the server down substantially. It also requires packages like torch that aren't used by the server and would slow down the build. So, this has to happen somewhere else, and it should ideally scale to zero, since training jobs are bursty and unpredictable.

Since AWS Lambda has 15-minute timeouts and no GPUs, an AWS solution might instead involve a custom workload on Sagemaker or AWS Batch, triggered whenever a customer wants to train a model. Setting this up and starting jobs from the webserver would require a lot of painful DevOps, costing us our most valuable asset, developer time. Instead, with Modal, any experiment that works in a Jupyter notebook can instantly become production code by wrapping it in a function and adding a decorator.

# this is a Modal application
 
from modal import Image, App
 
image = Image.debian_slim().pip_install("torch")
 
app = App('train-custom-model')
 
@app.function()
def really_expensive_training_job(model_name: str):
    import torch
    # paste your favorite jupyter notebook code here
    [...]

After deploying (which takes one CLI command), this function can be looked up from any Python runtime, including our webserver, and spawned to run on Modal's infrastructure for up to 24 hours (with GPUs if needed).

# do this on the webserver
import modal
from fastapi import FastAPI
 
app = FastAPI()
 
@app.get("/training_job")
def start_training_job(model_name: str)
    training_fn = modal.Function.lookup(
        'train-custom-model',
        'really_expensive_training_job'
    )
    training_fn.spawn('my-first-model')
 
    return {
        "message": "your training job has begun!"
    }

The ability to spawn jobs from one Python environment that run in a totally different Python environment (different packages, more CPUs, GPUs, etc.) is a great benefit to machine learning engineers, and makes it simple to train models on demand.

Deploying Classifiers

Classifiers trained on Taylor are often used to support live, user-facing product features, which means they have to be fast. This presents a different set of challenges from training, as the customer expects results in a few seconds or milliseconds, rather than a few hours. The main two technical hurdles for hosting lots of custom classifiers are scaling and reducing latency.

Because Taylor allows all users to create as many custom classifiers as they want, we end up with lots of custom models to host, some of which may only be tested once and never used again. If each model had dedicated compute, we'd quickly run out of money hosting models that were sitting idle. So for custom models, autoscaling and scaling to zero is an absolute must. Unfortunately, in typical autoscaling setups, cold-starts after scaling to zero add at least 20-30 seconds of latency to the first request while the compute is being provisioned and the image is being set up. Machine learning model inference, which additionally requires downloading weights and loading them into memory, makes this worse.

Though it hasn't eliminated the cold-start penalty altogether, Modal has done the most to reduce it out of all the serverless inference solutions we've tried. A ton of their engineering effort has gone into keeping cold-starts as fast as possible with system optimizations, such as lazily loading container images. And because Modal exposes so much control to the developer, there are remediations available to further reduce cold-start times, like downloading models at build time and "baking" them into the image, taking resumable snapshots of memory to amortize slow imports, and using familiar class syntax to re-use Python objects across invocations.

For high-traffic models, we can also control how many warm instances are always on, avoiding cold-starts altogether. This is as simple as setting the keep_warm parameter in the function decorator:

@app.function(keep_warm=10)
def important_classifier(text: str):
    return predict(text)

With this, 10 copies of the function will always be ready to run, and if the workload grows beyond that, Modal will automatically scale up to meet demand. This is a great feature for high-traffic models, as it allows us to keep latency low and costs predictable.

Conclusion

When building a startup, the only advantage you have over bigger, established companies is shipping speed (and maybe taste). Hosting our core infrastructure on Modal instead of an established hyperscaler like AWS sounds risky—but over the past year or so of building Taylor, using Modal allowed us to ship new features incredibly fast, avoiding slow, painful DevOps work and focusing on the core product. As a result, we now have happy customers who use our classifiers to build recruiting platforms, compensation software, trust and safety tools, and more. If you're interested in building a super-accurate text classifier for your taxonomy, check out our product (opens in a new tab), or book time with us here (opens in a new tab).