Jobs & Deployments API

Lightning AI supports running async workloads as Jobs (single machine) or Multi-Machine Training (MMT) jobs, and deploying models as persistent Deployments (inference servers).


Jobs

Jobs run async compute workloads on a specified machine type. They are typically created from a Studio’s code snapshot.

Base path: /v1/projects/{projectId}/jobs

List Jobs

GET /v1/projects/{projectId}/jobs

Returns all jobs in a project.

Query parameters:

Parameter Type Description
phase string Filter by status phase
name string Filter by job name
cloudspace_id string Filter by Studio ID

Response:

{
  "jobs": [
    {
      "id": "job-id-001",
      "name": "my-training-job",
      "status": {
        "phase": "JOB_STATE_RUNNING"
      },
      "spec": {
        "cluster_id": "lightning-cloud",
        "requested_compute": {
          "name": "lit-a100-1"
        }
      },
      "created_at": "2024-01-15T10:00:00Z"
    }
  ]
}

Job status phases:

Phase Description
JOB_STATE_PENDING Job is queued
JOB_STATE_RUNNING Job is executing
JOB_STATE_SUCCEEDED Job completed successfully
JOB_STATE_FAILED Job failed
JOB_STATE_STOPPED Job was manually stopped

Example:

curl -s -H "Authorization: Basic ${AUTH}" \
  "https://lightning.ai/v1/projects/${PROJECT_ID}/jobs" | jq '.jobs[].name'

Get Job

GET /v1/projects/{projectId}/jobs/{id}

Returns a single job by ID.

curl -s -H "Authorization: Basic ${AUTH}" \
  "https://lightning.ai/v1/projects/${PROJECT_ID}/jobs/${JOB_ID}"

Get Job by Name

GET /v1/projects/{projectOwnerName}/{projectName}/jobs/{jobName}

Returns a job by its human-readable name.


Create Job

POST /v1/projects/{projectId}/jobs

Submits a new async job.

Request body:

{
  "name": "my-training-job",
  "spec": {
    "cluster_id": "lightning-cloud",
    "requested_compute": {
      "name": "lit-a100-1",
      "spot": false
    },
    "lightningapp_instance_id": "studio-id",
    "run": {
      "kind": "LIGHTNINGAPP_INSTANCE",
      "entrypoint_command": "python train.py --epochs 10",
      "env": [
        {"name": "BATCH_SIZE", "value": "32"}
      ]
    }
  }
}
Field Type Required Description
name string Yes Unique job name within the project
spec.cluster_id string No Cloud account ID
spec.requested_compute.name string Yes Machine slug (e.g., lit-a100-1)
spec.requested_compute.spot boolean No Use interruptible instances
spec.lightningapp_instance_id string No Studio ID to snapshot for the job
spec.run.entrypoint_command string Yes Command to execute
spec.run.env array No Environment variable overrides

Response: Returns the created job object.

Example:

curl -s -X POST \
  -H "Authorization: Basic ${AUTH}" \
  -H "Content-Type: application/json" \
  "https://lightning.ai/v1/projects/${PROJECT_ID}/jobs" \
  -d '{
    "name": "train-resnet",
    "spec": {
      "requested_compute": {"name": "lit-a100-1"},
      "lightningapp_instance_id": "'${STUDIO_ID}'",
      "run": {
        "entrypoint_command": "python train.py"
      }
    }
  }'

Update Job

PUT /v1/projects/{projectId}/jobs/{id}

Updates a job (e.g., stop it by updating its desired phase).


Delete Job

DELETE /v1/projects/{projectId}/jobs/{id}

Deletes a completed or failed job and its artifacts.

curl -s -X DELETE \
  -H "Authorization: Basic ${AUTH}" \
  "https://lightning.ai/v1/projects/${PROJECT_ID}/jobs/${JOB_ID}"

Get Job Logs

GET /v1/projects/{projectId}/jobs/{id}/page-logs

Returns paginated logs from a job.

Query parameters:

Parameter Description
cursor Pagination cursor
limit Max number of log lines

Download Logs:

GET /v1/projects/{projectId}/jobs/{id}/download-logs

Returns a URL to download the full log file.


Get Job System Metrics

GET /v1/projects/{projectId}/jobs/system-metrics

Returns CPU/GPU/memory metrics for jobs in the project.


Multi-Machine Training (MMT)

MMT jobs run the same command across multiple machines simultaneously (e.g., for distributed training with torchrun).

Base path: /v1/projects/{projectId}/multi-machine-jobs

Create MMT Job

POST /v1/projects/{projectId}/multi-machine-jobs

Request body:

{
  "name": "distributed-training",
  "spec": {
    "num_machines": 4,
    "cluster_id": "lightning-cloud",
    "requested_compute": {
      "name": "lit-h100-8",
      "spot": false
    },
    "lightningapp_instance_id": "studio-id",
    "run": {
      "entrypoint_command": "torchrun --nproc_per_node=8 --nnodes=4 train.py"
    }
  }
}
Field Type Description
spec.num_machines integer Number of machines to use

List MMT Jobs

GET /v1/projects/{projectId}/multi-machine-jobs

Get MMT Job

GET /v1/projects/{projectId}/multi-machine-jobs/{id}

Get MMT Job by Name

GET /v1/projects/{projectId}/multi-machine-jobs/{name}/getbyname

Get MMT Job Events

GET /v1/projects/{projectId}/multi-machine-jobs/{id}/events

Delete MMT Job

DELETE /v1/projects/{projectId}/multi-machine-jobs/{id}

Update MMT Job

PUT /v1/projects/{projectId}/multi-machine-jobs/{id}


Deployments (Inference Servers)

Deployments run persistent services (e.g., model inference APIs). They auto-scale based on traffic.

Base path: /v1/projects/{projectId}/deployments

List Deployments

GET /v1/projects/{projectId}/deployments

Returns all deployments in a project.


Get Deployment

GET /v1/projects/{projectId}/deployments/{id}


Get Deployment by Name

GET /v1/projects/{projectId}/deployments/{name}/getbyname


Get Deployment by Owner/Project/Name

GET /v1/projects/{projectOwnerName}/{projectName}/deployments/{deploymentName}


Create Deployment

POST /v1/projects/{projectId}/deployments

Request body:

{
  "name": "my-llm-server",
  "spec": {
    "cluster_id": "lightning-cloud",
    "requested_compute": {
      "name": "lit-a100-1"
    },
    "work": {
      "image": "ghcr.io/my-org/my-llm:latest",
      "env": [
        {"name": "MODEL_PATH", "value": "/models/llm"}
      ]
    },
    "min_replicas": 1,
    "max_replicas": 3
  }
}

Delete Deployment

DELETE /v1/projects/{projectId}/deployments/{id}


Update Deployment

PUT /v1/projects/{projectId}/deployments/{id}

Update deployment configuration (e.g., replicas, environment variables).


Get Deployment Status

GET /v1/projects/{projectId}/deployments/{id}/status

Returns the current status and replica count.


Duplicate Deployment

POST /v1/projects/{projectId}/deployments/{sourceDeploymentId}/duplicate

Creates a copy of an existing deployment.


Get Deployment Telemetry

GET /v1/projects/{projectId}/deployments/{id}/telemetry

Returns request/response metrics.

GET /v1/projects/{projectId}/deployments/{id}/telemetry-aggregated

Returns aggregated telemetry.


Deployment Alerting

Create alerting policy: POST /v1/projects/{projectId}/deployments/{deploymentId}/alerting-policies

List alerting policies: GET /v1/projects/{projectId}/deployments/{deploymentId}/alerting-policies

Update alerting policy: PUT /v1/projects/{projectId}/deployments/{deploymentId}/alerting-policies

Delete alerting policy: DELETE /v1/projects/{projectId}/deployments/{deploymentId}/alerting-policies/{id}


Deployment Releases

List releases: GET /v1/projects/{projectId}/deployments/{deploymentId}/releases

Get release: GET /v1/projects/{projectId}/deployments/{deploymentId}/releases/{id}

Create release: POST /v1/projects/{projectId}/deployments/{deploymentId}/releases/{id}


Validate Job/Deployment

POST /v1/deployments/validate

Validates a job or deployment spec before submitting.


Studio Jobs (Serverless)

Studio Jobs are lighter-weight jobs that run within Studio infrastructure.

Base path: /v1/projects/{projectId}/studioapp/jobs

List Studio Jobs

GET /v1/projects/{projectId}/studioapp/jobs

Get Studio Job

GET /v1/projects/{projectId}/studioapp/jobs/{id}

Create Studio Job

POST /v1/projects/{projectId}/studioapp/jobs

Stop Studio Job

POST /v1/projects/{projectId}/studioapp/jobs/{id}/stop

Delete Studio Job

DELETE /v1/projects/{projectId}/studioapp/jobs/{id}


Python SDK Usage

The lightning-sdk provides high-level wrappers for Jobs:

from lightning_sdk import Studio, Machine

# Connect to an existing Studio
studio = Studio(
    name="my-studio",
    teamspace="my-team",
    org="my-org"  # or user="username"
)

# Run a simple job
job = studio.run_job(
    name="train-model",
    machine=Machine.A100,
    command="python train.py --epochs 100",
    env={"BATCH_SIZE": "64"},
    interruptible=False,
)

# Wait for it and check status
print(job.status)

# Run multi-machine training
mmt = studio.run_mmt(
    name="distributed-train",
    num_machines=4,
    machine=Machine.H100_X_8,
    command="torchrun --nproc_per_node=8 --nnodes=4 train.py",
)

Alternatively, use the Job class directly:

from lightning_sdk import Job, Machine, Studio, Teamspace

job = Job.run(
    name="my-job",
    machine=Machine.A100,
    command="python train.py",
    studio=Studio(name="my-studio", teamspace="my-team"),
)
print(f"Job status: {job.status}")

Schedules

Jobs can be run on a schedule using cron expressions.

Base path: /v1/projects/{projectId}/schedules

Create Schedule

POST /v1/projects/{projectId}/schedules

Request body:

{
  "name": "nightly-training",
  "cron_expression": "0 2 * * *",
  "cloudspace_id": "studio-id",
  "command": "python train.py"
}

List Schedules

GET /v1/projects/{projectId}/schedules

Get Schedule

GET /v1/projects/{projectId}/schedules/{id}

Delete Schedule

DELETE /v1/projects/{projectId}/schedules/{id}

Update Schedule

PUT /v1/projects/{projectId}/schedules/{id}

List Schedule Runs

GET /v1/projects/{projectId}/cloudspaces/{cloudspaceId}/schedules/{scheduleId}/runs

Create Schedule Run (trigger manually)

POST /v1/projects/{projectId}/cloudspaces/{cloudspaceId}/schedules/{scheduleId}/runs