Notebook

A Simple Script to Auto-Regulate Slurm Job Submissions to Avoid Resource Conflicts

2024-11-07

When running hundreds of computational tasks (for example, accumulating data for high-throughput screening), it's common to face resource conflicts in slurm queues. This simple script helps manage job submissions by automatically adjusting release timing based on time and resource availability.

Core Features

1. Night Mode (21:30-8:30)

  • Releases all pending jobs

  • Maximizes resource utilization during off-hours

2. Day Mode (8:30-21:30)

  • Only releases jobs when enough nodes are idle (≥3)

  • Keeps 2 nodes as buffer for interactive tasks

  • Automatically holds jobs when resources are limited

Usage

  • change your_jobs_partition to your jobs' partition

  • chmod +x to the script.

  • Set up cron job:
    crontab -e

  • Then add the following to a new row:(*/5 means the checking cycle, here is 5 minutes)
    */5 * * * * /path/to/job_control.sh

  • After the configuration, you can view the current crontab through the following command:
    crontab -l

  • And the running log will be output to the directory of the same level of the bash file.

job_control.sh

#!/bin/bash

# Get current user

USER=$(whoami)

# Set up logging

SCRIPT_DIR=$(dirname "$0")

LOG_FILE="$SCRIPT_DIR/run.log"

log_message() {

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"

}

# Convert current time to minutes for easier comparison

HOUR=$(date +%H)

MINUTE=$(date +%M)

CURRENT_TIME=$((HOUR * 60 + MINUTE))

# Define time windows for night mode (21:30-8:30)

NIGHT_START=$((21 * 60 + 30))

NIGHT_END=$((8 * 60 + 30))

# Get number of idle nodes in your_PARTITION partition

# If no idle nodes found, set to 0

IDLE_NODES=$(sinfo -p your_jobs_partition -t idle -h | awk '{print $4}')

IDLE_NODES=${IDLE_NODES:-0}

log_message "Script started - Current time: ${HOUR}:${MINUTE}, Idle nodes: ${IDLE_NODES}"

# Night mode (21:30-8:30): Release all pending jobs

# This maximizes resource utilization during off-peak hours

if [ $CURRENT_TIME -ge $NIGHT_START ] || [ $CURRENT_TIME -le $NIGHT_END ]; then

log_message "Night mode activated - Releasing all pending jobs"

PENDING_JOBS=$(squeue -u $USER -p your_jobs_partition -t PD -h | wc -l)

squeue -u $USER -p your_jobs_partition -t PD -h -o "%i" | xargs -I {} scontrol release {} 2>/dev/null

log_message "Released ${PENDING_JOBS} pending jobs in night mode"

else

# Day mode: Smart job release based on available nodes

# Keep at least 2 nodes free for urgent tasks

if [ $IDLE_NODES -ge 3 ]; then

JOBS_TO_RELEASE=$((IDLE_NODES - 2))

log_message "Day mode - ${IDLE_NODES} idle nodes available, releasing ${JOBS_TO_RELEASE} jobs"

RELEASED_JOBS=$(squeue -u $USER -p your_jobs_partition -t PD -h -o "%i" | sort -n | head -n $JOBS_TO_RELEASE)

if [ -n "$RELEASED_JOBS" ]; then

echo "$RELEASED_JOBS" | xargs -I {} scontrol release {} 2>/dev/null

log_message "Released jobs: ${RELEASED_JOBS}"

else

log_message "No pending jobs to release"

fi

else

# Hold all pending jobs when resources are limited

log_message "Day mode - Not enough idle nodes (${IDLE_NODES} available, need at least 3)"

HELD_JOBS=$(squeue -u $USER -p your_jobs_partition -t PD -h | wc -l)

squeue -u $USER -p your_jobs_partition -t PD -h -o "%i" | xargs -I {} scontrol hold {} 2>/dev/null

log_message "Held ${HELD_JOBS} pending jobs"

fi

fi

log_message "Script finished"