A Simple Script to Auto-Regulate Slurm Job Submissions to Avoid Resource Conflicts
2024-11-07
When running hundreds of computational tasks (for example, accumulating data for high-throughput screening), it's common to face resource conflicts in slurm queues. This simple script helps manage job submissions by automatically adjusting release timing based on time and resource availability.
Core Features
1. Night Mode (21:30-8:30)
Releases all pending jobs
Maximizes resource utilization during off-hours
2. Day Mode (8:30-21:30)
Only releases jobs when enough nodes are idle (≥3)
Keeps 2 nodes as buffer for interactive tasks
Automatically holds jobs when resources are limited
Usage
change
your_jobs_partition
to your jobs' partitionchmod +x
to the script.
Set up cron job:
crontab -e
Then add the following to a new row:(
*/5
means the checking cycle, here is 5 minutes)*/5 * * * * /path/to/job_control.sh
After the configuration, you can view the current crontab through the following command:
crontab -l
And the running log will be output to the directory of the same level of the bash file.
job_control.sh
#!/bin/bash
# Get current user
USER=$(whoami)
# Set up logging
SCRIPT_DIR=$(dirname "$0")
LOG_FILE="$SCRIPT_DIR/run.log"
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"
}
# Convert current time to minutes for easier comparison
HOUR=$(date +%H)
MINUTE=$(date +%M)
CURRENT_TIME=$((HOUR * 60 + MINUTE))
# Define time windows for night mode (21:30-8:30)
NIGHT_START=$((21 * 60 + 30))
NIGHT_END=$((8 * 60 + 30))
# Get number of idle nodes in your_PARTITION partition
# If no idle nodes found, set to 0
IDLE_NODES=$(sinfo -p your_jobs_partition -t idle -h | awk '{print $4}')
IDLE_NODES=${IDLE_NODES:-0}
log_message "Script started - Current time: ${HOUR}:${MINUTE}, Idle nodes: ${IDLE_NODES}"
# Night mode (21:30-8:30): Release all pending jobs
# This maximizes resource utilization during off-peak hours
if [ $CURRENT_TIME -ge $NIGHT_START ] || [ $CURRENT_TIME -le $NIGHT_END ]; then
log_message "Night mode activated - Releasing all pending jobs"
PENDING_JOBS=$(squeue -u $USER -p your_jobs_partition -t PD -h | wc -l)
squeue -u $USER -p your_jobs_partition -t PD -h -o "%i" | xargs -I {} scontrol release {} 2>/dev/null
log_message "Released ${PENDING_JOBS} pending jobs in night mode"
else
# Day mode: Smart job release based on available nodes
# Keep at least 2 nodes free for urgent tasks
if [ $IDLE_NODES -ge 3 ]; then
JOBS_TO_RELEASE=$((IDLE_NODES - 2))
log_message "Day mode - ${IDLE_NODES} idle nodes available, releasing ${JOBS_TO_RELEASE} jobs"
RELEASED_JOBS=$(squeue -u $USER -p your_jobs_partition -t PD -h -o "%i" | sort -n | head -n $JOBS_TO_RELEASE)
if [ -n "$RELEASED_JOBS" ]; then
echo "$RELEASED_JOBS" | xargs -I {} scontrol release {} 2>/dev/null
log_message "Released jobs: ${RELEASED_JOBS}"
else
log_message "No pending jobs to release"
fi
else
# Hold all pending jobs when resources are limited
log_message "Day mode - Not enough idle nodes (${IDLE_NODES} available, need at least 3)"
HELD_JOBS=$(squeue -u $USER -p your_jobs_partition -t PD -h | wc -l)
squeue -u $USER -p your_jobs_partition -t PD -h -o "%i" | xargs -I {} scontrol hold {} 2>/dev/null
log_message "Held ${HELD_JOBS} pending jobs"
fi
fi
log_message "Script finished"