[Resolved] SLURM Socket Timeout Errors | ACCRE | Vanderbilt University Skip to main content link Home About Us About Us page Overview Virtual Tour Who uses ACCRE? History Mission Governance Technical Details Location Staff Careers Getting Started Getting Started page New Account Process Request A Cluster Account First-Time Account Setup Required Training Submitting Your First Job: One Million Digits of Pi Documentation Documentation page Introduction ACCRE Cheat Sheet Linux: Using the Terminal SLURM: Scheduling and Managing Jobs Lmod: Loading Software Modules ACCRE Commands for Job Monitoring GPUs, Parallel Processing, and Job Arrays Singularity: Running Containers on ACCRE SciLo: Long Term Data Archiving MATLAB on the ACCRE Cluster AlphaFold on ACCRE Python on the ACCRE Cluster R on the ACCRE Cluster Ruby on the ACCRE Cluster GitHub Repositories Building Software on the ACCRE Cluster List of modules available on ACCRE Policies Frequently Asked Questions Visualization Portal Jupyter Other Services Pizza and Programming Lecture Series Tape Backup Services Custom Gateways LStore TNCRED Servers Vanderbilt CMS Tier 2 For PIs For PIs page Setting up a New Group Pricing Grant Text Forms and Documents Setting Up Fairshare Usage ACCREx: ACCRE’s Experimental Technologies Program Support Support page Frequently Asked Questions Open a Helpdesk Ticket Software Installation Inquiries General Inquiries accre-forum on Slack Status Dashboard ACCRE Cluster Status Notice [Resolved] SLURM Socket Timeout Errors [Resolved] SLURM Socket Timeout Errors Posted by burgerdm on Monday, March 26, 2018 in Cluster Status Notice. Updated 4/3, 2:44pm: We will continue to monitor SLURM responsiveness closely, but for now it is very good. Please submit a ticket if you encounter any further problems. Thanks! Updated 3/28, 1:35pm: We are continuing to work on SLURM responsiveness, though it has improved since Sunday. We are in communication with the makers of SLURM and hope to resolve it soon. Thanks for your patience! Updated 3/26, 6:12pm: SLURM has overall been more responsive today. We have identified a few potentially problematic workflows and are working with those users/groups to make appropriate changes. As a reminder: – Please avoid large groups (>300) of jobs that do not use job arrays. – Please avoid large groups of jobs that each run for less than 30 minutes. Bundle multiple short-running jobs into a single job. We highly prefer 1 6-hour job as opposed to 360 1-minute jobs. Depending on how busy the cluster is and your resource request, the former example may actually be faster to complete due to shorter queue times. – If your group uses some sort of automated pipeline (we are aware of roughly 6-8 groups doing this, but there are probably more) for submitting and monitoring jobs, please be prudent about how often you are requesting information from SLURM with commands like squeue and scontrol. SLURM sluggishness is generally the cumulative effect of multiple users not following the best practices we outline above. If you are unsure, please reach out to us. And please also let us know if you have recommendations or suggestions. As I mentioned last night, we are in the process of moving SLURM to solid state drives, and we have a few other ideas we are considering internally. There are roughly 900 unique researchers across VU and VUMC making use of ACCRE resources. Please be responsible and mindful that this is a shared resource and your decisions may impact other researchers. Original post: For about the past week our job scheduler, SLURM, has been sluggish. The sluggishness has been especially bad over the weekend. Often SLURM commands (e.g. squeue or sbatch) may timeout with “socket timeout” errors, or be very slow to complete. We are very sorry for the inconvenience and frustration this has caused. This problem generally occurs when the job scheduler is overloaded due to large batches of non-array jobs or extremely short jobs (see: https://www.vanderbilt.edu/accre/support/faq/#a-slurm-command-fails-with-a-socket-timeout-message-whats-the-problem), however neither of these factors appear to be playing a role in this case, so far as we can tell. We have made a few changes tonight in an attempt to improve SLURM responsiveness. If things have not improved by the morning, we will open a ticket with the SLURM developers for input and advice. If you are actively running jobs, please make sure you review the link above carefully and make changes to your workflow if necessary. We are available to assist. Note that jobs the are already running should not be impacted by these problems (unless you are invoking other SLURM commands like squeue or srun from within your SLURM job). We have plans to move SLURM to dedicated solid state drives in the very near future, which we expect to improve responsiveness and greatly reduce the occurrence of socket timeout errors. Related Status Dashboard Cluster utilization, past 24h More cluster utilization plots Connect with ACCRE twitter facebook youtube RSS Feed Your Vanderbilt Alumni Current Students Faculty & Staff International Students Media Parents & Family Prospective Students Researchers Sports Fans Visitors & Neighbors Quick Links PeopleFinder Libraries News Calendar Maps A-Z © Vanderbilt University · All rights reserved. Site Development: Digital Strategies (Division of Communications) Vanderbilt University is committed to principles of equal opportunity and affirmative action. Accessibility information. Vanderbilt®, Vanderbilt University®, V Oak Leaf Design®, Star V Design® and Anchor Down® are trademarks of The Vanderbilt University