HPC Systems Administrator
Milwaukee, WI 
Share
Posted 57 months ago
Position No Longer Available
Position No Longer Available
Job Description
Milwaukee School of Engineering (MSOE) invites applications for a full-time HPC Systems Administrator to join our Electrical Engineering and Computer Science department. 

MSOE is a leading undergraduate and master’s level private, non-profit educational institution founded in 1903.  The institution serves approximately 3,000 undergraduate and graduate students in the areas of engineering, computing, mathematics, business, nursing and communications.  The mission of MSOE is to be the university of choice for those seeking an inclusive community of experiential learners driven to solve the complex challenges of today and tomorrow. The MSOE community is guided by six values – collaboration, excellence, inclusion, innovation, integrity and stewardship – that represent the core of our campus culture.

MSOE is in the heart of downtown Milwaukee, a burgeoning tech hub known for its numerous festivals, beautiful lakefront, and cultural activities.  MSOE offers competitive benefits including medical, dental, life, disability and vision insurance, and a retirement plan.  Educational benefits are available for employees and family members as well as an on-campus fitness center for all employees.

The EECS department offers an undergraduate degree in CS with a focus on artificial intelligence as well as an established software engineering undergraduate degree. Significant investments are being made to support MSOE’s academic mission including the construction of the $34M Dwight and Dian Diercks Computational Science Hall that will house the SE and CS programs, as well as the HPC Systems Administrator.  More information about the programs and building are available at https://www.msoe.edu/CS/.  

The HPC Systems Administrator will lead efforts related to the daily operation of a small to medium-sized GPU-based high-performance computing (HPC) cluster. The individual in this position will provide engineering and administration support for HPC hardware and software.  They will be responsible for day-to-day operations, including assisting faculty and students with troubleshooting HPC workloads, developing and delivering training, interaction and collaboration with businesses and industry, and planning for growth of the system. Successful candidates will identify, engage and support the needs of faculty and students to support a variety of academic and research workloads.

Essential Duties and Responsibilities
  • Day-to-day operations of the systems including systems administration and monitoring
  • Installing, testing, and maintaining software applications on the cluster
  • Configuration and monitoring of the cluster scheduling and queuing system
  • Documenting system administration procedures for routine and complex tasks
  • Develop and delivery of training materials related to running and monitoring workloads for users
  • Create, administer, archive, and delete user network accounts (~500), user groups and file systems
  • Clearly communicates and enforces policies and procedures
  • Leverages automation tools where valuable and appropriate
  • Provide reactive support and excellent customer service to all students, faculty, and staff
  • Works closely with campus Information Technology department
Qualifications
  • Bachelor's degree in a related field with two years of professional experience.
  • Experience installing, configuring, and maintaining enterprise Linux systems.
  • Experience configuring and managing network attached storage systems and networks.
  • Experience managing and troubleshooting high performance network equipment.
  • Experience with application and systems programming languages such as Python, Java, or Javascript.
  • Strong understanding of open source and commercial database systems including MySQL, Oracle, and Microsoft SQL Server.
  • Experience with identity management technology such as LDAP, Kerberos, etc.
  • Demonstrated ability to manage the full stack (datacenter rack equipment, server hardware, OS, network, and security) of multi-tenant Linux-based systems both individually and within a team environment.
Preferred
  • Experience working in an academic environment.
  • Familiarity with container solutions such as Docker and Singularity. 
  • Experience with cluster management tools such as Slurm. 
  • Two years’ experience in providing support for a GPU-based cluster.
  • Familiarity with Nvidia NGC containers and/or CUDA.
  • Ability to write technical documentation in a clear and concise manner.
  • Self-motivated and works independently and as part of a team. Demonstrates problem-solving skills. Able to learn effectively and meet deadlines.
  • Understanding of system performance monitoring and actions that can be taken to improve or correct performance.
  • Demonstrated skills associated with adapting equipment and technology to serve user needs. Demonstrated comprehensive understanding of how system management actions affect other systems, system users and dependent/related functions.
  • Demonstrated experience writing and editing complex scripts used to perform system maintenance and administration.
  • Applied experience with these or similar technologies: Amazon Web Services;.Net framework; application load balancers; distributed version control and continuous integration/continuous deployment platforms; systems configuration management.
  • General knowledge of other areas of IT. Thorough understanding of and experience with systems-related issues and actions that can be taken to improve or correct performance.
  • Advanced knowledge of computer security best practices and policies including demonstrated experience securing server-based software.

Job Code: 222
 

 

Position No Longer Available
Job Summary
Employment Term and Type
Regular, Full Time
Required Education
Bachelor's Degree
Required Experience
2+ years
Email this Job to Yourself or a Friend
Indicates required fields