Milwaukee School of Engineering (MSOE) invites applications for a full-time HPC Systems Administrator to join our Electrical Engineering and Computer Science department.
MSOE is a leading undergraduate and master’s level private, non-profit educational institution founded in 1903. The institution serves approximately 3,000 undergraduate and graduate students in the areas of engineering, computing, mathematics, business, nursing and communications. The mission of MSOE is to be the university of choice for those seeking an inclusive community of experiential learners driven to solve the complex challenges of today and tomorrow. The MSOE community is guided by six values – collaboration, excellence, inclusion, innovation, integrity and stewardship – that represent the core of our campus culture.
MSOE is in the heart of downtown Milwaukee, a burgeoning tech hub known for its numerous festivals, beautiful lakefront, and cultural activities. MSOE offers competitive benefits including medical, dental, life, disability and vision insurance, and a retirement plan. Educational benefits are available for employees and family members as well as an on-campus fitness center for all employees.
The EECS department offers an undergraduate degree in CS with a focus on artificial intelligence as well as an established software engineering undergraduate degree. Significant investments are being made to support MSOE’s academic mission including the construction of the $34M Dwight and Dian Diercks Computational Science Hall that will house the SE and CS programs, as well as the HPC Systems Administrator. More information about the programs and building are available at https://www.msoe.edu/CS/.
The HPC Systems Administrator will lead efforts related to the daily operation of a small to medium-sized GPU-based high-performance computing (HPC) cluster. The individual in this position will provide engineering and administration support for HPC hardware and software. They will be responsible for day-to-day operations, including assisting faculty and students with troubleshooting HPC workloads, developing and delivering training, interaction and collaboration with businesses and industry, and planning for growth of the system. Successful candidates will identify, engage and support the needs of faculty and students to support a variety of academic and research workloads.
Essential Duties and Responsibilities
Day-to-day operations of the systems including systems administration and monitoring
Installing, testing, and maintaining software applications on the cluster
Configuration and monitoring of the cluster scheduling and queuing system
Documenting system administration procedures for routine and complex tasks
Develop and delivery of training materials related to running and monitoring workloads for users
Create, administer, archive, and delete user network accounts (~500), user groups and file systems
Clearly communicates and enforces policies and procedures
Leverages automation tools where valuable and appropriate
Provide reactive support and excellent customer service to all students, faculty, and staff
Works closely with campus Information Technology department
Bachelor's degree in a related field with two years of professional experience.
Experience installing, configuring, and maintaining enterprise Linux systems.
Experience configuring and managing network attached storage systems and networks.
Experience managing and troubleshooting high performance network equipment.
Strong understanding of open source and commercial database systems including MySQL, Oracle, and Microsoft SQL Server.
Experience with identity management technology such as LDAP, Kerberos, etc.
Demonstrated ability to manage the full stack (datacenter rack equipment, server hardware, OS, network, and security) of multi-tenant Linux-based systems both individually and within a team environment.
Experience working in an academic environment.
Familiarity with container solutions such as Docker and Singularity.
Experience with cluster management tools such as Slurm.
Two years’ experience in providing support for a GPU-based cluster.
Familiarity with Nvidia NGC containers and/or CUDA.
Ability to write technical documentation in a clear and concise manner.
Self-motivated and works independently and as part of a team. Demonstrates problem-solving skills. Able to learn effectively and meet deadlines.
Understanding of system performance monitoring and actions that can be taken to improve or correct performance.
Demonstrated skills associated with adapting equipment and technology to serve user needs. Demonstrated comprehensive understanding of how system management actions affect other systems, system users and dependent/related functions.
Demonstrated experience writing and editing complex scripts used to perform system maintenance and administration.
Applied experience with these or similar technologies: Amazon Web Services;.Net framework; application load balancers; distributed version control and continuous integration/continuous deployment platforms; systems configuration management.
General knowledge of other areas of IT. Thorough understanding of and experience with systems-related issues and actions that can be taken to improve or correct performance.
Advanced knowledge of computer security best practices and policies including demonstrated experience securing server-based software.
Job Code: 222
It is the policy of MSOE to provide equal employment opportunity to all individuals regardless of their race, ethnicity, color, creed, religion, sex, age, national origin, physical or mental disability, military and veteran status, sexual orientation, gender identity, genetic characteristics, marital status or any other characteristic protected by local, state or federal law. This policy applies to all jobs at the University and to all the terms, benefits, and conditions of employment/enrollment.