Do Traditional Cyber Security Rules Apply to HPC Environments?

Good security is about dissecting something in order to understand how it works and then looking for ways to manipulate in order to facilitate unintended functionality.

As a great example of this let’s look at how authentication works in TORQUE. This is a multi stage process using the Trqauthd daemon:

  1. Client (user submitting a job) establishes a connection from a privileged port (facilitated by a setUID root UNIX socket).
  2. The client then establishes a connection from a non privileged port
  3. Within the privileged connection the user authorizes the non privileged connection to undertake operations.
  4. Trqauthd validates that the userID passed in the privileged connection and the userID passed in the non-privileged connection match, if so the connection is authorized.
  5. From this point onward the user is able to submit jobs to pbs_server for execution
Sequence diagram showing how authentication occurs with trqauthd in TORQUE

The reason for using a setUID socket is that it provides the operating system a means to positively identify the user in a manner that a non root user should not be able to manipulate (a flaw that HPCsec has found to affect other job management software such as Alps (Cray), LSF (IBM) and Moab (Adaptive)).

Assuming a secure implementation the Trqauthd implementation within TORQUE is solid in principle. But when you look at this through the eyes of an attacker the chink in the amour is this privileged ports option. Non root users are not able to establish connections from ports <= 1024. However, as an attacker we’re focused on manipulating this, so with our attacker hat on, here are a few ways to defeat it:

Use a system that you have root access to in order to establish the privileged connection – this obviously relies on you being in a position to communicate with pbs_server. Unfortunately when I last checked there were an uncomfortable number of instances exposed to the internet.

Gain root access to a system that can talk to pbs_server – if you’re not in a position to communicate from a system in your control there are often a lot of devices that can do the job for you. Supercomputers are awash with embedded devices for things like cooling and if accessible they’re rarely secured; you’ll be surprised at how far a set of default credentials will get you. Root on an embedded busybox controlled cooling device is one simple hop away from root across an entire cluster!

TORQUE has a “–disable-privports” option which is often used – If this is the case, then you don’t even need to worry about connecting from a privileged port as the privileged connection can come from any port so can be initiated by any user.

Bypass pbs_server altogether and execute commands directly on a compute node (pbs_mom) – Unfortunately earlier versions of TORQUE’s pbs_mom were not validating the source of the jobs pushed to compute nodes. Job execution on a compute node looks like this:

Sequence diagram for pbs_mom authentication in TORQUE

However, pbs_mom neither validated that the request to run a job came from a privileged port, nor that it came from the pbs_server. Consequently any user could submit jobs directly to a pbs_mom and also request that they ran as root.

As you can see, flaws in authentication mechanisms such as Trqauthd can pose is a significant issue that can easily lead to the full compromise of a supercomputer. However, this example is not an issue that anyone who has not dissected TORQUE and how it works will have a working knowledge of. Were this type of flaw present in Windows Active Directory then it is something that would almost certainly be widely documented and shared amongst the security community. And therein lies the issue, HPC security is a niche area that most security people will never get an opportunity to work in and this is reflected in the tools that exist. Most security work starts with a sweep for known vulnerabilities, but most tools are not aware of HPC technologies (in fact I know of no tools that are) so they’re not going to identify anything… Or rather it will, but not the type of things you want it to. You can be almost certain that you’ll be informed that you have passwordless key based SSH and that this needs rectifying immediately!

Most security practitioners have never worked with HPC technologies, so even those who are extremely skilled will need to spend a lot of time investigating services and technologies and extracting as much knowledge as they can handle from your own HPC experts. We hope that some of the content on the HPCsec website can help bridge that knowledge gap, but the foundations is always the most difficult piece to get in place.

This article is probably a very long winded way of saying, yes traditional approaches do apply, but traditional approaches have built up over years of knowledge acquisition and sharing and that piece is not well practiced in the HPC world. If you’re willing to support some capable security practitioners by imparting HPC knowledge as they impart their security expertise then it’ll go some way to bridging that gap. Not only that but because the knowledge sharing goes both ways your HPC experts are getting a crash course in security and becoming HPC security experts themselves.

The TORQUE vulnerabilities which this article uses as examples appear as advisories within our HPCsec advisory section.