From a cyber security perspective the current HPE Cray EX systems marked a pretty noticeable departure, in terms of system architecture, from what we have seen previously – not just compared to previous Cray models, but also when compared to models from other HPC vendors. At HPCsec we specialise in securing HPC systems, so let’s delve briefly into what these changes mean from a security perspective:
As we’ve already said, a lot has changed from the previous generation of systems, an awful lot. These changes were no doubt primarily for performance reasons, but our interest here is from a security perspective and on this front the vast majority of the changes we’re interested in revolve around system management, administration and architecture.
We still see some Cray specific technologies like DVS, obviously the slingshot network interconnects are there but there is little else software-wise specifically developed by Cray. Alps/Aprun are gone in favour of Slurm or PBS Professional, and some other expected HPC technologies exist like Lustre for the ClusterStor storage. But the vast majority of what is new here is a plethora of very capable technologies that have mostly grown up in other worlds (typically cloud) and have been adopted here for HPC purposes. We see the introduction of containerisation delivered via Kubernetes, which is is a pretty key change, but it doesn’t stop there; we see Ceph, Elasticsearch, Kibana, Istio, Redfish, Keycloak, Ansible, Prometheus, Grafana… the list goes on. Each of these technologies plays a part in the operation of these systems in a way we’ve not seen before. and Cray have obviously spent a lot of time configuring everything to work together in what is a very complex but powerful system.
With so many new technologies introduced the potential attack surface becomes huge. However, unlike a lot of the technology we traditionally see on HPC systems, these technologies are not exclusive to HPC so they’ve already had a level of security scrutiny over them. It gives an interesting dynamic – a larger potential attack surface, but one that is at least little battle hardened this time around.
Cray have implemented an API through which most of the backend administrative services are accessed, which gives a great opportunity to reduce this attack surface somewhat. It is an opportunity to keep the innards of the system management tooling away from the prying eyes of the users – assuming the segmentation is well implemented anyway. Kubernetes plays a big role in this, stick an Istio service mesh on top of that (as Cray have done) and there is a good opportunity to take really granular control over what is going on in that Kubernetes cluster and prevent security events don’t to happen from happening.
Nevertheless an EX system is unquestionably a complex beast with many components, a lot could be achieved with well thought out and structured segmentation, which is something we rarely see in HPC systems, and has been the downfall of most that we’ve looked at historically. However, segmentation is the opposite of where Cray are hoping to go with things, their Customer Access Network (CAN) approach highlights this – it’s designed to open up the innards of the cluster to the wider customer network. This is risky, many HPC technologies, not least workload managers and parallel file systems rely on the users interacting with them not being privileged – something that is much harder to control when suddenly you plug a whole new network of systems into the cluster.
It is sadly true to say that historically most Supercomputers and HPC cluster security was very much a consequence of the system being relatively standalone; a black box not subject to any form of security scrutiny (by vendor or customer), with access controlled and restricted to a number of trusted users who had non-privileged access rather than through the system being inherently secure. This was never a not a good or recommended security model and it’s obviously reassuring to see that Cray are not walking this path with their EX systems.
So is it more secure? I don’t think there is a straightforward yes or no answer to this. It’s definitely different. Its complexity certainly adds a different element of security through obscurity to it. The newly introduced technologies are much more widely used outside of the HPC world and so have been subject to a much greater level of scrutiny than we see of HPC specific technology – with many of the vendors actively publishing security advisories with trackable CVE’s, good evidence of an active approach to security. That said, we already know that security patches are not a timely thing in the HPC world – it can often be months before patches for critical kernel vulnerabilities are shipped by HPC vendors (Cray included) – and there is a lot of technology here to be kept up to date with security patches.
Do I think Cray have done enough? Probably not, the security piece still feels an afterthought. The good security elements feel more like consequence of the technologies used rather than the result of a plan to ensure a secure system. With a little extra work it could have been slick, but there is a lot of potential and it is of course a developing platform so perhaps we’ll see this change over time. Cray have certainly picked the right path, perhaps they’re just a little early on the journey.