The story behind Alps, Europes second largest computer
www.computerweekly.com
Andrew Mayovskyy - stock.adobe.cFeatureThe story behind Alps, Europes second largest computer The director of CSCS and professor of computational physics at ETH, Thomas Schulthess, explains the development of one of the worlds fastest supercomputers, Alps, in SwitzerlandByPat Brans,Pat Brans Associates/Grenoble Ecole de ManagementPublished: 10 Feb 2025 The Swiss National Supercomputing Center, also known as CSCS, built and deployed a new supercomputer in collaboration with Nvidia and HPE.The machine, called Alps, came on line at the end of 2024, and is already listed as the worlds seventh most powerful supercomputer and Europes second most powerful. Computer Weekly sat down with Thomas Schulthess, director of CSCS and professor of computational physics at ETH [Eidgenssische Technische Hochschule or Federal Institute of Technology] Zurich, to find out more.What is the history of Alps and what architectural decisions did you make along the way?Thomas Schulthess: Ill start by explaining the difference between CSCS and Alps. CSCS is a centre with people. The main facility is in Lugano, near the football stadium and the ice hockey stadium. It was founded in 1991, long before I arrived, and its where we deploy and operate supercomputers, the biggest of which is Alps, which came online in 2024. Before Alps, we had already deployed many other supercomputers.For example, we had Piz Daint, a hybrid Cray XC40/XC50 machine, which was the first GPU-based supercomputer in Europe. We deployed it around 2012 to 2013, which was around the time of Jaguar in Oak Ridge National Laboratory in the US.One of the things that makes us special is that we design, build and operate supercomputers for MeteoSwiss, the Swiss meteorological service. Normally, weather services run their own computers, but in our case, we do it for them. As a result, we have had a strong collaboration with MeteoSwiss for decades.Alps is an effort to bring different computers into one platform and it was motivated by a peer review of the centre that we had in 2015, where we got the very strong message telling us we did a great job deploying Piz Daint, but now we must face the challenges of data and complex workflows in scientific computing.Thats when we started to look for options for how to evolve supercomputing. And what came out as a collaboration with what was then Cray, and now HPE, who acquired Cray in 2019. At the time, Cray was pushing its system in the direction of a micro-service architecture, which is sometimes called a cloud-native architecture. For us, this was a really good development, but it turned out to be very difficult, much more difficult than anybody predicted.But we decided to go this way around 2018 to 2019. We ran the procurement, and Cray won the contract. We then considered competing architectures Nvidia versus AMD and in the end, we went for both. We did the scale out with Grace Hopper [from Nvidia]; and now we also have a significant partition of MI300A accelerators [from AMD] on Alps.And how Alps is running today?Schulthess: The way Alps works today is it has a very large slingshot network, like Frontier and LUMI and we can partition the network. At the end of every network endpoint is either a storage device or a compute node. And the compute nodes are either Grace Hopper (GH200)-based or AMD-MI300A based. We also have Nvidia A100 and AMD MI250X processors, which makes the node the same as in LUMI and in Frontier. We have AMD Rome-based nodes as well, so a traditional multicore partition. Hence, we support a multitude of compute architectures on Alps. The idea there is that we can serve different workloads. And we have a big focus on application software development. So, we can make all these kinds of architectures available to software developers. And thats where we are today.How do you offer service on Alps?Schulthess: You can view Alps like a cloud-like experience, with different types of service. We can offer infrastructure as a service (IaaS). Typically, we offer IaaS to other research infrastructures, like for the Paul Scherrer Institute that runs several large user programmes, including access to a synchrotron [the Swiss Light Source], the free electron laser [SwissFEL], and the Swiss Spallation Neutron facility to study muon sciences. And so they get a partition on Alps and they run their own platforms on it.In other cases, we might create a platform for AI or traditional HPC or climate and weather for users. And then we have users or communities that run their own function as a service, and we provide them with a platform as a service. We are also involved with large experiments like the Square Kilometer Array or the Swiss tier two for LHC data analysis that is part of the world LHC compute grid, which is a partition on Alps.And probably the most important thing now is that where we used to have a separate computer for MeteoSwiss, with the new model, we run their numerical forecasting system ICON in a partition on Alps.It seems that the fact that ICON is now running in a partition is a good indication of the size of ALPS?Schulthess: Well, it shows you the size, but also the breadth that we can cover. Traditionally, a supercomputer is a unique system. It may be heterogeneous for example, Piz Daint is heterogeneous in that it has multicore nodes, GPU-accelerated nodes. It may be heterogeneous, but it was architected as a uniform system in that its a one-size-fits-all solution, in terms of the programming environment and things like that.Typically, users have to adapt to a particular supercomputer. So, you basically have a hammer and you need to make everything look like a nail. Now on Alps, we can create partitions and the software environment in those partitions to adapt to users.Who funds CSCS and ALPS?Schulthess: Alps as a research infrastructure is funded by the ETH Domain. CSCS is a unit of ETH Zurich, where I am also a professor of physics. ETH Zurich and EPFL, the sister school in Lausanne, and four national labs are joined together under what is called the ETH Domain.The whole domain is funded by the State Secretariat for Education, Research and Innovation thats our main funding source. But the MeteoSwiss part is funded by MeteoSwiss and whatever their funding sources are. So, we have to maintain a clear separation there. And also have third-party funding, like most research infrastructures, in the range of around 20%.Because we are a publicly funded infrastructure, even if we work with other third parties and get full cost recovery, we are still subsidised, and subsidies dont scale. We cannot have commercial activities on our infrastructure, though we can engage in research collaboration with commercial companies. And when we do collaborate with companies, they must fund the recovery costs of those collaborations.What about your involvement in the OpenCHAMI consortium?Schulthess: The OpenCHAMI consortium currently includes five partners: Los Alamos National Laboratory, NERSC [National Energy Research Scientific Computing Center], Lawrence Berkeley National Laboratory, University of Bristol, HPE, and CSCS.The consortium is developing the system management infrastructure of the future. Alps is an essential use case in this development. So, thats why the system management software will continue to evolve over the next two or three years here at CSCS, but also in Bristol, in Los Alamos, and in Berkeley.In The Current Issue:Forrester: Why digitisation needs strong data engineering skillsLabours first digital government strategy: Is it dj vu or something new?Download Current IssueSLM series: Editorial brief & scope CW Developer NetworkWill Skills England be allowed to change the course of the Government's inherited policy Titanic? When IT Meets PoliticsView All Blogs
0 التعليقات ·0 المشاركات ·46 مشاهدة