Tutorials

The following tutorials will be held during the conference:

The ∆QSD Paradigm: Designing Systems with Predictable Performance at High Load

Authors: Peter Van Roy (Université catholique deLouvain, Belgium) and Seyed Hossein Haeri (University of Bergen, Norway)

Abstract

The ∆Q Systems Development paradigm (∆QSD) is a novel industrially-derived systems development methodology for developing complex real-world distributed systems, that directly embeds statistically-based performance metrics from the outset of the system design process and throughout the entire software production life cycle. It uses a stochastic approach to specify system behaviour, using cumulative distribution functions to model both delay and failure. Experience shows that this is a ‘sweet spot’ that gives good results with respect to the amount of computation needed. Predictions are accurate when the system model correctly models both independent and dependent parts. This paradigm has been developed by the company PNSol over a period of 20+ years, in collaboration with IOG (formerly IOHK), BT, Vodafone, Boeing Space and Defence, and other major companies who focus on the development of reliable high-quality, high integrity, distributed software systems, with strong real-time requirements. In particular it:

  • is outcome-centric and especially concerns itself with the timeliness (and probability of success) of an activity of interest.
  • works for validating initial goals and through to in-life service assurance.
  • permits top-down and bottom-up (or a mixture) design approaches.
  • can formulate both system-centric and user-centric “experience” questions.

∆QSD primarily targets systems with many independent users where real-time performance is important. This includes systems with large flows of mostly independent data items and systems that are subject to frequence overload situations. This includes the following situations:

  • service assurance and strategic planning of national broadband deployments.
  • validating potential effectiveness of safety-of-life distributed systems in adversarial environments.
  • design, development, and deployment of the largest proof-of-stake based ledger technology where assuring timeliness is a key security property.

Short Bio

Peter Van Roy (peter.vanroy@uclouvain.be) is professor in the ICTEAM Institute at the Université catholique de Louvain (UCLouvain), where he heads the Programming Languages and Distributed Computing Research Group. He coordinated the EU projects SELFMAN and LIGHTKONE and was a partner in the projects EMJD-DC, SYNCFREE, MANCOOSI, EVERGROW, and PEPITO. He is a developer of the Mozart Programming System and author of a well-known textbook on computer programming published by MIT Press. P. Van Roy is developing the ∆QSD tutorial as the main instrument for the dissemination of the ∆QSD paradigm. He has given four versions of this tutorial at international conferences since 2022, namely at DisCoTec 2022, HiPEAC 2022, EuroPar 2022, and HiPEAC 2023. He also gives lectures on ∆QSD in his master-level distributed systems course LINFO2345 at UCLouvain (videos on Youtube channel @PeterVanRoy).

Seyed Hossein Haeri (hossein.haeri@gmail.com) is an associate professor in the BLDL institute at the University of Bergen, Norway as well as a software scientist at IOG, Singapore and a language specialist at Entropy Software Foundation, US. H. Haeri is a theoretical computer scientist who is developing the mathematical foundation of ∆QSD. His research is in the intersection between Programming Languages and Software Engineering. Over the past decade, his research has enjoyed a flavour of Distributed Systems on top. In the recent past, he has delivered a dozen of ∆QSD talks at different venues, including QAVS 2022, ICE 2023, and NWPT 2023.

Software Performance Analysis – Industry Perspectives

Authors: Kingsum Chow (Zhejiang University, China), Chengdong Li (Optimatist Technology, China), Anil Rajput (Advanced Micro Devices, United States) and Xinyu Jiang (Zhejiang University, China)

Abstract

Over the decades, software performance analysis has become a fundamental aspect of systems architecture. This shift came to the forefront with the accelerating expansion in the number of cores per CPU socket, necessitating a higher level of understanding and practice in the field. The number of cores per socket in CPUs, specifically those developed by industry leaders like Intel, AMD, and ARM, has significantly increased over the years according to the data gathered. Servers in 2022-2023 from AMD, ARM, and Intel have shown a trend towards more cores, leading to better performance, with AMD and Intel competing fiercely in this realm. For AMD, they have adopted a strategy of increasing core counts as a primary way to save power and cost, achieving up to 96 Zen 4 cores in Genoa and 128 Zen 4c cores in Bergamo. At Hot Chips 2023, Intel announced its 144-core Xeon Sierra Forest. As its CEO Pat Gelsinger stated, aims for higher core count versions, driving up the per-socket Aggregate Selling Price (ASP) more aggressively. On the other hand, ARM has focused on the development of server processors with large clusters of CPU cores, enhancing their general-purpose performance from cloud computing to high-performance computing. For instance, Ampere’s Altra Max, based on ARM architecture, offers up to 128 cores per socket. Overall, this data suggests a growing trend and demand toward multi-core CPUs, which enhances parallel processing capabilities, thereby resulting in not only superior performance but improved power efficiency too. This drastic increase in the number of cores per socket emphasizes the need for advanced performance analysis to manage and utilize these resources effectively. Such developments in hardware design are a testament to the ingenuity and relentless innovation prevalent in the tech industry. However, they also pose a critical challenge for software performance analysis. Namely, a stark rise in the complexity of performance analysis, given the task at hand now involves handling an increased number of CPUs, extensive data sets, complex workflows, and large-scale systems, all of which are an integral part of modern computing architectures. Performance benchmarking and testing are crucial for ensuring software quality. Typically, performance benchmarking and testing involve using specific indicators alongside benchmarks or workloads. Concurrently, engineers also employ system tools to gather data on system resource utilization. However, we have observed that many engineers often misinterpret performance data generated by these collection tools due to an insufficient understanding of their underlying mechanisms. In this tutorial, we will unveil the underlying mechanisms powering these performance data collection tools by identifying common patterns. Discuss the impact of different mechanisms by several illustrative cases. Audiences will learn how performance tools can lead engineers astray and the importance of understanding of performance data collection mechanisms.

Short Bio

Kingsum Chow (kingsum.chow@gmail.com) is a professor at the School of Software Technology, Zhejiang University. He received his Ph.D. in Computer Science and Engineering at the University of Washington in 1996. Prior to joining Zhejiang University in 2023, Kingsum has been working as a chief scientist and senior principal engineer in the industry. He has extensive experience in software hardware co-optimization from thirty years of working at Intel and Alibaba. He delivered two QCon keynotes. He appeared four times in JavaOne keynotes. He has been issued 30 patents. He has delivered more than 100 technical presentations. He has collaborated with many industry groups, including groups at Alibaba, Amazon, AMD, Ampere, Appeal, Arm, BEA, ByteDance, Facebook, Google, IBM, Intel, Microsoft, Netflix, Oracle, Siebel, Sun, Tencent and Twitter. In his spare time, he volunteers to coach multiple robotics teams to bring the joy of learning Science, Technology, Engineering and Mathematics to the K-12 students in USA and China.

Chengdong Li (chengdongli@optimatist.com) is a performance engineer, with more than 13 years of industry experience. He led and built several large-scale performance tools with both Tencent and Alibaba. In addition to his expertise in performance engineering, he is also interested in debugging, programming languages, and software-hardware co-optimization. He is the founder and CEO of Optimatist Technology, helping customers improve their hardware resource utilization and optimize workload running performance.

Anil Rajput (anil_Rajput@yahoo.com) is an AMD Fellow, Software System Design, as core architect for datacenter and cloud with focus on performance, deployments, optimizations, and best practices. He received his certification in data analytics from Harvard Business Analytics Program in 2022 and his Master’s in Electrical and Computer Engineering from Portland State University in 1997. Currently, Anil’s focus areas are workloads characterization, platform evaluation, cloud deployments, on-prem datacenters as well as understanding and resolving large deployment issues at scale for critical customers. Earlier, he has been at Intel Corporation for more than 20 years, playing various roles in the Software and Services Group, leading platform design, managed runtime like Java and .Net, scripting languages and development of representative benchmarks as chair of Java committee at SPEC. He was key members of teams who architected and developed several benchmarks like SPECjbb2005, SPECjvm2008, SPECjEnterprise2010, SPECpower_ssj2008 etc. Anil is also guiding graduate students as mentor and also participates in local High School science fairs to encourage kids for STEM in Oregon, USA.

Xinyu Jiang (bernardjiang5@outlook.com) is a postgraduate student at Zhejiang University. His advisor is Kingsum Chow. His research interest focuses on system performance analysis and optimization.

DTraComp: Distributed Trace Compare

Authors: Maryam Ekhlasi (Polytechnique Montreal, Canada) and Nasser Ezzati-Jivan (Brock University, Canada)

Abstract

Microservice architectures improve software development through the use of diverse programming languages and deployment models, the containment of failures to specific services, and the sped-up identification and resolution of issues in separate services. However, identifying the source of performance issues is difficult due to the numerous interacting service instances and the complexities introduced by parallelism. While end-toend tracing provides a way to follow execution paths and identify latency across services, it falls short in identifying specific root causes of performance lags between processes. Furthermore, the lack of a comparison feature in many performance analysis tools does not allow for a detailed understanding of performance variances between different sets of requests. We are going to present DTraComp (Distributed Trace Compare), an open-source tool designed to work with various microservice tracing standards and integrates with Eclipse Trace Compass™. DTraComp enhances the analysis of distributed systems by offering a powerful visual comparison of two sets of executions, including parallel nested spans, and delivers detailed system kernel information for each thread in every span. This depth of analysis helps in pinpointing specific causes of performance issues across distributed systems.

Short Bio

Professor Nasser Ezzati is an esteemed faculty member at Brock University’s Department of Computer Science, known for his pioneering research across multiple domains of computing. His expertise spans Software Debugging and Monitoring, Performance Engineering, Software Tracing, Distributed Multicore Systems, Cloud Computing and Virtualization, and Streaming Data Analysis. Through his work, Professor Ezzati aims to solve complex computing problems, enhance software reliability, and improve system performance. His contributions to the field are marked by a deep commitment to innovation and excellence, positioning him as a leading figure in computer science research and education. Email address: nezzatijivan@brocku.ca

Maryam Ekhlasi is a Ph.D. candidate at Polytechnique MontrÅLeal, specializing in software performance with a focus on complex distributed systems. With over eight years of experience as a software designer and developer, including experience working at Ericsson, she has developed expertise in performance metrics and Linux kernel events. Her work involves collaborating with big high-tech companies to address performance issues and enhance cloud efficiency. Additionally, she is recognized as one of the presenters at the Ericsson Developers Conference, demonstrating a practical application of her research in diagnosing software performance. Email address: Maryam.ekhlasi@polymtl.ca