Skip to navigation Skip to content
Careers | Phone Book | A - Z Index
Computer Architecture Group

George Michelogiannakis

georgem
George Michelogiannakis
Staff Scientist
Applied Mathematics & Computational Research Division
Phone: 510-495-2011
Fax: 650-390-6856
Lawrence Berkeley National Laboratory
One Cyclotron Rd
Berkeley, CA 94720 us

Biographical Sketch

George Michelogiannakis is a staff scientist for the computer architecture group (CAG) in the AMCR division. He has extensive work on networking (both off- and on-chip) and computer architecture. His latest work focuses on post Moore's law era looking into superconducting digital logic with novel compute models, compute and memory architectures, specialization, emerging devices (transistors), photonics, and 3D integration. He is also currently characterizing the use of key resources in modern HPC systems to reveal opportunities for resource disaggregation and is designing photonically resource-disaggregated racks.

Current Projects

Journal Articles

Meriam Gay Bautista, Darren Lyles, Kylie Huch, Patricia Gonzalez-Guerrero, George Michelogiannakis, "Area Efficient Asynchronous SFQ Pulse Round-Robin Distribution Network", IEEE Transactions on Circuits and Systems I: Regular Papers, November 2023,

Zhenguo Wu, Liang Yuan Dai, Asher Novick, Madeleine Glick, Ziyi Zhu, Sébastien Rumley, George Michelogiannakis, John Shalf, Keren Bergman, "Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications", IEEE Journal of Lightwave Technology, May 2023,

Kylie Huch, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Hyperdimensional Associative Memory Circuit for Scalable Machine Learning", IEEE Transactions on Applied Superconductivity, May 2023,

Dilip Vasudevan, George Michelogiannakis, "Efficient Temporal Arithmetic Logic Design for Superconducting RSFQ Logic", IEEE Transactions on Applied Superconductivity, March 2023,

Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Gay Bautista, George Michelogiannakis, "PaST-NoC: A Packet-Switched Superconducting Temporal NoC", IEEE Transactions on Applied Superconductivity, January 2023,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-Flux Shift Register for Race Logic and Its Applications", IEEE Transactions on Circuits and Systems I: Regular Papers, October 2022,

George Michelogiannakis, Benjamin Klenk, Brandon Cook, Min Yee Teh, Madeleine Glick, Larry Dennison, Keren Bergman, John Shalf, "A Case For Intra-Rack Resource Disaggregation in HPC", ACM Transactions on Architecture and Code Optimization, February 2022,

Georgios Tzimpragos, Jennifer Volk, Dilip Vasudevan, Nestan Tsiskaridze, George Michelogiannakis, Advait Madhavan, John Shalf, Timothy Sherwood, "Temporal Computing With Superconductors", IEEE MIcro, March 2021, 41:71-79, doi: 10.1109/MM.2021.3066377

Madeleine Glick, Nathan C. Abrams, Qixiang Cheng, Min Yee Teh, Yu-Han Hung, Oscar Jimenez, Songtao Liu, Yoshitomo Okawachi, Xiang Meng, Leif Johansson, Manya Ghobadi, Larry Dennison, George Michelogiannakis, John Shalf, Alan Liu, John Bowers, Alex Gaeta, Michal Lipson, and Keren Bergman, "PINE: Photonic Integrated Networked Energy efficient datacenters (ENLITENED Program)", IEEE Journal of Optical Communications and Networking, 2020, 12:443-456,

W Cui, G Tzimpragos, Y Tao, J Mcmahan, D Dangwal, N Tsiskaridze, G Michelogiannakis, DP Vasudevan, T Sherwood, "Language Support for Navigating Architecture Design in Closed Form", ACM Journal on Emerging Technologies in Computing Systems, January 2019, 16:1--28, doi: 10.1145/3360047

George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", Springer International Journal of Parallel Programming, December 2015, 43:6:1218-1243, doi: 10.1007/s10766-014-0326-5

George Michelogiannakis, William J. Dally, "Elastic Buffer Flow Control for On-Chip Networks", IEEE Transactions on Computers, 2013,

Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks, EB networks provide an up to 45% shorter cycle time, 16% more throughput per unit power or 22% more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21% more throughput per unit power than VC networks and 12% than EB networks.

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Packet Chaining: Efficient Single-Cycle Allocation for On-Chip Networks", IEEE Computer Architecture Letters, July 1, 2011,

This paper introduces packet chaining, a simple and effective method to increase allocator matching efficiency and hence network performance, particularly suited to networks with short packets and short cycle times. Packet chaining operates by chaining packets destined to the same output together, to reuse the switch connection of a departing packet. This allows an allocator to build up an efficient matching over a number of cycles, like incremental allocation, but not limited by packet length. For a 64-node 2D mesh at maximum injection rate and with single-flit packets, packet chaining increases network throughput by 15% compared to a conventional single-iteration separable iSLIP allocator, outperforms a wavefront allocator, and gives comparable throughput with an augmenting paths allocator. Packet chaining achieves this performance with a cycle time comparable to a single-iteration separable allocator. Packet chaining also reduces average network latency by 22.5%. Finally, packet chaining increases IPC up to 46% (16% average) for application benchmarks because short packets are critical in a typical cache-coherent CMP. These are considerable improvements given the maturity of network-on-chip routers and allocators.

George Michelogiannakis, Daniel U. Becker, William J. Dally, "Evaluating Elastic Buffer and Wormhole Flow Control", IEEE Transactions on Computers, 2011,

With the emergence of on-chip networks, router buffer power has become a primary concern. Elastic buffer (EB) flow control utilizes existing pipeline flip-flops in the channels to implement distributed FIFOs, eliminating the need for input buffers at the routers. EB routers have been shown to be more efficient than virtual channel routers, as they do not require input buffers or complex logic for managing virtual channels and tracking credits. Wormhole routers are more comparable in terms of complexity because they also lack virtual channels. This paper compares EB and wormhole routers and explores novel hybrid designs to more closely examine the effect of design simplicity and input buffer cost. Our results show that EB routers have up to 25 percent smaller cycle time compared to wormhole and hybrid routers. Moreover, EB flow control requires 10 percent less energy to transfer a single bit through a router and offers three percent more throughput per unit energy as well as 62 percent more throughput per unit area. The main contributor to these results is the cost and delay overhead of the input buffer.

Daniel Sanchez, George Michelogiannakis, Christos Kozyrakis, "An Analysis of Interconnection Networks for Large Scale Chip Multiprocessors", ACM Transactions on Architecture and Code Optimization, 2010,

With the number of cores of chip multiprocessors (CMPs) rapidly growing as technology scales down, connecting the different components of a CMP in a scalable and efficient way becomes increasingly challenging. In this article, we explore the architectural-level implications of interconnection network design for CMPs with up to 128 fine-grain multithreaded cores. We evaluate and compare different network topologies using accurate simulation of the full chip, including the memory hierarchy and interconnect, and using a diverse set of scientific and engineering workloads.

We find that the interconnect has a large impact on performance, as it is responsible for 60% to 75% of the miss latency. Latency, and not bandwidth, is the primary performance constraint, since, even with many threads per core and workloads with high miss rates, networks with enough bandwidth can be efficiently implemented for the system scales we consider. From the topologies we study, the flattened butterfly consistently outperforms the mesh and fat tree on all workloads, leading to performance advantages of up to 22%. We also show that considering interconnect and memory hierarchy together when designing large-scale CMPs is crucial, and neglecting either of the two can lead to incorrect conclusions. Finally, the effect of the interconnect on overall performance becomes more important as the number of cores increases, making interconnection choices especially critical when scaling up.

Conference Papers

Hamza Errahmouni Barkam, Sanggeon Yun, Hanning Chen, Paul Gensler, Albi Mema, Andrew Ding, George Michelogiannakis, Hussam Amrouch, Mohsen Imani, "Reliable hyperdimensional reasoning on unreliable emerging technologies", IEEE/ACM International Conference on Computer Aided Design (ICCAD), November 2023,

George Michelogiannakis, Yehia Arafa, Brandon Cook, Liang Yuan Dai, Abdel-Hameed Hameed Badawy, Madeleine Glick, Yuyang Wang, Keren Bergman, John shalf, "Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics", IEEE International Conference on Cluster Computing (CLUSTER), November 2023,

Jie Li, George Michelogiannakis, Brandon Cook, Dulanya Cooray, Yong Chen, "Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter", ISC High Performance, Elsevier, May 2023,

Patricia Gonzalez-Guerrero, Kylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "An Area Efficient Superconducting Unary CNN Accelerator", IEEE 24th International Symposium on Quality Electronic Design (ISQED), IEEE, April 2023,

Alvin Oliver Glova, Yukai Yang, Yiyao Wan, Zhizhou Zhang, George Michelogiannakis, Jonathan Balkind, Timothy Sherwood, "Establishing Cooperative Computation with Hardware Embassies", IEEE International Symposium on Secure and Private Execution Environment Design, September 2022,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, Kylie Huch, George Michelogiannakis, "Superconducting Digital DIT Butterfly Unit for Fast Fourier Transform Using Race Logic", 2022 20th IEEE Interregional NEWCAS Conference (NEWCAS), IEEE, June 2022, 441-445,

George Michelogiannakis, Madeleine Glick, John Shalf, Keren Bergman, "Photonics as a means to implement intra-rack resource disaggregation", Proceedings Volume 12027, Metro and Data Center Optical Networks and Short-Reach Links V, March 2022, doi: https://doi.org/10.1117/12.2607317

Patricia Gonzalez-Guerrero, Meriam Gay Bautista, Darren Lyles, George Michelogiannakis, "Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators", 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), ACM, February 2022,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-flux Shift Buffer for Race Logic", 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), August 2021,

George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko, "SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC", IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021,

George Michelogiannakis, Min Yeh Teh, Madeleine Glick, John Shalf, Keren Bergman, "Maximizing the impact of emerging photonic switches at the system level", SPIE 11692, Optical Interconnects XXI, 116920Z, March 2021,

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Min Yee Teh, Yu-Han Hung, George Michelogiannakis, Shijia Yan, Madeleine Glick, John Shalf, Keren Bergman, "TAGO: rethinking routing design in high performance reconfigurable networks", SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2020,

John Shalf, George Michelogiannakis, Brian Austin, Taylor Groves, Manya Ghobadi, Larry Dennison, Tom Gray, Yiwen Shen, Min Yee Teh, Madeleine Glick, and Keren Bergman, "Photonic Memory Disaggregation in Datacenters", OSA Advanced Photonics Congress (AP), July 2020,

Georgios Tzimpragos, Dilip Vasudevan, Nestan Tsiskaridze, George Michelogiannakis, Advait Madhavan, Jennifer Volk, John Shalf, Timothy Sherwood, "A Computational Temporal Logic for Superconducting Accelerators", ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, March 2020,

George Michelogiannakis, Yiwen Shen, Min Yeh Teh, Xian Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, Keren Bergman, "Bandwidth Steering in HPC Using Silicon Nanophotonics", SC19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2019,

Pooria Mohammadiyaghni, George Michelogiannakis, Paul V. Gratz, "SpecLock: Speculative Lock Forwarding", International Conference on Computer Design (ICCD), November 2019,

S Werner, P Fotouhi, X Xiao, M Fariborz, SJB Yoo, G Michelogiannakis, D Vasudevan, "3D photonics as enabling technology for deep 3D DRAM stacking", Proceedings of the International Symposium on Memory Systems - MEMSYS 19, ACM Press, September 2019, doi: 10.1145/3357526.3357559

Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf, "TIGER: topology-aware task assignment approach using ising machines", Proceedings of the 16th ACM International Conference on Computing Frontiers, April 2019,

Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf, "Extending classical processors to support future large scale quantum accelerators", Proceedings of the 16th ACM International Conference on Computing Frontiers Pages, April 2019,

George Michelogiannakis, Jeremiah Wilke, Min Yee Teh, Madeleine Glick, John Shalf, Keren Bergman, "Challenges and opportunities in system-level evaluation of photonics", Proceedings Volume 10946, Metro and Data Center Optical Networks and Short-Reach Links II, February 2019, doi: https://doi.org/10.1117/12.2510443

D Vasudevan, G Michclogiannakis, D Donofrio, J Shalf, "PARADISE - Post-Moore Architecture and Accelerator Design Space Exploration Using Device Level Simulation and Experiments", 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE, January 2019, doi: 10.1109/ispass.2019.00022

George Michelogiannakis, Benjamin Aivazi, Yiwen Shen, Larry Dennison, John Shalf, Keren Bergman, Madeleine Glick, "Architectural Opportunities and Challenges from Emerging Photonics in Future Systems", Photonics in Switching and Computing (PSC), September 2018,

Joseph P. Kenny, Khachik Sargsyan, Samuel Knight, George Michelogiannakis, Jeremiah J. Wilke, "The Pitfalls of Provisioning Exascale Networks: A Trace Replay Analysis for Understanding Communication Performance", ISC High Performance 2018, June 2018, 10876,

Keren Bergman, John Shalf, George Michelogiannakis, Sebastien Rumley, Larry Dennison, Monia Ghobadi, "PINE: An Energy Efficient Flexibly Interconnected Photonic Data Center Architecture for Extreme Scalability", 31st annual conference of the IEEE Photonics Society, IEEE, June 2018,

George Michelogiannakis, John Shalf, "Last Level Collective Hardware Prefetching For Data-Parallel Applications", IEEE 24th International Conference on High Performance Computing, IEEE, December 2017,

Dilip Vasudevan, George Michelogiannakis, John Shalf, "CASPER - Configurable Design Space Exploration of Programmable Architectures for Machine Learning using Beyond Moore Devices", IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), July 2017,

George Michelogiannakis, Khaled Z. Ibrahim, John Shalf, Jeremiah J. Wilke, Samuel Knight, Joseph P. Kenny, "APHiD: Hierarchical Task Placement to Enable a Tapered Fat Tree Topology for Lower Power and Cost in HPC Networks", 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, IEEE, May 2017, LBNL 1007126,

D Vasudevan, A Butko, G Michelogiannakis, D Donofrio, J Shalf, "Towards an Integrated Strategy to Preserve Digital Computing Performance Scaling Using Emerging Technologies", Springer International Publishing, January 1, 2017, 115--123, doi: 10.1007/978-3-319-67630-2_10

With the decline and eventual end of historical rates of lithographic scaling, we arrive at a crossroad where synergistic and holistic decisions are required to preserve Moore's law technology scaling. Numerous emerging technologies aim to extend digital electronics scaling of performance, energy efficiency, and computational power/density,
ranging from devices (transistors), memories, 3D integration capabilities, specialized architectures, photonics, and others.
The wide range of technology options creates the need for an integrated strategy to understand the impact of these emerging technologies on future large-scale digital systems for diverse application requirements and optimization metrics.
In this paper, we argue for a comprehensive methodology that spans the different levels of abstraction -- from materials, to devices, to complex digital systems and applications. Our approach integrates compact models of low-level characteristics of the emerging technologies to inform higher-level simulation models to evaluate their responsiveness to application requirements.
The integrated framework can then automate the search for an optimal architecture using available emerging technologies to maximize a targeted optimization metric.

George Michelogiannakis, Dave Donofrio, John Shalf, "Modeling of Novel Transistors, Manufacturing Technologies, and Architectures to Preserve Digital Computing Performance Scaling", 1ST INTERNATIONAL WORKSHOP ON POST-MOORE’S ERA SUPERCOMPUTING (PMES), November 2016,

Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf, "OpenSoC Fabric: On-Chip Network Generator", ISPASS 2016: International Symposium on Performance Analysis of Systems and Software, IEEE, April 2016, 194-203, doi: 10.1109/ISPASS.2016.7482094

D Unat, T Nguyen, W Zhang, MN Farooqi, B Bastem, G Michelogiannakis, A Almgren, J Shalf, "TiDA: High-level programming abstractions for data locality management", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), January 2016, 9697:116--135, doi: 10.1007/978-3-319-41321-1_7

Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf, "OpenSoC Fabric: On-Chip Network Generator", Proceedings of the Workshop on Network on Chip Architectures, ACM, December 2014, 45-50, LBNL LBNL-1005675, doi: 10.1145/2685342.2685351

George Michelogiannakis, John shalf, "Variable-Width Datapath for On-Chip Network Static Power Reduction", 8th International Symposium on Networks-on-Chip (NOCS), September 2014,

  • Download File: abn.pdf (pdf: 277 KB)

George Michelogiannakis, Alexander Williams, Samuel Williams, John Shalf, "Collective Memory Transfers for Multi-Core Chips", International Conference on Supercomputing (ICS), June 2014, doi: 10.1145/2597652.2597654

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Channel Reservation Protocol for Over-Subscribed Channels and Destinations", Conference on High Performance Computing Networking, Storage and Analysis, ACM, 2013,

George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems. 

Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, John Kim, William J. Dally, "A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator", International Symposium on Performance Analysis of Systems and Software, IEEE Computer Society, April 2013,

Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally, "Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks", International Conference on Computer Design, IEEE Computer Society, 2012,

This paper introduces Adaptive Backpressure, a novel scheme that improves the utilization of dynamically man- aged router input buffers by continuously adjusting the stiffness of the flow control feedback loop in response to observed traffic conditions. Through a simple extension to the router’s flow control mechanism, the proposed scheme heuristically limits the number of credits available to individual virtual channels based on estimated downstream congestion, aiming to minimize the amount of buffer space that is occupied unproductively. This leads to more efficient distribution of buffer space and improves isolation between multiple concurrently executing workloads with differing performance characteristics.

Experimental results for a 64-node mesh network show that Adaptive Backpressure improves network stability, leading to an average 2.6× increase in throughput under heavy load across traffic patterns. In the presence of background traffic, the pro- posed scheme reduces zero-load latency by an average of 31 %. Finally, it mitigates the performance degradation encountered when latency- and throughput-optimized execution cores contend for network resources in a heterogeneous chip multi-processor; across a set of PARSEC benchmarks, we observe an average reduction in execution time of 34%.

Nan Jiang, Daniel U. Becker, George Michelogiannakis, William J. Dally, "Network Congestion Avoidance through Speculative Reservation", International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2012,

Congestion caused by hot-spot traffic can significantly degrade the performance of a computer network. In this study, we present the Speculative Reservation Protocol (SRP), a new network congestion control mechanism that relieves the effect of hot-spot traffic in high bandwidth, low latency, lossless computer networks. Compared to existing congestion control approaches like Explicit Congestion Notification (ECN), which react to network congestion through packet marking and rate throttling, SRP takes a proactive approach of congestion avoidance. Using a light-weight endpoint reservation scheme and speculative packet transmission, SRP avoids hot-spot congestion while incurring minimal overhead. Our simulation results show that SRP responds more rapidly to the onset of severe hot-spots than ECN and has a higher network throughput on bursty network traffic. SRP also performs comparably to networks without congestion control on benign traffic patterns by reducing the latency and throughput overhead commonly associated with reservation protocols.

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks", International Symposium on Microarchitecture, ACM, 2011,

This paper introduces packet chaining, a simple and effective method to increase allocator matching efficiency and hence network performance, particularly suited to networks with short packets and short cycle times. Packet chaining operates by chaining packets destined to the same output together, to reuse the switch connection of a departing packet. This allows an allocator to build up an efficient matching over a number of cycles like incremental allocation, but not limited by packet length. For a 64-node 2D mesh at maximum injection rate and with single-flit packets, packet chaining increases network throughput by 15% compared to a highly-tuned router using a conventional single-iteration separable iSLIP allocator, and outperforms significantly more complex allocators. Specifically, it outperforms multiple-iteration iSLIP allocators and wavefront allocators by 10% and 6% respectively, and gives comparable throughput with an augmenting paths allocator. Packet chaining achieves this performance with a cycle time comparable to a single-iteration separable allocator. Packet chaining also reduces average network latency by 22.5% compared to a single-iteration iSLIP allocator. Finally, packet chaining increases IPC up to 46% (16% average) for application benchmarks because short packets are critical in a typical cache-coherent chip multiprocessor.

George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis, "Evaluating Bufferless Flow Control for On-chip Networks", International Symposium on Networks-on-Chip, IEEE Computer Society, 2010,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control addresses this issue by removing router buffers, and handles contention by dropping or deflecting flits. This work compares virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our evaluation includes optimizations for both schemes: buffered networks use custom SRAM-based buffers and empty buffer bypassing for energy efficiency, while bufferless networks feature a novel routing scheme that reduces average latency by 5%. Results show that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains: bufferless designs are only marginally (up to 1.5%) more energy efficient at very light loads, and buffered networks provide lower latency and higher throughput per unit power under most conditions.

George Michelogiannakis, William J. Dally, "Router Designs for Elastic Buffer On-Chip Networks", Conference on High Performance Computing Networking, Storage and Analysis, ACM, 2009,

This paper explores the design space of elastic buffer (EB) routers by evaluating three representative designs. We propose an enhanced two-stage EB router which maximizes throughput by achieving a 42% reduction in cycle time and 20% reduction in occupied area by using look-ahead routing and replacing the three-slot output EBs in the baseline router of [17] with two-slot EBs. We also propose a singlestage router which merges the two pipeline stages to avoid pipelining overhead. This design reduces zero-load latency by 24% compared to the enhanced two-stage router if both are operated at the same clock frequency; moreover, the single-stage router reduces the required energy per transferred bit and occupied area by 29% and 30% respectively, compared to the enhanced two-stage router. However, the cycle time of the enhanced two-stage router is 26% smaller than that of the single-stage router.

George Michelogiannakis, James Balfour, William J. Dally, "Elastic Buffer Flow Control for On-Chip Networks", International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2009,

This paper presents elastic buffers (EBs), an efficient flow-control scheme that uses the storage already present in pipelined channels in place of explicit input virtual-channel buffers (VCBs). With this approach, the channels themselves act as distributed FIFO buffers. Without VCBs, and hence virtual channels (VCs), deadlock prevention is achieved by duplicating physical channels. We develop a channel occupancy detector to apply universal globally adaptive load-balancing (UGAL) routing to load balance traffic in networks using EBs. Using EBs results in up to 8% (12% for low-swing channels) improvement in peak throughput per unit power compared to a VC flow-control network. These gains allow for a wider network datapath to be used to offset the removal of VCBs and increase throughput for a fixed power budget. EB networks have identical zero-load latency to VC networks operating under the same frequency. The microarchitecture of an EB router is considerably simpler than a VC router because allocators and credits are not required. For 5 times 5 mesh routers, this results in an 18% improvement in the cycle time.

Vassilis Papaefstathiou, Dionisios Pnevmatikatos, Manolis Marazakis, Giorgos Kalokairinos, Aggelos Ioannou, Michael Papamichael, Stamatis Kavadias, George Michelogiannakis, Manolis Katevenis, "Prototyping Efficient Interprocessor Communication Mechanics", International Conference on Embedded Computer Systems: Architectures, Modelling and Simulations, IEEE Computer Society, 2007,

Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapidsystemprototypingbecomesimportantindesigningand evaluating their architecture. We present an efficient FPGA- based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.

George Michelogiannakis, Dionisios Pnevmatikatos, Manolis Katevenis, "Approaching Ideal NoC Latency with Pre-Configured Routes", First International Symposium on Networks-on-Chip, IEEE Computer Society, 2007,

In multi-core ASICs, processors and other compute engines need to communicate with memory blocks and other cores with latency as close as possible to the ideal of a direct buffered wire. However, current state of the art networks-on- chip (NoCs) suffer, at best, latency of one clock cycle per hop. We investigate the design of a NoC that offers close to the ideal latency in some preferred, run-time configurable paths. Processors and other compute engines may perform network reconfiguration to guarantee low latency over different sets of paths as needed. Flits in non-preferred paths are given lower priority than flits in preferred ones, and suffer a delay of one clock cycle per hop when there is no contention. To achieve our goal, we use the "madpostman" [5] technique: every incoming flit is eagerly (i.e. speculatively) forwarded to the input's preferred output, if any. This is accomplished with the mere delay of a single pre-enabled tri-state driver. We later check if that decision was correct, and if not, we forward the flit to the proper output. Incorrectly forwarded flits are classified as dead and eliminated in later hops. We use a 2D mesh topology tailored for processor-memory communication, and a modified version of XY routing that remains deadlock-free. Our evaluation shows that, for the preferred paths, our approach offers typical latency around 500 ps versus 1500 ps for a full clock cycle or 135 ps for an ideal direct connect, in a 130 nm technology; non-preferred paths suffer a one clock cycle delay per hop, similar to that of other approaches. Performance gains are significant and can be proven greatly useful in other application domains as well.

Presentation/Talks

George Michelogiannakis, Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter, ISC High Performance, May 2023,

George Michelogiannakis, A Case for Intra-Rack Resource Disaggregation for HPC, HiPEAC conference 2023, January 17, 2023,

George Michelogiannakis, Intra-Rack Resource Disaggregation Using Emerging Photonics, OCP global summit, October 19, 2022,

John Shalf, George Michelogiannakis, Heterogeneous Integration for HPC, OCP global summit, October 19, 2022,

George Michelogiannakis, Madeleine Glick, John Shalf, Keren Bergman, Photonics as a Means to Implement Intra-rack Resource Disaggregation, SPIE photonics west, March 2022,

Patricia Gonzalez-Guerrero, Meriam Gay Bautista, Darren Lyles, George Michelogiannakis, Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators, 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), February 2022,

George Michelogiannakis, SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC, IEEE International Parallel and Distributed Processing Symposium, May 2021,

George Michelogiannakis, Min Yeh Teh, Madeleine Glick, John Shalf, Keren Bergman, Maximizing The Impact of Emerging Photonic Switches At The System Level, SPIE photonics west, March 2021,

George Michelogiannakis, Forecasting the future of HPC systems, RIPCON 2020, August 2020,

George Michelogiannakis, Bandwidth Steering in HPC Using Silicon Nanophotonics, SC19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, November 20, 2019,

George Michelogiannakis, Computation and Communication in a Post Moore’s Law Era, Post Exascale workshop part of HiPEAC conference, January 2019,

George Michelogiannakis, How Open Source Hardware Will Drive the Next Generation of HPC Systems, CROSS Symposium at UCSC, October 2018,

George Michelogiannakis, John Shalf, Benjamin Aivazi, Yiwen Shen, Keren Bergman, Madeleine Glick, Larry Dennison, Architectural Opportunities and Challenges from Emerging Photonics in Future Systems, IEEE conference on Photonics in Switching and Computing (PSC), September 2018,

George Michelogiannakis, An Architect’s Point of View of the Post Moore Era, 3rd International Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems (AISTECS with HiPEAC 2018), January 2018,

George Michelogiannakis, Open-Source Hardware in the Post Moore Era, NovelHPC: Beyond Exascale: Workshop on Novel HPC Architectures (HiPEAC 2018), January 2018,

George Michelogiannakis, John Shalf, Last Level Collective Hardware Prefetching For Data-Parallel Applications, IEEE 24th International Conference on High Performance Computing, December 18, 2017,

George Michelogiannakis, David Donofrio, John Shalf, Modeling of Novel Transistors, Manufacturing Technologies, and Architectures to Preserve Digital Computing Performance Scaling, Post-Moore's Era Supercomputing (PMES) Workshop, November 2016,

George Michelogiannakis, John Shalf, Variable-Width Datapath for On-Chip Network Static Power Reduction, 8th International Symposium on Networks-on-Chip, September 2014,

Didem Unat, George Michelogiannakis, John Shalf, The Role of Modeling in Locality Optimizations, Modeling and simulation workshop (MODSIM), August 2014,

George Michelogiannakis, Collective Memory Transfers for Multi-Core Chips, International Conference on Supercomputing (ICS), June 2014,

George Michelogiannakis, Channel Reservation Protocol for Over-Subscribed Channels and Destinations, Conference on High Performance Computing Networking, Storage and Analysis, 2013,

George Michelogiannakis, Hardware Support for Collective Memory Transfers in Stencil Computations, Workshop on Optimizing Stencil Computations, October 2013,

George Michelogiannakis, Extending Summation Precision for Distributed Network Operations, 25th International Symposium on Computer Architecture and High Performance Computing, October 2013,

George Michelogiannakis, Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks, International Symposium on Microarchitecture, 2011,

George Michelogiannakis, Evaluating Bufferless Flow Control for On-chip Networks, International Symposium on Networks-on-Chip, 2010,

George Michelogiannakis, Router Designs for Elastic Buffer On-Chip Networks, Conference on High Performance Computing Networking, Storage and Analysis, 2009,

George Michelogiannakis, Elastic Buffer Flow Control for On-Chip Networks, International Symposium on High Performance Computer Architecture, 2009,

George Michelogiannakis, Approaching Ideal NoC Latency with Pre-Configured Routes, International Symposium on Networks-on-Chip, 2007,

Reports

George Michelogiannakis, John Shalf, David Donofrio, John Bachan,, "Continuing the Scaling of Digital Computing Post Moore’s Law", LBNL report, April 2016, LBNL 1005126,

The approaching end of traditional CMOS technology scaling that up until now followed Moore's law is coming to an end in the next decade. However, the DOE has come to depend on the rapid, predictable, and cheap scaling of computing performance to meet mission needs for scientific theory, large scale experiments, and national security. Moving forward, performance scaling of digital computing will need to originate from energy and cost reductions that are a result of novel architectures, devices, manufacturing technologies, and programming models. The deeper issue presented by these changes is the threat to DOE’s mission and to the future economic growth of the U.S. computing industry and to society as a whole. With the impending end of Moore’s law, it is imperative for the Office of Advanced Scientific Computing Research (ASCR) to develop a balanced research agenda to assess the viability of novel semiconductor technologies and navigate the ensuing challenges. This report identifies four areas and research directions for ASCR and how each can be used to preserve performance scaling of digital computing beyond exascale and after Moore's law ends.

Thesis/Dissertations

Energy-Efficient Flow-Control for On-Chip Networks, George Michelogiannakis, Stanford University, 2012,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control has been proposed to address this issue by removing router buffers and handling contention by dropping or deflecting flits. In this thesis, we compare virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our study shows that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains. To provide buffering in the network but without the cost and timing overhead of router buffers, we propose elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers as well as the complexity for virtual channels (VCs) are no longer required. Therefore, EB networks have a shorter cycle time and offer more throughput per unit power than VC networks. We also propose a hybrid EB-VC router which is used to provide traffic separation for a number of traffic classes large enough for duplicate physical channels to be inefficient. These hybrid routers offer more throughput per unit power than both EB and VC routers. Finally, this thesis proposes packet chaining, which addresses the tradeoff between allocation quality and cycle time traditionally present in routers with VCs. Packet chaining is a simple and effective method to increase allocator matching efficiency to be comparable or superior to more complex and slower allocators without extending cycle time, particularly suited to networks with short packets.

Approaching Ideal NoC Latency with Pre-Configured Routes, George Michelogiannakis, University of Crete, 2007,

IPIF to PCI bridge specification, George Michelogiannakis, University of Crete, 2005,