2006 Workshop on On- And Off-Chip Interconnection Networks for Multicore Systems

Chung-Kuan Cheng, Communication Latency Aware Low Power NoC Synthesis Through Topology and Wire Style Optimization [ pdf ]

Communication latency and power consumption are two competing objectives in Network-on-Chip (NoC) design. We propose a novel method that unifies these two objectives in a multicommodity flow (MCF) formulation. With an improved fully polynomial approximation algorithm, power-efficient design of a homogeneous 8 by 8 NoC with certain communication bandwidth requirements can be found for given average latency constraints. Our methodology features three key characters. First, we introduce a variety of wire styles into NoC design, and incorporate latency constraints and power minimization objectives into a unified MCF model, with simultaneous optimization on network topologies, physical embedding, and interconnect wire styles. Second, we heuristically explore a large design space of network topologies. Third, we implement and optimize the MCF solver using approximation algorithms, which is significantly faster than CPLEX. Experimental results suggest that for NoC of size 8 by 8 (1) compared with mesh, torus and hypercube topologies, our optimized design can improve power latency product by up to 52.1%, 29.4% and 35.6%, respectively. (2) by sacrificing 2% latency, power consumption of optimized design can be improved by up to 19.4%, which indicates the importance of power latency co-optimization in NoC design.

David H. Albonesi (with Nevin Kırman, Meyrem Kırman, Rajeev K. Dokania, José F. Martínez, Alyssa B. Apsel, and Matthew A. Watkins), Leveraging Optical Technology in Future Chip Multiprocessors [ pdf ]

Although silicon-based optical technology is still in its formative stages, and the more near-term application is chip-to-chip communication, rapid advances have been made in the development of on-chip optical interconnects. We investigate the integration of CMOS-compatible optical technology to on-chip cache-coherent buses in future CMPs. While not exhaustive, our investigation yields a hierarchical opto-electrical system that exploits the advantages of optical technology while abiding by projected limitations. Our evaluation shows that, for the applications considered, compared to an aggressive all-electrical bus of similar power and area, significant performance improvements can be achieved using an opto-electrical bus. This performance improvement is largely dependent on the application's bandwidth demand and on the number of implemented wavelengths per optical waveguide. We also present a number of critical areas for future work that we discover in the course of our research.

Angelos Bilas (with Manolis Marazakis, Konstantinos Xinidis, Vassilis Papaefstathiou), Efficient Remote Block-level I/O over an RDMA-capable NIC [ pdf ]

Modern storage systems are required to scale to large storage capacities and I/O throughput in a cost effective manner. For this reason, they are increasingly being built out of commodity components, mainly PCs equipped with large numbers of disks and interconnected of high-performance system area networks. A main issue in these efforts is to achieve high I/O throughput over commodity, low-cost system area networks and commodity operating systems.

In this work, we examine in detail the performance of remote block-level storage I/O over commodity, RDMA-capable network interfaces and networks. We examine the support that is required from the network interface for achieving high throughput. We also examine in detail the overheads associated in kernel-level protocols for networked storage access. We find that base system performance is limited by (a) interrupt cost, (b) request size, and (c) protocol message size. We examine the impact of techniques to alleviate (a) and (b) and find that our techniques each can improve throughput by up to 50% over the unoptimized version. Our current prototype is able to achieve a throughput of about 200 MBytes/s over a network that is capable of delivering about 500 MBytes/s and is mostly limited by small messages in the remote storage access protocol.

Keren Bergman, Luca Carloni, and David Albonesi (with M. Watkins, M. Kırman, N. Kırman, J. Martínez, and A. Shacham), Optical On-Chip Networks for High-Performance, Energy-Efficient Multi-Core Architectures [ pdf ]

This poster reports on a recently initiated collaborative project where the objective is to co-architect high bandwidth optical switch architectures and fine grain Chip Multi-Processors to speed up high bandwidth-demand applications.

The primary research focus is on designing a high bandwidth grid network implemented on a dedicated optical layer and composed of very low power optical components. The interconnection network is used within a fine-grain CMP system of low-power cores that enables learning algorithms, to fully exploit high bandwidth, low power communication.

Fabrizio Petrini (with Daniele Paolo Scarpazza and Oreste Villa), Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors

Numerous applications require the exploration of large graphs. The problem has been tackled in the past through a variety of solutions, either based on commodity processors or custom-designed hardware Processors based on multiple cores, like the Cell Broadband Engine (CBE), are gaining popularity as basic building blocks for high performance clusters. Nevertheless, no studies have still investigated how effectively the CBE architecture can explore large graphs, and how its performance compares with other architectural solutions.

In this talk, we describe the challenges and design choices involved in mapping a breadth-first search (BFS) algorithm on the CBE. Our implementation has been driven by an accurate performance model, that has allowed seamless coordination between on-chip communication, off-chip memory access, and computation.

Preliminary results obtained on a pre-production prototype running at 3.2 GHz show almost linear speedups when using multiple synergistic processing units and impressive levels of performance when compared to other processors. A single CBE can provide the same processing rate of 256 BlueGene/L processors and it is twenty times faster than a top-of-the-line AMD Opteron clocked at the same frequency, and more than ten times faster than a dual-core Intel Woodcrest.

D. N. (Jay) Jayasimha (with Bilal Zafar and Yatin Hoskote), On-chip and Off-chip Interconnect Differences

With the potential emergence of chip multiprocessors with tens to hundreds of processing elements, the on-die interconnect becomes a key design element. Interconnects for off-chip multiprocessors are well studied. We argue that there are key differences between on-chip interconnects and their off-chip counterparts with respect to wiring, topology embedding, latencies of interest, and power. We suggest new metrics related to wiring density and length which need to be considered in addition to traditional metrics. We investigate several well known topologies using these metrics and show that the analysis could lead to non-intuitive choices for on-die interconnect topologies.

Manolis G.H. Katevenis, Light-Weight, Tightly-Coupled, High-Performance Network Interfaces [ pdf ]

Processor-to-Network Interfaces (NI) are the next-to-be-removed system bottleneck; low latency, high throughput, high flexibility, and low cost are needed. The NI must be small when compared to the processor and the memory that it connects to; in multi-core environments, this entails a "light-weight" NI that is small and fast when compared to the L1 cache of the processor. Hence, (a) the NI must not require dedicated memory of its own, but rather it must dynamically share a portion of L1 cache; and (b) sending/receiving information (enqueue/dequeue/RDMA) must be as fast as reading/writing a few words in L1 cache.

A powerful and simplifying architecture is to combine and integrate the network interface into/with the cache controller. Send/enqueue resembles cache block flush/replace, or write-update protocols, or writing into non-cacheable address space with write-combine. Receive/dequeue resembles/benefits from cache block prefetching. Support for synchronization primitives can be provided by enqueue/dequeue operations optionally triggering new events or packet generation; these resemble cache coherence protocol actions.

Mithuna Thottethodi, Power, Performance, Reliability and Design Reuse of On-Chip and Off-Chip Networks: A Research Snapshot

We present a snapshot of recent, ongoing and upcoming interconnection network research in our group.

Performance and Power Optimizations: We present recent work including performance optimization techniques that target routing and switch arbitration in two-dimensional networks (the preferred topology of multicore on-chip networks). Our techniques include a provably near-optimal worst case throughput routing algorithm and a table-lookup-based maximum-cardinality-matching switch arbitration algorithm. We also present techniques based on FIFO banking and decoupling of switch arbitration granularity and flow-control granularity to reduce both dynamic and static power consumption in input buffers of on-chip routers.

Soft Error Protection: Transient errors in a router's control logic (including FIFO operation, routing, VC allocation and switch arbitration) can cause a variety of faults including injection of spurious flits, misrouted flits, unroutable flits, lost flits and deadlocks. We quantify the architectural vulnerability factor (AVF) of the various components of a router's control logic and demonstrate simple protection techniques that enable significant reduction in AVF.

Semantic-Aware Networks: Network service requirements vary according to the varying semantics (such as real-time constraints, QoS requirements, in-order vs. out-of-order delivery and reliable vs. lossy transmission) of higher layers of the protocol stack. The various service requirements are traditionally realized in the software protocol stack with the network being oblivious of the higher level semantics. We propose an alternate semantic-aware network design that demonstrates opportunity to improve power, performance and design reusability of on-chip networks.

Olav Lysne, The realization of Virtual Compute Resources in Utility Computing Data Centers and Many-Core CPUs [ pdf ]

The envisaged mode of operation within a system comprising a single or multiple many-core CPU's is that compute resources assigned on-demand to incoming jobs. These jobs will typically request a subset of the resources on the chip for a more or less defined period of time. A similar scenario is seen in Utility Computing Data Centers, where virtual servers containing a subset of the available resources (e.g. computing power and data storage) are dynamically created to fulfill customer demands.

The challenges imposed by these developments are not new. They are recognizable as the requirements that time-sharing mainframes first faced some decades ago, and that led to the development of solutions for e.g. CPU, disk scheduling, and the use of virtual memory. For the system interconnect, however, there are a number of problems that have yet to be addressed. Some of these are listed below:

Flexible and Robust Partitioning: Since several jobs will run concurrently, and the quality of the software of these jobs will vary, it is important that a misbehaving job will not consume interconnect bandwidth (or the resources of peripheral devices) to the detriment of other jobs.

Fault Tolerance: The effect of a faulty component in the interconnection network should be constrained to as few jobs as possible, and the set of jobs that are terminated by the fault should, to the greatest extent possible, be controlled from a specification of job importance. A solution to fault tolerance should allow unaffected jobs to run uninterruptedly.

Predictable Service: With regard to partitioning and multiple jobs, it should be possible to guarantee a specific portion of the interconnect network capacity to each partition or job. Further, is should be possible to differentiate between jobs based on importance.

Our approach to these questions are based on a combination of flexible routing functions and reconfiguration capabilities. Our efforts will be aimed at providing solutions to the above problems to support the evolution of many-core interconnects in order to maximise the use of resources within such systems.

Dhabaleswar K. (DK) Panda, Designing High Performance Communication Middleware with Emerging Multi-core Architectures [ pdf ]

For cluster-based systems with emerging multi-core architectures, message distributions for applications over on-chip and off-chip interconnects are taking a new trend. With this new trend, it is becoming critical to optimize on-chip communication at the middleware layers to provide maximum benefits to the applications. In this research we provide detailed profiling of a range of MPI (Message Passing Interface) applications on clusters with dual- and quad-core processors. We provide new MPI-level designs to optimize on-chip communication while taking advantage of cache architectures on these systems. We show the benefits of such designs for modern clusters with multi-core architectures and emerging InfiniBand interconnect. These designs have been incorporated into open-source MVAPICH2 software which are being used worldwide to take advantage ofemerging multi-core architectures for MPI applications.

Pat Conway, AMD's Direct Connect Architecture [ pdf ]

This poster illustrates the variety of Opteron system topologies built to date using HyperTransport and AMD's Direct Connect architecture for glueless multiprocessing and summarizes the latency/bandwidth characteristics of these systems. Building and using Opteron systems has provided useful, real world lessons which expose the challenges to be addressed when designing future system interconnect, memory hierarchy and I/O to scale up both the number of cores and sockets in future x86 CMP Architectures.

Li-Shiuan Peh (with Noel Eisley and Li Shang), In-Network Cache Coherence [ pdf ]

With the trend towards increasing number of processor cores in future chip architectures, scalable directory-based protocols for maintaining cache coherence will be needed. However, directory-based protocols face well-known problems in delay and scalability. Most current protocol optimizations targeting these problems maintain a firm abstraction of the interconnection network fabric as a communication medium: protocol optimizations consist of endto- end messages between requestor, directory and sharer nodes, while network optimizations separately target lowering communication latency for coherence messages. In this paper, we propose an implementation of the cache coherence protocol within the network, embedding directories within each router node that manage and steer requests towards nearby data copies, enabling in-transit optimization of memory access delay. Simulation results across a range of SPLASH-2 benchmarks demonstrate significant performance improvement and good system scalability, with up to 44.5% and 56% savings in average memory access latency for 16 and 64-node systems, respectively, when compared against the baseline directory cache coherence protocol. Detailed microarchitecture and implementation characterization affirms the low area and delay impact of in-network coherence.

Vijaykrishnan Narayanan, 3-Dimensional Network in Memory

Three-Dimensional chips and Networks-on-Chip (NoC) are two solutions aimed at addressing the growing interconnect design complexity. This poster will present the design of a hybrid network-on-chip customized for a 3D Chip-Multiprocessor memory system. The poster will also briefly present other techniques that are being currently being explored for the interconnect fabric of 3D Chips.

Rajeev Balasubramonian (with Naveen Muralimanohar), A Case for Interconnect-Aware Architectures [ pdf ]

In future multi-core chips, a large fraction of chip area will be dedicated for cache hierarchies and interconnects between various cache components. A careful consideration of interconnect properties can help improve the design of cache hierarchies and cache coherence protocols. As an example case study, we show results for a design space exploration of non-uniform cache architecture (NUCA) organizations. We also make a case for a heterogeneous interconnect architecture where each link is composed of different wire types. Wires can be implemented to provide different latency/power/bandwidth properties. Data transfers also make different latency and bandwidth demands, depending on their architectural criticality. We show that an intelligent mapping of data to wires can yield performance and power improvements within the on-chip cache hierarchy.

2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems

6-7 December 2006, Stanford, California