پشتیبانی سخت افزاری برای تولیدی قسمتبندی آدرس های جهانی فضا (PGAS) برنامه نویسی
Abstract: In order to exploit the increasing number of transistors, and due to the limitations of frequency scaling, the number of cores inside a chip keeps growing. As many-core chips become ubiquitous, there is a greater need for a more productive and efficient parallel programming model. The easy-to-use, but locality-agnostic, shared memory model (e.g. OpenMP) is unable to efficiently exploit memory locality in systems with Non-Uniform Memory Access (NUMA) and Non-Uniform Cache-Access (NUCA) effects. The locality-aware, but explicit, message-passing model (e.g. MPI1) does not provide a productive development environment due to its two-sided communication and a distributed (and isolated) memory model. The Partitioned Global Address Space (PGAS) programming model strikes a balance between those two extremes via a global address space that is provided for ease-of-use, but is partitioned for locality awareness. The user-friendly PGAS memory model, however, comes at a performance cost, due to the needed address mapping, which can hinder its potential for performance. To mitigate this overhead and achieve full performance, compiler optimizations may be applied, but are often insufficient. Alternatively, manual optimizations can be applied but they are quite cumbersome and, as such, are unproductive. As a result, the overall benefit of PGAS has been severely limited. In this dissertation, we improved both the productivity and performance of PGAS by introducing a novel hardware support. This PGAS hardware support efficiently handles the complex PGAS mapping and communication without the intervention of an application developer. By introducing the new hardware at the micro-architecture level, fine grain and low latency local shared memory accesses are supported. The hardware is also made available through an ISA extension, so that it can easily be exploited by PGAS compilers to efficiently access and traverse the PGAS memory space. The automatic code generation eliminates the need for hand-tuning, and thus simultaneously improve both the performance and productivity of PGAS languages. This research also introduces and evaluates the possibility for the hardware support to handle a variety of PGAS languages. Results are obtained on two different system implementations: the first is based on the well-adopted full system simulator Gem5, which allows the precise evaluation of the performance gain. Two prototype compilers supporting the new hardware are created for experimentation by extending the Berkeley Unified Parallel C (UPC) compiler and the Cray Chapel compiler. This allows unmodified code to use the new instructions without any user intervention, thereby creating a productive programming environment. The second proof-of-concept implementation is a hardware prototype based on the multi-core Leon3 softcore processor running on a Virtex-6 FPGA. This allowed us to not only verify the feasibility of the implementation but also to evaluate the cost of the new hardware and its instructions. This research has shown very promising results. With benchmarks in UPC and Chapel including the NAS Parallel Benchmarks implemented in UPC, a speedup of up to 5.5x is demonstrated when using the hardware support with unmodified codes. Unmodified code performance using this hardware was shown to also surpass the performance of manually optimized UPC code in some of the cases by up to 10%. With Chapel, we obtained measurable speed-ups of up to 19x. Additionally, the hardware prototype demonstrated that only a very small area increase is needed.