卷2:第15章 Open MPI

2018-02-24 15:55 更新

現(xiàn)有版本還是比較粗的翻譯...

原文地址:http://www.aosabook.org/en/openmpi.html

作者:Jeffrey M. Squyres

15.1. Background

Open MPI [GFB+04] is an open source software implementation of The Message Passing Interface (MPI) standard. Before the architecture and innards of Open MPI will make any sense, a little background on the MPI standard must be discussed.

15.1. 背景

Open MPI [GFB+04] 是一個消息傳遞接口 (Message Passing Interface, MPI) 標準的開源軟件實現(xiàn)。在深入Open MPI架構(gòu)和內(nèi)部結(jié)構(gòu)之前,需要先介紹一些MPI標準的背景知識。

The Message Passing Interface (MPI)

The MPI standard is created and maintained by the MPI Forum, an open group consisting of parallel computing experts from both industry and academia. MPI defines an API that is used for a specific type of portable, high-performance inter-process communication (IPC): message passing. Specifically, the MPI document describes the reliable transfer of discrete, typed messages between MPI processes. Although the definition of an "MPI process" is subject to interpretation on a given platform, it usually corresponds to the operating system's concept of a process (e.g., a POSIX process). MPI is specifically intended to be implemented as middleware, meaning that upper-level applications call MPI functions to perform message passing.

消息傳遞接口 (MPI)

MPI標準是MPI論壇創(chuàng)立和維護的,該論壇是來自工業(yè)界和學術(shù)界的并行計算專家組成的開放團體。MPI定義一組API,用于一種可移植、高性能的進程間通信(IPC):消息傳遞。特別的,MPI文檔描述了MPI進程間可靠傳輸離散的、類型化的消息。盡管“MPI進程”的定義是由特定平臺自行解釋,但是通常來說都對應(yīng)于操作系統(tǒng)中進程的概念(比如,POSIX進程)。MPI的目的是成為一種中間件,上層應(yīng)用通過調(diào)用MPI函數(shù)可以進行消息傳遞。

MPI defines a high-level API, meaning that it abstracts away whatever underlying transport is actually used to pass messages between processes. The idea is that sending-process X can effectively say "take this array of 1,073 double precision values and send them to process Y". The corresponding receiving-process Y effectively says "receive an array of 1,073 double precision values from process X." A miracle occurs, and the array of 1,073 double precision values arrives in Y's waiting buffer.

MPI定義一組高層次的API,對進程間傳遞消息的底層傳輸進行抽象。大致上,發(fā)送進程X可以高效的說:“將此數(shù)組的1073個雙精度數(shù)值發(fā)送到進程Y”。對應(yīng)的接受進程Y高效的說:“從進程X接受1073個雙精度數(shù)值的數(shù)組”。奇跡發(fā)生了,這個含有1073個浮點數(shù)值的數(shù)組就會到達Y的等待緩沖中。

Notice what is absent in this exchange: there is no concept of a connection occurring, no stream of bytes to interpret, and no network addresses exchanged. MPI abstracts all of that away, not only to hide such complexity from the upper-level application, but also to make the application portable across different environments and underlying message passing transports. Specifically, a correct MPI application is source-compatible across a wide variety of platforms and network types.

請注意在此交換中未出現(xiàn)的東西:沒有連接建立的概念,沒有需要解釋的字節(jié)流,也沒有網(wǎng)絡(luò)地址的交換。MPI將這些都進行了抽象,不只是對上層應(yīng)用隱藏了這些復雜性,而且可以使應(yīng)用在不同環(huán)境和底層的消息傳輸上可移植。特別的,一個正確的MPI應(yīng)用可以在廣泛的平臺和網(wǎng)絡(luò)類型上源代碼級兼容。

MPI defines not only point-to-point communication (e.g., send and receive), it also defines other communication patterns, such as collective communication. Collective operations are where multiple processes are involved in a single communication action. Reliable broadcast, for example, is where one process has a message at the beginning of the operation, and at the end of the operation, all processes in a group have the message. MPI also defines other concepts and communications patterns that are not described here. (As of this writing, the most recent version of the MPI standard is MPI-2.2 [For09]. Draft versions of the upcoming MPI-3 standard have been published; it may be finalized as early as late 2012.)

MPI不只定義了點到點的通信(比如:發(fā)送和接受),而且定義了其他的通信模式,比如集合(collective)通信。集合操作是指一次通信中包含了多個進程。比如說,可靠的廣播,即開始時只有一個進程有一個消息,廣播操作后這個組內(nèi)的所有進程都有了這個消息。MPI還定義了一些其他的概念和通信模式,沒有在此處討論。(在本文撰寫中,最新的MPI標準是MPI-2.2[For09]。新的MPI-3標準的草稿版已經(jīng)發(fā)布,可能最早在2012年末就會定稿。)

Uses of MPI

There are many implementations of the MPI standard that support a wide variety of platforms, operating systems, and network types. Some implementations are open source, some are closed source. Open MPI, as its name implies, is one of the open source implementations. Typical MPI transport networks include (but are not limited to): various protocols over Ethernet (e.g., TCP, iWARP, UDP, raw Ethernet frames, etc.), shared memory, and InfiniBand.

使用MPI

MPI標準具有許多實現(xiàn),支持大量不同的平臺,操作系統(tǒng)和網(wǎng)絡(luò)類型。一些實現(xiàn)是開源的,一些是閉源的。Open MPI正如其名字所暗示的,是一個開源的實現(xiàn)。典型的MPI傳輸網(wǎng)絡(luò)包括(但不限于):以太網(wǎng)上的多種協(xié)議(比如:TCP,iWARP,UDP,原始以太網(wǎng)幀等),共享內(nèi)存和InfiniBand。

MPI implementations are typically used in so-called "high-performance computing" (HPC) environments. MPI essentially provides the IPC for simulation codes, computational algorithms, and other "big number crunching" types of applications. The input data sets on which these codes operate typically represent too much computational work for just one server; MPI jobs are spread out across tens, hundreds, or even thousands of servers, all working in concert to solve one computational problem.

典型的,MPI是在高性能計算(High Performance Computing, HPC)中使用。MPI本質(zhì)上為模擬、計算算法和其他的大型數(shù)值計算應(yīng)用提供IPC。通常來說,這些應(yīng)用操作的輸入數(shù)據(jù)代表了大量的計算,不適合于一臺服務(wù)器。所以,MPI作業(yè)都是分布在幾十個,幾百個,甚至幾千個服務(wù)器上,所有作業(yè)都是合作解決一個計算問題。

That is, the applications using MPI are both parallel in nature and highly compute-intensive. It is not unusual for all the processor cores in an MPI job to run at 100% utilization. To be clear, MPI jobs typically run in dedicated environments where the MPI processes are the only application running on the machine (in addition to bare-bones operating system functionality, of course).

所以,使用MPI的應(yīng)用都是含有并行性并且是高度計算密集的。MPI作業(yè)中的所有處理器核都是100%的利用率并不是罕見的。很明顯,MPI作業(yè)通常是運行在專門的環(huán)境中,即機器上只有MPI進程這唯一的應(yīng)用運行(當然,還有基礎(chǔ)的操作系統(tǒng))。

As such, MPI implementations are typically focused on providing extremely high performance, measured by metrics such as:

  • Extremely low latency for short message passing. As an example, a 1-byte message can be sent from a user-level Linux process on one server, through an InfiniBand switch, and received at the target user-level Linux process on a different server in a little over 1 microsecond (i.e., 0.000001 second).
  • Extremely high message network injection rate for short messages. Some vendors have MPI implementations (paired with specified hardware) that can inject up to 28 million messages per second into the network.
  • Quick ramp-up (as a function of message size) to the maximum bandwidth supported by the underlying transport.
  • Low resource utilization. All resources used by MPI (e.g., memory, cache, and bus bandwidth) cannot be used by the application. MPI implementations therefore try to maintain a balance of low resource utilization while still providing high performance.

因此,MPI實現(xiàn)通常關(guān)注于提供非常高的性能,從以下尺度測量: - 短消息傳遞中非常低的延遲。例如,服務(wù)器上用戶級Linux進程發(fā)送一條1字節(jié)的消息,通過InfiniBand交換機,被另外一臺服務(wù)器的目標用戶級Linux進程接受,整個過程只需1毫秒中很少一部分(比如:0.000001秒) - 短消息的極高網(wǎng)絡(luò)注入率。一些制造商的MPI實現(xiàn)(配合專門的硬件)可以達到每秒向網(wǎng)絡(luò)注入2千8百萬條消息。 - 在消息大小增長時,可以快速達到底層傳輸支持的最大帶寬。 - 低資源占用。所有MPI使用的資源(比如:內(nèi)存,緩存和總線帶寬)都不能被應(yīng)用使用。所以,MPI實現(xiàn)嘗試保持低資源占用和同樣提供高性能的平衡。

Open MPI

The first version of the MPI standard, MPI-1.0, was published in 1994 [Mes93]. MPI-2.0, a set of additions on top of MPI-1, was completed in 1996 [GGHL+96].

Open MPI

1994年發(fā)布MPI標準第一個版本MPI-1.0[Mes93]。1996年,通過在MPI-1的基礎(chǔ)上附加一些功能,完成了MPI-2.0[GGHL+96]。

In the first decade after MPI-1 was published, a variety of MPI implementations sprung up. Many were provided by vendors for their proprietary network interconnects. Many other implementations arose from the research and academic communities. Such implementations were typically "research-quality," meaning that their purpose was to investigate various high-performance networking concepts and provide proofs-of-concept of their work. However, some were high enough quality that they gained popularity and a number of users.

在MPI-1發(fā)布后的第一個十年內(nèi),涌現(xiàn)出許多MPI實現(xiàn)。很多實現(xiàn)來自于私有互連網(wǎng)絡(luò)的廠商。另外一些實現(xiàn)來自于科研和學術(shù)界。這些實現(xiàn)是典型的“研究品”,它們的目標是探索各種高性能網(wǎng)絡(luò)系統(tǒng)的想法,以及提供他們工作的概念驗證。然而,其中有一些具有足夠高的質(zhì)量,獲得一定數(shù)量的用戶而普及起來。

Open MPI represents the union of four research/academic, open source MPI implementations: LAM/MPI, LA/MPI (Los Alamos MPI), and FT-MPI (Fault-Tolerant MPI). The members of the PACX-MPI team joined the Open MPI group shortly after its inception.

Open MPI融合了4個科研、學術(shù)界的開源MPI實現(xiàn):LAM/MPI,LA/MPI(Los Alamos MPI)和FT-MPI(Fault-Tolerant MPI)。在PACX-MPI成立之后,其成員也立刻加入了Open MPI組。

The members of these four development teams decided to collaborate when we had the collective realization that, aside from minor differences in optimizations and features, our software code bases were quite similar. Each of the four code bases had their own strengths and weaknesses, but on the whole, they more-or-less did the same things. So why compete? Why not pool our resources, work together, and make an even better MPI implementation?

這4個開發(fā)團隊決定進行合作,是因為共同認識到除了細微的優(yōu)化和功能上的差別,我們的軟件代碼都是很類似的。其中各個軟件實現(xiàn)都有各自的強項和弱項,但是總體上,它們或多或少都是在做同樣的事情。所以,為什么要競爭?為什么不整合我們的資源,一起合作達到一個更好的MPI實現(xiàn)?

After much discussion, the decision was made to abandon our four existing code bases and take only the best ideas from the prior projects. This decision was mainly predicated upon the following premises:

在經(jīng)過很多討論后,決定放棄我們已有的4個代碼,只從之前的項目中借用最好的思想。這個決定主要基于以下的考慮:

  • Even though many of the underlying algorithms and techniques were similar among the four code bases, they each had radically different implementation architectures, and would be incredible difficult (if not impossible) to merge.
  • Each of the four also had their own (significant) strengths and (significant) weaknesses. Specifically, there were features and architecture decisions from each of the four that were desirable to carry forward. Likewise, there were poorly optimized and badly designed code in each of the four that were desirable to leave behind.
  • The members of the four developer groups had not worked directly together before. Starting with an entirely new code base (rather than advancing one of the existing code bases) put all developers on equal ground.

  • 盡管這4個項目中很多底層和算法和技術(shù)都是類似的,但是它們的實現(xiàn)架構(gòu)徹底不同,所以基本不可能合并在一起。

  • 這4個項目每一個都有各自明顯的強項和弱項。特別地,每個項目都有一些希望發(fā)揚的功能和架構(gòu)設(shè)計。類似地,每個項目也有一些期望丟棄的未優(yōu)化和不好的設(shè)計。
  • 4個開發(fā)團隊的成員之前從來沒有直接合作過。從一個全新的項目開始(而不是基于某個已有的代碼),可以讓所有的開發(fā)者都處在同樣的起跑線上。

Thus, Open MPI was born. Its first Subversion commit was on November 22, 2003.

最終,Open MPI誕生了。2003年11月22日產(chǎn)生了第一次Subversion提交。

15.2. Architecture

For a variety of reasons (mostly related to either performance or portability), C and C++ were the only two possibilities for the primary implementation language. C++ was eventually discarded because different C++ compilers tend to lay out structs/classes in memory according to different optimization algorithms, leading to different on-the-wire network representations. C was therefore chosen as the primary implementation language, which influenced several architectural design decisions.

15.2 架構(gòu)

基于多種原因(絕大部分關(guān)系到性能或者可移植性),C 和 C++是兩種可能的主要實現(xiàn)語言。最終,放棄C++是因為不同的C++編譯器傾向于根據(jù)不同的優(yōu)化算法進行數(shù)據(jù)結(jié)構(gòu)或類的內(nèi)存布局,這就導致了線上網(wǎng)絡(luò)表現(xiàn)中的差異。因此,選擇 C 作為主要的實現(xiàn)語言,進而影響了很多架構(gòu)設(shè)計的決定。

When Open MPI was started, we knew that it would be a large, complex code base:

  • In 2003, the current version of the MPI standard, MPI-2.0, defined over 300 API functions.
  • Each of the four prior projects were large in themselves. For example, LAM/MPI had over 1,900 files of source code, comprising over 300,000 lines of code (including comments and blanks).
  • We wanted Open MPI to support more features, environments, and networks than all four prior projects put together.

在Open MPI開始的時候,我們就知道它會是一個大型和復雜的代碼項目:

  • 在2003年,當時的MPI標準MPI-2.0已經(jīng)定義了超過300個API函數(shù)。
  • 這4個先前的項目都是較大的。比如,LAM/MPI源代碼文件超過1900個,總共超過300000行代碼(包括注釋和空行)。
  • 我們期望Open MPI能夠比之前4個項目的總和支持更多的特性,環(huán)境和網(wǎng)絡(luò)。

We therefore spent a good deal of time designing an architecture that focused on three things:

  1. Grouping similar functionality together in distinct abstraction layers.
  2. Using run-time loadable plugins and run-time parameters to choose between multiple different implementations of the same behavior.
  3. Not allowing abstraction to get in the way of performance.

因此,我們花費了一大堆時間進行架構(gòu)設(shè)計,關(guān)注以下3件事情:

  1. 將類似的功能特性匯總到不同的抽象層次。
  2. 使用運行時可裝載的插件和運行時參數(shù),從而可以在多個具有相同行為的不同實現(xiàn)中進行選擇
  3. 不讓抽象妨礙性能

Abstraction Layer Architecture

Open MPI has three main abstraction layers, shown in Figure 15.1:

  • Open, Portable Access Layer (OPAL): OPAL is the bottom layer of Open MPI's abstractions. Its abstractions are focused on individual processes (versus parallel jobs). It provides utility and glue code such as generic linked lists, string manipulation, debugging controls, and other mundane—yet necessary—functionality.
    OPAL also provides Open MPI's core portability between different operating systems, such as discovering IP interfaces, sharing memory between processes on the same server, processor and memory affinity, high-precision timers, etc.

  • Open MPI Run-Time Environment (ORTE) (pronounced "or-tay"): An MPI implementation must provide not only the required message passing API, but also an accompanying run-time system to launch, monitor, and kill parallel jobs. In Open MPI's case, a parallel job is comprised of one or more processes that may span multiple operating system instances, and are bound together to act as a single, cohesive unit.
    In simple environments with little or no distributed computational support, ORTE uses rsh or ssh to launch the individual processes in parallel jobs. More advanced, HPC-dedicated environments typically have schedulers and resource managers for fairly sharing computational resources between many users. Such environments usually provide specialized APIs to launch and regulate processes on compute servers. ORTE supports a wide variety of such managed environments, such as (but not limited to): Torque/PBS Pro, SLURM, Oracle Grid Engine, and LSF.

  • Open MPI (OMPI): The MPI layer is the highest abstraction layer, and is the only one exposed to applications. The MPI API is implemented in this layer, as are all the message passing semantics defined by the MPI standard.
    Since portability is a primary requirement, the MPI layer supports a wide variety of network types and underlying protocols. Some networks are similar in their underlying characteristics and abstractions; some are not.

抽象層次架構(gòu)

Open MPI具有3個主要的抽象層次,如圖15.1所示:

  • 開放、可移植的訪問層(Open Portable Access Layer, OPAL):OPAL位于Open MPI抽象的最底層,它關(guān)注于各個進程個體(相對于并行作業(yè))。它提供了一些工具和集成代碼,包括:鏈接列表,字符串操作,debug控制和其他平凡但是必須的功能。
    OPAL使得Open MPI的核心在不同操作系統(tǒng)間可移植。比如,發(fā)現(xiàn)IP網(wǎng)卡,在同一個服務(wù)器的進程間共享內(nèi)存,處理器和內(nèi)存的親和性,高精度的計時器等。

  • Open MPI運行時環(huán)境(Open MPI Run-Time Environment, ORTE,發(fā)音為“or-tay”):MPI實現(xiàn)不只是提供必要的消息傳遞API,還必須提供輔助的運行時系統(tǒng)以發(fā)起,監(jiān)視和殺死并行作業(yè)。在Open MPI中,并行作業(yè)是指一個或者多個可能跨多個操作系統(tǒng)的進程,組合在一起作為一個緊密耦合的單元。
    在只有很少或者沒有分布式計算支持的簡單環(huán)境下,ORTE使用rsh或者ssh啟動并行作業(yè)中各個進程。更高級的情況下,專門的HPC環(huán)境通常會提供調(diào)度和資源管理器,以公平的在多個用戶間共享計算資源。這種環(huán)境通常會提供特殊的API以在計算節(jié)點上發(fā)起和控制進程。ORTE支持眾多這樣的管理環(huán)境,包括(單不限于)Torque/PBS Pro, SLURM, Oracle Grid Engine和LSF。

  • Open MPI(OMPI):MPI層是最上面的抽象層次,唯一暴露給應(yīng)用程序的層次。在這層次內(nèi),實現(xiàn)了MPI API并且所有的消息傳遞語意符合MPI標準的定義。
    因為可移植性是主要的需求,MPI層次支持廣泛的網(wǎng)絡(luò)類型和底層的協(xié)議。其中,一些網(wǎng)絡(luò)在底層的特征和抽象上是類似的,有一些是不同的。

以上內(nèi)容是否對您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號
微信公眾號

編程獅公眾號