卷2:第15章 Open MPI

2018-02-24 15:55 更新

現(xiàn)有版本還是比較粗的翻譯...

原文地址:http://www.aosabook.org/en/openmpi.html

作者:Jeffrey M. Squyres

15.1. Background

Open MPI [GFB+04] is an open source software implementation of The Message Passing Interface (MPI) standard. Before the architecture and innards of Open MPI will make any sense, a little background on the MPI standard must be discussed.

15.1. 背景

Open MPI [GFB+04] 是一個(gè)消息傳遞接口 (Message Passing Interface, MPI) 標(biāo)準(zhǔn)的開(kāi)源軟件實(shí)現(xiàn)。在深入Open MPI架構(gòu)和內(nèi)部結(jié)構(gòu)之前,需要先介紹一些MPI標(biāo)準(zhǔn)的背景知識(shí)。

The Message Passing Interface (MPI)

The MPI standard is created and maintained by the MPI Forum, an open group consisting of parallel computing experts from both industry and academia. MPI defines an API that is used for a specific type of portable, high-performance inter-process communication (IPC): message passing. Specifically, the MPI document describes the reliable transfer of discrete, typed messages between MPI processes. Although the definition of an "MPI process" is subject to interpretation on a given platform, it usually corresponds to the operating system's concept of a process (e.g., a POSIX process). MPI is specifically intended to be implemented as middleware, meaning that upper-level applications call MPI functions to perform message passing.

消息傳遞接口 (MPI)

MPI標(biāo)準(zhǔn)是MPI論壇創(chuàng)立和維護(hù)的,該論壇是來(lái)自工業(yè)界和學(xué)術(shù)界的并行計(jì)算專家組成的開(kāi)放團(tuán)體。MPI定義一組API,用于一種可移植、高性能的進(jìn)程間通信(IPC):消息傳遞。特別的,MPI文檔描述了MPI進(jìn)程間可靠傳輸離散的、類型化的消息。盡管“MPI進(jìn)程”的定義是由特定平臺(tái)自行解釋,但是通常來(lái)說(shuō)都對(duì)應(yīng)于操作系統(tǒng)中進(jìn)程的概念(比如,POSIX進(jìn)程)。MPI的目的是成為一種中間件,上層應(yīng)用通過(guò)調(diào)用MPI函數(shù)可以進(jìn)行消息傳遞。

MPI defines a high-level API, meaning that it abstracts away whatever underlying transport is actually used to pass messages between processes. The idea is that sending-process X can effectively say "take this array of 1,073 double precision values and send them to process Y". The corresponding receiving-process Y effectively says "receive an array of 1,073 double precision values from process X." A miracle occurs, and the array of 1,073 double precision values arrives in Y's waiting buffer.

MPI定義一組高層次的API,對(duì)進(jìn)程間傳遞消息的底層傳輸進(jìn)行抽象。大致上,發(fā)送進(jìn)程X可以高效的說(shuō):“將此數(shù)組的1073個(gè)雙精度數(shù)值發(fā)送到進(jìn)程Y”。對(duì)應(yīng)的接受進(jìn)程Y高效的說(shuō):“從進(jìn)程X接受1073個(gè)雙精度數(shù)值的數(shù)組”。奇跡發(fā)生了,這個(gè)含有1073個(gè)浮點(diǎn)數(shù)值的數(shù)組就會(huì)到達(dá)Y的等待緩沖中。

Notice what is absent in this exchange: there is no concept of a connection occurring, no stream of bytes to interpret, and no network addresses exchanged. MPI abstracts all of that away, not only to hide such complexity from the upper-level application, but also to make the application portable across different environments and underlying message passing transports. Specifically, a correct MPI application is source-compatible across a wide variety of platforms and network types.

請(qǐng)注意在此交換中未出現(xiàn)的東西:沒(méi)有連接建立的概念,沒(méi)有需要解釋的字節(jié)流,也沒(méi)有網(wǎng)絡(luò)地址的交換。MPI將這些都進(jìn)行了抽象,不只是對(duì)上層應(yīng)用隱藏了這些復(fù)雜性,而且可以使應(yīng)用在不同環(huán)境和底層的消息傳輸上可移植。特別的,一個(gè)正確的MPI應(yīng)用可以在廣泛的平臺(tái)和網(wǎng)絡(luò)類型上源代碼級(jí)兼容。

MPI defines not only point-to-point communication (e.g., send and receive), it also defines other communication patterns, such as collective communication. Collective operations are where multiple processes are involved in a single communication action. Reliable broadcast, for example, is where one process has a message at the beginning of the operation, and at the end of the operation, all processes in a group have the message. MPI also defines other concepts and communications patterns that are not described here. (As of this writing, the most recent version of the MPI standard is MPI-2.2 [For09]. Draft versions of the upcoming MPI-3 standard have been published; it may be finalized as early as late 2012.)

MPI不只定義了點(diǎn)到點(diǎn)的通信(比如:發(fā)送和接受),而且定義了其他的通信模式,比如集合(collective)通信。集合操作是指一次通信中包含了多個(gè)進(jìn)程。比如說(shuō),可靠的廣播,即開(kāi)始時(shí)只有一個(gè)進(jìn)程有一個(gè)消息,廣播操作后這個(gè)組內(nèi)的所有進(jìn)程都有了這個(gè)消息。MPI還定義了一些其他的概念和通信模式,沒(méi)有在此處討論。(在本文撰寫中,最新的MPI標(biāo)準(zhǔn)是MPI-2.2[For09]。新的MPI-3標(biāo)準(zhǔn)的草稿版已經(jīng)發(fā)布,可能最早在2012年末就會(huì)定稿。)

Uses of MPI

There are many implementations of the MPI standard that support a wide variety of platforms, operating systems, and network types. Some implementations are open source, some are closed source. Open MPI, as its name implies, is one of the open source implementations. Typical MPI transport networks include (but are not limited to): various protocols over Ethernet (e.g., TCP, iWARP, UDP, raw Ethernet frames, etc.), shared memory, and InfiniBand.

使用MPI

MPI標(biāo)準(zhǔn)具有許多實(shí)現(xiàn),支持大量不同的平臺(tái),操作系統(tǒng)和網(wǎng)絡(luò)類型。一些實(shí)現(xiàn)是開(kāi)源的,一些是閉源的。Open MPI正如其名字所暗示的,是一個(gè)開(kāi)源的實(shí)現(xiàn)。典型的MPI傳輸網(wǎng)絡(luò)包括(但不限于):以太網(wǎng)上的多種協(xié)議(比如:TCP,iWARP,UDP,原始以太網(wǎng)幀等),共享內(nèi)存和InfiniBand。

MPI implementations are typically used in so-called "high-performance computing" (HPC) environments. MPI essentially provides the IPC for simulation codes, computational algorithms, and other "big number crunching" types of applications. The input data sets on which these codes operate typically represent too much computational work for just one server; MPI jobs are spread out across tens, hundreds, or even thousands of servers, all working in concert to solve one computational problem.

典型的,MPI是在高性能計(jì)算(High Performance Computing, HPC)中使用。MPI本質(zhì)上為模擬、計(jì)算算法和其他的大型數(shù)值計(jì)算應(yīng)用提供IPC。通常來(lái)說(shuō),這些應(yīng)用操作的輸入數(shù)據(jù)代表了大量的計(jì)算,不適合于一臺(tái)服務(wù)器。所以,MPI作業(yè)都是分布在幾十個(gè),幾百個(gè),甚至幾千個(gè)服務(wù)器上,所有作業(yè)都是合作解決一個(gè)計(jì)算問(wèn)題。

That is, the applications using MPI are both parallel in nature and highly compute-intensive. It is not unusual for all the processor cores in an MPI job to run at 100% utilization. To be clear, MPI jobs typically run in dedicated environments where the MPI processes are the only application running on the machine (in addition to bare-bones operating system functionality, of course).

所以,使用MPI的應(yīng)用都是含有并行性并且是高度計(jì)算密集的。MPI作業(yè)中的所有處理器核都是100%的利用率并不是罕見(jiàn)的。很明顯,MPI作業(yè)通常是運(yùn)行在專門的環(huán)境中,即機(jī)器上只有MPI進(jìn)程這唯一的應(yīng)用運(yùn)行(當(dāng)然,還有基礎(chǔ)的操作系統(tǒng))。

As such, MPI implementations are typically focused on providing extremely high performance, measured by metrics such as:

  • Extremely low latency for short message passing. As an example, a 1-byte message can be sent from a user-level Linux process on one server, through an InfiniBand switch, and received at the target user-level Linux process on a different server in a little over 1 microsecond (i.e., 0.000001 second).
  • Extremely high message network injection rate for short messages. Some vendors have MPI implementations (paired with specified hardware) that can inject up to 28 million messages per second into the network.
  • Quick ramp-up (as a function of message size) to the maximum bandwidth supported by the underlying transport.
  • Low resource utilization. All resources used by MPI (e.g., memory, cache, and bus bandwidth) cannot be used by the application. MPI implementations therefore try to maintain a balance of low resource utilization while still providing high performance.

因此,MPI實(shí)現(xiàn)通常關(guān)注于提供非常高的性能,從以下尺度測(cè)量: - 短消息傳遞中非常低的延遲。例如,服務(wù)器上用戶級(jí)Linux進(jìn)程發(fā)送一條1字節(jié)的消息,通過(guò)InfiniBand交換機(jī),被另外一臺(tái)服務(wù)器的目標(biāo)用戶級(jí)Linux進(jìn)程接受,整個(gè)過(guò)程只需1毫秒中很少一部分(比如:0.000001秒) - 短消息的極高網(wǎng)絡(luò)注入率。一些制造商的MPI實(shí)現(xiàn)(配合專門的硬件)可以達(dá)到每秒向網(wǎng)絡(luò)注入2千8百萬(wàn)條消息。 - 在消息大小增長(zhǎng)時(shí),可以快速達(dá)到底層傳輸支持的最大帶寬。 - 低資源占用。所有MPI使用的資源(比如:內(nèi)存,緩存和總線帶寬)都不能被應(yīng)用使用。所以,MPI實(shí)現(xiàn)嘗試保持低資源占用和同樣提供高性能的平衡。

Open MPI

The first version of the MPI standard, MPI-1.0, was published in 1994 [Mes93]. MPI-2.0, a set of additions on top of MPI-1, was completed in 1996 [GGHL+96].

Open MPI

1994年發(fā)布MPI標(biāo)準(zhǔn)第一個(gè)版本MPI-1.0[Mes93]。1996年,通過(guò)在MPI-1的基礎(chǔ)上附加一些功能,完成了MPI-2.0[GGHL+96]。

In the first decade after MPI-1 was published, a variety of MPI implementations sprung up. Many were provided by vendors for their proprietary network interconnects. Many other implementations arose from the research and academic communities. Such implementations were typically "research-quality," meaning that their purpose was to investigate various high-performance networking concepts and provide proofs-of-concept of their work. However, some were high enough quality that they gained popularity and a number of users.

在MPI-1發(fā)布后的第一個(gè)十年內(nèi),涌現(xiàn)出許多MPI實(shí)現(xiàn)。很多實(shí)現(xiàn)來(lái)自于私有互連網(wǎng)絡(luò)的廠商。另外一些實(shí)現(xiàn)來(lái)自于科研和學(xué)術(shù)界。這些實(shí)現(xiàn)是典型的“研究品”,它們的目標(biāo)是探索各種高性能網(wǎng)絡(luò)系統(tǒng)的想法,以及提供他們工作的概念驗(yàn)證。然而,其中有一些具有足夠高的質(zhì)量,獲得一定數(shù)量的用戶而普及起來(lái)。

Open MPI represents the union of four research/academic, open source MPI implementations: LAM/MPI, LA/MPI (Los Alamos MPI), and FT-MPI (Fault-Tolerant MPI). The members of the PACX-MPI team joined the Open MPI group shortly after its inception.

Open MPI融合了4個(gè)科研、學(xué)術(shù)界的開(kāi)源MPI實(shí)現(xiàn):LAM/MPI,LA/MPI(Los Alamos MPI)和FT-MPI(Fault-Tolerant MPI)。在PACX-MPI成立之后,其成員也立刻加入了Open MPI組。

The members of these four development teams decided to collaborate when we had the collective realization that, aside from minor differences in optimizations and features, our software code bases were quite similar. Each of the four code bases had their own strengths and weaknesses, but on the whole, they more-or-less did the same things. So why compete? Why not pool our resources, work together, and make an even better MPI implementation?

這4個(gè)開(kāi)發(fā)團(tuán)隊(duì)決定進(jìn)行合作,是因?yàn)楣餐J(rèn)識(shí)到除了細(xì)微的優(yōu)化和功能上的差別,我們的軟件代碼都是很類似的。其中各個(gè)軟件實(shí)現(xiàn)都有各自的強(qiáng)項(xiàng)和弱項(xiàng),但是總體上,它們或多或少都是在做同樣的事情。所以,為什么要競(jìng)爭(zhēng)?為什么不整合我們的資源,一起合作達(dá)到一個(gè)更好的MPI實(shí)現(xiàn)?

After much discussion, the decision was made to abandon our four existing code bases and take only the best ideas from the prior projects. This decision was mainly predicated upon the following premises:

在經(jīng)過(guò)很多討論后,決定放棄我們已有的4個(gè)代碼,只從之前的項(xiàng)目中借用最好的思想。這個(gè)決定主要基于以下的考慮:

  • Even though many of the underlying algorithms and techniques were similar among the four code bases, they each had radically different implementation architectures, and would be incredible difficult (if not impossible) to merge.
  • Each of the four also had their own (significant) strengths and (significant) weaknesses. Specifically, there were features and architecture decisions from each of the four that were desirable to carry forward. Likewise, there were poorly optimized and badly designed code in each of the four that were desirable to leave behind.
  • The members of the four developer groups had not worked directly together before. Starting with an entirely new code base (rather than advancing one of the existing code bases) put all developers on equal ground.

  • 盡管這4個(gè)項(xiàng)目中很多底層和算法和技術(shù)都是類似的,但是它們的實(shí)現(xiàn)架構(gòu)徹底不同,所以基本不可能合并在一起。

  • 這4個(gè)項(xiàng)目每一個(gè)都有各自明顯的強(qiáng)項(xiàng)和弱項(xiàng)。特別地,每個(gè)項(xiàng)目都有一些希望發(fā)揚(yáng)的功能和架構(gòu)設(shè)計(jì)。類似地,每個(gè)項(xiàng)目也有一些期望丟棄的未優(yōu)化和不好的設(shè)計(jì)。
  • 4個(gè)開(kāi)發(fā)團(tuán)隊(duì)的成員之前從來(lái)沒(méi)有直接合作過(guò)。從一個(gè)全新的項(xiàng)目開(kāi)始(而不是基于某個(gè)已有的代碼),可以讓所有的開(kāi)發(fā)者都處在同樣的起跑線上。

Thus, Open MPI was born. Its first Subversion commit was on November 22, 2003.

最終,Open MPI誕生了。2003年11月22日產(chǎn)生了第一次Subversion提交。

15.2. Architecture

For a variety of reasons (mostly related to either performance or portability), C and C++ were the only two possibilities for the primary implementation language. C++ was eventually discarded because different C++ compilers tend to lay out structs/classes in memory according to different optimization algorithms, leading to different on-the-wire network representations. C was therefore chosen as the primary implementation language, which influenced several architectural design decisions.

15.2 架構(gòu)

基于多種原因(絕大部分關(guān)系到性能或者可移植性),C 和 C++是兩種可能的主要實(shí)現(xiàn)語(yǔ)言。最終,放棄C++是因?yàn)椴煌腃++編譯器傾向于根據(jù)不同的優(yōu)化算法進(jìn)行數(shù)據(jù)結(jié)構(gòu)或類的內(nèi)存布局,這就導(dǎo)致了線上網(wǎng)絡(luò)表現(xiàn)中的差異。因此,選擇 C 作為主要的實(shí)現(xiàn)語(yǔ)言,進(jìn)而影響了很多架構(gòu)設(shè)計(jì)的決定。

When Open MPI was started, we knew that it would be a large, complex code base:

  • In 2003, the current version of the MPI standard, MPI-2.0, defined over 300 API functions.
  • Each of the four prior projects were large in themselves. For example, LAM/MPI had over 1,900 files of source code, comprising over 300,000 lines of code (including comments and blanks).
  • We wanted Open MPI to support more features, environments, and networks than all four prior projects put together.

在Open MPI開(kāi)始的時(shí)候,我們就知道它會(huì)是一個(gè)大型和復(fù)雜的代碼項(xiàng)目:

  • 在2003年,當(dāng)時(shí)的MPI標(biāo)準(zhǔn)MPI-2.0已經(jīng)定義了超過(guò)300個(gè)API函數(shù)。
  • 這4個(gè)先前的項(xiàng)目都是較大的。比如,LAM/MPI源代碼文件超過(guò)1900個(gè),總共超過(guò)300000行代碼(包括注釋和空行)。
  • 我們期望Open MPI能夠比之前4個(gè)項(xiàng)目的總和支持更多的特性,環(huán)境和網(wǎng)絡(luò)。

We therefore spent a good deal of time designing an architecture that focused on three things:

  1. Grouping similar functionality together in distinct abstraction layers.
  2. Using run-time loadable plugins and run-time parameters to choose between multiple different implementations of the same behavior.
  3. Not allowing abstraction to get in the way of performance.

因此,我們花費(fèi)了一大堆時(shí)間進(jìn)行架構(gòu)設(shè)計(jì),關(guān)注以下3件事情:

  1. 將類似的功能特性匯總到不同的抽象層次。
  2. 使用運(yùn)行時(shí)可裝載的插件和運(yùn)行時(shí)參數(shù),從而可以在多個(gè)具有相同行為的不同實(shí)現(xiàn)中進(jìn)行選擇
  3. 不讓抽象妨礙性能

Abstraction Layer Architecture

Open MPI has three main abstraction layers, shown in Figure 15.1:

  • Open, Portable Access Layer (OPAL): OPAL is the bottom layer of Open MPI's abstractions. Its abstractions are focused on individual processes (versus parallel jobs). It provides utility and glue code such as generic linked lists, string manipulation, debugging controls, and other mundane—yet necessary—functionality.
    OPAL also provides Open MPI's core portability between different operating systems, such as discovering IP interfaces, sharing memory between processes on the same server, processor and memory affinity, high-precision timers, etc.

  • Open MPI Run-Time Environment (ORTE) (pronounced "or-tay"): An MPI implementation must provide not only the required message passing API, but also an accompanying run-time system to launch, monitor, and kill parallel jobs. In Open MPI's case, a parallel job is comprised of one or more processes that may span multiple operating system instances, and are bound together to act as a single, cohesive unit.
    In simple environments with little or no distributed computational support, ORTE uses rsh or ssh to launch the individual processes in parallel jobs. More advanced, HPC-dedicated environments typically have schedulers and resource managers for fairly sharing computational resources between many users. Such environments usually provide specialized APIs to launch and regulate processes on compute servers. ORTE supports a wide variety of such managed environments, such as (but not limited to): Torque/PBS Pro, SLURM, Oracle Grid Engine, and LSF.

  • Open MPI (OMPI): The MPI layer is the highest abstraction layer, and is the only one exposed to applications. The MPI API is implemented in this layer, as are all the message passing semantics defined by the MPI standard.
    Since portability is a primary requirement, the MPI layer supports a wide variety of network types and underlying protocols. Some networks are similar in their underlying characteristics and abstractions; some are not.

抽象層次架構(gòu)

Open MPI具有3個(gè)主要的抽象層次,如圖15.1所示:

  • 開(kāi)放、可移植的訪問(wèn)層(Open Portable Access Layer, OPAL):OPAL位于Open MPI抽象的最底層,它關(guān)注于各個(gè)進(jìn)程個(gè)體(相對(duì)于并行作業(yè))。它提供了一些工具和集成代碼,包括:鏈接列表,字符串操作,debug控制和其他平凡但是必須的功能。
    OPAL使得Open MPI的核心在不同操作系統(tǒng)間可移植。比如,發(fā)現(xiàn)IP網(wǎng)卡,在同一個(gè)服務(wù)器的進(jìn)程間共享內(nèi)存,處理器和內(nèi)存的親和性,高精度的計(jì)時(shí)器等。

  • Open MPI運(yùn)行時(shí)環(huán)境(Open MPI Run-Time Environment, ORTE,發(fā)音為“or-tay”):MPI實(shí)現(xiàn)不只是提供必要的消息傳遞API,還必須提供輔助的運(yùn)行時(shí)系統(tǒng)以發(fā)起,監(jiān)視和殺死并行作業(yè)。在Open MPI中,并行作業(yè)是指一個(gè)或者多個(gè)可能跨多個(gè)操作系統(tǒng)的進(jìn)程,組合在一起作為一個(gè)緊密耦合的單元。
    在只有很少或者沒(méi)有分布式計(jì)算支持的簡(jiǎn)單環(huán)境下,ORTE使用rsh或者ssh啟動(dòng)并行作業(yè)中各個(gè)進(jìn)程。更高級(jí)的情況下,專門的HPC環(huán)境通常會(huì)提供調(diào)度和資源管理器,以公平的在多個(gè)用戶間共享計(jì)算資源。這種環(huán)境通常會(huì)提供特殊的API以在計(jì)算節(jié)點(diǎn)上發(fā)起和控制進(jìn)程。ORTE支持眾多這樣的管理環(huán)境,包括(單不限于)Torque/PBS Pro, SLURM, Oracle Grid Engine和LSF。

  • Open MPI(OMPI):MPI層是最上面的抽象層次,唯一暴露給應(yīng)用程序的層次。在這層次內(nèi),實(shí)現(xiàn)了MPI API并且所有的消息傳遞語(yǔ)意符合MPI標(biāo)準(zhǔn)的定義。
    因?yàn)榭梢浦残允侵饕男枨螅琈PI層次支持廣泛的網(wǎng)絡(luò)類型和底層的協(xié)議。其中,一些網(wǎng)絡(luò)在底層的特征和抽象上是類似的,有一些是不同的。

以上內(nèi)容是否對(duì)您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號(hào)
微信公眾號(hào)

編程獅公眾號(hào)