Multiprocessing (parallel computer architecture) (多处理系统)
作者:互联网
- 参考: C o m p u t e r A r i c h i t e c t u r e ( 6 th E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer Arichitecture (6th Edition)
目录
- Multiprocessing
- SIMD (Data-Level Parallelism)
- MIMD (Thread-Level Parallelism)
- Cache Coherence and Coherence Protocol
Multiprocessing
subprocessor: 协处理器
Classification of computer architecture
Flynn’s Taxonomy (Flynn 分类法):
- 基本思想:计算机工作过程是指令流的执行和数据流的处理。根据指令流和数据流的多倍性对计算机系统结构进行分类
- 指令流:机器执行的指令序列
- 数据流:由指令流调用的数据序列(包括输入数据和中间结果)
- 多倍性:在系统性能的瓶颈部件上处于同一执行阶段的指令或数据的最大个数 (一个指令最多操作多少数据)
- SISD: 单指令流单数据流;Uniprocessor 的工作过程
- 冯诺依曼架构采用的方式
- SIMD (Data Level Parallelism):单指令流
→
\rightarrow
→ 一台机器上运行;但可以同时操纵多个数据
- 向量机 Vector processer、阵列机 Array processor
- MISD: 实际不存在
- MIMD (Thread (Process) Level Parallelism 线程/进程级并行):多指令流
→
\rightarrow
→ 多道程序
→
\rightarrow
→ 多处理器;每个指令流处理自己的数据
- Multi-computers: 集群 Clusters
- Multi-processors: 一台机器多 CPU、多核 (Multi-core)
- Reason for multicores: physical limitations can cause significant heat dissipation and data synchronization problems
- e.g. Intel Core 2 dual core processor, with CPU-local Level 1 caches+ shared, on-die Level 2 cache. (多核之间共享 L2 Cache,数据同步相对好处理一点)
SISD v. SIMD v. MIMD
PU: 处理单元;Instruction Pool: 指令 Cache;Data Pool: 数据 Cache
Challenges to Parallel Programming
First challenge is % of program inherently sequential
- primarily via new algorithms that have better parallel performance
Example
- Suppose 80X speedup from 100 processors. (100个处理器达到80倍的加速比) What fraction of original program can be sequential?
a.10% b. 5% c.1% d.<1% - Amdahl’s Law Answers: <1%
Second challenge is long latency to remote memory
- both by architect and by the programmer. For example, reduce frequency of remote accesses either by
- Caching shared data (HW)
- Restructuring the data layout to make more accesses local (SW)
Example
- Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. What is performance impact if 0.2% instructions involve remote access?
a. 1.5X b. 2.0X c. 2.6X - CPI Equation
- Remote access cost = 200ns × \times × 2GHz = 400 clock cycles.
- CPI = Base CPI + Remote request rate x Remote request cost = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3
- No communication (the MP with all local reference) is 1.3/0.5 or 2.6 faster than 0.2% instructions involve remote access
SIMD (Data-Level Parallelism)
Vector Processor
Why Vector Processors?
- A single vector instruction specifies a great deal of work—it is equivalent to executing an entire loop. Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.
- The computation of each result (for one element) in the vector is independent of the computation of other results in the same vector and so hardware does not have to check for data hazards within a vector instruction. Hardware need only check for data hazards between two vector instructions once per vector operand, not once for every element within the vectors.
- Vector instructions that access memory have a known access pattern (数组在内存中的存储模式是固定的,因此访存操作也是固定的).
Basic Vector Architecture
vector-register processors
- In a vector-register processor, all vector operations—except load and store—are among the vector registers. (向量寄存器 → \rightarrow → 必须足够大)
memory-memory vector processors
- In a memory-memory vector processor, all vector operations are memory to memory.
Vector Memory-Memory vs. Vector Register Machines
指令后加 V V V 表示向量指令
- Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?
- All operands must be read in and out of memory
- VMMAs make it difficult to overlap execution of multiple vector operations, why?
- Must check dependencies on memory addresses
- VMMAs incur greater startup latency
→
\rightarrow
→ All major vector machines since Cray-1 have had vector register architectures (we ignore vector memory-memory from now on)
- Scalar code was faster on CDC Star-100 (memory-memory) for vectors < 100 elements
- For Cray-1 (vector registor), vector/scalar breakeven point was around 2 elements
Vector Supercomputers
Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions
- Load/Store Architecture, Vector Registers, Vector Instructions, Hardwired Control, Interleaved Memory System, Highly Pipelined Functional Units (其实向量指令中每个数据的计算并不是并行的一起做,而是采用高速流水的方式进行 (高速是因为不用检测向量内部的相关性))
Vector Programming Model
Stride: 例如二维数组按列取数时就要用到 stride
Multimedia Extensions (aka SIMD extensions)
当前 CPU 里集成的一般是多媒体扩展,类似于向量操作
- Very short vectors added to existing ISAs for microprocessors. Use existing 64-bit registers split into
2
×
32
2\times32
2×32b or
4
×
16
4\times16
4×16b or
8
×
8
8\times8
8×8b (Newer designs have wider registers)
- Single instruction operates on all elements within register
Multimedia Extensions versus Vectors
- Limited instruction set: no vector length control, no strided load/store or scatter/gather, unit-stride loads must be aligned to 64/128-bit boundary
- Limited vector register length: requires superscalar dispatch to keep multiply/add/load units busy, loop unrolling to hide latencies increases register pressure
- Trend towards fuller vector support in microprocessors: Better support for misaligned memory accesses; Support of double-precision (64-bit floating-point); New Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b)
The basic structure of a vector-register architecture
VMIPS
Primary Components of VMIPS
- Vector registers — VMIPS has eight vector registers, and each holds 64 elements. Each vector register must have at least two read ports and one write port.
- Vector functional units — Each unit is fully pipelined and can start a new operation on every clock cycle.
- In VMIPS, vector operations use the same names as MIPS operations, but with the letter “ V V V” appended.
- Vector load-store unit —The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory with a bandwidth of 1 word per clock cycle, after an initial latency.
- A set of scalar registers —Scalar registers can also provide data as input to the vector functional units, as well as compute addresses to pass to the vector load-store unit.
Vector Code Example
VLR: vector length register
Automatic Code Vectorization
- Scalar Sequential Code:
- Vectorized Code: Vectorization is a massive compile-time reordering of operation sequencing
→
\rightarrow
→ requires extensive loop dependence analysis
Vector Arithmetic Execution
- Use deep pipeline (=> fast clock) to execute element operations. Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)
Vector Stripmining
分段开采
- Problem: Vector registers have finite length
- The solution is to create a vector-length register (VLR), which controls the length of any vector operation. The value in the VLR, however, cannot be greater than the length of the vector registers— maximum vector length (MVL).
- If the vector is longer than the maximum length, stripmining is used. (Break loops into pieces that fit into vector registers)
Vector Stride
跨距
- At the statement we could vectorize the multiplication of each row of
B
B
B with each column of
C
C
C. When an array is allocated memory, it is linearized and must be laid out in either row-major or column-major order. This linearization means that either the elements in the row or the elements in the column are not adjacent in memory. 总有一种是不连续的
- This distance separating elements that are to be gathered into a single register is called the stride.
- The vector stride, like the vector starting address, can be put in a general-purpose register. Then the VMIPS instruction LVWS (load vector with stride) can be used to fetch the vector into a vector register. Likewise, when a nonunit stride vector is being stored, SVWS (store vector with stride) can be used.
Vector Chaining
- Vector version of register bypassing: the Concept of Forwarding Extended to Vector Registers, making a sequence of dependent vector operations run faster (因为向量中元素的运算是用 pipeline 完成的,因此只要前一个向量指令的向量结果中有一个元素被计算出来,就可以送到下一条向量指令中进行运算)
Multiple Lanes
多航道 / 多车道
- increase the peak performance of a vector machine by adding more parallel execution units
Array Processor
阵列机 (SIMD)
Basic idea:
- A single control unit (并非是多核) provides the signals to drive many Processing Elements (Run in lockstep 前后紧接,步伐一致的前进) (PE 执行一样的指令,不一样的数据
→
\rightarrow
→ SIMD)
- PE 组成:各部件受控制部件控制;控制部件受指令控制总线控制
- PE 组成:各部件受控制部件控制;控制部件受指令控制总线控制
*GPU
GPU / CPU架构比较
- The GPU Devotes More Transistors to Data Processing.
- CPU:更多资源用于缓存和控制
CUDA 高速运算的基础
CUDA (Compute Unified Device Architecture) → \rightarrow → 编程接口
- 计算一致性 (computing coherence)
- 单程序、多数据执行模式 (SPMD) (不完全是 SIMD): 每一个核都可以执行程序中的一段代码,形成一个微小的线程放到 ALU 上执行 (依靠上千个核进行并行运算) (也称为 SIMT: 单指令多线程)
- 大量并行计算资源: Thousands of CPU cores so far; Thousands of threads on the fly
- 隐藏存储器延时: 提升计算/通信比例、合并相邻地址的内存访问; 快速线程切换
- 1 cycle@GPU vs. ~1000 cycles@CPU
适合的应用
- GPU 只有在计算 高度数据并行任务 时才能发挥作用。在这类任务中,需要处理大量的数据,数据的储存形式类似于规则的网格,而对这些数据的进行的处理则基本相同
- 例如:图像处理,物理模型模拟(如计算流体力学),工程和金融模拟与分析,搜索,排序
不适合的应用 (需要重新设计算法和数据结构或者打包处理)
- 需要复杂数据结构的计算如树,相关矩阵,链表,空间细分结构等,则不适用于使用 GPU 进行计算
- 串行和事务性处理较多的程序
- 并行规模很小的应用,如只有数个并行线程
- 需要ms量级实时性的程序
MIMD (Thread-Level Parallelism)
- “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
- Parallel Architecture = Computer Architecture + Communication Architecture
Communication Models
Shared Memory Model
Multi-processors (多处理器系统): 基于共享存储器 Shared Memory
- Communication occurs through a shared virtual address space (via loads and stores)
⇒
\Rightarrow
⇒ low overhead for communication
- 唯一的地址空间并不意味着在物理上只有一个存储器。共享地址空间可以通过一个物理共享的存储器来实现 (Centralized),也可以通过分布式存储器在软硬件支持下实现 (Distributed)
- 唯一的地址空间并不意味着在物理上只有一个存储器。共享地址空间可以通过一个物理共享的存储器来实现 (Centralized),也可以通过分布式存储器在软硬件支持下实现 (Distributed)
shared memory multiprocessors either
- UMA (Uniform Memory Access time) for shared address, centralized memory MP
- NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP
Message Passing
Multi-computers (多计算机系统): 基于消息传递 Message Passing
- Communication occurs by explicitly passing messages among the processors. (distributed memory system)
- 每个处理器有自己的局部/私有存储器 (private local memory),该存储器只能被该处理器访问而不能被其他处理器直接访问
- The address space can consist of multiple private address spaces. Each processor-memory module is essentially a separate computer
MIMD Memory Architecture: Centralized (SMP) vs. Distributed
2 classes of multiprocessors with respect to memory:
- (1) Centralized Memory Multiprocessor (Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors; 对称 (针对存储而言的):每个处理器都以完全相同的方式访存)
- few dozen processor chips and cores
- Small enough to share single, centralized memory (Memory 过大,则硬件过于复杂,难以管理) ⇒ \Rightarrow ⇒ needs Larger Cache
- Can scale to a few dozen processors by using a switch and by using many memory banks. Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases
- (2) Physically Distributed-Memory multiprocessor
- Larger number chips and cores
- BW (bandwidth) demands (Memory distributed among processors)
互联网络可以是以太网、光纤等高速网络
The Flynn-Johnson classification of computer systems
横轴是存储器架构;纵轴是通信模型
考点;例如描述一下为什么这么划分
Typical parallel computer architectures
对称多处理机 SMP(Symmetric Multiprocessor)
- Centralized Memory Multiprocessors are also called SMPs because single main memory has a symmetric relationship to all processors
工作站机群 COW (Cluster of Workstation)
- A computer cluster is a group of coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer.
- The components of a cluster are commonly, but not always, connected to each other through fast local area networks.
- MPI is a widely-available communications library that enables parallel programs to be written in C, Python…
Cluster categorizations
- High-availability (HA) clusters 高可用性
- operate by having redundant nodes, which are then used to provide service when system components fail.
- e.g. failover clusters 故障转移集群
- Load-balancing clusters 负载平衡
- operate by distributing a workload evenly over multiple back end nodes.
- Grid/Cloud computing
- grid clusters, a technology closely related to cluster computing. The key differences are that grids connect collections of computers which do not fully trust each other, or which are geographically dispersed. (more like a computing utility than like a single computer)
- support more heterogeneous collections
大规模并行处理机 MPP (Massively Parallel Processor)
- MPP 系统是由成百上千台处理机组成的大规模并行计算机系统。主要用于科学计算、工程模拟等以计算为主的场合,目前也广泛应用于商业和网络应用中
- 开发困难,价格高,国家综合实力的象征 (超算)
- 使用高性能的私用的互连网络,可以在低延时和高带宽的条件下传递消息 (这是它与 COW 最大的区别)
Cluster vs. MPP
体系结构方面的区别
- (1) Cluster 的结点是更完整的计算机,计算机可以是同构的也可以是异构的。结点都有自己的磁盘,驻留有自己的完整的操作系统;一般都有一定的自主性, 脱离 Cluster 照样能运行。而 MPP 系统结点一般没有磁盘,只驻留操作系统内核
- (2) MPP 使用制造厂商专有(或者有专利权)的高速通信网络,网络接口是连到处理结点的存储总线上(紧耦合); Cluster 一般采用公开销售的标准高速局域网或系统域网,网络通常是与结点计算机的 I/O 总线相连(松散耦合)
Interconnection Networks
Processor-to-Memory Interconnection Networks
- 互连网络:由开关元件按照一定的拓扑结构和控制方式构成的网络,实现计算机系统结点之间的相互连接
- 互连函数:反映网络输入数组 (Processors) 和输出数组 (Memory banks) 之间对应的置换关系或排列关系
互连网络的分类
- 静态网络 (Static Networks): 结点间有着固定连接通路且在程序执行期间,这种连接保持不变的网络
- 动态网络 (Dynamic Networks): 由开关单元构成,可按应用程序的要求动态地改变连接状态 (适合于大型互连网络). 动态网络主要有总线、交叉开关、多级交换网络
Connecting Multiple Computers
- Shared Media vs. Switched (“point-to-point”)
Switching scheme
- Circuit switching
- Packet switching
Multistage Network
- 为了构造大型网络,可以把交叉开关级联起来,构成多级互连网络。作用是通过对各个交叉连接单元的控制来完成输入端和输出端之间的各种连接,使每个输入端上的信息都可以送到任何一个输出端上去
Cache Coherence and Coherence Protocol
多处理器缓存一致性 (这个问题只可能在 “Shared Memory Models” 中才可能发生)
- Symmetric shared-memory machines usually support the caching of both shared and private data.
- Private data are used by a single processor
- When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Because no other processor uses the data, the program behavior is identical to that in a uniprocessor.
- Shared data are used by multiple processors, essentially providing communication among the processors through reads and writes of the shared data
- When shared data are cached, the shared value may be replicated in multiple caches. In addition to the reduction in access latency and required memory bandwidth, this replication also provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously.
- Caching of shared data, however, introduces a new problem: cache coherence.
- Private data are used by a single processor
What Is Multiprocessor Cache Coherence?
Cache Coherence problem
- Because the view of memory held by two different processors is through their individual caches, the processors could end up seeing different values for the same memory location
Notice that the coherence problem exists because we have both a global state, defined primarily by the main memory, and a local state, defined by the individual caches, which are private to each processor core. Thus, in a multi-core where some level of caching may be shared (e.g., an L3), although some levels are private (e.g., L1 and L2), the coherence problem still exists and must be solved.
Coherent Memory Model
- Definition: Reading an address should return the last value written to that address
- This simple definition contains two different aspects:
- (1) Coherence (一致性) defines what values can be returned by a read (返回给读操作的是什么值)
- (2) Consistency (连贯性) determines when a written value will be returned by a read (什么时候才能将已写入的值返回给读操作)
- Coherence and consistency are complementary: Coherence defines the behavior of reads and writes to the same memory location, while consistency defines the behavior of reads and writes with respect to accesses to other memory locations.
Coherent Memory System
- Preserve Program Order (保持程序顺序): 处理器 P P P 对 X X X 进行一次写之后又对 X X X 进行读,读和写之间没有其它处理器对 X X X 进行写,则读的返回值总是写进的值
- Coherent view of memory (一致性存储器视图):一个处理器对 X X X 进行写之后,另一处理器对 X X X 进行读,读和写之间无其它写,则读 X X X 的返回值应为写进的值 (if a processor could continuously read an old data value, we would clearly say that memory was incoherent.)
- Write serialization (写操作串行化): 对同一单元的写是顺序化的,这保证了任意两个处理器对同一单元的两次写,从所有处理器看来顺序都应是相同的
- For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.
Memory Consistency Model
- Although the three properties just described are sufficient to ensure coherence, the question of when a written value will be seen is also important.
- To see why, observe that we cannot require that a read of X X X instantaneously see the value written for X X X by some other processor.
- The issue of exactly when a written value must be seen by a reader is defined by a memory consistency model
Memory Consistency Model
- (1) A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write (直到所有的处理器均看到了写的结果,一次写操作才算完成)
- (2) The processor does not change the order of any write with respect to any other memory access
- if a processor writes location A A A followed by location B B B, any processor that sees the new value of B B B must also see the new value of A A A
- These restrictions allow the processor to reorder reads, but forces the processor to finish writes in program order (允许处理器无序读,但必须以程序规定的顺序进行写)
We will rely on this assumption until we reach Section 5.6, where we will see exactly the implications of this definition, as well as the alternatives.
Basic Schemes for Enforcing Coherence
- In a coherent multiprocessor, the caches provide both
m
i
g
r
a
t
i
o
n
migration
migration and
r
e
p
l
i
c
a
t
i
o
n
replication
replication of shared data items.
- Migration (迁移) – data can be moved to a local cache and used there in a transparent fashion. ⇒ \Rightarrow ⇒ 降低了对远程共享数据的访问延迟及带宽需求
- Replication (复制) – for shared data that are being simultaneously read, since caches make a copy of data in local cache. ⇒ \Rightarrow ⇒ 不仅降低了访存的延迟,也减少了访问共享数据所产生的冲突
Cache Coherence Protocols (HW)
- Key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. The state of any cache block is kept using status bits associated with the block, similar to the valid and dirty bits kept in a uniprocessor cache
- (1) Directory based (目录法) — Sharing status of a block of physical memory is kept in just one location, the directory (存在瓶颈)
- (2) Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept
- All caches are accessible via some broadcast medium (a bus or switch)
- All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access
Snooping Coherence Protocols
Write Invalidate, Write Update
- Cache Controller “snoops” all transactions on the shared medium (bus or switch)
⇒
\Rightarrow
⇒ if any relevant transaction (Cache contains that block): take action to ensure coherence
- (1) Write Invalidate (写作废) (Exclusive access): Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: all other cached copies of the item are invalidated
- (2) Write Update (写更新): update all the cached copies of a data item when that item is written ⇒ \Rightarrow ⇒ uses more broadcast medium BW ⇒ \Rightarrow ⇒ all recent MPUs use write invalidate
- (1) Write Invalidate (写作废) (Exclusive access): Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: all other cached copies of the item are invalidated
下面主要介绍 Write Invalidate
Basic Implementation Techniques (Write Invalidate)
Broadcast Medium Transactions (e.g., bus)
- To perform an invalidate, the processor acquires bus access and broadcasts the address to be invalidated on the bus. All processors snoop on the bus, watching the addresses. The processors check whether the address on the bus is in their cache. If so, the corresponding data are invalidated.
- 总线机制还同时实现了写串行化: If two processors attempt to write shared blocks at the same time, their attempts to broadcast an invalidate operation will be serialized when they arbitrate for the bus.
- One implication of this scheme is that a write to a shared data item cannot actually complete until it obtains bus access.
Locate up-to-date copy of data
Write-through cache
- all written data are always sent to the memory
⇒
\Rightarrow
⇒ get up-to-date copy from memory
- Write through simpler if enough memory BW
Write-back cache
- Can use same snooping mechanism:
- supply value: Snoop every address placed on the bus. If a processor has dirty (newest) copy of requested cache block, it provides it in response to a read request and aborts the memory access
- Write-back needs lower memory bandwidth ⇒ \Rightarrow ⇒ Support larger numbers of faster processors ⇒ \Rightarrow ⇒ Most multiprocessors use write-back
An Example Protocol (Snoopy, Invalidate)
- Snooping coherence protocol is usually implemented by incorporating a finite-state controller in each node
Write-through Cache Protocol
- 2 states per block in each cache (Valid / Invalid)
Write Back Cache Protocol
- Each cache block is in one state:
- Shared : block can be read
- Exclusive : cache has only copy, its writeable, and dirty
- Invalid : block contains no data (in uniprocessor cache too)
Write-Back State Machine - CPU
- State machine for CPU requests for each cache block
Writes to clean blocks are treated as misses (都是发信号到总线通知写无效)
Write-Back State Machine- Bus request
- State machine for bus requests for each cache block
Write-back State Machine-III
标签:computer,processors,Vector,Memory,vector,architecture,memory,data,parallel 来源: https://blog.csdn.net/weixin_42437114/article/details/116322986