Classification of Parallel Computers
graph TD PS[Parallel System] --> SIMD PS --> MIMD MIMD --> DM[Distributed Memory] DM --> MPP DM --> NOW DM --> Clusters MIMD --> SM[Shared Memory] SM --> UMA SM --> NUMA SM --> COMA NUMA --> ccNUMA NUMA --> ncNUMA
Classification of Memory
Distributed Memory
- Architecture
- Programming Model: Message passing.
- Types
- MPP Massive Parallel System
- Specialized system
- Unified Operating system
- Most Super computers run this
- NOW Network of Workstations
- Loosely connected group of individual computers with “standard” network
- Cluster
- Network of Computers with high performance network.
- Separate instance of operating system.
- MPP Massive Parallel System
Shared Memory
- Programming model: Shared variables.
- Simultaneous access to the same variable is not allowed.
- Types
- NUMA
- Each Processor has has a local memory which can be accessed by other processors.
- DIR → Directory. Stores information regarding the files in cache. Like a registry.
- During memory access the DIR is read to know if the file should be accessed from the cache or from memory.
- ccNUMA: Cache Coherent non-Uniform Memory Access
- nccNUMA: non-ccNUMA
- COMA: Cache only memory access
- TAG → It is an identifier for each Datum.
- Datum migrates to where it is needed, hence you need a TAG to identify the data.
- Here the memory is accessed through the cache only.
- This means that this if the same data is being used, the access operation is much faster.
- NUMA
Understanding the architecture is essential to writing efficient programs.
- Example:
- Caches are accessed in form of cache lines of fixed size - like a bus transporting people, it has the same energy is used to transport 1 person and 100 passengers.
- To avoid Cache misses, data should be processed continuously. (cache miss: it occurs when the program looks at the DIR and doesn’t find what’s required hence it searches for it in memory which is expensive)
- Code example
for (int i = 0; i < x; i++) { for (int j = 0; j < y; j++) { data1[i + x * j] = func(i, j); // Data 1 } } for (int i = 0; i < x; i++) { for (int j = 0; j < y; j++) { data2[j + y * i] = func(i, j); // Data 2 } }
- For
data1[i + x * j]
: The inner loop incrementsj
Asj
increases, we’re accessing memory locations that arex
positions apart This creates non-sequential memory access patterns These jumps in memory can cause cache misses - For
data2[j + y * i]
: The inner loop still incrementsj
Asj
increases, we’re accessing sequential memory locations (sincej
is the first term) This sequential access pattern is cache-friendly The inner loop works with “continuous j” data2
is faster because it follows a more cache-friendly memory access pattern where the inner loop variable (j
) corresponds to adjacent memory locations.
- For
Networks
- Distance between nodes
- Shortest path in # edges between two nodes. (#→ number of something )
- Degree
- Largest number of edges at a node.
- Diameter
- Largest distance between two nodes.
Examples
- Fully connected Network
- Linear network
- Ring network or 1D Torus
- dD Torus (d→ dimension)
- D-Dimensional hypercube
- Nodes
- Node connected with neighbours.
- indexed by bits.
- Diameter = =
- Degree = =
- Node index differ from neighbor’s index by 1 bit.
- Patch through the network changes bit at a time
- 000→001→101→111
- Distance between nodes = # Bits different in the index
- Hamming Distance: number of non-matching bits two words of same length.(How many bits have to be changed to go from one data to another data.)