Linux+: Hardware Part 02 - CPU The main part of any system is the Central Processing Unit (CPU), sometimes called the processor or micro-processor. A processor performs instructions in different stages. There are four main stages of a CPU: Fetch Decode Execute Writeback The first stage is to fetch the next instruction or set of instructions. The fetch is performed from cache first, then from Random Access Memory (RAM). As stated in the previous article, Linux+: Hardware Part 01, the cache is faster than RAM and can increase performance. The second stage is the decoding phase. When instructions are decoded they are broken down into simpler commands which the CPU can handle. For example, to increment a variable by 10 would require: Retrieval of the variable The value of the variable is placed into a register The value 10 to be placed in another register Values are to be added together Result is placed into a register Result in register is placed into the memory space for the variable NOTE: The list for decoding the instruction can be more or less complicated depending on the processor. The third stage is to execute the instructions as they have been decoded. The fourth phase is to writeback any results from the instructions to the necessary location such as RAM, video memory or a hard disk. In older processors one instruction was fetched, decoded, executed and the writeback performed before another instruction was fetched. Newer processors can perform a pipeline where instructions are fetched, while another instruction is decoded, another is executed and another is in a writeback phase. Every stage is being performed simultaneously giving a large performance boost. A trick to improve the output of the CPU is to perform out-of-order instructions. Some decoded instructions are simple and fast to perform while others take more time. Instructions may also require the retrieval of data from RAM or the hard disk. While one instruction is waiting, other instructions can be completed and stored in cache for when the instruction is needed. Another trick for performance boosting is branch prediction. Within software code are branches. Branches are where decisions are made such as ‘if x>20 then do Function A otherwise do Function B’. In this case, the CPU pipelines are best to be kept full to enhance the speed of the application. To increase performance, both Functions are processed and held in cache. When the decision is determined, then the appropriate branch of code can be retrieved and the writeback performed since it has already been decoded and the results of the execution saved. Superscalar Architecture The Pentium processor introduced superscalar architecture. This means that the processor had multiple components such as the integer multiplier, floating point unit, Arithmetic Logic Unit (ALU), etc. NOTE: Not all components may be present so the CPU is not a multi-core. A multi-core processor actually has two complete, but separate, processors in one unit. A multi-core CPU will be seen as if the system had two processors. The architecture allows multiple instructions to be managed at once. The architecture can be thought of as having multiple pipelines working in parallel. The MMX technology was also included in the P5 Architecture (Pentium). MMX is a set of instructions to help address instructions for graphics, audio and video. NOTE: MMX has no true meaning; Intel only uses the letters to represent the technology added to the P5 Architecture. The pipeline is made up of 14 stages to improve performance. P6 Architecture The P6 architecture appeared in the Pentium Pro processor. The P6 has a Dual Independent Bus (DIB) which has two buses. One bus connects the CPU to the RAM while the other connects the CPU to the L2 cache. Each bus can be accessed at once. MMX was enhanced by adding new instructions. Netburst (P68) Architecture The Pentium 4 brought the Netburst Architecture which includes Hyper Pipelined Technology, Rapid Execution Technology and Execution Trace Cache. The Hyper Pipeline has a depth of 20 stages. The more depth there exists in the pipeline, the greater the clock speed, but the smaller the Instructions per cycle (IPC). This is due to a small number of longer pipelines which can complete less instructions per cycle. If the pipelines were small, but there were more of them, more instructions could be completed. A problem can arise when using branch prediction. Since the pipelines are so long, sets of instructions completed for branches which are not needed are wasted. To handle this problem, there are two Arithmetic Logic Units (ALUs) which are double pumped. Double pump means that the ALUs operate at twice the CPU speed. The extra speed helps to improve performance for the low IPC count. The double pumped ALUs are the Rapid Execution Technology. The Execution Trace Cache is within the L1 Cache of the CPU. Instead of the CPU fetching and decoding a new instruction, it checks the Trace Cache. If the instruction is there, it is fetched there where it is already decoded. IA-64 Architecture (Itanium) The Intel Itanium is the family of IA-64 processors. The 64-bit processor was meant for high-end servers. The Itanium, under good conditions, could execute six instructions per clock cycle. The later Itanium processors had the following units to improve execution performance: Six ALUs – Arithmetic Logic Units handles timers and number processing Two integer units – performs arithmetic and logical operations on integers One shift unit – shifts bits as needed Four data caches – storage for data Six multimedia units (MMX) – audio, video and graphic processing instructions One parallel multiply unit – performs multiplication operations One population counter – digit sum of a binary string Two parallel shift units – shift units as needed in parallel Two 82-bit floating point units – manages computations with decimal values Two SIMD floating point multiply-accumulate units – multiplies two values and places result into an accumulator variable Three branch units – makes decisions based on branch commands such as if-then-else. Both side of a branch decision can be executed in parallel The IA-64 architecture also has Speculative Loading which loads data into the registers before the data is needed. AMD Processors The company, Advanced Micro Devices (AMD), is another company which makes processors. The processors are comparatively similar, but AMD processors are usually less expensive. From the previous information about processors, the AMD processors compare as follows: AMD Intel K6 Pentium Athlon Pentium II/III Duron Celeron Athlon XP/MP Pentium 4 Opteron IA-64 Athlon 64 IA-64 NOTE: AMD’s version of MMX is called 3DNow. Symmetric Multiprocessing (SMP) A system with two or more processors which uses the same memory, I/O devices and Interrupt Requests (IRQs) is called multiprocessing. In this way applications can be multithreaded. Multithreading is when an application is decoded and executed by all processors simultaneously. NOTE: For SMP all processors must be identical. System clocks and speeds must be the same or problems can occur with processing. Any system with multiple processors must have an Operating System (OS) which supports using multiple processors. Linux currently supports up to 16 processors in a single system if the hardware can support it. SMP also applies to processors with multiple cores. Each core within the processor is considered an individual processor.