4P_Bulldozer
In Runtime
- Messages
- 205
- Location
- Earth
Here are a LOT of details from the Published Manual:
AMD Family 15h version 1 _AND_ version 2:
Page 20: AMD Family 15h processors have multiple compute units, each containing its own L2 cache and two cores. The cores share their compute unit's L2 cache. Each core incorporates the complete x86 instruction set logic and L1 data cache. Compute units share the processor's L3 cache and Northbridge (see Chapter 2, Microarchitecture of AMD Family 15h Processors).
Page 23: AMD Instruction Set Enhancements - The AMD Family 15h processor has been enhanced with the following new instructions:
• XOP and AVX support—Extended Advanced Vector Extensions provide enhanced instruction encodings and non-destructive operands with an extended set of 128-bit (XMM) and 256-bit (YMM) media registers
• FMA instructions—support for floating-point fused multiply accumulate instructions
• Fractional extract instructions—extract the fractional portion of vector and scalar single-precision and double-precision floating-point operands
• Support for new vector conditional move instructions.
• VPERMILx instructions—allow selective permutation of packed double- and single-precision floating point operands
• VPHADDx/VPSUBx—support for packed horizontal add and substract instructions
• Support for packed multiply, add and accumulate instructions
• Support for new vector shift and rotate instructions
Page 23: AMD Family 15h processors add support for 128-bit floating-point execution units. As a result, the throughput of both single-precision and double-precision floating-point SIMD vector operations has improved by 2X over the previous generation of AMD processors.
Page 25: Instruction Fetching Improvements - While previous AMD64 processors had a single 32-byte fetch window, AMD Family 15h processors have two 32-byte fetch windows, from which four μops can be selected. These fetch windows, when combined with the 128-bit floating-point execution unit, allow the processor to sustain a fetch/dispatch/retire sequence of four instructions per cycle.
Page 26: Several integer and floating-point instructions have improved latencies and decode types on AMD Family 15h processors.
Page 26: Current AMD Family 15h processors support two SIMD logical/shuffle units, one in the FMUL pipe and another in the FADD pipe, while previous AMD64 processors have only one SIMD logical/shuffle unit in the FMUL pipe. As a result, the SIMD shuffle instructions can be processed at twice the previous bandwidth on AMD Family 15h processors. Furthermore, the PSHUFD and SHUFPx shuffle instructions are now DirectPath instructions instead of VectorPath instructions on AMD Family 15h processors and take advantage of the 128-bit floating point execution units. Hence, these instructions get a further 2X boost in bandwidth, resulting in an overall improvement of 4X in bandwidth compared to the previous generation of AMD processors.
Page 26: Notable Performance Improvements - Several enhancements to the AMD64 architecture have resulted in significant performance improvements in AMD Family 15h processors, including:
• Improved performance of shuffle instructions
• Improved data transfer between floating-point registers and general purpose registers
• Improved floating-point register to floating-point register moves
• Optimization of repeated move instructions
• More efficient PUSH/POP stack operations
• 1-Gbyte paging
Page 30: Key Microarchitecture Features - AMD Family 15h processors include many features designed to improve software performance. The internal design, or microarchitecture, of these processors provides the following key features:
• Integrated DDR3 memory controller with memory prefetcher
• 64-Kbyte L1 instruction cache and 16-Kbyte L1 data cache
• Shared L2 cache between cores of compute unit
• Shared L3 cache compute units on chip (for supported platforms)
• 32-byte instruction fetch
• Instruction predecode and branch prediction during cache-line fills
• Decoupled prediction and instruction fetch pipelines
• Four-way AMD64 instruction decoding (This is a theoretical limit.)
• Dynamic scheduling and speculative execution
• Two-way integer execution
• Two-way address generation
• Two-way 128-bit wide floating-point execution
• Legacy single-instruction multiple-data (SIMD) instruction extensions, as well as support for XOP, FMA4, VPERMILx, and Advanced Vector Extensions (AVX).
• Superforwarding
• Prefetch into L2 or L1 data cache
• Deep out-of-order integer and floating-point execution
• HyperTransport™ technology
Page 30: Microarchitecture of AMD Family 15h Processors - AMD Family 15h processors implement the AMD64 instruction set by means of macro-ops (the primary units of work managed by the processor) and micro-ops (the primitive operations executed in the processor's execution units). These are simple fixed-length operations designed to include direct support for AMD64 instructions and adhere to the high-performance principles of fixed-length encoding, regularized instruction fields, and a large register set. This enhanced microarchitecture enables higher processor core performance and promotes straightforward extensibility for future designs.
Page 31: Superscalar Processor - The AMD Family 15h processors are aggressive, out-of-order, four-way superscalar AMD64 processors. They can theoretically fetch, decode, and issue up to four AMD64 instructions per cycle using decoupled fetch and branch prediction units and three independent instruction schedulers, consisting of two integer schedulers and one floating-point scheduler. These processors can fetch 32 bytes per cycle and can scan two 16-byte instruction windows for up to four micro-ops, which can be dispatched together in a single cycle. However, this is a theoretical limit.
Page 33:
L1 Instruction Cache - The out-of-order execution engine of AMD Family 15h processors contains a 64-Kbyte, 2-way setassociative L1 instruction cache. Each line in this cache is 64 bytes long. However, only 32 bytes are fetched in every cycle.
L1 Data Cache - The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128-bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle.
L2 Cache - The AMD Family 15h processor has one shared L2 cache per compute unit. This full-speed on-die L2 cache is mostly inclusive relative to the L1 cache. The L2 is a write-through cache.
L3 Cache - The AMD Family 15h processor supports a maximum of 8MB of L3 cache per die, distributed among four L3 sub-caches which can each be up to 2MB in size.
Page 35: The scheduling for integer operations is fully data-dependency driven; proceeding out-of-order based on the validity of source operands and the availability of execution resources. Since the Bulldozer core implements a floating point co-processor model of operation, most scheduling and execution decisions of floating-point operations are handled by the floating point unit.
Page 37: Floating-Point Unit - The AMD Family 15h processor floating point unit (FPU) was designed to provide four times the raw FADD and FMUL bandwidth as the original AMD Opteron and Athlon 64 processors. It achieves this by means of two 128-bit fused multiply-accumulate (FMAC) units which are supported by a 128-bit high-bandwidth load-store system. The FPU is a coprocessor model that is shared between the two cores of one AMD Family 15h compute unit. As such it contains its own scheduler, register files and renamers and does not share them with the integer units. This decoupling provides optimal performance of both the integer units and the FPU. In addition to the two FMACs, the FPU also contains two 128-bit integer units which perform arithmetic and logical operations on AVX, MMX and SSE packed integer data.
Page 39:
Write Combining - AMD Family 15h processors provide four write-combining data buffers that allow four simultaneous streams. ... A Write Coalescing Cache (WCC) has been incorporated into the AMD family 15h microarchitecture. The WCC is 4 KB in size and is 4-way set associative. Stores to cacheable memory and, thus, to the L2 cache are coalesced in this cache.
Integrated Memory Controller - AMD Family 15h processors provide integrated low-latency, high-bandwidth DDR3 memory controllers. The memory controller supports:
• DRAM chips that are 4, 8, and 16 bits wide within a DIMM.
• Interleaving memory within DIMMs.
• ECC checking with single symbol correcting and double symbol detecting.
• Dual-independent 64-bit channel operation.
• Optimized scheduling algorithms and access pattern predictors to improve latency and achieved bandwidth, particularly for interleaved streams of read and write DRAM accesses.
• A data prefetcher.
Page 40: HyperTransport3 increases the aggregate link bandwidth to a maximum of 25.6 Gbyte/s (16-bit link).
Page 41: Multisocket-capable AMD family 15h processors incorporate HyperTransport assist technology (also referred to in some documents as probe filtering).
Page 108: Use streaming instructions instead of PREFETCHW in situations where all of the following conditions are true:
• The code will overwrite one or more complete cache lines with new data.
• The new data will not be used again soon.
Streaming instructions include the non-temporal stores MOVNTDQ, MOVNTI, MOVNTPS, MOVNTPD, MOVNTSD, MOVNTSS and the MMX instruction MOVNTQ.
Page 109: The following performance caveats apply when using streaming stores on AMD Family 15h cores.
• When writing out a single stream of data sequentially, performance of AMD Family 15h processors is comparable to previous generations of AMD processors.
• When writing out two streams of data, AMD Family 15h version 1 processors can be up to three times slower than previous-generation AMD processors. AMD Family 15h version 2 processor performance is approximately 1.5 times slower than previous AMD processors.
• When writing out four non-temporal streams, AMD Family 15h version 1 can be up to three times slower than previous AMD processors. AMD Family 15h version 2 processor performance is comparable to previous AMD processors.
• Using non-temporal stores but not writing out an entire cacheline may cause performance to be up to six times slower than previous AMD processors.
Page 167: AMD Family 15h processors with 128-bit multipliers and adders achieve better throughput using SIMD instructions. (Double precision throughput is 2× and single precision is 4× the throughput of x87.) ... The SIMD instructions provide a theoretical single-precision peak throughput of four additions and four multiplications per clock cycle, whereas x87 instructions can only sustain one addition and one multiplication per clock cycle. The double-precision peak throughput of the SIMD instructions is two additions and two multiplications per clock cycle.
Page 192: AMD family 15h and later multiprocessor systems implement cache coherent non-uniform memory access (ccNUMA) architecture to connect two or more processors.
Page 195: Large configurations of more than four processors may contain processors that are not directly connected. However, Hypertransport links may now be divided in half, enabling more direct connections between processors but at a reduced bandwidth. System OEMs may choose to use either divided Hypertransport links or configurations with multi-hop remote memory references.
Page 210: AMD family 15h processors incorporate four distinct cores on a single die and have a cache that all the cores share.
Page 221: Earlier generations of AMD-V processors did not support Flush By ASID VMCB commands, which led to an optimization that involved managing ASIDs. This optimization is functionally correct but is superseded on processors that support Flush By ASID. Earlier generations of AMD-V processors did not support Flush By ASID VMCB commands, which led to an optimization that involved managing ASIDs. This optimization is functionally correct but is superseded on processors that support Flush By ASID.
Page 224: The VMM provides a set of CPUID results to its guests that represents a common subset of features. That subset may not represent any existing physical processor.
Page 225: The UD2 opcode should be used if software wishes to create #UD exceptions.
Page 315: Application Power Management (APM) boost support.
Page 322: In general, AMD Family 15h propcessor-based systems can have up to eight nodes.
Page 333: In the previous generation multi-core processors, each core has its own timestamp counter locked to its core. Starting with AMD Family 15h processors, there exists a single clock source in the NorthBridge for all timestamp counters in a processor and these counters are incremented in lockstep.
Page 335: The precision afforded by IBS (Instruction-Based Sampling) also enables automated optimization techniques (e.g., profile-directed optimization) ...
If you found that interesting then please click the Star. Thanks for reading my Post.
AMD Family 15h version 1 _AND_ version 2:
Page 20: AMD Family 15h processors have multiple compute units, each containing its own L2 cache and two cores. The cores share their compute unit's L2 cache. Each core incorporates the complete x86 instruction set logic and L1 data cache. Compute units share the processor's L3 cache and Northbridge (see Chapter 2, Microarchitecture of AMD Family 15h Processors).
Page 23: AMD Instruction Set Enhancements - The AMD Family 15h processor has been enhanced with the following new instructions:
• XOP and AVX support—Extended Advanced Vector Extensions provide enhanced instruction encodings and non-destructive operands with an extended set of 128-bit (XMM) and 256-bit (YMM) media registers
• FMA instructions—support for floating-point fused multiply accumulate instructions
• Fractional extract instructions—extract the fractional portion of vector and scalar single-precision and double-precision floating-point operands
• Support for new vector conditional move instructions.
• VPERMILx instructions—allow selective permutation of packed double- and single-precision floating point operands
• VPHADDx/VPSUBx—support for packed horizontal add and substract instructions
• Support for packed multiply, add and accumulate instructions
• Support for new vector shift and rotate instructions
Page 23: AMD Family 15h processors add support for 128-bit floating-point execution units. As a result, the throughput of both single-precision and double-precision floating-point SIMD vector operations has improved by 2X over the previous generation of AMD processors.
Page 25: Instruction Fetching Improvements - While previous AMD64 processors had a single 32-byte fetch window, AMD Family 15h processors have two 32-byte fetch windows, from which four μops can be selected. These fetch windows, when combined with the 128-bit floating-point execution unit, allow the processor to sustain a fetch/dispatch/retire sequence of four instructions per cycle.
Page 26: Several integer and floating-point instructions have improved latencies and decode types on AMD Family 15h processors.
Page 26: Current AMD Family 15h processors support two SIMD logical/shuffle units, one in the FMUL pipe and another in the FADD pipe, while previous AMD64 processors have only one SIMD logical/shuffle unit in the FMUL pipe. As a result, the SIMD shuffle instructions can be processed at twice the previous bandwidth on AMD Family 15h processors. Furthermore, the PSHUFD and SHUFPx shuffle instructions are now DirectPath instructions instead of VectorPath instructions on AMD Family 15h processors and take advantage of the 128-bit floating point execution units. Hence, these instructions get a further 2X boost in bandwidth, resulting in an overall improvement of 4X in bandwidth compared to the previous generation of AMD processors.
Page 26: Notable Performance Improvements - Several enhancements to the AMD64 architecture have resulted in significant performance improvements in AMD Family 15h processors, including:
• Improved performance of shuffle instructions
• Improved data transfer between floating-point registers and general purpose registers
• Improved floating-point register to floating-point register moves
• Optimization of repeated move instructions
• More efficient PUSH/POP stack operations
• 1-Gbyte paging
Page 30: Key Microarchitecture Features - AMD Family 15h processors include many features designed to improve software performance. The internal design, or microarchitecture, of these processors provides the following key features:
• Integrated DDR3 memory controller with memory prefetcher
• 64-Kbyte L1 instruction cache and 16-Kbyte L1 data cache
• Shared L2 cache between cores of compute unit
• Shared L3 cache compute units on chip (for supported platforms)
• 32-byte instruction fetch
• Instruction predecode and branch prediction during cache-line fills
• Decoupled prediction and instruction fetch pipelines
• Four-way AMD64 instruction decoding (This is a theoretical limit.)
• Dynamic scheduling and speculative execution
• Two-way integer execution
• Two-way address generation
• Two-way 128-bit wide floating-point execution
• Legacy single-instruction multiple-data (SIMD) instruction extensions, as well as support for XOP, FMA4, VPERMILx, and Advanced Vector Extensions (AVX).
• Superforwarding
• Prefetch into L2 or L1 data cache
• Deep out-of-order integer and floating-point execution
• HyperTransport™ technology
Page 30: Microarchitecture of AMD Family 15h Processors - AMD Family 15h processors implement the AMD64 instruction set by means of macro-ops (the primary units of work managed by the processor) and micro-ops (the primitive operations executed in the processor's execution units). These are simple fixed-length operations designed to include direct support for AMD64 instructions and adhere to the high-performance principles of fixed-length encoding, regularized instruction fields, and a large register set. This enhanced microarchitecture enables higher processor core performance and promotes straightforward extensibility for future designs.
Page 31: Superscalar Processor - The AMD Family 15h processors are aggressive, out-of-order, four-way superscalar AMD64 processors. They can theoretically fetch, decode, and issue up to four AMD64 instructions per cycle using decoupled fetch and branch prediction units and three independent instruction schedulers, consisting of two integer schedulers and one floating-point scheduler. These processors can fetch 32 bytes per cycle and can scan two 16-byte instruction windows for up to four micro-ops, which can be dispatched together in a single cycle. However, this is a theoretical limit.
Page 33:
L1 Instruction Cache - The out-of-order execution engine of AMD Family 15h processors contains a 64-Kbyte, 2-way setassociative L1 instruction cache. Each line in this cache is 64 bytes long. However, only 32 bytes are fetched in every cycle.
L1 Data Cache - The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128-bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle.
L2 Cache - The AMD Family 15h processor has one shared L2 cache per compute unit. This full-speed on-die L2 cache is mostly inclusive relative to the L1 cache. The L2 is a write-through cache.
L3 Cache - The AMD Family 15h processor supports a maximum of 8MB of L3 cache per die, distributed among four L3 sub-caches which can each be up to 2MB in size.
Page 35: The scheduling for integer operations is fully data-dependency driven; proceeding out-of-order based on the validity of source operands and the availability of execution resources. Since the Bulldozer core implements a floating point co-processor model of operation, most scheduling and execution decisions of floating-point operations are handled by the floating point unit.
Page 37: Floating-Point Unit - The AMD Family 15h processor floating point unit (FPU) was designed to provide four times the raw FADD and FMUL bandwidth as the original AMD Opteron and Athlon 64 processors. It achieves this by means of two 128-bit fused multiply-accumulate (FMAC) units which are supported by a 128-bit high-bandwidth load-store system. The FPU is a coprocessor model that is shared between the two cores of one AMD Family 15h compute unit. As such it contains its own scheduler, register files and renamers and does not share them with the integer units. This decoupling provides optimal performance of both the integer units and the FPU. In addition to the two FMACs, the FPU also contains two 128-bit integer units which perform arithmetic and logical operations on AVX, MMX and SSE packed integer data.
Page 39:
Write Combining - AMD Family 15h processors provide four write-combining data buffers that allow four simultaneous streams. ... A Write Coalescing Cache (WCC) has been incorporated into the AMD family 15h microarchitecture. The WCC is 4 KB in size and is 4-way set associative. Stores to cacheable memory and, thus, to the L2 cache are coalesced in this cache.
Integrated Memory Controller - AMD Family 15h processors provide integrated low-latency, high-bandwidth DDR3 memory controllers. The memory controller supports:
• DRAM chips that are 4, 8, and 16 bits wide within a DIMM.
• Interleaving memory within DIMMs.
• ECC checking with single symbol correcting and double symbol detecting.
• Dual-independent 64-bit channel operation.
• Optimized scheduling algorithms and access pattern predictors to improve latency and achieved bandwidth, particularly for interleaved streams of read and write DRAM accesses.
• A data prefetcher.
Page 40: HyperTransport3 increases the aggregate link bandwidth to a maximum of 25.6 Gbyte/s (16-bit link).
Page 41: Multisocket-capable AMD family 15h processors incorporate HyperTransport assist technology (also referred to in some documents as probe filtering).
Page 108: Use streaming instructions instead of PREFETCHW in situations where all of the following conditions are true:
• The code will overwrite one or more complete cache lines with new data.
• The new data will not be used again soon.
Streaming instructions include the non-temporal stores MOVNTDQ, MOVNTI, MOVNTPS, MOVNTPD, MOVNTSD, MOVNTSS and the MMX instruction MOVNTQ.
Page 109: The following performance caveats apply when using streaming stores on AMD Family 15h cores.
• When writing out a single stream of data sequentially, performance of AMD Family 15h processors is comparable to previous generations of AMD processors.
• When writing out two streams of data, AMD Family 15h version 1 processors can be up to three times slower than previous-generation AMD processors. AMD Family 15h version 2 processor performance is approximately 1.5 times slower than previous AMD processors.
• When writing out four non-temporal streams, AMD Family 15h version 1 can be up to three times slower than previous AMD processors. AMD Family 15h version 2 processor performance is comparable to previous AMD processors.
• Using non-temporal stores but not writing out an entire cacheline may cause performance to be up to six times slower than previous AMD processors.
Page 167: AMD Family 15h processors with 128-bit multipliers and adders achieve better throughput using SIMD instructions. (Double precision throughput is 2× and single precision is 4× the throughput of x87.) ... The SIMD instructions provide a theoretical single-precision peak throughput of four additions and four multiplications per clock cycle, whereas x87 instructions can only sustain one addition and one multiplication per clock cycle. The double-precision peak throughput of the SIMD instructions is two additions and two multiplications per clock cycle.
Page 192: AMD family 15h and later multiprocessor systems implement cache coherent non-uniform memory access (ccNUMA) architecture to connect two or more processors.
Page 195: Large configurations of more than four processors may contain processors that are not directly connected. However, Hypertransport links may now be divided in half, enabling more direct connections between processors but at a reduced bandwidth. System OEMs may choose to use either divided Hypertransport links or configurations with multi-hop remote memory references.
Page 210: AMD family 15h processors incorporate four distinct cores on a single die and have a cache that all the cores share.
Page 221: Earlier generations of AMD-V processors did not support Flush By ASID VMCB commands, which led to an optimization that involved managing ASIDs. This optimization is functionally correct but is superseded on processors that support Flush By ASID. Earlier generations of AMD-V processors did not support Flush By ASID VMCB commands, which led to an optimization that involved managing ASIDs. This optimization is functionally correct but is superseded on processors that support Flush By ASID.
Page 224: The VMM provides a set of CPUID results to its guests that represents a common subset of features. That subset may not represent any existing physical processor.
Page 225: The UD2 opcode should be used if software wishes to create #UD exceptions.
Page 315: Application Power Management (APM) boost support.
Page 322: In general, AMD Family 15h propcessor-based systems can have up to eight nodes.
Page 333: In the previous generation multi-core processors, each core has its own timestamp counter locked to its core. Starting with AMD Family 15h processors, there exists a single clock source in the NorthBridge for all timestamp counters in a processor and these counters are incremented in lockstep.
Page 335: The precision afforded by IBS (Instruction-Based Sampling) also enables automated optimization techniques (e.g., profile-directed optimization) ...
If you found that interesting then please click the Star. Thanks for reading my Post.