Bulldozer (V1 and V2) - The (partial) Scoop

Status
Not open for further replies.

4P_Bulldozer

In Runtime
Messages
205
Location
Earth
Here are a LOT of details from the Published Manual:

AMD Family 15h version 1 _AND_ version 2:

Page 20: AMD Family 15h processors have multiple compute units, each containing its own L2 cache and two cores. The cores share their compute unit's L2 cache. Each core incorporates the complete x86 instruction set logic and L1 data cache. Compute units share the processor's L3 cache and Northbridge (see Chapter 2, Microarchitecture of AMD Family 15h Processors).

Page 23: AMD Instruction Set Enhancements - The AMD Family 15h processor has been enhanced with the following new instructions:
• XOP and AVX support—Extended Advanced Vector Extensions provide enhanced instruction encodings and non-destructive operands with an extended set of 128-bit (XMM) and 256-bit (YMM) media registers
• FMA instructions—support for floating-point fused multiply accumulate instructions
• Fractional extract instructions—extract the fractional portion of vector and scalar single-precision and double-precision floating-point operands
• Support for new vector conditional move instructions.
• VPERMILx instructions—allow selective permutation of packed double- and single-precision floating point operands
• VPHADDx/VPSUBx—support for packed horizontal add and substract instructions
• Support for packed multiply, add and accumulate instructions
• Support for new vector shift and rotate instructions

Page 23: AMD Family 15h processors add support for 128-bit floating-point execution units. As a result, the throughput of both single-precision and double-precision floating-point SIMD vector operations has improved by 2X over the previous generation of AMD processors.

Page 25: Instruction Fetching Improvements - While previous AMD64 processors had a single 32-byte fetch window, AMD Family 15h processors have two 32-byte fetch windows, from which four μops can be selected. These fetch windows, when combined with the 128-bit floating-point execution unit, allow the processor to sustain a fetch/dispatch/retire sequence of four instructions per cycle.

Page 26: Several integer and floating-point instructions have improved latencies and decode types on AMD Family 15h processors.

Page 26: Current AMD Family 15h processors support two SIMD logical/shuffle units, one in the FMUL pipe and another in the FADD pipe, while previous AMD64 processors have only one SIMD logical/shuffle unit in the FMUL pipe. As a result, the SIMD shuffle instructions can be processed at twice the previous bandwidth on AMD Family 15h processors. Furthermore, the PSHUFD and SHUFPx shuffle instructions are now DirectPath instructions instead of VectorPath instructions on AMD Family 15h processors and take advantage of the 128-bit floating point execution units. Hence, these instructions get a further 2X boost in bandwidth, resulting in an overall improvement of 4X in bandwidth compared to the previous generation of AMD processors.

Page 26: Notable Performance Improvements - Several enhancements to the AMD64 architecture have resulted in significant performance improvements in AMD Family 15h processors, including:
• Improved performance of shuffle instructions
• Improved data transfer between floating-point registers and general purpose registers
• Improved floating-point register to floating-point register moves
• Optimization of repeated move instructions
• More efficient PUSH/POP stack operations
• 1-Gbyte paging

Page 30: Key Microarchitecture Features - AMD Family 15h processors include many features designed to improve software performance. The internal design, or microarchitecture, of these processors provides the following key features:
• Integrated DDR3 memory controller with memory prefetcher
• 64-Kbyte L1 instruction cache and 16-Kbyte L1 data cache
• Shared L2 cache between cores of compute unit
• Shared L3 cache compute units on chip (for supported platforms)
• 32-byte instruction fetch
• Instruction predecode and branch prediction during cache-line fills
• Decoupled prediction and instruction fetch pipelines
• Four-way AMD64 instruction decoding (This is a theoretical limit.)
• Dynamic scheduling and speculative execution
• Two-way integer execution
• Two-way address generation
• Two-way 128-bit wide floating-point execution
• Legacy single-instruction multiple-data (SIMD) instruction extensions, as well as support for XOP, FMA4, VPERMILx, and Advanced Vector Extensions (AVX).
• Superforwarding
• Prefetch into L2 or L1 data cache
• Deep out-of-order integer and floating-point execution
• HyperTransport™ technology

Page 30: Microarchitecture of AMD Family 15h Processors - AMD Family 15h processors implement the AMD64 instruction set by means of macro-ops (the primary units of work managed by the processor) and micro-ops (the primitive operations executed in the processor's execution units). These are simple fixed-length operations designed to include direct support for AMD64 instructions and adhere to the high-performance principles of fixed-length encoding, regularized instruction fields, and a large register set. This enhanced microarchitecture enables higher processor core performance and promotes straightforward extensibility for future designs.

Page 31: Superscalar Processor - The AMD Family 15h processors are aggressive, out-of-order, four-way superscalar AMD64 processors. They can theoretically fetch, decode, and issue up to four AMD64 instructions per cycle using decoupled fetch and branch prediction units and three independent instruction schedulers, consisting of two integer schedulers and one floating-point scheduler. These processors can fetch 32 bytes per cycle and can scan two 16-byte instruction windows for up to four micro-ops, which can be dispatched together in a single cycle. However, this is a theoretical limit.

Page 33:
L1 Instruction Cache - The out-of-order execution engine of AMD Family 15h processors contains a 64-Kbyte, 2-way setassociative L1 instruction cache. Each line in this cache is 64 bytes long. However, only 32 bytes are fetched in every cycle.
L1 Data Cache - The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128-bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle.
L2 Cache - The AMD Family 15h processor has one shared L2 cache per compute unit. This full-speed on-die L2 cache is mostly inclusive relative to the L1 cache. The L2 is a write-through cache.
L3 Cache - The AMD Family 15h processor supports a maximum of 8MB of L3 cache per die, distributed among four L3 sub-caches which can each be up to 2MB in size.

Page 35: The scheduling for integer operations is fully data-dependency driven; proceeding out-of-order based on the validity of source operands and the availability of execution resources. Since the Bulldozer core implements a floating point co-processor model of operation, most scheduling and execution decisions of floating-point operations are handled by the floating point unit.

Page 37: Floating-Point Unit - The AMD Family 15h processor floating point unit (FPU) was designed to provide four times the raw FADD and FMUL bandwidth as the original AMD Opteron and Athlon 64 processors. It achieves this by means of two 128-bit fused multiply-accumulate (FMAC) units which are supported by a 128-bit high-bandwidth load-store system. The FPU is a coprocessor model that is shared between the two cores of one AMD Family 15h compute unit. As such it contains its own scheduler, register files and renamers and does not share them with the integer units. This decoupling provides optimal performance of both the integer units and the FPU. In addition to the two FMACs, the FPU also contains two 128-bit integer units which perform arithmetic and logical operations on AVX, MMX and SSE packed integer data.

Page 39:
Write Combining - AMD Family 15h processors provide four write-combining data buffers that allow four simultaneous streams. ... A Write Coalescing Cache (WCC) has been incorporated into the AMD family 15h microarchitecture. The WCC is 4 KB in size and is 4-way set associative. Stores to cacheable memory and, thus, to the L2 cache are coalesced in this cache.

Integrated Memory Controller - AMD Family 15h processors provide integrated low-latency, high-bandwidth DDR3 memory controllers. The memory controller supports:
• DRAM chips that are 4, 8, and 16 bits wide within a DIMM.
• Interleaving memory within DIMMs.
• ECC checking with single symbol correcting and double symbol detecting.
• Dual-independent 64-bit channel operation.
• Optimized scheduling algorithms and access pattern predictors to improve latency and achieved bandwidth, particularly for interleaved streams of read and write DRAM accesses.
• A data prefetcher.

Page 40: HyperTransport3 increases the aggregate link bandwidth to a maximum of 25.6 Gbyte/s (16-bit link).

Page 41: Multisocket-capable AMD family 15h processors incorporate HyperTransport assist technology (also referred to in some documents as probe filtering).

Page 108: Use streaming instructions instead of PREFETCHW in situations where all of the following conditions are true:
• The code will overwrite one or more complete cache lines with new data.
• The new data will not be used again soon.
Streaming instructions include the non-temporal stores MOVNTDQ, MOVNTI, MOVNTPS, MOVNTPD, MOVNTSD, MOVNTSS and the MMX instruction MOVNTQ.

Page 109: The following performance caveats apply when using streaming stores on AMD Family 15h cores.
• When writing out a single stream of data sequentially, performance of AMD Family 15h processors is comparable to previous generations of AMD processors.
• When writing out two streams of data, AMD Family 15h version 1 processors can be up to three times slower than previous-generation AMD processors. AMD Family 15h version 2 processor performance is approximately 1.5 times slower than previous AMD processors.
• When writing out four non-temporal streams, AMD Family 15h version 1 can be up to three times slower than previous AMD processors. AMD Family 15h version 2 processor performance is comparable to previous AMD processors.
• Using non-temporal stores but not writing out an entire cacheline may cause performance to be up to six times slower than previous AMD processors.

Page 167: AMD Family 15h processors with 128-bit multipliers and adders achieve better throughput using SIMD instructions. (Double precision throughput is 2× and single precision is 4× the throughput of x87.) ... The SIMD instructions provide a theoretical single-precision peak throughput of four additions and four multiplications per clock cycle, whereas x87 instructions can only sustain one addition and one multiplication per clock cycle. The double-precision peak throughput of the SIMD instructions is two additions and two multiplications per clock cycle.

Page 192: AMD family 15h and later multiprocessor systems implement cache coherent non-uniform memory access (ccNUMA) architecture to connect two or more processors.

Page 195: Large configurations of more than four processors may contain processors that are not directly connected. However, Hypertransport links may now be divided in half, enabling more direct connections between processors but at a reduced bandwidth. System OEMs may choose to use either divided Hypertransport links or configurations with multi-hop remote memory references.

Page 210: AMD family 15h processors incorporate four distinct cores on a single die and have a cache that all the cores share.

Page 221: Earlier generations of AMD-V processors did not support Flush By ASID VMCB commands, which led to an optimization that involved managing ASIDs. This optimization is functionally correct but is superseded on processors that support Flush By ASID. Earlier generations of AMD-V processors did not support Flush By ASID VMCB commands, which led to an optimization that involved managing ASIDs. This optimization is functionally correct but is superseded on processors that support Flush By ASID.

Page 224: The VMM provides a set of CPUID results to its guests that represents a common subset of features. That subset may not represent any existing physical processor.

Page 225: The UD2 opcode should be used if software wishes to create #UD exceptions.

Page 315: Application Power Management (APM) boost support.

Page 322: In general, AMD Family 15h propcessor-based systems can have up to eight nodes.

Page 333: In the previous generation multi-core processors, each core has its own timestamp counter locked to its core. Starting with AMD Family 15h processors, there exists a single clock source in the NorthBridge for all timestamp counters in a processor and these counters are incremented in lockstep.

Page 335: The precision afforded by IBS (Instruction-Based Sampling) also enables automated optimization techniques (e.g., profile-directed optimization) ...


If you found that interesting then please click the Star. :star: Thanks for reading my Post.
 
There is rumor from others and statements from AMD that there will be a short delay before the launch of Bulldozer.
This is GREAT News if it means we will get a Version 3 Chip (at faster clock speeds) and revised pricing that will kick the later released SBe.
 
Something is about to show up as Asus is releasing high-end AM3+ boards.
 
There is rumor from others and statements from AMD that there will be a short delay before the launch of Bulldozer.
This is GREAT News if it means we will get a Version 3 Chip (at faster clock speeds) and revised pricing that will kick the later released SBe.
Sales of the Fusion (Zacate/Ontario) chips exploded much more than AMD expected (they're now sold out after 5 million chips). So they decided it would be more profitable to launch Llano first.
At the same time they are also doing another stepping of Bulldozer to improve clocks.

There have been a few leaked benchmarks of B0 stepping BD chips. And the results are worse than K10.
However it appears that this is deliberate. B0 chips aren't reporting correct clock speeds. Actual clocks are an unknown amount lower than reported speeds, and therefore true performance is being obfuscated.
The only reason B0 chips are given out is for compatibility testing for board makers.

AMD is being tighter than the CIA about BD's true performance.
 
OBR has confirmed his benchmarks ran at the posted speeds. I expect bulldozer to be a sack of fail....just like Phenom 1. If it comes out and proves me otherwise I'll eat my words...but from what I've seen so far Bulldozer is a huge sack of overhyped fail.
 
CPU-Z is worthless with new chips. The speed is stamped right there on the front of the CPU. Also respinning won't improve IPC, only clockspeeds.
 
CPU-Z is worthless with new chips.
Generally, yes.
And in this case, even if Zambezi was 100% supported by CPU-Z, the ES chips could be feeding CPU-Z false clocks.
The speed is stamped right there on the front of the CPU.
Not exactly.... The model number is stamped on. And on retail chips, the clocks of each model is known.
Also respinning won't improve IPC, only clockspeeds.
I know. That's the aim.

My main point is that nobody knows Bulldozer's IPC compared to K10.5 or SB. Or clocks.

If I had to guess, I'd say it's IPC will be somewhere near SB's IPC.
Will it be higher? lower? I don't know.

Also, when it comes to (single threaded) performance, what counts is a combination of IPC and clocks.
Clock speed by itself is not everything, and IPC by itself is not everything.

If BD has lower IPC but higher clocks than SB, it can still be a winner.
Of course you can't sacrifice too much IPC (Pentium 4). And we do know IPC of BD is definitely higher than K10.5.


When it comes to multithreaded performance (8 threads) it's easy to expect better than SB performance.
I mean, it has 8 cores... Even 8 * K10.5 cores can beat 4 SB cores if they're all being utilized.
 
Another point we fail to note is that all these tests are being run with Code that is not optimized for the Bulldozer Architecture.

With a new Arch so different from any other you would want to re-compile the Benchmarks with a Compiler that performs Processor optimizations targeted towards the new chip; but since no such Compiler is in the public domain (yet) there remains a bit more to squeeze out of the as yet un-released Chip.

You can bet that sub-par Chips are released (for testing purposes) to places that lack the security necessary to protect AMD's Patents (non-US Companies) and that have employees that might sell a Chip for a few years wages ($10,000). I am not suggesting that Intel would purchase these Chips but there are PLENTY of Blogs that do (look what happened to Apple's IPhone that was "left in a Bar"). Intel reads the rumors the same as the rest of us.

AMD has suggested a certain level of performance that we are to expect, lets wait and see if they pull off the 4GHz Chip with 1GHz Turbo for $320 in two months before we say they won't.

The recently released Llano has made Record Benchmarks (using "old" Code, though it does not need "new") and is beating most of the SBes (and near the i7) on graphics performance, price, and wattage (important for a "Laptop Chip"); so do not count the Bulldozer out of the running yet, especially with the delay scheduled for the IB.

Part of the "Bulldozer Line" is the 62xx Series and we already know that they win for what they were designed for (Servers - more Cores and lower wattage and not single thread performance).

I think you (gurusan - Translation: without guru) need to re-read the "Published Manual" (now in it's third edition) and see how BD3 is going to stack up before you say its "a sack of fail". The "engineering samples" are not failing, they work. The "ops" units can be updated with a re-spin and higher IPC (fewer cycles) is as likely as higher clockspeed.

I (many of us) are not seeing much "fail" let alone a whole sack full, do you have a small sack ?
 
I read on XS that the Turbo and P-states are broken is ES chips. So yes, I think the frequencies are not what they should be in leaked benches (although it may not be deliberate).
Also, I think the general consensus on XS is that OBR's benchmarks are fakes (photoshop).

I'm pretty sure at least some of the people on XS have ES chips and are honoring their NDA's too.
chew* might, but he wouldn't admit it if he does.
 
Status
Not open for further replies.
Back
Top Bottom