cpu - floating point operations per cycle - intel -


I am looking for a while and I am unable to find an authoritative / decisive figure that is a precise floating point Cuts the number of operations / clock cycles that Intel Czon can complete QuadContor. I have an Intel Xeon quadcore E5530 CPU.

I'm hoping to use this to get my theoretical FLOP / s my CPU.

MAX FLOPS = (# number core) * (clock frequency (cycle / second)) * (# flopps / cycle)

Anyone pointing me in the right direction will be useful . I have found this

Intel Core 2 and Nehlam:

4 DP FLOP / Cycle: 2-wide SSE 2 Extra + 2-wide SSE 2 Multiplication

8 SP flops / cycle: 4-wide SSE extra + 4-wide SSE multiplication

But I'm not sure where these figures were found. Do they consider a fuse multiplication number (FMAD) operation?

EDIT: By using this, I cite the correct DP arithmetic input in the DP as 38.4 GFLOP / S (cited) by. For SP, I get double, 76.8 GFLOP / S I'm pretty sure 4 DP floppy / cycles and 8 SP flops / cycles are correct, I just confirm how they get the cycle value of FLOP / 4 and 8 Are there.

Nehlam is capable of executing 4 DP or 8 SPFOP / chakras. It has been completed using the SSE, which works on the floating point value of the pack, register in 2 / DP and register 4 / SP in 4 / SP FLOP / Chakra or 8 SP for the PHOP / Cycles per cycle 2 SSE instructions are to be executed. It is accomplished by executing one cycle and one ADDDP (or one MULSP and ADDSP) per cycle. The reason for this is possible because Nepal has separate execution units for SSE multiplication and SSE, and these units are put in pipelines so that throughput add one multiplication and one per circle per quadruple 4 cycles in SP Multiples are in pipeline and DP has 5 cycles. The additions are in the pipeline for 3 cycles independent of the SP / DP. The number of cycles in the pipeline is known as the latency . To calculate the peak FLOP / cycle you only need to know is. Therefore 1 SSE Vector Instruction / Cycle for both multiplier and connector (2 execution units) with a throughput for both 2x2 = 4FLOP / cycle in DP and 2x4 = 8FLOP / cycle in SP. In order to actually maintain this peak you will have to consider latency (as in the depth of the pipeline you have at least several independent operations in the pipeline) and consider if you are able to feed enough amount of data. Have to do. In Nehalam there is an integrated memory controller which can be capable of very high bandwidth from memory, if data prefetter estimates the accuracy of the access pattern of data (progressively loading from memory, which is a trivial pattern which is Can guess). Normally there is not enough memory bandwidth to feed all the cores with data on the extreme FLOP / cycle, so to recover the peak FLOP / cycle it is necessary to reuse some amount of data from the cache.

Where you can get information about the number of independent execution units and follow their throughput and latency in the chakras, details.

See page 105 of this document 8.9 Performance Entities

It says that for Neholam

The floating point multiplier at the port has a latency of 4 for single precision and 5 for double and double dual precision. The thumbnail for the floating point multiplier is 1 clock per cycle, except for the long double precision on core 2. Floating Point Connector is connected to Port 1. There is a latency of 3 and it is completely pipeline.

8 SP floppy / cycles you need 4 SP ADD / cycle and 4 SP MUL / cycle connector and multiplier are on separate execution units, and separately With ports, you can run on 4 SP pack operands simultaneously using each SSE pack (vector) instructions (4x32bit = 128bits). Both have been operated per clock per clock cycle. In order to achieve that throughput, you have to consider the state of delay ... Before you can use the results, how many cycles will be after the instructions .. So you have to issue several independent instructions for delay. In single precision there is a latency of multiplier 4 and additive of 3.

You can get the same throughput and latency number for Nehalem in the Intel optimization guide, Table C-15A



Comments

Popular posts from this blog

Verilog Error: output or inout port "Q" must be connected to a structural net expression -

jasper reports - How to center align barcode using jasperreports and barcode4j -

c# - ASP.NET MVC - Attaching an entity of type 'MODELNAME' failed because another entity of the same type already has the same primary key value -