Hirdetés

Új hozzászólás Aktív témák

  • Petykemano

    veterán

    "One thing I'm still confused about with GCN and my Google-fu is failing me (asking console devs on Twitter might be the easiest route but hopefully someone here knows as well): transcendental/special-function is 1/4 rate on GCN, but do they stall the entire pipeline for 4 cycles, or can FMAs be issued in parallel for some of these cycles?

    Everything I've found implies that they stall the pipeline for 4 cycles, which is pretty bad (speaking from experience for mobile workloads *sigh* maybe not as bad on PC-level workloads) and compares pretty badly with NVIDIA which on Volta/Turing is able to co-issue SFU instructions 100% for free and they don't stall the pipeline unless they're the overall bottleneck (as they've got spare decoder and spare register bandwidth, and they deschedule the warp until the result is ready; obviously they can't co-issue FP+INT+SFU, but FP+SFU and INT+SFU are fine).

    It feels to me like at this point, 1 NVIDIA "CUDA core" is actually quite a bit more "effective flops" than an AMD ALU. It's not just the SFU but also interpolation, cubemap instructions, etc... We can examine other parts of the architecture in a lot of detail as much as we want, but I suspect the lower effective ALU throughput is probably a significant part of the performance difference at this point... unlike the Kepler days when NVIDIA was a lot less efficient per claimed flop than they are today.

    The "Super SIMD" patent is an interesting opportunity to reverse that for AMD, especially if the "extra FMA" can run in parallel to SFU and interpolation instructions and so on... I really hope it gets implemented in Navi and the desktop GPU market gets a little bit more exciting again! :)"

    Szakvélemény, abu?

    Találgatunk, aztán majd úgyis kiderül..

Új hozzászólás Aktív témák