Re: [hatari-devel] OS X performance problem

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi,

On torstai 29 toukokuu 2014, Anders Eriksson wrote:
> On Thu, 29 May 2014, Eero Tamminen wrote:
> >> Funnily enough, I notice this problem much more with the powerful
> >> discrete AMD Radeon videochip rather than with the slower Intel
> >> videochip.
> > 
> > This kind of trivial blitting operation is memory bandwidth bound,
> > not 3D operations bound.  OSX SDL backend apparently does separate
> > texture upload for each rect update. With the discrete graphics card
> > the data goes over PCI Express bus, with integrated GPU through
> > system memory bus.
> > 
> > What PCI-E connection you have to your card, and what Intel processor?
> 
> Discrete video:
> 8x PCIe, AMD Radeon HD 6750M with 1 GB memory
> 
> Integrated video:
> Intel HD Graphics 3000, 512 MB
> 
> Processor:
> 2.2 GHz Intel Core i7 Mobile
> http://ark.intel.com/products/50067/Intel-Core-i7-2720QM-Processor-6M-Cac
> he-up-to-3_30-GHz

Ok, this is Sandy Bridge.  Do you have 1600Mhz DDR3 memories
or something slower?


> Looking at bandwidth for the Hatari window updating at 50 Hz, the window
> is 832*576, 4 bytes per pixel, 1916928 bytes per frame.  Times 50 and
> it's ~91 MB/s or 0.089 GB/s.

Is your issue in windowed or fullscreen mode?

What you calculated, is Hatari part of the operations.  Then there
are operations that SDL OSX backend does, and the OSX desktop compositor.

Because statusbar updates have this kind of impact, it seems that your
SDL backend makes each SDL_UpdateRect(s) call into a separate window
update, instead of doing just one window update for them.  With
statusbar, there will be two of those updates, one for the emulated
screen content, and another for the statusbar below it
(they don't overlap).

If you're using windowed mode, the window content will be composited
to screen.  That's + 3x (reads from window & destination, write to
destination), potentially for whole window, not just for the part
that got updated.

AFAIK most(?) graphics drivers don't support partial screen updates,
so the final blit of screen content will actually be full screen
update i.e. + 1x full resolution write.


> The processor has PCIe v2, so 4 GB/s for 8
> lanes, much (44 times) above the needed bandwidth.

For DDR3, AFAIK a good rule of thumb for *real* maximum memory
bandwidth is 0.6x of the theoretical memory bandwidth.  You
can achieve that with best pipelined SIMD CPU instructions,
or well pipelined 3D operations that allow GPU to keep its
memory interface fully utilized with optimal memory access
burst patterns.

With integrated chips, like Intel HD grahics in your Sandy
Bridge, besides sharing the memory interface to system
RAM, CPU & GPU also share LLC (CPU's L3).


As to PCI-E speed, the issue with that is high latency,
and a lot depends on the drivers, compositor and SDL
backend, what kind of things get uploaded to the graphics
card memory, when/how, and whether they fit there [1].

[1] If you aren't running other applications besides Hatari,
I think we can be pretty sure that rest of the system hasn't
caused graphics card to run out of memory. :-)

I would assume OSX desktop side to be done sensibly, so
my suspicion is that SDL 2D backend does something really
braindead on OSX, in addition to turning every UpdateRect(s)
call to window update.


> But I don't know how Hatari works, maybe it updates the window many many
> times per refresh and thereby exceeds the bandwidth as you suggest. Or
> maybe I did the calculations all wrong :)

You just left out most of the operations. :-)


> One more thing I also noticed was a lucky (but poor in this care) guy
> with a retina display (aka, HI-DPI) where the window needed to have 4x
> pixel zoom to get to a usable window size. Oh boy that machine didn't
> stand much a chance to keep full frame rate. But running without Hatari
> zoom and then using the OS X built-in zoom (control+mousewheel or
> control+two finger swipe on touchpad) worked well (this zoom does not
> alter the screen resolution and still pushes all the pixels, but I guess
> doing it through OpenGL).

If you have discrete graphics card, doing the large upscaling
on the gfx card side is of course better.  It has much faster
memory for that, much less data needs to travel over slower PCI-e
to it, and CPU gets more memory bandwidth for its own operations.

However, as you calculated, at Hatari window sizes and the memory
speeds of current machines even doing it on CPU side shouldn't be
a problem, *unless* something stupid happens between what Hatari
does and its output ending on screen.


	 - Eero



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/