MorphZone

Some doubts about the PegasosII G3

marcik

# 51
Order of the Butterfly

Posts: 268 from 2003/4/12

From: Kielce/Krakow,...

Quote:

I would also like the ability to link mPlayer as the default tool for an icon for a movie file, like the Amiga olden days, using iconx or xicon or whatever it is.

What stopps you from doing that now? Configure filetypes in ambient to use mplayer as a default player for videos and here it's (or use some filetypes-pack which are already configured properly).
»15.07.06 - 13:08

Framiga

# 52
Order of the Butterfly

Posts: 363 from 2003/7/11

From: Milan-Italy

@CISC

so its the poor CSPPC bandwidht that makes MPlayer so slow (quite unusable) compared to FroggerNG? (even with AVIs MPEGs) ot its MPlayer not-optimized for our Classic?
»15.07.06 - 14:15

CISC

# 53
MorphOS Developer

Posts: 619 from 2005/8/27

From: the land with ...

Quote:

so its the poor CSPPC bandwidht that makes MPlayer so slow (quite unusable) compared to FroggerNG? (even with AVIs MPEGs) ot its MPlayer not-optimized for our Classic?

I'm not sure how you reached that conclusion as Frogger is obviously shifting (more or less) the same amount of data as MPlayer on the same file...

If indeed MPlayer is that bad (I don't know) on CSPPC the problem is most likely elsewhere (do you f.ex. have a gfxcard which has overlay support in the drivers? if not the conversion in MPlayer vs Frogger might make up a significant difference), and are you even doing a comparison on equal grounds (ie, same file, same pixfmts, etc)?

That said it doesn't surprise me if Frogger is faster as it has far fewer intermediate stages (since it has fewer features) for the data than MPlayer, and is more tightly integrated (and probably also optimized better)...

- CISC
»15.07.06 - 14:34

Framiga

# 54
Order of the Butterfly

Posts: 363 from 2003/7/11

From: Milan-Italy

"'m not sure how you reached that conclusion as Frogger is obviously shifting (more or less) the same amount of data as MPlayer on the same file..."

eh, of course

Same file, no overlay support (CVPPC)
»15.07.06 - 14:57

Velcro_SP

# 55
Priest of the Order of the Butterfly

Posts: 929 from 2003/7/13

From: Universe

|||

[ Edited by Velcro_SP 03.11.2011 - 17:26 ]
Pegasos2 G3, 512 megs RAM
»15.07.06 - 16:47

marcik

# 56
Order of the Butterfly

Posts: 268 from 2003/4/12

From: Kielce/Krakow,...

Action
Name Play with MPlayer
Event DoubleClick
Command AMIGADOS Run <>NIL: MorphOS:Apps/mplayer/mplayer %sp
End

Works here.
»15.07.06 - 17:21

BigGun

# 57
Acolyte of the Butterfly

Posts: 150 from 2004/6/18

From: Nagold - Germany

Merko,

The binaries don't include a detailed description so you can not know that you are accidently comparing the best possible performance on MOS with the worst possible on Linux.

The tests are part of a bigger test range done to improve glibc routines for PowerPC linux. The test was never done to compage MOS with Linux - this was never my intention - so its was not documented in anyway in this regard.

The test rather compares Linux with MacOS is Macos is in many ways very good optimized for PowerPC.

I'll try to quickly explain you some CPU and test backgrounds so that you can understand the tests better.

Oversimplified you can say that most normal C code does not use the G3 and G4 PowerPC CPUs full potential.
A simple C copy loop will starve on G3 and G4 because of load latencies and other cache effects.
BTW this bad effect is very common on other CPU as Athlon or Pentium too.

Over simplified you can say that a 'average' selfmade C loop only archieves 50% performance on Pentium, Ahtlon, G3 or G4.
The G5 is a different thing. The G5 has very clever hardware which improves the performance of average C code drasticaly.

A 'good' algorythm that is aware of the cache behavior can achieve a much better result on the above mentioned CPUs.
If you use cache prefetching [streaming] on the G3 and G4 then you can easely increase the performance by 100% sometimes even more.
The [clever] G5 has hardware build in to automactily start streaming. So in general handoptimization of code is a bit less important on the G5 than on th G4.

In general you can say that the main bottleneck on both Athlon, Pentium and G4 is the read latency.
The fix is the same for all these CPUs - you prefetch data in advance. This is called streaming or data prefetchin.
This prefetching make sense if you work on data bigger than 64byte.
If you work on bigger datasets then you will double the performance on the mentioned CPUs by using prefetching.

The G5 can automagicly prefetch so its average memory throughput is always very good.

An operating system should have optimal functions for these commen tasks.
You would expect that an operating system uses optimized routines to tripple or double the performance of a memcopy.
For some operating system this is true for other not.
MacOS for example has well optimized routines.
Linux did not have them for th PPC.
The glibc copy on Linux crawled with less than halve of the performance of the comparable MacOS functions.
Working on improving those Linux (glibc) routines is what I had the pleasure to do with some IBM engineers.

On Linux as of today : the normal glibc copy is totally PowerPC unaware and very slow. (as slow as the MOS routine)
But a program under Linux can use its own optimized routines and achieve a doubled or trippled performance.
Wen the glibc patches are approved and included into the next version all Linux programs will experience a great performance increase when copying or comparing memory of more than 100bytes on PowerPC.

On MOS the picture is a bit different.
The special cache settings on MorphOS prevent the usage of streaming. This means you can not get 700 MB/s read performance but only 250 MB on MorphOS.

The tests are not so well documented on the website as the test were mainly targeted as simple example for the people working on the tests and not for all people.

If you look at the light colored functions then these are the not PowerPC aware ones. These functions are slow.
The darker functions are examples using different streaming settings.
So on Linux for example unoptimised memory reads are
read8, read32, read64 they all store about 250MB
By not activating cache prrefetching you get them into the 600-700 MB range.

For example, the MACOS memcopy functions (compiled and used under Linux on Pegasos) achieves about 600 MB/sec.
Under MOS you will net achive this good result but rather get a score in the range of 350-400MB at best.
The problem on MOS is that the data-streaming that enables the function on MacOS to shine is disabled on MOS.

As no datastreaming enhancements are possible on MOS the performance of limited to what the worst functions achieve on Linux. In case of memcopz this is about 350MB/s

The mem copy function on MOS is as good or bad as it can get in MOS atm. So the copy mem quick call on MOS basicly shows you what MOS can do right now. (350-400MB) The MOS version was only compiled for a friend asking for it. It was not intented to prove anything and so its not documented in that way and it does not clearly show that all the performance enhancement tricks fail on MOS.
I hope that this explanation helps to better understand the situation. If you want then simple compile the Linux source on MOS and verify this yourself. To compile the Linux version you will need to disable the altivec stuff of course.

I don't doubt that the cache settings on MOS have a good reason. But I saw a number of programs beeing very limited by the memory throughput. Mplayer on Linux for example uses own PowerPC optimized routines to get speed. But these optimizations as mentioned can not work on MOS with its settings.

I would not be surprised if a new version of MOS would change their settings to be more like MacOS for example. I say here MacOS and not Linux as in Linux you currently can optimize for PowerPC but on MacOS the OS and many programs are well optimized already.

thanks for reading
»15.07.06 - 20:51

BigGun

# 58
Acolyte of the Butterfly

Posts: 150 from 2004/6/18

From: Nagold - Germany

Quote:

CISC wrote:
Quote:

Proving the memory bandwidth differences between MacOS,Linux, and MOS is easy:

Those benchmarks shows mainly cache efficiency (for the particular code at hand) btw, not "memory throughput" like your page claims.

CISC, you are wrong

The truth is that the test will measure both the
memory bandwith, the 1st level cache , and 2nd level cache bandwidth.

The tests will work on different sized array from small ones to big ones. As finally copying blocks of 80MB each to another 80 MB region.
This is of course 100% memory bandwidth what gets measured in that case, as the Pegasos can never fit 80 MB in its 500KB cache.
»15.07.06 - 21:00

BigGun

# 59
Acolyte of the Butterfly

Posts: 150 from 2004/6/18

From: Nagold - Germany

Quote:

CISC wrote:

Quote:

- under MOS you can not fully utilize the throughput of a G4

Not true (only partially when you are talking about 64bit access to cached areas).

The G4 Pegasos under Linux using 32bit copy loop gets 700MB/sec
Best possible copy loop on MOS (MorphOS copymem)
will score around 400 MB/sec.
Both numbers for working on data bigger 1KB (coldcache)
(Using Linux you can enable streaming to increase performance)

Question: Is 400 less than 700 ?
If its not true that MOS can not get over 400 MB/s (coldcache) ?
If MOS can not get in the range of 700 MB then MOS is not utilizing the possible throughput of a G4 !
(700 MB on Linux with 32bit access!)

So what is true and who is lying?

What is the best throughput you can achieve on MOS when copying lets say a block of 100MB ?
On MOS you can't even get close to Linux !

CISC, either acknowledge that on MOS you can not achieve the same memory bandwith and memory copying speed as on Linux or come back with real results and show us how to achive 700 MB/sec when copying for example a block of 100MB.

[ Edited by BigGun on 2006/7/15 21:38 ]
»15.07.06 - 21:27

merko

# 60
Order of the Butterfly

Posts: 328 from 2003/5/19

BigGun: You make some claims, I ask for proof. You point me to some
test program. I run the test. It tells me nothing. Now you admit that
the test is useless. No thanks for wasting my time.

No one is disputing that MorphOS and linux uses different cache
settings. Naturally, they each have their advantages and
disadvantages. If there was not a trade off, no configurability would
have been needed in the first place.

As things stand, I just can't see that you've offered any sort of
credible argument supporting your claim that MOS memory settings would
slow down applications. You gave two examples, MAME and "video
playback", and others here in this thread claim that MAME runs faster
on MOS than in linux. Until someone posts actual benchmarks, I'm
inclined to believe that these statements are more accurate than
yours.
»15.07.06 - 22:02

merko

# 61
Order of the Butterfly

Posts: 328 from 2003/5/19

Oh, and one more thing: what is the relevancy of an app copying 100 MB
of memory? What app except a benchmark utility would ever do this?
Naturally this makes sense if you are improving a memcopy routine,
with nothing else in mind. But what does it tell us about optimal
cache settings for real-world applications? Not a thing.
»15.07.06 - 22:04

SoundSquare

# 62
Paladin of the Pegasos

Posts: 1213 from 2004/12/1

From: Paris, France

the guy asked a simple question and you're all masturbating again... lol
»15.07.06 - 22:31

CISC

# 63
MorphOS Developer

Posts: 619 from 2005/8/27

From: the land with ...

I'm really getting tired of repeating my answers to you, please take some time to process what you read before you reply...

Quote:

The truth is that the test will measure both the
memory bandwith, the 1st level cache , and 2nd level cache bandwidth.

I didn't bother looking at the code, but the graphs on your page (nor your previously pasted figures) show no such thing.

Quote:

The tests will work on different sized array from small ones to big ones. As finally copying blocks of 80MB each to another 80 MB region.
This is of course 100% memory bandwidth what gets measured in that case, as the Pegasos can never fit 80 MB in its 500KB cache.

I'm not talking about the speed of the cache when I say "cache efficiency", I'm talking about how efficiently your code (with the cache-modes of the areas you are manipulating) is with regards to the CPUs ability to keep the cache saturated.

Quote:

The G4 Pegasos under Linux using 32bit copy loop gets 700MB/sec
Best possible copy loop on MOS (MorphOS copymem)
will score around 400 MB/sec.

You are getting these figures by way of cache-prefetching though, hence my "cache efficiency" claims, this has nothing to do with "memory throughput" .. when you are cache-prefetching the CPU will do the transfer in the background with the widest possible operation (in this case 64bit), thus you are not really benchmarking 32bit ops to memory, but rather cache (which has next-to-no impact) plus the time the CPU takes to prefetch.

If you want to generate some more realistic figures, try loading/storing random bits and pieces in memory with 32bit ops, that should throw the prefetching out of whack.

Quote:

Question: Is 400 less than 700 ?

Sure, but it's not half (and you seem to repeatedly round MorphOS figures down and Linux figures up).

Quote:

If its not true that MOS can not get over 400 MB/s (coldcache) ?

Not much more than that (on cached areas) atleast, yes it's true.

Quote:

If MOS can not get in the range of 700 MB then MOS is not utilizing the possible throughput of a G4 !
(700 MB on Linux with 32bit access!)

Again, you're neither measuring "memory throughput" nor 32bit access here...

Quote:

So what is true and who is lying?

I don't think anyone is lying (you really are obsessed with this), I'm simply claiming you are mixing up your facts, be it due to incompetence or unwillingness, you choose...

Quote:

What is the best throughput you can achieve on MOS when copying lets say a block of 100MB ?
On MOS you can't even get close to Linux !

When you're talking about "cache efficiency" during a large copy you are right, same goes for 64bit access to cached areas (same thing really), however as "memory throughput" goes with 32bit access you are wrong.

Quote:

CISC, either acknowledge that on MOS you can not achieve the same memory bandwith and memory copying speed as on Linux or come back with real results and show us how to achive 700 MB/sec when copying for example a block of 100MB.

I've never disputed that memory-block copy on cached areas are faster in Linux, infact I believe I mentioned this specifically as the downside of MorphOS' cache-mode, thus you will never achieve those kind of speeds on cached areas.

- CISC
»16.07.06 - 08:00

magnetic

# 64
Yokemate of Keyboards

Posts: 2129 from 2003/3/1

From: Los Angeles

Big Gun
Although you seem like a knowledgable guy, why do you Insist on arguing with the actual developers of the OS about the OS? That seems rather foolish. Like a Mercedes mechanic arguing with a Mercedes Engineer. Know what I mean?

magnetic
Pegasos 2 Rev 2B3 w/ Freescale 7447 "G4" @ 1ghz / 1gb Nanya Ram
Quad Boot: MorphOS 2.7 | Amiga OS4.1 U4 | Ubuntu PPC GNU/Linux | OS X 10.4
»17.07.06 - 07:42