Improve NetStack on MorphOS?
  • MorphOS Developer
    Nadir
    Posts: 170 from 2003/3/17
    In what sense do you want to improve it? I have a version running since a few years which is based on a much newer BSD stack. Unfortunately there is a memtrash issue which is hard to track down and I haven't had the energy to make a new attempt in a long time.

    Also, the source is closed to the MorphOS team so it's not possible for others to work on it.
  • »17.03.25 - 12:52
    Profile
  • MorphOS Developer
    jacadcaps
    Posts: 3182 from 2003/3/5
    From: Canada
    - Read through the NetBSD networking code. See if you can improve that, or make it slower, as part of your learning curve.
    - Get AmiTCP source code (https://aros.aminet.net/package/comm/net/AmiTCP-src-30b2)
    - Merge current NetBSD bits onto the AmiTCP code base

    That's roughly how MorphOS' NetStack was created.

    There's 2 things how it can be improved: by updating the code as outlined above or by redesigning the SanaII interface. The latter would be impossible for you since that means updating all closed source network card drivers, but nothing stops you from doing the former.
  • »17.03.25 - 13:13
    Profile Visit Website
  • Moderator
    Kronos
    Posts: 2380 from 2003/2/24
    As Nadir pointed out, it is closed source controlled by the MorphOS team and you would need to give them a good reason/plan to give you access.

    Which brings us to the real question, do you even know what you want to change/improve?

    As far as my understanding goes the underlying problem is that the stack adheres to the same driver model and uses the same user facing API as AmiTCP which is forcing it to do simple things in a complicated and slow way.

    So what one would need is a new stack with new drivers and API, updated apps using that API and some bdsocket wrapper for those apps that can't or won't be updated.

    Something that would require not only a lot of work but also very good planning to make sure these new API can be left standing far into the future.
  • »17.03.25 - 13:14
    Profile
  • Acolyte of the Butterfly
    Acolyte of the Butterfly
    Georg
    Posts: 121 from 2004/4/7
    Quote:

    Nadir wrote:
    Unfortunately there is a memtrash issue which is hard to track down


    In AROS I used to temporarily modify things such that mungwall memory check of all the allocated memory happened on each and every task switch. Helped. Mem allocation routines were changed to use Disable()/Enable(), etc.

    Also AvailMem(MEMF_CLEAR) was made to trigger same mungwall memory check of all alloated memory in the system. So calls to this were put all over the (suspect) places to make it more and more difficult for the memtrashing code to hide.
  • »17.03.25 - 14:42
    Profile
  • MorphOS Developer
    Nadir
    Posts: 170 from 2003/3/17
    Quote:

    Kronos wrote:
    As far as my understanding goes the underlying problem is that the stack adheres to the same driver model and uses the same user facing API as AmiTCP which is forcing it to do simple things in a complicated and slow way.



    Yes, the driver model is not good and would need to be completely overhauled. As was pointed out, this is something that would require very careful planning and in-depth knowledge of the way the BSD stack works as well as understanding of the our current driver interface and access to the source of existing device drivers (unless one wants to rewrite them completely). It is definitely not an easy project!

    There is also a lot of other things that are outdated in the current stack. Just one example: TCP congestion control algorithms need to work differently/be more advanced in a wifi network. When the BSD stack that we use was written, wifi was still in its infancy. There are also lots of other things that work much better in a modern stack so either a new BSD stack should be ported over and integrated or one should write a new one from scratch :-) Even the first option is a major undertaking since modern stacks have changed and are much more complex internally and rely on a lot of kernel APIs that we don't have readily available.

    If someone is seriously interested in working on this, I recommend to carefully read the TCP/IP Illustrated books, particularly Volume II. That's what I did when I last updated our NetStack (in 2012 or so I think). Unfortunately I don't think Volume II of that book series has been updated since 1995 which makes it much harder to tackle the port of a really modern stack, as I myself discovered after spending way too much time on it.

    [ Edited by Nadir 17.03.2025 - 14:50 ]
  • »17.03.25 - 14:47
    Profile
  • Caterpillar
    Caterpillar
    ChrisC
    Posts: 21 from 2025/3/1
    Nothing lose giving the guy a shot, can he not have the stack source anyway since it is from the BSD stack?
    Power Mac G5 11,2 Dual 2GHz
    ATI X1950 Pro 256MB
    2GB RAM
    500GB Storage
    MorphOS 3.19 (Licenced)
  • »17.03.25 - 14:54
    Profile
  • MorphOS Developer
    Nadir
    Posts: 170 from 2003/3/17
    Quote:

    Georg wrote:
    Quote:

    Nadir wrote:
    Unfortunately there is a memtrash issue which is hard to track down


    In AROS I used to temporarily modify things such that mungwall memory check of all the allocated memory happened on each and every task switch. Helped. Mem allocation routines were changed to use Disable()/Enable(), etc.

    Also AvailMem(MEMF_CLEAR) was made to trigger same mungwall memory check of all alloated memory in the system. So calls to this were put all over the (suspect) places to make it more and more difficult for the memtrashing code to hide.




    Yes, I agree it should be possible to make progress with such techniques but I ran out of energy and last looked at this about 6 years ago when I got my second child :-)
  • »17.03.25 - 14:55
    Profile
  • Caterpillar
    Caterpillar
    ChrisC
    Posts: 21 from 2025/3/1
    Nothing to lose giving the guy a shot, can he not have the stack source anyway since it is from the BSD stack?
    Power Mac G5 11,2 Dual 2GHz
    ATI X1950 Pro 256MB
    2GB RAM
    500GB Storage
    MorphOS 3.19 (Licenced)
  • »17.03.25 - 14:55
    Profile
  • MorphOS Developer
    Nadir
    Posts: 170 from 2003/3/17
    Quote:

    ChrisC wrote:
    Nothing to lose giving the guy a shot, can he not have the stack source anyway since it is from the BSD stack?


    The BSD stack is of course readily available but the MorphOS layers are closed source. As jaca mentioned, one can get a good idea from the AmiTCP code based on the Linux stack or from the AROS stack which likely can be ported to MorphOS quite easily. I think Papiosaur is seriously underestimating what is needed to make progress in this area. If it was easy, we would have done this already years ago.

    [ Edited by Nadir 17.03.2025 - 15:03 ]
  • »17.03.25 - 15:02
    Profile
  • Caterpillar
    Caterpillar
    ChrisC
    Posts: 21 from 2025/3/1
    Understood, it has taken the RISC OS Open devs years to get a newer stack on RISC OS 5 so i get it, i am sure he will when he starts reading up, nice to have someone so eager to develop for the platform though.
    Power Mac G5 11,2 Dual 2GHz
    ATI X1950 Pro 256MB
    2GB RAM
    500GB Storage
    MorphOS 3.19 (Licenced)
  • »17.03.25 - 15:06
    Profile
  • Priest of the Order of the Butterfly
    Priest of the Order of the Butterfly
    eliot
    Posts: 631 from 2004/4/15
    IPv6 would be nice to have …
    regards
    eliot
  • »17.03.25 - 16:31
    Profile
  • MorphOS Developer
    Nadir
    Posts: 170 from 2003/3/17
    That should in principle be possible with the newer stack that I got partially running but it would also need an extension of the native APIs
  • »17.03.25 - 16:42
    Profile
  • Yokemate of Keyboards
    Yokemate of Keyboards
    Andreas_Wolf
    Posts: 12281 from 2003/5/22
    From: Germany
    > the MorphOS layers are closed source. [...] one can get a good
    > idea from the AmiTCP code based on the Linux stack or from the
    > AROS stack which likely can be ported to MorphOS quite easily.

    Maybe having a look at the MOSNet source code (which is based on AmiTCP v3) and playing around with that can make sense to get acquainted with TCP/IP stack technology. And it's already running on MorphOS...
  • »17.03.25 - 19:53
    Profile
  • MorphOS Developer
    Piru
    Posts: 593 from 2003/2/24
    From: finland, the l...
    Quote:

    Georg wrote:
    In AROS I used to temporarily modify things such that mungwall memory check of all the allocated memory happened on each and every task switch. Helped. Mem allocation routines were changed to use Disable()/Enable(), etc.

    Also AvailMem(MEMF_CLEAR) was made to trigger same mungwall memory check of all alloated memory in the system. So calls to this were put all over the (suspect) places to make it more and more difficult for the memtrashing code to hide.



    MorphOS 3.16 introduced a built-in "memtrack" functionality in many ways surpasses what Mungwall or Wipeout does. In fact, all MorphOS Beta builds have the memtrack feature enabled by default to promote buggy code being detected early. In production builds this memtrack feature can be enabled via the boot option "ed=permmemtrack".

    There however are certain types of memory corruption issues that evade simple checks like this. This network stack one sounds one of those cases.
  • »18.03.25 - 14:59
    Profile
  • Acolyte of the Butterfly
    Acolyte of the Butterfly
    Georg
    Posts: 121 from 2004/4/7
    Quote:

    Piru wrote:
    MorphOS 3.16 introduced a built-in "memtrack" functionality in many ways surpasses what Mungwall or Wipeout does. In fact, all MorphOS Beta builds have the memtrack feature enabled by default to promote buggy code being detected early. In production builds this memtrack feature can be enabled via the boot option "ed=permmemtrack".



    In AROS it's built in as well. It's just called "mungwall", but it's not really the AOS-mungwall.

    Quote:


    There however are certain types of memory corruption issues that evade simple checks like this. This network stack one sounds one of those cases.


    That's when you have to use other tricks as said (like make it check things at each task switch), but on MOS it may be difficult/impossible to temporarily hack such things in for normal non-mos-kernel coders.

    The hosted version of AROS also can be run in gdb and you have complete symbols for all the OS, the libs, all the apps running, etc. The hosted OS is just like one big (Linux) program with user level threading (the Exec Tasks). It does not matter if something happens in Forbid/Disable state or if there is a deadlock or whatever. You can always CTRL-Z into the debugger. And then see symbolized backtraces of the current tasks (or other tasks in ready/wait queue).

    For mem trashes you can use (hardware) watchpoints if you know what is trashed. You can use gdb tricks to install/remove such watchpoints dynamically/on the fly. Like by telling gdb to add a breakpoint in function xyz() and then tell gdb to execute some instructions whenever this breakpoint is reached. This could be ~"add hw watch point to xyz() param "taglist" + 1234, add another breakpoint at end of function, continue running, remove hw watch point again". For cases where a mem trash happens while inside xzy(), but only 1 out of 1000 times it gets called.

    Sometimes very random memtrashes (like when you cannot pinpoint the crash to happening in a specifc task/function even with tricks like above) may have cause in "locking" bugs. Like TDNestCnt going "bad" because of mismatched Forbid()/Permit() calls. If somewhere Permit() is called too often (without matching Forbid()) it will cause task's future code execution inside Forbid()/Permit() protection to not be protected anymore. If Forbid() is called too often, a long time it may go unnoticed (because Wait() still breaks the forbid state), but if TDNestCnt is overflown there will be 1 time where code inside Forbid() block is not protected.
  • »18.03.25 - 17:17
    Profile
  • MorphOS Developer
    Piru
    Posts: 593 from 2003/2/24
    From: finland, the l...
    Quote:

    Georg wrote:
    Quote:

    Piru wrote:
    There however are certain types of memory corruption issues that evade simple checks like this. This network stack one sounds one of those cases.


    That's when you have to use other tricks as said (like make it check things at each task switch), but on MOS it may be difficult/impossible to temporarily hack such things in for normal non-mos-kernel coders.


    Of course.

    Technically adding something like this is not difficult at all, but this still only helps with specific kind of corruption. This doesn't work in occasions where the corruption occurs to valid, allocated memory. This is the worst kind of corruption and extremely difficult to track.

    Of course in most cases the corruption of simple kind (under/overflow of some kind) and relatively easy to detect.
  • »18.03.25 - 21:48
    Profile
  • Acolyte of the Butterfly
    Acolyte of the Butterfly
    Georg
    Posts: 121 from 2004/4/7
    Quote:

    Georg wrote:
    Forbid()



    Btw, in this old AmiTCP 3.0b2 sources they do use Forbid()/Permit() for normal (non-debug) builds for this spl() thing in "synch.h". Only in debug builds they use semaphores. Will fail if some change ever introduces something which directly or indirectly causes Forbid state to be broken because of a Wait() call.

    In AROS it seems to have been changed to always use semaphores instead. Maybe also because the default Forbid/Permit spl() version differs from the orig. Sources from Aminet in that it had it's own spl_variable instead (of ab-using SysBase->TDNestCnt directly) and this looks buggy, broken and unsafe.
  • »19.03.25 - 07:21
    Profile
  • Acolyte of the Butterfly
    Acolyte of the Butterfly
    Georg
    Posts: 121 from 2004/4/7
    Quote:

    Piru wrote:
    This doesn't work in occasions where the corruption occurs to valid, allocated memory.



    Depends. If you find a way to make it trash a specific place (not always, maybe just 1 in 100 boots). Place like a window list, or msg list, or gadget text, or a taglist, or just some fixed string in memory. Then - if you cant use watchpoints - add calls to a function all over the place which validates this thing.

    Cause will likely have to do something with locking/synchronization. So one could also check if problem disappears by extended or additional locking (prevent more stuff from possibly running concurrently). Or an intentional mismatched Forbid() at beginng of a task, to make that task work in "cooperative task switching" mode instead of "preemptive task switching mode".
  • »19.03.25 - 07:39
    Profile
  • MorphOS Developer
    Piru
    Posts: 593 from 2003/2/24
    From: finland, the l...
    Quote:

    Georg wrote:
    Quote:

    Piru wrote:
    This doesn't work in occasions where the corruption occurs to valid, allocated memory.



    Depends. If you find a way to make it trash a specific place (not always, maybe just 1 in 100 boots). Place like a window list, or msg list, or gadget text, or a taglist, or just some fixed string in memory. Then - if you cant use watchpoints - add calls to a function all over the place which validates this thing.


    Yes, of course in this situation you'd use data breakpoints. The issue with this is that 99 out of 100 boots you get false alarms when something else hits the address. The data breakpoint works only when you have consistently address for the out of bound write. I've only used such debugging successfully maybe twice over 20 years. YMMV of course, but this is my experience.

    Quote:

    Cause will likely have to do something with locking/synchronization. So one could also check if problem disappears by extended or additional locking (prevent more stuff from possibly running concurrently). Or an intentional mismatched Forbid() at beginng of a task, to make that task work in "cooperative task switching" mode instead of "preemptive task switching mode".


    It's just impossible to know. This is especially difficult problem with complicated component like TPC/IP stack that runs a process that handles the processing itself, has bsdsocket.library that interfaces with user applications via per-task context, while also communicating with the network driver over SANA-II device API. There are multiple moving parts, so testing this is very painful. The issue also is that changing the way the locking works may indeed make the bug disappear, but not make it any easier to identify the bug. The interactions are so complicated this may not help at all.
  • »19.03.25 - 09:23
    Profile
  • Acolyte of the Butterfly
    Acolyte of the Butterfly
    Georg
    Posts: 121 from 2004/4/7
    Quote:

    Piru wrote:

    The issue with this is that 99 out of 100 boots you get false alarms when something else hits the address.


    I meant this (example, when you do *not* have option/possibilty to use watchpoints for debugging):

    For example you find out that the mem trash sometimes (even if just 1 in 100 boots) causes crash, because the node in some list (maybe messages of a msgport) ends up containing some odd address in some field. Like some IntuiMessage->Window of some Ambient window being 0x77777777.

    For debugging you therefore create some temp. validation function which you can easliy call from everywhere (by for example "hiding" it in exec.library/SetFunction().

    Code:

    SetFunction(library, funcoffset, newfunction):
    if (library == 0x11111111)
    {
    Forbid
    foreach ambient window
    {
    for each msg in window userport
    {
    if (isodd(msg->some_field_which_shouldnt_be_odd))
    {
    kprintf("Validation failed! %sn, (char *)newfunction);
    }
    }
    }
    Permit
    }
    else
    {
    ... real setfunction
    }


    Then all over the place (not just buggy software itself) add tons of calls:

    Code:

    ...
    some code A
    ...
    SetFunction(0x11111111, 0, "1 Memtrash!");
    ...
    some code B
    ...
    SetFunction(0x11111111, 0, "2 Memtrash!");
    ...
    some code C
    ...
    SetFunction(0x11111111, 0, "3 Memtrash!");
    ...


    Then if "2 MemTrash!" appears in debug output it gives hint where mem trash may be, if there was no "1 MemTrash!" beforehand. To be sure you may need to verify again with some of the code Forbid-Permit enclosed, to rule out other task involvement. If possible it would be best if you could make this validation call on every task switch. Is tc_Switch, tc_Launch supported in MOS?

    Even if the memtrash in that place (in this example: msg in userport of window) only happens 1 out of 100 boots it doesn't matter. You just reboot, until it happens. Maybe a bit slow/annoying if you cannot reboot the whole OS in less than a second and script the whole procedure ...
  • »19.03.25 - 14:13
    Profile
  • MorphOS Developer
    Piru
    Posts: 593 from 2003/2/24
    From: finland, the l...
    Quote:

    Georg wrote:
    For example you find out that the mem trash sometimes (even if just 1 in 100 boots) causes crash, because the node in some list (maybe messages of a msgport) ends up containing some odd address in some field.


    I've never seen a bug this consistent to be honest.
  • »19.03.25 - 17:06
    Profile