Handling system calls without blocking the kernel [AVR]

smeezekitty · Post by **smeezekitty** » Tue Jul 04, 2017 6:31 pm

Hello,

I am currently developing an operating system that runs on AVR microcontrollers. The target platform is VERY low spec -- I am trying to keep the minimum hardware requirement of an Atmega328 with 32KB of ROM and 2 KB of RAM. From reading the forum, I gather that most people on here are x86 developers. But hopefully my issue is generic enough that the architecture isn't an issue.

Thus far, I have FAT16 file system support, an ELF loader, memory allocator and a scheduler all while using less than 1 KB of kernel space RAM.

I've hit a roadblock though: system calls.

I have done a lot of searching both here and on the general web as well as reading source code of Linux and other open source operating systems but yielding no solution.

In short: When calling a blocking system call, how does one avoid blocking the kernel?

The way I am handling system calls now is that the process calls a vector at a known address and the vector adds the system call to a queue and then goes into a while() loop until the call is completed. At the same time, the kernel task looks for any outstanding system calls in the queue. It loops through the queue and does them in sequence if there are any, marks them complete and puts the result into the queue buffer. The system call vector then sees the call is completed and returns to the process with the result.

The major problem is if a blocking call like read() is performed, it blocks the kernel task so no other system calls can be performed until the system call is completed.

I'm not calling the kernel functions directly from the vector because some of the kernel functions are relatively memory hungry and I was hesitant to run them in the calling processes' stack.

Any ideas?

Thank you

Brendan · Post by **Brendan** » Tue Jul 04, 2017 8:24 pm

Hi,

smeezekitty wrote:In short: When calling a blocking system call, how does one avoid blocking the kernel?

Assume you have some kind of device where the driver issues a request/command to the device (e.g. to read or write some sectors) and then a little later the device causes an IRQ to inform the device driver that its request has completed.

Now assume that your driver maintains some kind of list of pending requests; and when the device's IRQ occurs the driver:

Gathers status from the "just completed" request, and informs something else (e.g. file system code)
Finds the next "pending request" and asks the device to start doing that request

At any time you could add more requests to the driver's list of pending requests (and when this happens you check if there's no "currently in progress" request and tell the disk controller to start the request immediately).

Now assume that individual file systems (e.g. FAT, ext2, ISO9660, whatever) and the virtual file system layer all use similar techniques - e.g. they all work on "requests" and "replies", and they all keep track of "pending requests" and "in-progress requests". For example, from the VFS's perspective:

something sends a request to VFS asking it to fetch 20 KiB from a file, VFS remembers the details
VFS sends a request to FAT file system code asking it to fetch 20 KiB from a file
Sooner or later VFS receives reply from FAT file system, figures out what it's about (the details it remembered earlier) and sends reply back to whatever send the initial request

And from the FAT file system's perspective:

VFS sends a request to FAT file system fetch 20 KiB from a file, FAT file system remembers the details
FAT file system sends a request to storage device driver asking it to fetch 20 KiB from a hard disk
Sooner or later FAT file system receives reply from storage device driver, figures out what it's about (the details it remembered earlier) and sends reply back to VFS

And from the storage device driver's perspective:

FAT file system sends a request to storage device driver to fetch 20 KiB from a hard disk, storage device driver remembers the details
Sooner or later storage device driver sends a command to the disk controller asking it to fetch 20 KiB from a hard disk
Sooner or later storage device driver receives an IRQ from disk controller, figures out what it's about (the details it remembered earlier) and sends reply back to FAT file system

Finally; assume that the kernel API has a "read()" function, which actually does "send request to VFS and block this task" and when the VFS sends a reply back the kernel unblocks the task again (so that the task doesn't get any CPU time while it's waiting for its "read()" to complete). Of course kernel can also have a "read_async()" function which sends a request to the VFS but doesn't block the task.

Now imagine that there's 54321 tasks that are all calling "read()" at various times (and blocking and unblocking); but the VFS is able to have many requests in various states at the same time, and the FAT file system is able to have many requests in various states at the same time, and storage device drivers are able to have many requests in various states at the same time.

Next; imagine that there's multiple file systems (FAT, ext2, ISO9660, etc) and multiple disk controller drivers and/or storage devices. At any point in time VFS might have 1234 requests that it's keeping track of; where FAT might have 234 of the requests, ext2 might have 600 of the requests and ISO9660 might have 400 of the requests; and where SATA disk controller might have 834 of the requests and USB CD-ROM driver might have 400 of the requests.

Also imagine that the VFS layer does file data caching. Whenever kernel/task sends a request to VFS asking to read some data from a file, VFS checks its cache and if the data is in the cache it sends a reply back immediately; and if the data isn't in the cache it sends a request to a file system, and when the file system replies the VFS grabs the data from the FS's reply and puts it in the cache (before VFS sends its own reply back to kernel/task that made the original request). Whenever kernel/task sends a request to VFS asking to write some data to a file, VFS stores the new data in its cache then sends a request to a file system.

Also imagine that all of these requests have an "IO priority"; and that each layer can take this into account when figuring out which request to do next; so that important things happen sooner/faster (and less important things happen later/slower). Then imagine that (for swap space) the kernel may send requests directly to the storage device drivers and that these requests are "very high priority" (so that swap space reads/writes happen very quickly even when there's 1234 lower/normal priority requests waiting to be done).

Note that for a micro-kernel (e.g. where all of these pieces are separate processes, and sending requests and replies is done using IPC/message passing) it should be relatively easy to see how all this works. For a monolithic kernel it's basically the same; except that instead of using IPC you use direct function calls and function pointers. For example, to send a request to a file system the VFS might do "fs_control_data->handle_read_request_function( info );" and to send a reply back to VFS the file system might do "VFS_handle_read_reply ( info );". When direct function calls are being used like this you can cheat a little by passing the same "request info structure" from VFS to FS to storage driver (as requests are made at each level) and then from storage driver to FS to VFS (as replies come back from each level), so VFS and file systems don't necessarily need to keep track of the requests while they're "in progress at a lower layer".

Cheers,

Brendan

~ · Post by ~ » Tue Jul 04, 2017 9:32 pm

A routine in a program or a system call could be very big and could take a long time.

In your system library, you could implement routines as a set of smaller functions that call each other in some sequence. Then you will be able to return as soon as you can, system calls will be executed in a chunked way, so as long as you don't need to get locked to keep state consistent, you will have the chance in your system code not only to keep calling the sequence of functions for an API call, but also to attend other things, switch to other threads or processes.

smeezekitty · Post by **smeezekitty** » Wed Jul 05, 2017 1:03 am

Thank you for the information, Brendan. Unfortunately, implementing a full VFS system is just not practical for a system with such little RAM.
Minimalism is key for embedded systems.

I guess I am curious how the callback would be implemented. Would the lower layers call a function in the upper layer or is there some other method?

In your system library, you could implement routines as a set of smaller functions that call each other in some sequence.

Well the system routines are already broken down into subroutines. I suppose I could poll to service other system calls in the middle of some operations but it doesn't seem like the most elegant solution.

One of the biggest culprits of blocking is serial IO. I think I'll need to break down and implement buffered interrupt driven I/O. I'm trying to minimize buffering as much as possible but at some point it is counter productive. I want to get this blocking issue out of the way because I'm planning on adding an IP stack later.

Another idea I am thinking about trying is adding non-blocking versions of the system functions and just keeping the system call in the queue if it would need to block to satisfy the request. I just worry about the queue buffer filling up with outstanding requests.

Brendan · Post by **Brendan** » Wed Jul 05, 2017 1:26 am

Hi,

smeezekitty wrote:Thank you for the information, Brendan. Unfortunately, implementing a full VFS system is just not practical for a system with such little RAM.
Minimalism is key for embedded systems.

The full description I provided can be stripped back (by ignoring/discarding parts) as much or as little as you like (mostly depending on how "minimal" your final product is), all the way up to the point where the question doesn't make sense in the first place (e.g. the final product is so minimal that no device is capable of "DMA transfer with IRQ on completion").

Note that "embedded" ranges from tiny little micro-controllers in things like microwave ovens and washing machines (where no OS or file system makes sense), all the way up to smartphones (where it's not too different to a full blown desktop OS).

Cheers,

Brendan

~ · Post by ~ » Wed Jul 05, 2017 1:52 am

smeezekitty wrote:Well the system routines are already broken down into subroutines. I suppose I could poll to service other system calls in the middle of some operations but it doesn't seem like the most elegant solution.

Polling is not a key part of what I'm telling you.

What I mean is that if, instead of having a monolithic API function, you can separate in individual functions for very well defined parts (MASTER FUNCTION --> STEP 0, STEP 1, STEP 2).

With that separation of steps, you can multithread. You can make individual steps look like extensions to the instruction set if you make them based on minimal optimized logic.

You can ensure that each individual step is either blocking or non-blocking, just like with atomic (lockable) and non-atomic (non-locking but optionally lockable) instructions.

Then, you can return from each step as soon as it's completed. You can attend other tasks in a multithreaded way, with execution states separated for each of the tasks you want to perform. The APIs would just make use of the state you pass to them.

But you don't need polling. You could make it in an interrupt-driven way or based in a main loop.

You can keep a pointer for the next step that you need to run that a master function can use.

There would be a master function for each API function, and that master function would call its STEP functions as it needed. There could be many common STEP functions, like generic instructions for generic tasks, but instructions implemented by you through functions.

The master function would be sort of a fetcher for the instructions of its own logic task.

linuxyne · Post by **linuxyne** » Wed Jul 05, 2017 3:31 am

smeezekitty wrote: The major problem is if a blocking call like read() is performed, it blocks the kernel task so no other system calls can be performed until the system call is completed.

The kernel task need not wait for the read() to complete. It can record the fact that a process is waiting to read a device, and move on with other duties, until such a time arrives when the device does have data to send (signalled by an interrupt) to the process, at which point the kernel satisfies the request. Similar approach applies to other calls, as well as to other situations which demand a decoupling of requests and responses. A framework, which allows tasks to wait for events and be alerted when they arrive, is needed.

smeezekitty wrote: I just worry about the queue buffer filling up with outstanding requests.

If that occurred during a typical run of the system, wouldn't that prove that the software is improperly designed, or that the hardware itself lacks enough resources?

Korona · Post by **Korona** » Wed Jul 05, 2017 6:25 am

Most OSes (e.g. Linux) for more powerful systems maintain one kernel stack per task and suspend this thread during the read(). They usually use IRQs to wait for read()s to complete. It seems that doing that is not an option in your situation.

Some questions you should work out are:

Can you afford multiple threads? Do you have enough RAM for their stacks? If the answer is no, a synchronous read() system call does not make sense.
Do you need to support more than one concurrent I/O at all? Maybe your use case only requires one operation at a time, which simplifies the design and saves valuable ROM space.
Does a kernel-userspace separation make sense in your resource constrained environment? Do you need system calls at all? Maybe it makes more sense to integrate the drivers into your application code as libraries.

One thing that you might consider is asynchronous I/O: Instead of having a blocking read() you could have a initiate_read() function that starts the read request and a poll_progress() function that performs any polling and returns the read operations that have already completed. Using such a model a single thread is able to perform multiple concurrent I/O operations, e.g. by calling initiate_read() multiple times and then calling poll_progress() in a loop until all operations have been completed. This approach also fits nicely into your system-call-queue code.

smeezekitty · Post by **smeezekitty** » Wed Jul 05, 2017 3:05 pm

With that separation of steps, you can multithread. You can make individual steps look like extensions to the instruction set if you make them based on minimal optimized logic.

You can ensure that each individual step is either blocking or non-blocking, just like with atomic (lockable) and non-atomic (non-locking but optionally lockable) instructions.

Then, you can return from each step as soon as it's completed. You can attend other tasks in a multithreaded way, with execution states separated for each of the tasks you want to perform. The APIs would just make use of the state you pass to them.

But you don't need polling. You could make it in an interrupt-driven way or based in a main loop.

You can keep a pointer for the next step that you need to run that a master function can use.

Okay. That makes a lot more sense. I'll have to give it some thought.

The kernel task need not wait for the read() to complete. It can record the fact that a process is waiting to read a device, and move on with other duties, until such a time arrives when the device does have data to send (signalled by an interrupt) to the process, at which point the kernel satisfies the request. Similar approach applies to other calls, as well as to other situations which demand a decoupling of requests and responses. A framework, which allows tasks to wait for events and be alerted when they arrive, is needed.

Well as of now I don't have a way to call back the tasks but they are stopped while waiting for the call to be serviced. It's kind of a head scratcher to me how to handle callbacks.

Can you afford multiple threads? Do you have enough RAM for their stacks? If the answer is no, a synchronous read() system call does not make sense.

Well, yes and no. I have enough RAM to do multithreading. But not enough for an additional stack per processes. At that point I would just risk running the kernel routine in the process stack.

Do you need to support more than one concurrent I/O at all? Maybe your use case only requires one operation at a time, which simplifies the design and saves valuable ROM space

"Sort of". I probably don't need to be able to handle more than one request to the filesystem simultaneously (it's an SD card so it responds pretty fast). But I don't want to block all I/O waiting for a stream read/write (say serial or network)

Does a kernel-userspace separation make sense in your resource constrained environment? Do you need system calls at all? Maybe it makes more sense to integrate the drivers into your application code as libraries.

Yes. Because if the kernel has the I/O code that it will need anyway to do things like load ELF files anyway, it would use too much flash/RAM to duplicate all that code in each process that is running. Besides there would be the risk of having different processes sending conflicting instructions to an I/O device

While I'm at it, what is the best method to pass argc/argv to main() of a starting process. Do I just stick it in the stack?

~ · Post by ~ » Wed Jul 05, 2017 4:09 pm

For an actual callback, you can standardize the structure of the callback, call an API function including a function pointer to your callback function, then let the API function do whatever it does and call the callback function you passed with a pointer. If the pointer is 0 or NULL, then no callback should be called.

For passing parameters, you could pass a pointer to a system structure that would contain all of the data the program could need at startup time. You could pass that pointer in a register, the stack or a known system variable (more difficult), or resolve the pointer into a header field of your executable intended to contain that system parameter structure.

You could also massively export kernel functions with an index or some constant, and then have the system dynamically link those functions in another list of function pointers provided by the application. It would only need to copy the address of the function to the corresponding program import.

smeezekitty · Post by **smeezekitty** » Thu Jul 13, 2017 3:55 pm

I solved this problem by checking if the action can be performed without blocking and if it cannot, leaving the call in the queue and checking it again next pass. The system calls are somewhat slower than I'd like but for now it works well enough.

I also implemented arguments as well as exit statuses. That was fun

OSDev.org

Handling system calls without blocking the kernel [AVR]

Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]

Re: Handling system calls without blocking the kernel [AVR]