Page 1 of 1

A simple question

Posted: Wed Sep 03, 2008 6:12 am
by Jeko
I'm writing syscalls for my operating system. Now I'm writing syscalls to handle virtual file system... I have a small problem:
file names can be long 256 bytes, but also 12 bytes, or 14, or 15, that is every length..
The problem occurs when an user process calls the syscall readdir
This syscall opens a directory and read a specific entry of the directory. The problem is how to return the name of the file to the user process...
In fact, if the function returns a string allocated in the kernel memory, the user process can't access to the string. If the user alloc some memory and pass the address to readdir, readdir can only return a filename of a specific length, because the user process, when allocates memory, can't know how the filename is long...

How can I resolve this problem? I've seen that other operating systems give for example a maximum of 256 characters, but I want that filenames can be of every length...

Re: A simple question

Posted: Wed Sep 03, 2008 9:34 am
by Solar
Jeko wrote:In fact, if the function returns a string allocated in the kernel memory, the user process can't access to the string.
Map the result into user address space as read-only?

Make the user provide the memory, return zero if successfull, and the amount of memory required if what the user provided isn't long enough?

Re: A simple question

Posted: Wed Sep 03, 2008 9:47 am
by Jeko
Solar wrote:
Jeko wrote:In fact, if the function returns a string allocated in the kernel memory, the user process can't access to the string.
Map the result into user address space as read-only?
With this method I must allocate memory for each readdir. Isn't this a waste of time? But I think it's the only valid method...
However if I alloc from the user heap for example 8 bytes, I use an entire page for a filename of 8 bytes?
Solar wrote:Make the user provide the memory, return zero if successfull, and the amount of memory required if what the user provided isn't long enough?
With this method the user must allocate memory, call the function, allocate more memory, call another time the function. The other method is better.

Re: A simple question

Posted: Wed Sep 03, 2008 9:08 pm
by AndrewAPrice
Look how Microsoft do it. :)

e.g.
ReturnSomeKernelString(char *buffer, uint sizeOfBuffer);

Re: A simple question

Posted: Wed Sep 03, 2008 11:58 pm
by Solar
Jeko wrote:
Solar wrote:
Solar wrote:Make the user provide the memory, return zero if successfull, and the amount of memory required if what the user provided isn't long enough?
With this method the user must allocate memory, call the function, allocate more memory, call another time the function.
If he did this the first time around, he'd be able to catch 99% of all filename cases:

Code: Select all

#include <stdio.h>

...
char * buffer = malloc( FILENAME_MAX );
...
Oh, while we're at it - are your filenames in ASCII (if yes, which codepage?), or Unicode (if yes, which encoding?)?

Re: A simple question

Posted: Thu Sep 04, 2008 12:32 am
by Jeko
Solar wrote:
Jeko wrote:With this method the user must allocate memory, call the function, allocate more memory, call another time the function.
If he did this the first time around, he'd be able to catch 99% of all filename cases:

Code: Select all

#include <stdio.h>

...
char * buffer = malloc( FILENAME_MAX );
...
Oh, while we're at it - are your filenames in ASCII (if yes, which codepage?), or Unicode (if yes, which encoding?)?
They are ASCII, but I will support Unicode.
(What are codepages, and what are encodings?)

Re: A simple question

Posted: Thu Sep 04, 2008 12:33 am
by Jeko
MessiahAndrw wrote:Look how Microsoft do it. :)

e.g.
ReturnSomeKernelString(char *buffer, uint sizeOfBuffer);
So a filename is returned partially. If sizeOfBuffer < filename, only a part of the filename will be put in the buffer

Re: A simple question

Posted: Thu Sep 04, 2008 12:47 am
by Solar
Jeko wrote:(What are codepages, and what are encodings?)
Codepages is DOS vocabulary; more generally spoken it's about character encodings. ISO 646, ISO 8859, EBCDIC, Windows-125*, KOI-8, ISO 2022...

With "encoding" I meant UTF-8, UTF-16, UTF-32...

I know you usually don't want to bother with these things in the beginning, you just "want it to work". But they are a royal PITA when you attempt to retrofit this kind of international support, because it affects virtually everything down to the simplest of functions.

Re: A simple question

Posted: Thu Sep 04, 2008 1:07 am
by Jeko
Solar wrote:
Jeko wrote:(What are codepages, and what are encodings?)
Codepages is DOS vocabulary; more generally spoken it's about character encodings. ISO 646, ISO 8859, EBCDIC, Windows-125*, KOI-8, ISO 2022...

With "encoding" I meant UTF-8, UTF-16, UTF-32...

I know you usually don't want to bother with these things in the beginning, you just "want it to work". But they are a royal PITA when you attempt to retrofit this kind of international support, because it affects virtually everything down to the simplest of functions.
I think I'll use Unicode UTF-8 or UTF-16. I must study the differences between these, but I read that UTF-32 isn't good because waste space.

Re: A simple question

Posted: Thu Sep 04, 2008 4:13 am
by Solar
UTF-8 is space-efficient, at least as long as most of your characters are ASCII, or at least BMP (Basic Multilingual Plane). However, it's a multibyte encoding - i.e. you can't know how many bytes you have to skip if you want to skip n characters, as one character can be 1..? bytes.

UTF-32 is not space-efficient, as every character takes 32 bits of space (wide encoding). However, skipping characters, concatenating and many other string operations are more efficient because of this.

ISO/IEC 9899:1999 (C language standard) more or less assumes that files are stored as multibytes, while in-memory-operations are usually done in wide encoding.

Re: A simple question

Posted: Thu Sep 04, 2008 12:10 pm
by Jeko
Solar wrote:UTF-8 is space-efficient, at least as long as most of your characters are ASCII, or at least BMP (Basic Multilingual Plane). However, it's a multibyte encoding - i.e. you can't know how many bytes you have to skip if you want to skip n characters, as one character can be 1..? bytes.

UTF-32 is not space-efficient, as every character takes 32 bits of space (wide encoding). However, skipping characters, concatenating and many other string operations are more efficient because of this.

ISO/IEC 9899:1999 (C language standard) more or less assumes that files are stored as multibytes, while in-memory-operations are usually done in wide encoding.
Which is, in your opinion, the best encoding? I read that UTF-32 isn't good... But, what do you think?

Re: A simple question

Posted: Thu Sep 04, 2008 12:44 pm
by Solar
Personally I don't think a newly-written operating system should bother with anything but UTF-8 or UTF-32 natively. Which one you chose is really up to you, but it should be consistent all across the API. (I'd opt for the more comfortable but memory-inefficient UTF-32, but that's because any OS I'd write would be aimed at the desktop / server range. An embedded system certainly would go for UTF-8.)

With regards to the subject "readdir", I'd probably toy around with the kernel's return value. A syscall is kernel-space code, but usually provides a user-space wrapper so you don't have to fiddle with registers and the like but can do a convenient C-syntax function call. I see you want your function to return one file name at a time, which is in line with the way "readdir" as user-space coders know it works.

But who says that you need to call kernel space for every invocation of the syscall, having the memory problem with every invocation? You could, for example, have the kernel return a whole block of information the first time around, which get stored in user-space by the syscall wrapper, and only the first filename from that block actually gets returned to the caller. Subsequent calls to "readdir" are satisfied by the buffer, until that is exhausted and another call to kernel space is made.

Two advantages here: You get fewer context switches, and all memory management remains in the hands of the OS (as the syscall wrapper can return const pointers to its buffer, which is already user-space).

Re: A simple question

Posted: Fri Sep 05, 2008 6:00 am
by AndrewAPrice
Jeko wrote:
MessiahAndrw wrote:Look how Microsoft do it. :)

e.g.
ReturnSomeKernelString(char *buffer, uint sizeOfBuffer);
So a filename is returned partially. If sizeOfBuffer < filename, only a part of the filename will be put in the buffer

Hide it behind a nice interface. E.g. in your library have a function that is:

std::string ReturnSomeKernelString();

Internally it is:

Code: Select all

// somewhere that can be shared between all common functions like this (not thread safe though):
#define CHUNKS_TO_DO_AT_ONCE 1024
char buffer[CHUNKS_TO_DO_AT_ONCE];

// the system call:
void sysReturnSomeKernelString(char *bufferToStoreChars, uint offsetInName, uint charsThatFitInBuffer, bool &stillMoreCharactersRemaining);

// the func:
std::string ReturnSomeKernelString()
{
    std::string str;
    bool stillMore = true;
    uint offset = 0;

    while(stillMore)
    {
        sysReturnSomeKernelString(buffer, offset, CHUNKS_TO_DO_AT_ONCE, stillMore);
        offset += CHUNKS_TO_DO_AT_ONCE;
        if(stillMore)
             str += std::string(buffer, CHUNKS_TO_DO_AT_ONCE); // no null-terminator so we specify size
        else
             str += std::string(buffer); // will stop at null-terminator
    }

    return str;
}
Though some could say it's inefficient because it does multiple allocations. A simpler way would have 2 system calls: GetSizeOfSomeKernelString() first then GetSomeKernelString(), also make sure you pass in a maximum buffer size to the latter system call incase the contents increase between calling them both.