OSDev.org

Posted: **Sat May 05, 2012 2:39 am**

berkus wrote:
MessiahAndrw wrote:
berkus wrote:This way you retain the speed (and it DOES slow down a lot when you have to do a lot of comparisons with a lot of long long null-terminated strings, unless you use something like KMP), and also keep the API flexible (just be aware that going from "rich" strings to C strings is easy, other way around may be not.
Well it depends on the language. For example:
If you read what I said you will see that converting from std::string to a C string is easy (just return the data() pointer), but going from a C string to std::string requires running strlen() on the C string again, exactly to reconstruct the length field.

You can easily combine strlen\strcpy for the vast majority of strings. You could just assume that the source string would fit in, say a 128/256 byte buffer, and do a strncpy like function that returns the amount of characters that couldn't be copied. If that return value is greater than zero, just do a realloc on the buffer and finish copying the string. I'm not familar with std::string implementations, but I'd be surprised if they didn't do something similar to this.

Defining non null terminated strings at the system level would be an absolute nightmare in a C\C++ (or similar) environment. Even if you redid all of the functions in your C library that relied on them, a lot of software would break anyways. Since the user can store string lengths on his end to speed things up, I don't see a point.

Code: Select all

char* mystring;
unsigned long long mystring_len;

Isn't exactly difficult to do. You'll also never never out grow the ULL (or just UL for non Microsoft), and unless you're storing 10,000's of strings (and their lengths) wont be of any significant overhead.

Posted: **Sat May 05, 2012 6:41 am**

The conclusion is still that non-C strings have obvious algorithmic and safety advantages over C-strings, but they would need to be implemented as a dedicated UDT to respect the language

And of course you can make it policy not to use C-strings at all - making a constructor macro for the conversion is not difficult with string constants in source - gcc should even be able to optimize {strlen(constant), constant} for you. But since char * has a much broader meaning than a string, so you don't want to blatantly supersede that at all.

And after all, if you got old code or want to manually optimize string functions even though the library already provides that feature, there's nothing stopping you from adding #include <string.h>.

Posted: **Sun May 06, 2012 4:55 am**

I have played around and implemented some string functions for both, null-terminated, and rich strings.

The conclusion that that seems to stem from my work, is that the functions only differ in how they find out what size the string is. And it's obvious, that having a prefix, makes them faster than having to use str_len(). Only functions, that need to calculate the length of a string are slower than their counterparts, which work with null-terminated strings. As, I think, bluemoon said, you only shift the process of string length measurement to a different place, but I find it advantageous, because of three things:
* you only measure the length of a string once, and can use the results many times, as opposed to using str_len() repeatedly in every function, that needs the strings length;
* even when you do need to measure the length of a string, it's often faster than str_len(), for example, in str_cat you would just add prefixes of two strings to get the length of a new string;
* constant strings can have their length (pre-)compile time calculated, so no performance loss at all.

The only performance hit is because of stack activity, as in my model, I pass the strings length separately from the string itself. But I think you'll agree, that's negligible when compared to the performance gain, of not having to loop on a string.

I know that there's little string usage in kernel. But I intend to use rich strings for the entire OS, meaning, not only kernel, but everything in user space.

Only one question arises. Given that I want to use rich strings, what would you advise - having the length prefix inside or outside the string? My own thoughts about it are only that having it inside, is a bit uncomfortable to work with the prefix, but also passing such a string as an argument is nicer, as it's just one argument, and that having it outside the string also results in more stack activity, as I mentioned earlier.

Posted: **Sun May 06, 2012 5:25 am**

iansjack wrote:Null-terminated strings are great when you want to scan a string in some way - no need to check the length of the string to know when you have reached the end, just keep going till you hit a zero.

You compare two values each time before looping another time anyway, be it

Code: Select all

while(str[a] != 0)
{
	...

	a++;
}

or

Code: Select all

while(str_len + 1 != a)
{
	...

	a++;
}

, and I agree that the latter is slower, because of the addition (which should not be in the loop, he optimizer will probably add it once and use the value repeatedly), but overall system will increase, because this loop has to only happen once, as compared to repeated str_len()'s.

iansjack wrote:And when it comes to dividing a string in two, say at some separator value, it's great. Just scan the string till you find the value and pop a zero in there instead. Now think of the processes to be followed with Pascal type strings.

Null-terminated strings are ideal when you are looking at relatively low-level operations.

You're right, but again, seeing the big picture, i think, having the strings length pre-calculated and reused at least three times already makes up for the initial loss.

Posted: **Sun May 06, 2012 5:40 am**

Consider the following optimisation:

Code: Select all

bstrcat(BSTR dest, BSTR src) // dest = dest + src
{
    dest.ptr = realloc(dest.size + src.size); // no word about error checking
    char * dptr = dest.ptr + dest.size;
    char * sptr = src.ptr;
    int bytes = src.size;
    while (bytes >= 8) 
    {
        // copy 8 bytes at a time while we can
        *(uint64_t*)dptr = *(uint64_t*)sptr;
        dptr += 8; sptr += 8; bytes -= 8;
    }
    while (bytes > 0)
    {
        // copy remainder
        *dptr++ = *sptr++;
        bytes--;
    }
    dptr.size += sptr.size;
}

Posted: **Sun May 06, 2012 2:46 pm**

Doesn't memcpy do that for you already?

That depends completely on the person that wrote it

Point was that you can't pull off that trick when you have to keep checking each byte for a null terminator.

As far as realloc is concerned, there are quite a few optimisations for that too, but the things I can think of right now work just as well with null-terminated strings.

Posted: **Sun May 06, 2012 4:41 pm**

After reading all the posts here I'm beginning to think that null terminating strings are bad. Sure you can perhaps pull off two or three nifty tricks on them but data of arbitrary length with such a dangerous boundry check of just \0 will probably result in more problems than what it is worth.

I think strings should be treated as any other array of data. Adding a number describing the length of the data (and not the number of characters) is a good idea. If you can't store that number as a separate entity adding it to the beginning is a good idea, especially if you wan't to send it as a packet of data back and forth in your system. Perhaps look at it like a little header describing the data to come.

Posted: **Sun May 06, 2012 11:15 pm**

The simple matter of fact is that, in C/C++, strings are defined to be zero-terminated by the language standard. There's nothing you can do that would change that. So the very best you can do is to provide an additional type with different semantics, with all the strings attached (conversion, consistency, violation of principle of least surprise, ...).

I don't say it couldn't be worth the bother, but seriously, how much string munching will you be doing in your kernel API? User-space applications that are heavy on string usage usually do so with custom-made string types anyway, and there's little chance you could satisfy all their needs with your custom kernel type...

(In the app I am maintaining 9-to-5, there are three string types - one for efficient storage, one for efficient searching / comparing, and one for compatibility with the ICU library...)

Posted: **Mon May 07, 2012 4:03 am**

Solar wrote:... with all the strings attached ...

Pun intended?

Posted: **Mon May 07, 2012 6:07 am**

Solar wrote:The simple matter of fact is that, in C/C++, strings are defined to be zero-terminated by the language standard. There's nothing you can do that would change that. So the very best you can do is to provide an additional type with different semantics, with all the strings attached (conversion, consistency, violation of principle of least surprise, ...).

attached strings or not, one string I'd like to add is that all strings attached to the std:: world (

) already have different semantics for exactly the reasons we've been discussing in this thread - only really leaving the conversion step as an argument.

IMHO C is really inferior when it comes to string support in the language standard - it's exactly that language that completely lacks the managed type.

<---- OT: I'm a CPU

OSDev.org

Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.