bluemoon wrote:That's pascal string, disadvantage is:
1. string length is restricted by to prefix size. ie 1 byte prefix you have string < 256 (wide) characters; 2 bytes prefix would occupy more space than null-terminated string
That's a valid point, but I'm not quite worried about additional three bytes per string, given the speed advantage. And with 32-bit prefix, you can have a string as long as 4gb. I use 32-bit prefix on 32-bit machines (on 64-bit machines I'd use 64-bit prefix, etc. etc.), since I heard address bus sized integers are fastest on a particular machine. Not that speed is an issue now, but it may be later. After all, I picked this sort of strings implementation because of speed advantages.
bluemoon wrote:2. fixed character width, can't be very flexible to implement utf8 on top of it.
Valid too, and about this one I'm worried. I'd like to implement UTF-8 in the future, not only I think it's a good standard, but also widely used one (I'd have to correct myself - I don't care about standards compliance, where it gets in the way). But again, the strings can be as big as the entire memory, that is literally more than enough. Though I'm not sure if the prefix would indicate how many bytes or how many characters the string has in UTF-8. Maybe a solution would suggest itself, when implementing the standard.
bluemoon wrote:advantage is quick, but watch out, you still need to count the character when you create/modify the string; so the total consumed time may not differ much. (it just a way of pre-calulate the strlen)
Maybe, if I create a string byte by byte, but adding two strings, It'd be simpler and faster then with null terminated strings to count the length:
Code: Select all
void str_cat(unsigned int str_1_len, char* str_1, unsigned int str_2_len, char* str_2, unsigned int* ret_str_len, char* buf)
{
str_prefix = str_1_len + str_2_len;
...
}
Also it's a bit inconvenient to count the bytes in a string, when I do:
Code: Select all
prnt_str(what?, "a veeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeery long string : p")
Or am I being a noob and don't know an automated way to do it?
Solar wrote:a C userspace string (which is inherently null-terminated)
How come is it inherently null-terminated? As far as I understand, I can build a library, which uses strings with prefix, can't I? Just like I build a kernel.
Solar wrote:You also have to be careful when crossing API boundaries. If the kernel is using Pascal strings, a C userspace string (which is inherently null-terminated) would have to be converted at some point - or userspace would be faced with two different kinds of strings (the native C ones, and the kernel ones). And we all know how great that is for a feature (CString, anyone?).
If you manage to keep that abstraction in the kernel only, you won't have the problem - but you won't have much of an advantage either.
Having kernel have one kind of strings and user space have either different kind of strings and convert to kernel type of strings, or have user space have two kind of strings is nasty, to say the least. But how come C user space strings are inherently null terminated? Are you talking about existing C libraries? I intend to build my own, so if that's the problem, there's no problem. : ) I want to find out if I have an advantage with non null terminated strings and if so, extend it to user space, not only kernel space.
iansjack wrote:And when it comes to dividing a string in two, say at some separator value, it's great. Just scan the string till you find the value and pop a zero in there instead. Now think of the processes to be followed with Pascal type strings.
Null-terminated strings are ideal when you are looking at relatively low-level operations.
True, but in other places I believe, there are advantages for the alternative way. Like what I demonstrated with str_cat() example, and ofc str_len().
Also, wikipedia says null terminated strings have speed issues, and the are more sources. Besides, non null terminated strings tend to be less error prone, they say.
My guess now is that in some ways, one is better, and in some, the other. It would be best to optimize for what is most oftenly used. I think I'll try to implement all the string operations i can think of (I haven't yet, cus I didn't
need them), and see where on type of string or the other would be advantageous. But that does not mean it will be best for performance, as some functions may be used more than the others. For example I've read, that str_len() is very often used. I should find the original article, that spawned this devil seed in me and re-read it, but I'm kinda in a hurry now.