C Strings Vs Pascal Strings

Candy · Post by **Candy** » Thu Aug 19, 2004 12:01 pm

Curufir wrote: If it can be pre-computed then it should be pre-computed in my opinion.

You can store a graph displaying the occurence of each character in the string, plus its second derivate (how much more does the c appear compared to the b, than the b compared to the a?). You have to draw a line because of overhead.

Curufir · Post by **Curufir** » Thu Aug 19, 2004 12:42 pm

Solar wrote: ...which means you have the data ready at hand for "querying" functions (strlen), but have to do additional algorithmics on "modifying" functions (strcat),

Hmm. Not sure about that. With C strings you'd have to do a couple of strlens before concatentation. With my proposal those strlens are immediately available so memory allocation for the new string becomes trivial, as does figuring out where in memory the contents of the second string should be placed. The problems (I think this is what you were getting at) would occur trying to concatenate two strings of different types and char widths. Easiest solution would be to just deny string operations on strings of different types, forcing the use of some cast-style operations.

and are still at a loss for another set of "querying" functions (strstr, strpbrk).

True, there's no bonus here (Although there will be if the functions can take advantage of searching from both start and end simultaneously), but there's no loss either. Once again there'd have to be some mechanism for type matching, but I don't see that as a major obstacle.

You also significantly blow up space requirements...

Not really. I'm talking about maybe an extra 8 bytes/string. That's low enough to be inconsequential aside from very small strings.

, and require the user to do some serious up-front setup for even the most simple fopen() derivate...

I don't see why. Details for hardcoded filenames can be dealt with at compilation time, strings input by the user can be dealt with as they are typed in, strings returned by the system can be formatted correctly before return. Once the whole system is passing around strings using the same format the overheads aren't as great.

I agree that type/character width conversion is likely to be a pain in the neck, but I don't see too many instances where it's going to be needed (Could just be missing something).

Solar · Post by **Solar** » Thu Aug 19, 2004 1:09 pm

Curufir wrote: (Could just be missing something).

Yes... either you'd have to provide a C library that permanently converts to and fro (YourStrings from UI -> C strings in the app -> YourStrings in the kernel and vice versa), or you provide a C library that isn't "C" anymore - which will definitely annoy developers.

I agree that smells like "because that's the way we always did it".

zloba · Post by **zloba** » Thu Aug 19, 2004 6:51 pm

i am using COM strings ;D
as in, having an interface pointer to a COM object that allows you to determine length, obtain chars, insert/overwrite etc.
yep, a virtual method call for obtaining every single character.

what structure is used internally, is completely irrelevant to the user. furthermore, if i suddenly come up with a better interface for string access, i can just implement it, and no code will be broken.

the general-purpose implementation currently in use has a dynamically allocated buffer, with a buffer size field, and an actual strlen field.

or, i may create another implementation that favors insert/append operations, or saves space, or whatever - such as a linked list of chunks - the user code will never know the difference, as far as usage is concerned.
or maybe an implementation that additionally implements Unicode string functions or something else.

Dreamsmith · Post by **Dreamsmith** » Fri Aug 20, 2004 12:30 am

Curufir wrote:
Solar wrote:...which means you have the data ready at hand for "querying" functions (strlen), but have to do additional algorithmics on "modifying" functions (strcat),
Hmm. Not sure about that. With C strings you'd have to do a couple of strlens before concatentation.

As someone who's implemented <string.h>, I can assure you, Solar is right -- strcat doesn't call strlen at all, much less twice.

He's right in general as well. Getting the length of a C string is a O(n) operation, as opposed to O(1) for Pascal strings. But for nearly every other function in the C standard library, the code is smaller and faster because it uses null-terminated strings rather than length-encoded strings. Not just string modifying routines, either -- try writing strcmp with both types. The null-terminated string version is significantly more trivial than the length-encoded version, and requires fewer comparison per iteration of the loop.

Solar · Post by **Solar** » Fri Aug 20, 2004 12:52 am

Dreamsmith wrote:
Curufir wrote:
Solar wrote:...which means you have the data ready at hand for "querying" functions (strlen), but have to do additional algorithmics on "modifying" functions (strcat),
Hmm. Not sure about that. With C strings you'd have to do a couple of strlens before concatentation.
As someone who's implemented <string.h>, I can assure you, Solar is right -- strcat doesn't call strlen at all, much less twice.

As someone who's implemented <string.h> too, I have to admit that Curufir is right, too - you, as a user, most likely have to call strlen() before you call strcat(), since you have to determine how much space to allocate for s1 (the destination array)...

Dreamsmith · Post by **Dreamsmith** » Fri Aug 20, 2004 1:25 am

Solar wrote:you, as a user, most likely have to call strlen() before you call strcat(), since you have to determine how much space to allocate for s1 (the destination array)...

Well, sometimes you do, and sometimes you don't. The nice thing about the C way of doing things is the library doesn't make any assumptions about that for you. You can do the extra work when you need it, and skip it when you know you don't.

It's pure folly for the toolsmith to assume what'll work best for the string library's users. Only they know their programs' requirements.

There are a great many string operations where knowing the length of the string ahead of time is not at all helpful. When you're going to be needing to know it, you can keep a variable around that holds it length -- I do this all the time. With C strings, you have the choice of maintaining the information or not. With Pascal strings, you're forced to maintain it, adding additional complications to all sorts of functions (even simple ones like strcmp), whether you need it or not, since it's not a properly formatted Pascal string if you don't keep that variable accurate, whether you're using it or not.

Any time you make something required rather than optional, you remove an option from the user of your library. Every time you do that, you guarentee that at least some programs are going to be forced to run more slowly than they otherwise would. With C strings, a programmer can always opt to carry around a length variable with each string to optimize for situations where this is helpful, so Pascal strings offer no speed advantage here. With Pascal strings, a programmer cannot opt to not carry around a length variable when it's not helpful, and it does complicate a lot of functions that would otherwise go much more quickly, so Pascal strings offer a speed disadvantage here. Not a good tradeoff in my book...

Candy · Post by **Candy** » Fri Aug 20, 2004 1:38 am

Isn't it an idea to make a form of transformed C string? Say, a c string with a prefix of 2 bytes containing its length, for text strings only, so most functions (strcat, strlen) can operate in either O(n) or O(1) time, instead of O(m+n) or O(n) time. Keeping track of the first two bytes to be skipped is barely a serious requirement, and making it optional as a substitute library (your own?) makes it usable. But, you'd still have to provide them for the standard C stuff that's to be ported. Changing a few lines because of a minor change is OK, checking every line three times because you want to change the way I think is not OK. Same goes for the GAS code I produce, have to read every line three times to make sure the operands are the AT&T way around, immediates are prefixed with $ and addresses are not, and that each register has the required amount of % signs.

In short, you can make it anything you like as long as you also provide a default implementation. Don't force people, ease them in, with an easy way out. Prefer to aim for a different class of strings that can be applied and removed (!) with a single SED command.

As for the actual implementation, I'm considering switching to UTF8 internal representation, although it makes you get O(n) stuff for some operations (most actually). UTF32 is another alternative but seems too wasteful for unicode (which never uses the top 11 bits), and esp. for plain text which then takes up 4x the space it did before. In all places but Microsoft-land that's unacceptable bloat. But - prefixing the UTF8 string with a few variables and turning it in a c++ class that can be used very transparently in C as if it was a plain string class (educate, do not force, users) makes it a very normal thing to use. This would make the utf8 implementation as fast, or just about as fast as, the normal ascii char functions, with an overhead of 2-8 bytes (possibly more on big strings) for small strings, which have an average length of 16. That's a waste of 12-50%, compared to the 300% waste UTF32 would have. For bigger strings, the overhead lowers, where for UTF32 (or wchar_t on an implementation with similar capabilities) never gets below 300%.

Think I'm going to work on this a while in the near future, say, after 0.0.3 is out...

darn, need to get that boot loader working & the kernel tested

Solar · Post by **Solar** » Fri Aug 20, 2004 2:02 am

Candy wrote: Same goes for the GAS code I produce, have to read every line three times to make sure the operands are the AT&T way around, immediates are prefixed with $ and addresses are not, and that each register has the required amount of % signs.

On the other hand, when I read your GAS code (and you read code more often than you write it), I get all that cosy extra information...

...prefixing the UTF8 string with a few variables and turning it in a c++ class that can be used very transparently in C...

You can't use C++ classes transparently in C.

darn, need to get that boot loader working & the kernel tested

Have you tried GRUB? *ducksandruns*

Pype.Clicker · Post by **Pype.Clicker** » Fri Aug 20, 2004 2:30 am

Just a thought ... What you actually want is an oracle that tells you "string at address X is of len Y", right ? why not putting such information in a reference table, for instance ?

Let's say that the two bytes before your string can be an index into the table (if it's a well-allocated string, and not a portion of a string).

You check that index against the table's length and if within, you compare the address X with the address stored in table[index] ... If they match, you're facing a 'valid string' and you can read the size from the table. If they're not matching, you've been fooled by a middle-of-a-string pointer and should go for the traditionnal 'scan-for-a-null-char' approach...

zloba · Post by **zloba** » Fri Aug 20, 2004 4:29 am

Pype.Clicker

Just a thought ... What you actually want is an oracle that tells you "string at address X is of len Y", right ?

it's funny how this exact idea occurred to me today regarding a memory manager:
given a pointer to delete,
obtain an "index into the allocation table" stored just before it,
verify that the index is within range,
look up the entry in the allocation table to see if the target address matches,
if matches, deallocate using the data obtained; otherwise, make noise: either something has been damaged (index or allocation table), or someone is trying to delete a bogus pointer. at this point, i'd be most interested in a core dump.
there may be other interesting ways to make use of this.

p.s. replaced "identifier" with "index into the allocation table" (RE to Solar )

Solar · Post by **Solar** » Fri Aug 20, 2004 4:46 am

I don't really get what you mean with "identifier", but having a "memory header" in front of what malloc() actually returns is common practice, e.g. in dlmalloc().

Curufir · Post by **Curufir** » Fri Aug 20, 2004 7:58 am

Dreamsmith wrote: Well, sometimes you do, and sometimes you don't. The nice thing about the C way of doing things is the library doesn't make any assumptions about that for you.

Ok, try using a null terminator as something else... ;D

Dunno why you think there's a problem with strcmp.

All that happens is the comparison loop gets a defined limit (Which is no bad thing IMHO) rather than including a check to see if a null-terminator has been found. Chances are it would be quicker.

I honestly don't see where you think this huge extra book-keeping is coming from. If all strings on the system know their lengths then all the book-keeping stuff comes down to very few instructions. Enough to be totally insignificant in comparison to the vast number of memory ops strings require. Sure it'll slow things down if you run a million strings through it, but if you're that desperately concerned about string operation speed you'd have rewritten the string library in assembly (Yup, there are speed gains to be had).

Solar · Post by **Solar** » Fri Aug 20, 2004 8:26 am

Curufir wrote:
All that happens is the comparison loop gets a defined limit (Which is no bad thing IMHO) rather than including a check to see if a null-terminator has been found.

Well, it already has a defined limit:

Code: Select all

int strcmp( const char * s1, const char * s2 )
{
   while ( *s1 && *s2 )
   {
      ++s1;
      ++s2;
   }
   return ( *s1 - *s2 );
}

Chances are it would be quicker.

How come? Sure, you could check for equal length first to find whether they could be equal, but then you'd still have to iterate through both strings, equal length or not.

I honestly don't see where you think this huge extra book-keeping is coming from. If all strings on the system know their lengths then all the book-keeping stuff comes down to very few instructions.

Instead of char * myString = malloc( strlen( otherString ) + 1 ), I'd have to allocate extra space for the bookkeeping. (Since malloc() doesn't know that it's about to reserve a string it can't do that for me.)

Then, instead of doing the trivial strcpy:

Code: Select all

char * strcpy( char * s1, const char * s2 )
{
   char * dest = s1;
   while ( ( *dest++ = *s2++ ) );
   return s1;
}

I would have to track the length of the copied string and write it:

Code: Select all

char * strcpy( char * s1, const char * s2 )
{
   char * dest = s1;
   while ( ( *dest++ = *s2++ ) );
   /* assuming 1-byte length information */
   *(s1 - 1) = (dest - s1);
   return s1;
}

Enough to be totally insignificant in comparison to the vast number of memory ops strings require.

I'll grant you that. But you will be going through all kinds of headaches, both for you and for anyone using your system - because, trust me, it won't be transparent to the C user - for a speed benefit in strcat, strncat, strrchr, and strlen. Perhaps strstr, too, if you're doing some smart algorithmics. The rest won't profit, would even be that bit slower.

And you'd break strtok, not that it's that popular.

...but if you're that desperately concerned about string operation speed you'd have rewritten the string library in assembly (Yup, there are speed gains to be had).

Depends on your compiler, and I want to see it to believe it. In any case, I prefer my code to be maintainable, and I'm better at maintaining C than ASM.

Curufir · Post by **Curufir** » Fri Aug 20, 2004 9:29 am

Solar wrote:
How come? Sure, you could check for equal length first to find whether they could be equal, but then you'd still have to iterate through both strings, equal length or not.

Because I can leverage the fact that I know the lengths to perform operations on dwords not bytes more efficiently.

Eg (Pseudo-code)

Code: Select all

Loop for (str2.len >> 2) times
Do whatever you want with dwords
End Loop

If (str2.len | 0x2)
Do whatever you want with words
End If

If (str2.len | 0x1)
Do whatever you want with bytes
End If

Where str2.len is the shortest length of the two strings

Sure you can do things in dwords with null-terminated string, but you need a bunch of extra logic to deal with cases where the string isn't a factor of 4 in length.

Instead of char * myString = malloc( strlen( otherString ) + 1 ), I'd have to allocate extra space for the bookkeeping. (Since malloc() doesn't know that it's about to reserve a string it can't do that for me.)

Hmm. Well I wouldn't be using strlen anyway. Since there's variable character width allowed strlen is meaningless in terms of memory allocation. It'd make far more sense to tweak sizeof to return the true size (In bytes) for a string and use that for malloc.

Ok, so it breaks the standard, but only a little ;D.

I'll grant you that. But you will be going through all kinds of headaches, both for you and for anyone using your system - because, trust me, it won't be transparent to the C user - for a speed benefit in strcat, strncat, strrchr, and strlen.

Very true, I guess I'd have to put a label on the tin saying "Non-standard C".

Depends on your compiler, and I want to see it to believe it.

It's not much, the C libraries are pretty good about it nowadays, but there are still some gains to be made...trust me (Ever seen your compiler use SSE for strings?).

OSDev.org

C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings

Re:C Strings Vs Pascal Strings