You can store a graph displaying the occurence of each character in the string, plus its second derivate (how much more does the c appear compared to the b, than the b compared to the a?). You have to draw a line because of overhead.Curufir wrote: If it can be pre-computed then it should be pre-computed in my opinion.
C Strings Vs Pascal Strings
Re:C Strings Vs Pascal Strings
Re:C Strings Vs Pascal Strings
Hmm. Not sure about that. With C strings you'd have to do a couple of strlens before concatentation. With my proposal those strlens are immediately available so memory allocation for the new string becomes trivial, as does figuring out where in memory the contents of the second string should be placed. The problems (I think this is what you were getting at) would occur trying to concatenate two strings of different types and char widths. Easiest solution would be to just deny string operations on strings of different types, forcing the use of some cast-style operations.Solar wrote: ...which means you have the data ready at hand for "querying" functions (strlen), but have to do additional algorithmics on "modifying" functions (strcat),
True, there's no bonus here (Although there will be if the functions can take advantage of searching from both start and end simultaneously), but there's no loss either. Once again there'd have to be some mechanism for type matching, but I don't see that as a major obstacle.and are still at a loss for another set of "querying" functions (strstr, strpbrk).
Not really. I'm talking about maybe an extra 8 bytes/string. That's low enough to be inconsequential aside from very small strings.You also significantly blow up space requirements...
I don't see why. Details for hardcoded filenames can be dealt with at compilation time, strings input by the user can be dealt with as they are typed in, strings returned by the system can be formatted correctly before return. Once the whole system is passing around strings using the same format the overheads aren't as great., and require the user to do some serious up-front setup for even the most simple fopen() derivate...
I agree that type/character width conversion is likely to be a pain in the neck, but I don't see too many instances where it's going to be needed (Could just be missing something).
Re:C Strings Vs Pascal Strings
Yes... either you'd have to provide a C library that permanently converts to and fro (YourStrings from UI -> C strings in the app -> YourStrings in the kernel and vice versa), or you provide a C library that isn't "C" anymore - which will definitely annoy developers.Curufir wrote: (Could just be missing something).
I agree that smells like "because that's the way we always did it".
Every good solution is obvious once you've found it.
Re:C Strings Vs Pascal Strings
i am using COM strings ;D
as in, having an interface pointer to a COM object that allows you to determine length, obtain chars, insert/overwrite etc.
yep, a virtual method call for obtaining every single character.
what structure is used internally, is completely irrelevant to the user. furthermore, if i suddenly come up with a better interface for string access, i can just implement it, and no code will be broken.
the general-purpose implementation currently in use has a dynamically allocated buffer, with a buffer size field, and an actual strlen field.
or, i may create another implementation that favors insert/append operations, or saves space, or whatever - such as a linked list of chunks - the user code will never know the difference, as far as usage is concerned.
or maybe an implementation that additionally implements Unicode string functions or something else.
as in, having an interface pointer to a COM object that allows you to determine length, obtain chars, insert/overwrite etc.
yep, a virtual method call for obtaining every single character.
what structure is used internally, is completely irrelevant to the user. furthermore, if i suddenly come up with a better interface for string access, i can just implement it, and no code will be broken.
the general-purpose implementation currently in use has a dynamically allocated buffer, with a buffer size field, and an actual strlen field.
or, i may create another implementation that favors insert/append operations, or saves space, or whatever - such as a linked list of chunks - the user code will never know the difference, as far as usage is concerned.
or maybe an implementation that additionally implements Unicode string functions or something else.
Re:C Strings Vs Pascal Strings
As someone who's implemented <string.h>, I can assure you, Solar is right -- strcat doesn't call strlen at all, much less twice.Curufir wrote:Hmm. Not sure about that. With C strings you'd have to do a couple of strlens before concatentation.Solar wrote:...which means you have the data ready at hand for "querying" functions (strlen), but have to do additional algorithmics on "modifying" functions (strcat),
He's right in general as well. Getting the length of a C string is a O(n) operation, as opposed to O(1) for Pascal strings. But for nearly every other function in the C standard library, the code is smaller and faster because it uses null-terminated strings rather than length-encoded strings. Not just string modifying routines, either -- try writing strcmp with both types. The null-terminated string version is significantly more trivial than the length-encoded version, and requires fewer comparison per iteration of the loop.
Re:C Strings Vs Pascal Strings
As someone who's implemented <string.h> too, I have to admit that Curufir is right, too - you, as a user, most likely have to call strlen() before you call strcat(), since you have to determine how much space to allocate for s1 (the destination array)...Dreamsmith wrote:As someone who's implemented <string.h>, I can assure you, Solar is right -- strcat doesn't call strlen at all, much less twice.Curufir wrote:Hmm. Not sure about that. With C strings you'd have to do a couple of strlens before concatentation.Solar wrote:...which means you have the data ready at hand for "querying" functions (strlen), but have to do additional algorithmics on "modifying" functions (strcat),
Every good solution is obvious once you've found it.
Re:C Strings Vs Pascal Strings
Well, sometimes you do, and sometimes you don't. The nice thing about the C way of doing things is the library doesn't make any assumptions about that for you. You can do the extra work when you need it, and skip it when you know you don't.Solar wrote:you, as a user, most likely have to call strlen() before you call strcat(), since you have to determine how much space to allocate for s1 (the destination array)...
It's pure folly for the toolsmith to assume what'll work best for the string library's users. Only they know their programs' requirements.
There are a great many string operations where knowing the length of the string ahead of time is not at all helpful. When you're going to be needing to know it, you can keep a variable around that holds it length -- I do this all the time. With C strings, you have the choice of maintaining the information or not. With Pascal strings, you're forced to maintain it, adding additional complications to all sorts of functions (even simple ones like strcmp), whether you need it or not, since it's not a properly formatted Pascal string if you don't keep that variable accurate, whether you're using it or not.
Any time you make something required rather than optional, you remove an option from the user of your library. Every time you do that, you guarentee that at least some programs are going to be forced to run more slowly than they otherwise would. With C strings, a programmer can always opt to carry around a length variable with each string to optimize for situations where this is helpful, so Pascal strings offer no speed advantage here. With Pascal strings, a programmer cannot opt to not carry around a length variable when it's not helpful, and it does complicate a lot of functions that would otherwise go much more quickly, so Pascal strings offer a speed disadvantage here. Not a good tradeoff in my book...
Re:C Strings Vs Pascal Strings
Isn't it an idea to make a form of transformed C string? Say, a c string with a prefix of 2 bytes containing its length, for text strings only, so most functions (strcat, strlen) can operate in either O(n) or O(1) time, instead of O(m+n) or O(n) time. Keeping track of the first two bytes to be skipped is barely a serious requirement, and making it optional as a substitute library (your own?) makes it usable. But, you'd still have to provide them for the standard C stuff that's to be ported. Changing a few lines because of a minor change is OK, checking every line three times because you want to change the way I think is not OK. Same goes for the GAS code I produce, have to read every line three times to make sure the operands are the AT&T way around, immediates are prefixed with $ and addresses are not, and that each register has the required amount of % signs.
In short, you can make it anything you like as long as you also provide a default implementation. Don't force people, ease them in, with an easy way out. Prefer to aim for a different class of strings that can be applied and removed (!) with a single SED command.
As for the actual implementation, I'm considering switching to UTF8 internal representation, although it makes you get O(n) stuff for some operations (most actually). UTF32 is another alternative but seems too wasteful for unicode (which never uses the top 11 bits), and esp. for plain text which then takes up 4x the space it did before. In all places but Microsoft-land that's unacceptable bloat. But - prefixing the UTF8 string with a few variables and turning it in a c++ class that can be used very transparently in C as if it was a plain string class (educate, do not force, users) makes it a very normal thing to use. This would make the utf8 implementation as fast, or just about as fast as, the normal ascii char functions, with an overhead of 2-8 bytes (possibly more on big strings) for small strings, which have an average length of 16. That's a waste of 12-50%, compared to the 300% waste UTF32 would have. For bigger strings, the overhead lowers, where for UTF32 (or wchar_t on an implementation with similar capabilities) never gets below 300%.
Think I'm going to work on this a while in the near future, say, after 0.0.3 is out...
darn, need to get that boot loader working & the kernel tested
In short, you can make it anything you like as long as you also provide a default implementation. Don't force people, ease them in, with an easy way out. Prefer to aim for a different class of strings that can be applied and removed (!) with a single SED command.
As for the actual implementation, I'm considering switching to UTF8 internal representation, although it makes you get O(n) stuff for some operations (most actually). UTF32 is another alternative but seems too wasteful for unicode (which never uses the top 11 bits), and esp. for plain text which then takes up 4x the space it did before. In all places but Microsoft-land that's unacceptable bloat. But - prefixing the UTF8 string with a few variables and turning it in a c++ class that can be used very transparently in C as if it was a plain string class (educate, do not force, users) makes it a very normal thing to use. This would make the utf8 implementation as fast, or just about as fast as, the normal ascii char functions, with an overhead of 2-8 bytes (possibly more on big strings) for small strings, which have an average length of 16. That's a waste of 12-50%, compared to the 300% waste UTF32 would have. For bigger strings, the overhead lowers, where for UTF32 (or wchar_t on an implementation with similar capabilities) never gets below 300%.
Think I'm going to work on this a while in the near future, say, after 0.0.3 is out...
darn, need to get that boot loader working & the kernel tested
Re:C Strings Vs Pascal Strings
On the other hand, when I read your GAS code (and you read code more often than you write it), I get all that cosy extra information...Candy wrote: Same goes for the GAS code I produce, have to read every line three times to make sure the operands are the AT&T way around, immediates are prefixed with $ and addresses are not, and that each register has the required amount of % signs.
You can't use C++ classes transparently in C....prefixing the UTF8 string with a few variables and turning it in a c++ class that can be used very transparently in C...
Have you tried GRUB? *ducksandruns*darn, need to get that boot loader working & the kernel tested
Every good solution is obvious once you've found it.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:C Strings Vs Pascal Strings
Just a thought ... What you actually want is an oracle that tells you "string at address X is of len Y", right ? why not putting such information in a reference table, for instance ?
Let's say that the two bytes before your string can be an index into the table (if it's a well-allocated string, and not a portion of a string).
You check that index against the table's length and if within, you compare the address X with the address stored in table[index] ... If they match, you're facing a 'valid string' and you can read the size from the table. If they're not matching, you've been fooled by a middle-of-a-string pointer and should go for the traditionnal 'scan-for-a-null-char' approach...
Let's say that the two bytes before your string can be an index into the table (if it's a well-allocated string, and not a portion of a string).
You check that index against the table's length and if within, you compare the address X with the address stored in table[index] ... If they match, you're facing a 'valid string' and you can read the size from the table. If they're not matching, you've been fooled by a middle-of-a-string pointer and should go for the traditionnal 'scan-for-a-null-char' approach...
Re:C Strings Vs Pascal Strings
Pype.Clicker
given a pointer to delete,
obtain an "index into the allocation table" stored just before it,
verify that the index is within range,
look up the entry in the allocation table to see if the target address matches,
if matches, deallocate using the data obtained; otherwise, make noise: either something has been damaged (index or allocation table), or someone is trying to delete a bogus pointer. at this point, i'd be most interested in a core dump.
there may be other interesting ways to make use of this.
p.s. replaced "identifier" with "index into the allocation table" (RE to Solar )
it's funny how this exact idea occurred to me today regarding a memory manager:Just a thought ... What you actually want is an oracle that tells you "string at address X is of len Y", right ?
given a pointer to delete,
obtain an "index into the allocation table" stored just before it,
verify that the index is within range,
look up the entry in the allocation table to see if the target address matches,
if matches, deallocate using the data obtained; otherwise, make noise: either something has been damaged (index or allocation table), or someone is trying to delete a bogus pointer. at this point, i'd be most interested in a core dump.
there may be other interesting ways to make use of this.
p.s. replaced "identifier" with "index into the allocation table" (RE to Solar )
Re:C Strings Vs Pascal Strings
I don't really get what you mean with "identifier", but having a "memory header" in front of what malloc() actually returns is common practice, e.g. in dlmalloc().
Every good solution is obvious once you've found it.
Re:C Strings Vs Pascal Strings
Ok, try using a null terminator as something else... ;DDreamsmith wrote: Well, sometimes you do, and sometimes you don't. The nice thing about the C way of doing things is the library doesn't make any assumptions about that for you.
Dunno why you think there's a problem with strcmp.
All that happens is the comparison loop gets a defined limit (Which is no bad thing IMHO) rather than including a check to see if a null-terminator has been found. Chances are it would be quicker.
I honestly don't see where you think this huge extra book-keeping is coming from. If all strings on the system know their lengths then all the book-keeping stuff comes down to very few instructions. Enough to be totally insignificant in comparison to the vast number of memory ops strings require. Sure it'll slow things down if you run a million strings through it, but if you're that desperately concerned about string operation speed you'd have rewritten the string library in assembly (Yup, there are speed gains to be had).
Re:C Strings Vs Pascal Strings
Well, it already has a defined limit:Curufir wrote:
All that happens is the comparison loop gets a defined limit (Which is no bad thing IMHO) rather than including a check to see if a null-terminator has been found.
Code: Select all
int strcmp( const char * s1, const char * s2 )
{
while ( *s1 && *s2 )
{
++s1;
++s2;
}
return ( *s1 - *s2 );
}
How come? Sure, you could check for equal length first to find whether they could be equal, but then you'd still have to iterate through both strings, equal length or not.Chances are it would be quicker.
Instead of char * myString = malloc( strlen( otherString ) + 1 ), I'd have to allocate extra space for the bookkeeping. (Since malloc() doesn't know that it's about to reserve a string it can't do that for me.)I honestly don't see where you think this huge extra book-keeping is coming from. If all strings on the system know their lengths then all the book-keeping stuff comes down to very few instructions.
Then, instead of doing the trivial strcpy:
Code: Select all
char * strcpy( char * s1, const char * s2 )
{
char * dest = s1;
while ( ( *dest++ = *s2++ ) );
return s1;
}
Code: Select all
char * strcpy( char * s1, const char * s2 )
{
char * dest = s1;
while ( ( *dest++ = *s2++ ) );
/* assuming 1-byte length information */
*(s1 - 1) = (dest - s1);
return s1;
}
I'll grant you that. But you will be going through all kinds of headaches, both for you and for anyone using your system - because, trust me, it won't be transparent to the C user - for a speed benefit in strcat, strncat, strrchr, and strlen. Perhaps strstr, too, if you're doing some smart algorithmics. The rest won't profit, would even be that bit slower.Enough to be totally insignificant in comparison to the vast number of memory ops strings require.
And you'd break strtok, not that it's that popular.
Depends on your compiler, and I want to see it to believe it. In any case, I prefer my code to be maintainable, and I'm better at maintaining C than ASM....but if you're that desperately concerned about string operation speed you'd have rewritten the string library in assembly (Yup, there are speed gains to be had).
Every good solution is obvious once you've found it.
Re:C Strings Vs Pascal Strings
Because I can leverage the fact that I know the lengths to perform operations on dwords not bytes more efficiently.Solar wrote:
How come? Sure, you could check for equal length first to find whether they could be equal, but then you'd still have to iterate through both strings, equal length or not.
Eg (Pseudo-code)
Code: Select all
Loop for (str2.len >> 2) times
Do whatever you want with dwords
End Loop
If (str2.len | 0x2)
Do whatever you want with words
End If
If (str2.len | 0x1)
Do whatever you want with bytes
End If
Where str2.len is the shortest length of the two strings
Hmm. Well I wouldn't be using strlen anyway. Since there's variable character width allowed strlen is meaningless in terms of memory allocation. It'd make far more sense to tweak sizeof to return the true size (In bytes) for a string and use that for malloc.Instead of char * myString = malloc( strlen( otherString ) + 1 ), I'd have to allocate extra space for the bookkeeping. (Since malloc() doesn't know that it's about to reserve a string it can't do that for me.)
Ok, so it breaks the standard, but only a little ;D.
Very true, I guess I'd have to put a label on the tin saying "Non-standard C".I'll grant you that. But you will be going through all kinds of headaches, both for you and for anyone using your system - because, trust me, it won't be transparent to the C user - for a speed benefit in strcat, strncat, strrchr, and strlen.
It's not much, the C libraries are pretty good about it nowadays, but there are still some gains to be made...trust me (Ever seen your compiler use SSE for strings?).Depends on your compiler, and I want to see it to believe it.