Re:joining 2 strings
Posted: Mon May 23, 2005 12:41 pm
I have personally tried to write a "sane" string library with immutable strings semantics. And with C++, it's harder than it seems if you want to avoid lots of copies.
Now, there were two fundamental operations: substring and concatenate. Both were constant time operations. Since the strings are immutable, concatenation is really just a question of creating an objects which looks like a string, but in reality contains pointers to two original strings. Doing enough concatenations, you end up with a binary tree of small strings. Per-character indexing becomes a bit slow, but usually you just iterate the whole string (for printing) and this can be handled with an intelligent iterator.
Now, we already have a need for automatic memory management, but since there are no cycles, reference counts will solve the problem. Good sofar.
What is harder is that once you introduce substring, things get ugly. One easy option is to create a string-like object which points to the original string, but has an (offset,length) pair to fix indexing and length. The trouble is, the whole (possibly large) string stays live until you remove the last (possibly very small) substring of it. And if you have a substring of a concatenation of several substrings, you might have tons of strings that can't really be referenced at all.
This can be solved though. The solution involves replacing the original string with a concatenation of it's substrings (one for pre-selection, one for the selection, one for post-selection) in place so that all old references now point to the concatenation. We can also avoid copying at this point if we have a special purpose allocator which allows allocated regions to be split on the fly so that they can later be free'd in any order.
Finally, to avoid the semantic garbage created, we either need to make substrings reconcatenate after finding a substring of a concatenation... or create a special purpose garbage collector which traces those strings that are still live, and discards the semantic garbage. The substring reconcatenation can in worst-case get O(n) performance though, so strictly speaking that isn't an option.
In short, it becomes messy. Alternative solutions (with the same semantics: no copies, "essentially" O(1) substring/concat, no unbounded wasted space) are appreciated, since I'm going to need a string library with the above semantics for a scripting language of mine.
Now, there were two fundamental operations: substring and concatenate. Both were constant time operations. Since the strings are immutable, concatenation is really just a question of creating an objects which looks like a string, but in reality contains pointers to two original strings. Doing enough concatenations, you end up with a binary tree of small strings. Per-character indexing becomes a bit slow, but usually you just iterate the whole string (for printing) and this can be handled with an intelligent iterator.
Now, we already have a need for automatic memory management, but since there are no cycles, reference counts will solve the problem. Good sofar.
What is harder is that once you introduce substring, things get ugly. One easy option is to create a string-like object which points to the original string, but has an (offset,length) pair to fix indexing and length. The trouble is, the whole (possibly large) string stays live until you remove the last (possibly very small) substring of it. And if you have a substring of a concatenation of several substrings, you might have tons of strings that can't really be referenced at all.
This can be solved though. The solution involves replacing the original string with a concatenation of it's substrings (one for pre-selection, one for the selection, one for post-selection) in place so that all old references now point to the concatenation. We can also avoid copying at this point if we have a special purpose allocator which allows allocated regions to be split on the fly so that they can later be free'd in any order.
Finally, to avoid the semantic garbage created, we either need to make substrings reconcatenate after finding a substring of a concatenation... or create a special purpose garbage collector which traces those strings that are still live, and discards the semantic garbage. The substring reconcatenation can in worst-case get O(n) performance though, so strictly speaking that isn't an option.
In short, it becomes messy. Alternative solutions (with the same semantics: no copies, "essentially" O(1) substring/concat, no unbounded wasted space) are appreciated, since I'm going to need a string library with the above semantics for a scripting language of mine.