Page 1 of 1

Upper case converter

Posted: Tue Mar 14, 2006 3:00 pm
by Candamir
I'm now dealing with some string utilities, and I want to make a upper/lower case converter. How can I now which characters actually *have* an upper/lower case and how do I get it?

Re:Upper case converter

Posted: Tue Mar 14, 2006 4:51 pm
by blip
Bit 5 of the character determines case in ASCII, it is clear if the letter is uppercase and set if it is lowercase.

Re:Upper case converter

Posted: Tue Mar 14, 2006 6:37 pm
by Candamir
And what happens to characters that don't have any upper/lower case pendant?

Re:Upper case converter

Posted: Tue Mar 14, 2006 7:22 pm
by Brendan
Hi,

For ASCII:

Code: Select all

toupper:
   cmp al,'a'
   jb .l1
   cmp al,'z'
   ja .l1
   sub al,'a'-'A'
.l1:
   ret

tolower:
   cmp al,'Z'
   ja .l1
   cmp al,'A'
   jb .l1
   add al,'a'-'A'
.l1:
   ret
Unicode is left as an exercise for the reader... :o


Cheers,

Brendan

Re:Upper case converter

Posted: Tue Mar 14, 2006 7:40 pm
by Candamir
ASCII conversion in C (I'm not a great fan of ASM ;)):

Code: Select all

char lower_case(char c)
{
   if (c >= 'A' && c <= 'Z')
   {
      c = c + 'a' - 'A';
   }
   return c;
}

char upper_case(char c)
{
   if (c >= 'a' && c <= 'z')
   {
      c = c - 'a' + 'A';
   }
   return c;
}
For now, I haven't thought about Unicode, but I have no Unicode table close, so I'll think about it...

Re:Upper case converter

Posted: Tue Mar 14, 2006 11:43 pm
by Solar
Candamir wrote: How can I know which characters actually *have* an upper/lower case and how do I get it?
The "char value arithmetics" shown by blip and Brendan here is indeed unable to handle anything else but 7-bit ASCII, which makes it pretty useless for anyone not being British or American (i.e., about 95% of the world population).

If you look at the C library header <ctype.h>, you will realize there is a whole family of related functions - toupper(), tolower(), isspace(), ispunct() etc. etc. All these are "locale dependent", i.e. when the program switches locales, these functions will return different values.

The idea is to create a translation table for each locale. As long as we are talking about 8-bit characters, consider a [tt]char __toupper[256][/tt] containing uppercase translation codes:

Code: Select all

int toupper( int c )
{
    return __toupper[ c ];
}
Another char-array __tolower and a third array containing status flags for each character value (Is it a whitespace? Is it printable? etc.), and you're set. When a program changes locale, you have to reload the three translation tables with the values for the new locale, which are usually read from disk.

Two issues here:
  • When you're doing Unicode, using a "flat" translation table would be wasting lots of memory. You might want to implement a "folded" translation table, kind of like the page tables and directories used in virtual memory management.
  • Some alphabets contain characters that expand when converted to upper-/lowercase. A common example is the German '?', which expands to 'SS' in uppercase. Standard C has no way of handling this, so you better come up with some OS API function that does. ;-)
I hope this helps.