Page 1 of 1
Upper case converter
Posted: Tue Mar 14, 2006 3:00 pm
by Candamir
I'm now dealing with some string utilities, and I want to make a upper/lower case converter. How can I now which characters actually *have* an upper/lower case and how do I get it?
Re:Upper case converter
Posted: Tue Mar 14, 2006 4:51 pm
by blip
Bit 5 of the character determines case in ASCII, it is clear if the letter is uppercase and set if it is lowercase.
Re:Upper case converter
Posted: Tue Mar 14, 2006 6:37 pm
by Candamir
And what happens to characters that don't have any upper/lower case pendant?
Re:Upper case converter
Posted: Tue Mar 14, 2006 7:22 pm
by Brendan
Hi,
For ASCII:
Code: Select all
toupper:
cmp al,'a'
jb .l1
cmp al,'z'
ja .l1
sub al,'a'-'A'
.l1:
ret
tolower:
cmp al,'Z'
ja .l1
cmp al,'A'
jb .l1
add al,'a'-'A'
.l1:
ret
Unicode is left as an exercise for the reader...
Cheers,
Brendan
Re:Upper case converter
Posted: Tue Mar 14, 2006 7:40 pm
by Candamir
ASCII conversion in C (I'm not a great fan of ASM
):
Code: Select all
char lower_case(char c)
{
if (c >= 'A' && c <= 'Z')
{
c = c + 'a' - 'A';
}
return c;
}
char upper_case(char c)
{
if (c >= 'a' && c <= 'z')
{
c = c - 'a' + 'A';
}
return c;
}
For now, I haven't thought about Unicode, but I have no Unicode table close, so I'll think about it...
Re:Upper case converter
Posted: Tue Mar 14, 2006 11:43 pm
by Solar
Candamir wrote:
How can I know which characters actually *have* an upper/lower case and how do I get it?
The "char value arithmetics" shown by blip and Brendan here is indeed unable to handle anything else but 7-bit ASCII, which makes it pretty useless for anyone not being British or American (i.e., about 95% of the world population).
If you look at the C library header <ctype.h>, you will realize there is a whole family of related functions - toupper(), tolower(), isspace(), ispunct() etc. etc. All these are "locale dependent", i.e. when the program switches locales, these functions will return different values.
The idea is to create a translation table for each locale. As long as we are talking about 8-bit characters, consider a [tt]char __toupper[256][/tt] containing uppercase translation codes:
Code: Select all
int toupper( int c )
{
return __toupper[ c ];
}
Another char-array __tolower and a third array containing status flags for each character value (Is it a whitespace? Is it printable? etc.), and you're set. When a program changes locale, you have to reload the three translation tables with the values for the new locale, which are usually read from disk.
Two issues here:
- When you're doing Unicode, using a "flat" translation table would be wasting lots of memory. You might want to implement a "folded" translation table, kind of like the page tables and directories used in virtual memory management.
- Some alphabets contain characters that expand when converted to upper-/lowercase. A common example is the German '?', which expands to 'SS' in uppercase. Standard C has no way of handling this, so you better come up with some OS API function that does.
I hope this helps.