Reading strings in-between markers
Reading strings in-between markers
Hi.
If you want to read a string in-between two markers into a buffer, ie:
[MARKER]this is the string i want[MARKER]
Is it possible to do this (in C) with a library function - sscanf() doesn't seem to work so well on strings such as this - or would I have to use pointers?
Thanks.
If you want to read a string in-between two markers into a buffer, ie:
[MARKER]this is the string i want[MARKER]
Is it possible to do this (in C) with a library function - sscanf() doesn't seem to work so well on strings such as this - or would I have to use pointers?
Thanks.
Re:Reading strings in-between markers
I can't think of anything off-hand which will do this in one line of code.
I would do this:
1. Start in state 0 (int state = 0)
2. Scan through the string character by character
3. If in state 0:
a. If you see a [, enter state 1
b. Else, add the character to a buffer, or do whatever you want with it
4. If in state 1:
a. If you see a ], enter state 0
b. Else, ignore the character
This is a very simple state machine. The machine is in state 0 when it's looking in between two markers, or outside a marker (you can't tell the difference unless you differentiate between "open markers" and "close markers", like HTML/XML). The machine is in state 1 when it's looking inside a marker.
I would do this:
1. Start in state 0 (int state = 0)
2. Scan through the string character by character
3. If in state 0:
a. If you see a [, enter state 1
b. Else, add the character to a buffer, or do whatever you want with it
4. If in state 1:
a. If you see a ], enter state 0
b. Else, ignore the character
This is a very simple state machine. The machine is in state 0 when it's looking in between two markers, or outside a marker (you can't tell the difference unless you differentiate between "open markers" and "close markers", like HTML/XML). The machine is in state 1 when it's looking inside a marker.
Re:Reading strings in-between markers
You may be able to use strtok() to parse the string. For example, to parse a line from semicolon delimited database, you could write something like:
(This code has not been tested, and likely contains errors, but it should correctly reflect how it works. Comments and corrections welcome.)
The strtok() function keeps it's last result in a static pointer, and if the str1 argument is NULL, it uses that so as to continue parsing the string.
Code: Select all
#include <string.h>
// ...
char *db_record, *curr, *next;
char *field[BUFSIZE];
// ...
curr = db_record;
next = strtok(db_record, ";");
while (NULL != next)
{
strncpy(field, curr, ((next - curr) * sizeof(char)));
// do whatever you're doing to the data in field[]
// ...
curr = next;
next = strtok(NULL, ";");
}
The strtok() function keeps it's last result in a static pointer, and if the str1 argument is NULL, it uses that so as to continue parsing the string.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Reading strings in-between markers
i would strongly discourage the use of "strtok". It tends to behave in a very strange way and becomes completely useless as soon as you enter multithreaded programming.
Instead, i would advocate for "sscanf" use (very powerful when used accurately).
will do what you want provided that you have no nested markers and that you know what the marker is a priori.
Instead, i would advocate for "sscanf" use (very powerful when used accurately).
Code: Select all
if (sscanf(buffer,"[MARKER]%a[^[][MARKER] ",&target)==1) // you have your string in target.
Re:Reading strings in-between markers
I'm not aware of strtok having tendencies toward strange behavior, although I believe the second argument to strtok is a delimiter set, not a delimiter string, so passing in something like "[MARKER]" would stop at every "[", "M", "A", "R", "K", "E", and "]". And I don't think the delimiters are returned so a string such as:
char str[] = "[MARKER]A Man Walked Off A Sidewalk[MARKER]" would tokenize with strtok(str, "[MARKER]") as:
The results are strange, but I wouldn't describe that as strange behavior. It's behaving exactly as it is defined. It would actually be better to call it in a manner that would sort of automate what Tim was talking about:
strtok(str, "[");
followed by
strtok(NULL, "]");
and so on.
A well-written version of the standard library should work fine with multi-threading, as well. Much more dangerous is using strtok in a loop that calls a function that uses strtok.
char str[] = "[MARKER]A Man Walked Off A Sidewalk[MARKER]" would tokenize with strtok(str, "[MARKER]") as:
Code: Select all
" "
"an Walked Off "
" Sidewalk"
strtok(str, "[");
followed by
strtok(NULL, "]");
and so on.
A well-written version of the standard library should work fine with multi-threading, as well. Much more dangerous is using strtok in a loop that calls a function that uses strtok.
Re:Reading strings in-between markers
I thought the problem was that strtok stored an internal pointer to the string to keep track of the tokens. If you have multiple threads that call strtok, then the internal pointer will be changed by the multiple threads.
The GNU library among others supports a strtok_r function with a double pointer to store extra data.
Joel's right tho. It wouldn't split the text based on tokens.
- Nick
The GNU library among others supports a strtok_r function with a double pointer to store extra data.
Joel's right tho. It wouldn't split the text based on tokens.
- Nick
Re:Reading strings in-between markers
Most compilers that support multi-threading I think come with a thread-safe version of the standard library, and that version should implement strtok in a thread-safe manner. Microsoft's multi-threaded library is supposed to make strtok thread-safe, at least.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Reading strings in-between markers
another weirdness from strtok comes from the fact it manipulates the input string and puts '\0' everywhere it feels so (still keeping a copy of the removed character), which can be veeery confusing from times to times.
Quoting "man strtok" :
Quoting "man strtok" :
BUGS
Never use these functions. If you do, note that:
These functions modify their first argument.
These functions cannot be used on constant strings.
The identity of the delimiting character is lost.
The strtok() function uses a static buffer while
parsing, so it's not thread safe. Use strtok_r() if
this matters to you.
Re:Reading strings in-between markers
Again, strtok is defined to put '\0' wherever it finds a delimiter, because it returns a pointer to the token and not a copied version of it.
I'm not sure what the actual standard says about strtok. This is from MSDN:
I'm not sure what the actual standard says about strtok. This is from MSDN:
Warning Each of these functions uses a static variable for parsing the string into tokens. If multiple or simultaneous calls are made to the same function, a high potential for data corruption and inaccurate results exists. Therefore, do not attempt to call the same function simultaneously for different strings and be aware of calling one of these function from within a loop where another routine may be called that uses the same function. However, calling this function simultaneously from multiple threads does not have undesirable effects.
Re:Reading strings in-between markers
Manual char scanning in INCREDIBLY fast, and would do the trick best, with no need for external routines. You wouldn't scan for the strings, but simply scan through for '[', then if found, handle the "MARKER]" and whatnot.
eg.
char *pstr=<stringbuffer>;
while(*pstr) {
if(*pstr++=='[') {
// check for marker
}
}
That will do it in no time. For more complex things, such as if you have many different '[' type of cases, you could use a switch statement with *pstr++. For the most complex stuff, you can use a 256 entry (for ASCII, unicode's larger), jump (function) table with an entry for each character. It will get through it in no time!
eg.
char *pstr=<stringbuffer>;
while(*pstr) {
if(*pstr++=='[') {
// check for marker
}
}
That will do it in no time. For more complex things, such as if you have many different '[' type of cases, you could use a switch statement with *pstr++. For the most complex stuff, you can use a 256 entry (for ASCII, unicode's larger), jump (function) table with an entry for each character. It will get through it in no time!
Re:Reading strings in-between markers
I think some compilers will optimise switch statements to a jump table if they're dense enough. I noticed a few switch statements used this in sierra's agi interpreter. (in code that was obviously written with a compiler)
- Nick
- Nick