EATextUtil is a collection of string utilities that are of a higher-level nature than those found in the C runtime library or our EAString module. EASTL/string.h and EASTL/algorithm.h contain C++/STL variations of functions that are similar to the C runtime library functions, but which are generally more powerful and flexible than the C functions while usually being more efficient.
All functions in EATextUtil are present in char8_t versions and char16_t versions, and assume UTF8 and UCS2 Unicode encoding respectively. Recall that these encodings are backward-compatible with ASCII and so will work for most or all of the text that you give them.
Each of the functions comes in a version that doesn't allocate any memory but instead uses user-supplied memory in-place. In a few cases, there are String versions that will allocate memory if they need to increase the size of the user-supplied string.
Here's a brief summary of the functions currently found in EATextUtil. We use char_t in the declarations below to refer to both char8_t and char16_t; there are thus two versions of each function.
bool WildcardMatch(const char_t* pString, const char_t* pPattern, bool bCaseSensitive);
bool ParseDelimitedText(const char_t* pText, const char_t* pTextEnd, char_t cDelimiter,
const char_t*& pToken, const char_t*& pTokenEnd, const char_t** ppNewText);
const char_t* GetTextLine(const char_t* pText, const char_t* pTextEnd, const char_t** ppNewText);
bool SplitTokenDelimited(const char_t* pSource, size_t nSourceLength, char_t cDelimiter,
char_t* pToken, size_t nTokenLength, const char_t** ppNewSource = NULL);
bool SplitTokenSeparated(const char_t* pSource, size_t nSourceLength, char_t cDelimiter,
char_t* pToken, size_t nTokenLength, const char_t** ppNewSource = NULL);
void ConvertBinaryDataToASCIIArray(const void* pBinaryData, size_t nBinaryDataLength, char_t* pASCIIArray);
bool ConvertASCIIArrayToBinaryData(const char_t* pASCIIArray, size_t nASCIIArrayLength, void* pBinaryData);
int BoyerMooreSearch(const char8_t* pPattern, int nPatternLength, const char8_t* pSearchString, int nSearchStringLength,
int* pPatternBuffer1, int* pPatternBuffer2, int* pAlphabetBuffer, int nAlphabetBufferSize);
We will proceed to address each of the above functions.
bool WildcardMatch(const char16_t* pString, const char16_t* pPattern, bool bCaseSensitive); bool WildcardMatch(const char8_t* pString, const char8_t* pPattern, bool bCaseSensitive);
These functions match source strings to wildcard patterns like those used in file specifications. '*' in the pattern means match zero or more consecutive source characters. '?' in the pattern means match exactly one source character. Multiple * and ? characters may be used. Two consecutive * characters are treated as if they were one. Here are some examples:
| Source | Pattern | Result |
| abcde | *e | true |
| abcde | *f | false |
| abcde | ???de | true |
| abcde | ????g | false |
| abcde | *c?? | true |
| abcde | *e?? | false |
| abcde | *???? | true |
| abcde | bcdef | false |
| abcde | *????? | true |
Example usage:
bool result = WildcardMatch("Hello world", "hello*", false); // result becomes true
bool result = WildcardMatch("Hello world", "hello?", false); // result becomes false
bool result = WildcardMatch("Hello world", "Hello??orld", true); // result becomes true
bool result = WildcardMatch("Hello world", "*Hello*world*", true); // result becomes true
bool ParseDelimitedText(const char16_t* pText, const char16_t* pTextEnd, char16_t cDelimiter,
const char16_t*& pToken, const char16_t*& pTokenEnd, const char16_t** ppNewText);
bool ParseDelimitedText(const char8_t* pText, const char8_t* pTextEnd, char8_t cDelimiter,
const char8_t*& pToken, const char8_t*& pTokenEnd, const char8_t** ppNewText);
Given a line of text (e.g. like this:)
342.5, "This is a string", test, "This is a string, with a comma"
ParseDelimitedText parses it into separate fields (e.g. like this:)
342.5 This is a string test This is a string, with a comma
ParseDelimitedText lets you dynamically specify the delimiter. The delimiter can be any char (e.g. space, tab, semicolon) except the quote char itself, which
is reserved for the purpose of grouping. See the source code comments for more details. However, in the case of text that is UTF8-encoded, you need to make sure
the delimiter char is a value that is less than 127, so as not to collide with UTF8 encoded chars.
The input is a pointer to text and the text length. For ASCII, MBCS, and UTF8, this is the number of bytes or chars. For UTF16 (Unicode) it is the number of characters.
There are two bytes (two chars) per character in UTF16. The input nTextLength can be -1 (kLengthNull) to indicate that the string is null-terminated.
Example behaviour for string array version (which you can extrapolate to the iterative version):
| Input string | MaxResults | Delimiter | Return value |
Output array size |
Output array value |
"" |
-1 |
' ' |
0 | 0 |
"" |
"000 111" |
-1 |
' ' |
2 | 2 |
"000" "111" |
"000 111 222 333 444 \"555 555\" 666" |
-1 |
' ' |
7 | 7 |
"000" "111" "222" "333" "444" "555 555" "666" |
" 000 111 222 333 " |
-1 |
' ' |
4 | 4 |
"000" "111" "222" "333" |
" 000 111 222 333 " |
-1 |
' ' |
2 | 2 |
"000" "111" |
"" |
-1 |
',' |
0 | 0 |
"" |
"000,111" |
-1 |
',' |
2 | 2 |
"000" "111" |
"000, 111 , 222 333 ,444 \"555, 555 \" 666" |
-1 |
',' |
4 | 4 |
"000" "111" "222 333" "444 \"555, 555 \" 666" |
" ,, 000 ,111, 222, 333 , " |
-1 |
',' |
6 | 6 |
"" "" "000" "111" "222" "333" |
" ,, 000 ,111, 222, 333 , " |
2 |
',' |
0 | 0 |
"" "" |
void ConvertBinaryDataToASCIIArray(const void* pBinaryData, size_t nBinaryDataLength, char16_t* pASCIIArray); void ConvertBinaryDataToASCIIArray(const void* pBinaryData, size_t nBinaryDataLength, char8_t* pASCIIArray); bool ConvertASCIIArrayToBinaryData(const char8_t* pASCIIArray, size_t nASCIIArrayLength, void* pBinaryData); bool ConvertASCIIArrayToBinaryData(const char16_t* pASCIIArray, size_t nASCIIArrayLength, void* pBinaryData);
The first two functions convert an array of binary characters into an encoded ASCII format that can be later converted back to binary. You might want to do this if you are trying to embed binary data into a text file (e.g. .ini file) and need a way to encode the binary data as text.
The second two functions take an ASCII string of text and converts it to binary data. This is the reverse of the ConvertBinaryDataToASCIIArray functions. If an invalid hexidecimal character is encountered, it is replaced with a '0' character. These functions return true if the input was entirely valid hexadecimal data.
Example usage:
const uint8_t data[4] = { 0x12, 0x34, 0x56, 0x78 };
char8_t ascii[8];
ConvertBinaryDataToASCIIArray(data, 4 * sizeof(uint8_t), ascii); // ascii becomes "12345678"
const char16_t ascii[8] = "12345678";
uint8_t data[4];
ConvertASCIIArrayToBinaryData(ascii, 8, data); // data becomes { 0x12, 0x34, 0x56, 0x78 }
const char16_t* GetTextLine(const char16_t* pText, const char16_t* pTextEnd, const char16_t** ppNewText); const char8_t* GetTextLine(const char8_t* pText, const char8_t* pTextEnd, const char8_t** ppNewText);
Given a block of text, this function reads a line of text and moves to the beginning of the next line. The return value is the end of the current line, previous to the newline characters. If ppNewText is supplied, it holds the start of the new line, which will often be different from the return value, as the start of the new line is after any newline characters. The length of the current line is pTextEnd - pText.
These functions are useful for reading lines of text from a text file via an iterative method, which is perhaps the most flexible way of doing this.
Example usage:
char buffer[256];
char* pLineNext(buffer);
char* pLine;
do{
pText = pLineNext;
const char* pLineEnd = GetTextLine(pLine, buffer + 256, &pLineNext);
// Use pLine - pLineEnd here
}while(pLineNext != (buffer + 256));
bool SplitTokenDelimited(const char16_t* pSource, size_t nSourceLength, char16_t cDelimiter,
char16_t* pToken, size_t nTokenLength, const char16_t** ppNewSource = NULL);
bool SplitTokenDelimited(const char8_t* pSource, size_t nSourceLength, char8_t cDelimiter,
char8_t* pToken, size_t nTokenLength, const char8_t** ppNewSource = NULL);
SplitTokenDelimited returns tokens that are delimited by a single character -- repetitions of that character will result in empty tokens returned. This is most commonly useful when you want to parse a string of text delimited by commas or spaces. Returns true whenever it extracts a token. Note however that the extracted token may be empty. Note that the return value will be true if the source has length and will be false if the source is empty. If the input pToken is non-null, the text before the delimiter is copied to it.
Example behaviour (delimiter is a comma):
| Source | Token | New source | Return value |
"a,b" |
"a" |
"b" |
true |
" a , b " |
" a " |
" b " |
true |
"ab,b" |
"ab" |
"b" |
true |
",a,b" |
"" |
"a,b" |
true |
",b" |
"" |
"b" |
true |
",,b" |
"" |
",b" |
true |
",a," |
"" |
"a," |
true |
"a," |
"a" |
"" |
true |
"," |
"" |
"" |
true |
", " |
"" |
" " |
true |
"a" |
"a" |
"" |
true |
" " |
" " |
"" |
true |
"" |
"" |
"" |
false |
NULL |
"" |
NULL |
false |
Example usage:
const char16_t* pString = L"a, b, c, d";
char16_t pToken[16];
while(SplitTokenDelimited(pString, kLengthNull, ',', pToken, 16, &pString))
printf("%s\n", pToken);
bool SplitTokenSeparated(const char16_t* pSource, size_t nSourceLength, char16_t cDelimiter,
char16_t* pToken, size_t nTokenLength, const char16_t** ppNewSource = NULL);
bool SplitTokenSeparated(const char8_t* pSource, size_t nSourceLength, char8_t cDelimiter,
char8_t* pToken, size_t nTokenLength, const char8_t** ppNewSource = NULL);
SplitTokenSeparated returns tokens that are separated by one or more instances of a character. Returns true whenever it extracts a token.
Example behaviour (delimiter is a space char):
| Source | Token | New source | Return value |
"a" |
"a" |
"" |
true |
"a b" |
"a" |
"b" |
true |
"a b" |
"a" |
"b" |
true |
" a b" |
"a" |
"b" |
true |
" a b " |
"a" |
"b " |
true |
" a " |
"a" |
"" |
true |
" a " |
"a" |
"" |
true |
"" |
"" |
"" |
false |
" " |
"" |
"" |
false |
NULL |
"" |
NULL |
false |
/// BoyerMooreSearch
///
/// patternBuffer1 is a user-supplied buffer and must be at least as long as the search pattern.
/// patternBuffer2 is a user-supplied buffer and must be at least as long as the search pattern.
/// alphabetBuffer is a user-supplied buffer and must be at least as long as the highest character value
/// used in the searched string and search pattern.
///
int BoyerMooreSearch(const char* pPattern, int nPatternLength, const char* pSearchString, int nSearchStringLength,
int* pPatternBuffer1, int* pPatternBuffer2, int* pAlphabetBuffer, int nAlphabetBufferSize)
Boyer-Moore is a very fast string search compared to most others, including those in the STL. However, you need to be searching a string of at least 100 chars and have a search pattern of at least 3 characters for the speed to show, as Boyer-Moore has a startup precalculation that costs some cycles. This startup precalculation is proportional to the size of your search pattern and the size of the alphabet in use. Thus, doing Boyer-Moore searches on the entire Unicode alphabet is going to incur a fairly expensive precalculation cost.