EATextUtil

Introduction

EATextUtil is a collection of string utilities that are of a higher-level nature than those found in the C runtime library or our EAString module. EASTL/string.h and EASTL/algorithm.h contain C++/STL variations of functions that are similar to the C runtime library functions, but which are generally more powerful and flexible than the C functions while usually being more efficient.

All functions in EATextUtil are present in char8_t versions and char16_t versions, and assume UTF8 and UCS2 Unicode encoding respectively. Recall that these encodings are backward-compatible with ASCII and so will work for most or all of the text that you give them.

Each of the functions comes in a version that doesn't allocate any memory but instead uses user-supplied memory in-place. In a few cases, there are String versions that will allocate memory if they need to increase the size of the user-supplied string.

Here's a brief summary of the functions currently found in EATextUtil. We use char_t in the declarations below to refer to both char8_t and char16_t; there are thus two versions of each function.

bool WildcardMatch(const char_t* pString, const char_t* pPattern, bool bCaseSensitive);
 
bool ParseDelimitedText(const char_t*  pText,  const char_t*  pTextEnd,  char_t cDelimiter, 
                        const char_t*& pToken, const char_t*& pTokenEnd, const char_t** ppNewText);
  
const char_t* GetTextLine(const char_t* pText, const char_t* pTextEnd, const char_t** ppNewText);
 
bool SplitTokenDelimited(const char_t* pSource, size_t nSourceLength, char_t cDelimiter, 
                         char_t* pToken, size_t nTokenLength, const char_t** ppNewSource = NULL);
bool SplitTokenSeparated(const char_t* pSource, size_t nSourceLength, char_t cDelimiter, 
                         char_t* pToken, size_t nTokenLength, const char_t** ppNewSource = NULL);

void ConvertBinaryDataToASCIIArray(const void* pBinaryData, size_t nBinaryDataLength, char_t* pASCIIArray);
bool ConvertASCIIArrayToBinaryData(const char_t* pASCIIArray, size_t nASCIIArrayLength, void* pBinaryData);
 
int BoyerMooreSearch(const char8_t* pPattern, int nPatternLength, const char8_t* pSearchString, int nSearchStringLength, 
                     int* pPatternBuffer1, int* pPatternBuffer2, int* pAlphabetBuffer, int nAlphabetBufferSize);

We will proceed to address each of the above functions.

WildcardMatch

bool WildcardMatch(const char16_t* pString, const char16_t* pPattern, bool bCaseSensitive);
bool WildcardMatch(const char8_t*  pString, const char8_t*  pPattern, bool bCaseSensitive);

These functions match source strings to wildcard patterns like those used in file specifications. '*' in the pattern means match zero or more consecutive source characters. '?' in the pattern means match exactly one source character. Multiple * and ? characters may be used. Two consecutive * characters are treated as if they were one. Here are some examples:

Source Pattern Result
abcde *e true
abcde *f false
abcde ???de true
abcde ????g false
abcde *c?? true
abcde *e?? false
abcde *???? true
abcde bcdef false
abcde *????? true

Example usage:

bool result = WildcardMatch("Hello world", "hello*", false);       // result becomes true
bool result = WildcardMatch("Hello world", "hello?", false);       // result becomes false
bool result = WildcardMatch("Hello world", "Hello??orld", true);   // result becomes true
bool result = WildcardMatch("Hello world", "*Hello*world*", true); // result becomes true

ParseDelimitedText (iterative version)

bool ParseDelimitedText(const char16_t* pText, const char16_t* pTextEnd, char16_t cDelimiter, 
                        const char16_t*& pToken, const char16_t*& pTokenEnd, const char16_t** ppNewText);

bool ParseDelimitedText(const char8_t* pText, const char8_t* pTextEnd, char8_t cDelimiter, 
                        const char8_t*& pToken, const char8_t*& pTokenEnd, const char8_t** ppNewText);

Given a line of text (e.g. like this:)

342.5, "This is a string", test, "This is a string, with a comma"

ParseDelimitedText parses it into separate fields (e.g. like this:)

342.5
This is a string
test
This is a string, with a comma 

ParseDelimitedText lets you dynamically specify the delimiter. The delimiter can be any char (e.g. space, tab, semicolon) except the quote char itself, which is reserved for the purpose of grouping. See the source code comments for more details. However, in the case of text that is UTF8-encoded, you need to make sure the delimiter char is a value that is less than 127, so as not to collide with UTF8 encoded chars.

The input is a pointer to text and the text length. For ASCII, MBCS, and UTF8, this is the number of bytes or chars. For UTF16 (Unicode) it is the number of characters. There are two bytes (two chars) per character in UTF16. The input nTextLength can be -1 (kLengthNull) to indicate that the string is null-terminated.

Example behaviour for string array version (which you can extrapolate to the iterative version):

Input string MaxResults Delimiter Return value

Output array size

Output array value
""
-1
' '
0 0
""
"000 111"
-1
' '
2 2
"000"  "111"
"000 111   222   333 444 \"555 555\" 666"
-1
' '
7 7
"000"  "111"  "222"  "333"  "444"  "555 555"  "666"
"     000 111 222         333                "
-1
' '
4 4
"000"  "111"  "222"  "333"
"     000 111 222         333                "
-1
' '
2 2
"000"  "111"
""
-1
','
0 0
""
"000,111"
-1
','
2 2
"000"  "111"
"000,  111 , 222   333 ,444 \"555,  555  \" 666"
-1
','
4 4
"000"  "111"  "222   333"  "444 \"555,  555  \" 666"
"  ,, 000 ,111, 222,         333          ,     "
-1
','
6 6
""   ""   "000"   "111"   "222"   "333"
"  ,, 000 ,111, 222,         333          ,     "
2
','
0 0
""   ""

Convert binary ASCII

void ConvertBinaryDataToASCIIArray(const void* pBinaryData, size_t nBinaryDataLength, char16_t* pASCIIArray);
void ConvertBinaryDataToASCIIArray(const void* pBinaryData, size_t nBinaryDataLength, char8_t*  pASCIIArray);
 
bool ConvertASCIIArrayToBinaryData(const char8_t*  pASCIIArray, size_t nASCIIArrayLength, void* pBinaryData);
bool ConvertASCIIArrayToBinaryData(const char16_t* pASCIIArray, size_t nASCIIArrayLength, void* pBinaryData);

The first two functions convert an array of binary characters into an encoded ASCII format that can be later converted back to binary. You might want to do this if you are trying to embed binary data into a text file (e.g. .ini file) and need a way to encode the binary data as text.

The second two functions take an ASCII string of text and converts it to binary data. This is the reverse of the ConvertBinaryDataToASCIIArray functions. If an invalid hexidecimal character is encountered, it is replaced with a '0' character. These functions return true if the input was entirely valid hexadecimal data.

Example usage:

const uint8_t data[4] = { 0x12, 0x34, 0x56, 0x78 };
char8_t ascii[8];
 
ConvertBinaryDataToASCIIArray(data, 4 * sizeof(uint8_t), ascii);    // ascii becomes "12345678"
const char16_t ascii[8] = "12345678";
uint8_t data[4];
 
ConvertASCIIArrayToBinaryData(ascii, 8, data);    // data becomes { 0x12, 0x34, 0x56, 0x78 }

GetTextLine

const char16_t* GetTextLine(const char16_t* pText, const char16_t* pTextEnd, const char16_t** ppNewText);
const char8_t*  GetTextLine(const char8_t*  pText, const char8_t*  pTextEnd, const char8_t**  ppNewText);

Given a block of text, this function reads a line of text and moves to the beginning of the next line. The return value is the end of the current line, previous to the newline characters. If ppNewText is supplied, it holds the start of the new line, which will often be different from the return value, as the start of the new line is after any newline characters. The length of the current line is pTextEnd - pText.

These functions are useful for reading lines of text from a text file via an iterative method, which is perhaps the most flexible way of doing this.

Example usage:

char  buffer[256];
char* pLineNext(buffer);
char* pLine;

do{
pText = pLineNext;
const char* pLineEnd = GetTextLine(pLine, buffer + 256, &pLineNext);
// Use pLine - pLineEnd here
}while(pLineNext != (buffer + 256));

SplitTokenDelimited

bool SplitTokenDelimited(const char16_t* pSource, size_t nSourceLength, char16_t cDelimiter, 
                         char16_t* pToken, size_t nTokenLength, const char16_t** ppNewSource = NULL);


bool SplitTokenDelimited(const char8_t* pSource, size_t nSourceLength, char8_t cDelimiter, 
                         char8_t* pToken, size_t nTokenLength, const char8_t** ppNewSource = NULL);

SplitTokenDelimited returns tokens that are delimited by a single character -- repetitions of that character will result in empty tokens returned. This is most commonly useful when you want to parse a string of text delimited by commas or spaces. Returns true whenever it extracts a token. Note however that the extracted token may be empty. Note that the return value will be true if the source has length and will be false if the source is empty. If the input pToken is non-null, the text before the delimiter is copied to it.

Example behaviour (delimiter is a comma):

Source Token New source Return value
"a,b"
"a"
"b"
true
" a , b "
" a "
" b "
true
"ab,b"
"ab"
"b"
true
",a,b"
""
"a,b"
true
",b"
""
"b"
true
",,b"
""
",b"
true
",a,"
""
"a,"
true
"a,"
"a"
""
true
","
""
""
true
", "
""
" "
true
"a"
"a"
""
true
" "
" "
""
true
""
""
""
false
NULL
""
NULL
false

Example usage:

const char16_t* pString = L"a, b, c, d";
char16_t pToken[16];

while(SplitTokenDelimited(pString, kLengthNull, ',', pToken, 16, &pString))
    printf("%s\n", pToken);

SplitTokenSeparated

bool SplitTokenSeparated(const char16_t* pSource, size_t nSourceLength, char16_t cDelimiter, 
                         char16_t* pToken, size_t nTokenLength, const char16_t** ppNewSource = NULL);


bool SplitTokenSeparated(const char8_t* pSource, size_t nSourceLength, char8_t cDelimiter, 
                         char8_t* pToken, size_t nTokenLength, const char8_t** ppNewSource = NULL);

SplitTokenSeparated returns tokens that are separated by one or more instances of a character. Returns true whenever it extracts a token.

Example behaviour (delimiter is a space char):

Source Token New source Return value
"a"
"a"
""
true
"a b"
"a"
"b"
true
"a  b"
"a"
"b"
true
" a b"
"a"
"b"
true
" a b "
"a"
"b "
true
" a "
"a"
""
true
" a  "
"a"
""
true
""
""
""
false
" "
""
""
false
NULL
""
NULL
false

BoyerMooreSearch

/// BoyerMooreSearch
///
/// patternBuffer1 is a user-supplied buffer and must be at least as long as the search pattern.
/// patternBuffer2 is a user-supplied buffer and must be at least as long as the search pattern.
/// alphabetBuffer is a user-supplied buffer and must be at least as long as the highest character value 
/// used in the searched string and search pattern.
///
int BoyerMooreSearch(const char* pPattern, int nPatternLength, const char* pSearchString, int nSearchStringLength, 
                     int* pPatternBuffer1, int* pPatternBuffer2, int* pAlphabetBuffer, int nAlphabetBufferSize)

Boyer-Moore is a very fast string search compared to most others, including those in the STL. However, you need to be searching a string of at least 100 chars and have a search pattern of at least 3 characters for the speed to show, as Boyer-Moore has a startup precalculation that costs some cycles. This startup precalculation is proportional to the size of your search pattern and the size of the alphabet in use. Thus, doing Boyer-Moore searches on the entire Unicode alphabet is going to incur a fairly expensive precalculation cost.