Introduction

This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences called tokens. The indices in the original sequence that determine such breaks in the sequence are known as delimiters.

There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or a map of strings.

Another Tokenizer?

To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the likely candidate might be the implementation in the Boost library. Regardless proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, generality (or is it genericty these days?) of design, performance-efficiency.

1. Over-all usage patterns

This requirement is concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms and containers. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should there be such a thing as an empty token? and what do preceding and trailing delimiters mean? and when should they be included as part of the token?

2. Constructions

In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub-string in question from the original sequence, then returned back (some of this may be alleviated by Return Value Optimization RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range and allow the caller to decide how they wish to inspect and further manipulate the token.

StrTk Token Iterators - Copyright Arash Partow

This minor change in interface design provides a great deal of flexibility and performance gain.

3. Generality(Genericity) of design

Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.

4. Performance and Efficiency

Tokenizing strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do, to parsing of large market-data streams in FIX format. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process. It should be noted that everything in the proceeding article can be done by purely using the STL - that said, C++'s ability to allow one to skin the proverbial cat in numerous way gives rise to novel solutions that are for the most part not of any practical use other than to showcase ones abilities in using the STL.

Getting started

The String Toolkit Library (StrTk) provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.

The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.

No Compressed Delimiters

InputToken List
aa
a|ba,b
a||ba,<>,b
|a<>,a
a|a,<>
|a||b<>,a,<>,b
||a||b||<>,<>,a,<>,b,<>,<>
|<>,<>
||<>,<>,<>
|||<>,<>,<>,<>

Compressed Delimiters

InputToken List
aa
a||ba,b
|a||b<>,a,b
||a||b||<>,a,b,<>
|<>,<>
||<>,<>
|||<>,<>

Delimiters

Two forms of delimiters are supported and they are single delimiter predicate and multiple delimiters perdicate otherwise known as SDP and MDP respectively. Essentially an SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.

Single Delimiter Predicate

Instantiation requires specialization of type and construction requires an instance of the type.

strtk::single_delimiter_predicate<typename T>(const T& t)

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');

Multiple Delimiter Predicate

Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.

strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);

std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());

unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);

As previously mentioned tokenization of data need not be limited to strings comprised of chars, but can also be extended to other PODs or complex types. In the above example a predicate used for tokenizing a sequence of unsigned ints is being defined.

Multiple Char Delimiter Predicate

Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.

strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)

strtk::multiple_char_delimiter_predicate predicate(" .;?");

The delimeter concept can be extended to the point where the predicate itself can act as a state machine transitioning from state to state based on conditions and rules related to the current symbol being processed. This simple extension can lead to some very interesting parsing capabilities.

Split

This is a function that performs tokenization over an entire sequence in one go. strtk::split takes a sequence through a range of iterators or in the case of a string through a reference to its instance, a delimiter predicate and an output iterator or otherwise known as a type sink. It populates the output iterator with the tokens it extracts. The tokens in this case are std::pairs of iterators for the sequence.

StrTk Split Routine - Copyright Arash Partow

Split can be used in a "simple - no frills" manner by simply passing the necessary parameters:

std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));

strtk::split can also be used in a more explicit manner whereby the exact type of delimiter predicate can be specificed by the user:

std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(predicate,str,std::back_inserter(token_list));

strtk::split provides an additional usage option that allows the user to specify if they would like to either compress the delimiters and whether or not they would like to include the delimiter as part of the token range. This enum parameter is called strtk::split_options and has the following values:

Split OptionDefinition
strtk::split_options::default_modeDefault options
strtk::split_options::compress_delimitersConsecutive delimiters are treated as one
strtk::split_options::include_1st_delimiterThe first delimiter is included in the resulting token range
strtk::split_options::include_delimitersAll delimiters are included in the resulting token range

The simple example below demonstrates a split that will occur over a string given a predicate where the provided split options indicate that consecutive delimiters will be treated as one and also all delimiters encountered after each token will also be included in the token up until the next token begins.

strtk::split(predicate,
             str,
             std::back_inserter(token_list),
             strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);

Another way of using split may be to use the strtk::multiple_char_delimiter_predicate as follows:

std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(predicate,str,std::back_inserter(token_list));

The contents of the token_list can be printed out as follows:

strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << (*itr) << "\t";
   ++itr;
}

Split N-Tokens

A natural extension of strtk::split is strtk::split_n. This function provides the ability to tokenize a sequence up until a specific number of tokens have been encountered or until there are no more tokens left. The return value of the strtk::split_n would be the number of tokens encountered.

std::string data = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
const std::string delimiters (" ,.;?");
strtk::split_n(delimiters,
               data,
               token_count,
               std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;

Offset Splitter

Another simple variant is the strtk::offset_splitter. This form of split takes a series of offsets and an iterator range or string and determines the tokens based on the offsets. This function can be set to perform a single pass of the offsets or to rotate them until the range has been completely traversed. The example below demonstrates how a string representing date and time can be tokenized into its constituent parts (year, month, day, hour, minutes ,seconds,milliseconds)

std::string time_data = "20000101091011345"; //2000-01-01 9:10:11sec 345ms
const std::size_t offset_list_size = 7;
const int offset_list[offset_list_size] = {
                                            4, // year
                                            2, // month
                                            2, // day
                                            2, // hour
                                            2, // minute
                                            2, // second
                                            3  // ms
                                          };
const strtk::offset_predicate<offset_list_size> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(time_data,predicate,std::back_inserter(token_list));

strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "] ";
   ++itr;
}
std::cout << std::endl;

Split Regex

Another form of splitter is based on the concept of using regular expressions as the delimiter predicate. Below is a simple example of splitting tokens wrapped in round-brackets.

std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",
                   s,
                   std::back_inserter(token_list),
                   strtk::regex_match_mode::match_1);

std::list<std::string>::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;
</std::string>

Note: The parameter regex_match_mode represents the capture of the marked sub-expression in the current match. By default it is strtk::regex_match_mode::match_all which in this case would provide the entire match including the bounding pattern, eg: Token3 would be (0ijkx). However in the above example we are only interested in the sub-expression between the round-brackets, hence strtk::regex_match_mode::match_1 is used resulting in Token3 being 0ijkx.

The following examples demonstrate the use of strtk::split_regex and strtk::split_regex_n routines in extracting specific types of data - in this case the POD types int and double.

int main()
{
   {
      // Extract ints from data string
      std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
      std::deque<int> int_list;
      strtk::split_regex("([+-]?([\\d]+))",
                         data,
                         strtk::range_to_type_back_inserter(int_list),
                         strtk::regex_match_mode::match_1);
   }

   {
      // Extract doubles from data string
      std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
      std::deque<double> double_list;
      strtk::split_regex(strtk::ieee754_expression,
                         data,
                         strtk::range_to_type_back_inserter(double_list),
                         strtk::regex_match_mode::match_1);
   }

   {
      // Extract the first 3 ints from data string
      std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
      std::deque<int> int_list;
      strtk::split_regex_n("([+-]?([\\d]+))",
                           data,
                           3,
                           strtk::range_to_type_back_inserter(int_list),
                           strtk::regex_match_mode::match_1);
   }

   {
      // Extract the first 4 doubles from data string
      std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
      std::deque<double> double_list;
      strtk::split_regex_n(strtk::ieee754_expression,
                           data,
                           4,
                           strtk::range_to_type_back_inserter(double_list),
                           strtk::regex_match_mode::match_1);
   }

   return 0;
}

The following table describes the various regex patterns built into StrTk which can be used seemlessly with the strtk::split_regex and strtk::split_regex_n routines.

RegexDefinition
strtk::uri_expressionURI, URL address e.g.: http://www.example.com, domain.example.net/index.html
strtk::email_expressionE-mail address e.g.: some.one@example.com
strtk::ip_expressionIPv4 address e.g.: 192.168.0.1, 127.0.0.1
strtk::ieee754_expressionFloating point value e.g.: 1.1, 1.234e-123, -1.0001E+10, 0.1234

Tokenizer

The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Definitions exist for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.

const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type;
tokenizer_type tokenizer(data,data + data_size,predicate);

Similar to that of strtk::split, strtk::tokenizer provides tokenizing options that are passed in during construction. Below is a table depicting said options:

Tokenize OptionDefinition
strtk::tokenize_options::default_modeDefault options
strtk::tokenize_options::compress_delimitersConsecutive delimiters are treated as one
strtk::tokenize_options::include_1st_delimiterThe first delimiter is included in the resulting token range
strtk::tokenize_options::include_delimitersAll delimiters are included in the resulting token range
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type
tokenizer_type tokenizer(data, data + data_size,
                         predicate,
                         strtk::tokenize_options::compress_delimiters |
                         strtk::tokenize_options::include_1st_delimiter);

Furthermore, Iteration over the tokens of strtk::tokenizer is performed as follows:

tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::copy((*itr).first,(*itr).second,std::ostream_iterator<unsigned int>(std::cout," "));
   std::cout << std::endl;
   ++itr;
}

A typical std::string can be tokenized in the following manner:

std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;

Another common situation may be tokenizing a sequence of strings, such as the following:

const std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
                                 "mno" , "delimiter" , "rst" , "uvw" ,
                                 "delimiter" , "xyz" };

const std::size_t str_list_size = sizeof(str_list) / sizeof(std::string);

strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
typedef strtk::tokenizer<std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer_type;
tokenizer_type tokenizer(range.begin(),range.end(),predicate);
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::copy((*itr).first,(*itr).second,std::ostream_iterator<std::string>(std::cout," "));
   std::cout << std::endl;
   ++itr;
}

The Parse Routine

Till now the mentioned routines worked specifically with tokens, or in other words ranges of characters. The responsibility of managing the tokens and converting the tokens to user specified types was done manually via "range to type" oriented back inserters and converters. This can be a bit cumbersome and as such StrTk provides a series of helper routines called strtk::parse. Parse takes an std::string representing a tuple of delimited values as input data, a delimiter set, and a series of references to variables that are to be populated with the values from the parsed tokens. The following diagram demonstrates the flow of data, tokens and the corresponding relationships and conversions between each of the parameters.

StrTk Parse Routine - Copyright Arash Partow

Note: strtk::parse returns a boolean value of true upon successful parsing and false for all other results. Situations that cause strtk::parse to fail include:

  • Insufficient number of tokens for the given number of variables
  • Conversion failure from token range to variable type
  • Empty or null token(s)

Some Simple Parse Examples

strtk::parse can take an arbitary number of variable references. The code below demonstrates the basic usage of strtk::parse taking various number of parameters.

std::string data = "abcde,-1234|567.890#1.1f";
std::string delimiters = ",|#";

std::string var0;
int var1;
double var2;
float var3;

strtk::parse(data,delimiters,var0);

strtk::parse(data,delimiters,var0,var1);

strtk::parse(data,delimiters,var0,var1,var2);

strtk::parse(data,delimiters,var0,var1,var2,var3);

The following examples demonstrate parsing of PODs such as int and double into STL compatible sequences (std::vector, std::deque, std::list, std::set, std::queue, std::stack and std::priority_queue).

// Insert into std::vector
std::string int_string = "1 +2 -3 4 +5 -6 7 +8 -9 10 +11 -12 13 +14 -15";
std::vector<int> int_vector;
strtk::parse(int_string," ",int_vector);

// Insert into std::deque
std::string double_string = "-123.456,789.012,-345.678,901.234,+567.890";
std::deque<double> double_deque;
strtk::parse(double_string,",",double_deque);

// Insert into std::list
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);

// Insert into std::set
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse(set_string,"#|",string_set);

// Insert into std::queue
std::string queue_string = "value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse(queue_string,",|",string_queue);

// Insert into std::stack
std::string stack_string = "value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse(stack_string,",|",string_stack);

// Insert into std::priority_queue
std::string priority_queue_string = "value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse(priority_queue_string,",|#",string_priority_queue);

Similar to what is described above, the following demonstrates parsing of upto "N" elements into an STL compatible sequence.

// Insert N elements into std::vector
std::string int_string = "100,-200,+300,400,-500,+600,700,-800,+900";
std::vector<int> int_vector;
strtk::parse_n(int_string,",",5,int_vector);

// Insert N elements into std::deque
std::string double_string = "123.456,+789.012,345.678,-901.234,567.890";
std::deque<double> double_deque;
strtk::parse_n(double_string,",",3,double_deque);

// Insert N elements into std::list
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);

// Insert N elements into std::set
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse_n(set_string,"#|",6,string_set);

// Insert N elements into std::queue
std::string queue_string = "value0,value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse_n(queue_string,",|",4,string_queue);

// Insert N elements into std::stack
std::string stack_string = "value0|value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse_n(stack_string,",|",4,string_stack);

// Insert N elements into std::priority_queue
std::string priority_queue_string = "value0#value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse_n(priority_queue_string,",|#",4,string_priority_queue);

Note: The return value of the routine strtk::parse_n indicates how many elements were parsed and placed into the specified sequence.

A Practical Example

Lets assume you have been given an english text file to process, with the intention of extracting a lexicon from the file.

One solution would be to break the problem down to a line by line tokenization problem. In this case you would define a functional object such as the following which will take the container in which you plan on storing your tokens (words) and a predicate and insert the tokens as strings into your container.

template<typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate)
   {}

   inline void operator() (const std::string& str)
   {
      strtk::split(str,
                   predicate_,
                   strtk::range_to_type_back_inserter(container_),
                   strtk::split_options::compress_delimiters);
   }

private:

   Container& container_;
   const Predicate& predicate_;
};

The whole thing together would include a process of opening the file and reading it line by line each time invoking the parse_line would be as follows:

template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
   static const std::string delimiters = " ,.;:<>'[]{}()_?/"
                                         "`~!@#$%^&*|-_\"=+\t\r\n\0"
                                         "0123456789";
   strtk::multiple_char_delimiter_predicate predicate(delimiters);
   strtk::for_each_line(file_name,
                        parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}

int main()
{
   std::string text_file_name = "text.txt";
   std::deque<std::string> word_list;
   parse_text(text_file_name,word_list);
   std::cout << "Token Count: " << word_list.size() << std::endl;
   return 0;
}

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. This type of list is commonly known as a Stop Word List. In this example the stop-word list definition will be as follows:

const std::string stop_word_list [] =
                  {
                     "as", "at", "but", "by", "for",
                     "in", "like", "next", "of", "on",
                     "opposite", "out", "past", "to",
                     "up", "via", ""
                  };

const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);

Some minor updates to the parse_line processor include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_type back_inserter to convert the tokens from their range iterator representation to a type representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as a set. The following is the improved version of the parse_line processor:

template<typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(" "),
     tokenizer_(tmp_,predicate_,true),
     filter_(stop_word_list,stop_word_list + stop_word_list_size,
             strtk::range_to_string_back_inserter_iterator<Container>(container_),
             true,false)
   {}

   inline void operator() (const std::string& s)
   {
      const filter_type& filter = filter_;
      strtk::for_each_token(s,tokenizer_,filter);
   }

private:

   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
   strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};

The above described predicate can be greatly simplified by using binders and various lamba expressions.

Another Example

When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.

In this example we will assume that there exists a struct which represents the properties of a person, a person struct if you will:

struct person
{
   unsigned int id;
   std::string name;
   unsigned int age;
   double height
   float weight;
};

The process of populating a person struct would entail having an instance of a person and the necessary data string. The following is an example of how this would be done using the strtk::parse function.

Person Tuple Format

Token0Delimiter0Token1Delimiter1Token2Delimiter2Token3Delimiter3Token4
Unique ID(hex)|Name|Age|Height(m)|Weight(kg)
std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); // register id with the hex sink
strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);

Batch processing of a text file comprised of one person tuple per-line is somewhat similar to the previous example. A predicate is defined that takes a container specialized on the person struct, and a delimiter predicate with which the strtk::parse function will be invoked. This predicate is then instantiated and coupled with the text file name, is fed to the strtk::for_each_line processor.

template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:

   parse_person_tuple(Container& container)
   : container_(container),
     hex_sink(p_.id)
   {}

   inline void operator() (const std::string& s)
   {
      if (strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
         container_.push_back(p_);
      else
         std::cerr << "Failed to parse: " << s << std::endl;
   }

private:

   Container& container_;
   person p_;
   strtk::hex_to_number_sink<unsigned int> hex_sink;
};

Bringing the above pieces together to process a file results in the following:

int main()
{
   std::string text_file_name = "person_records.txt";
   std::deque<person> person_list;
   strtk::for_each_line(file_name,predicate_type(person_list));
   return 0;
}

To make things easier one could adapt a struct (made up entirely of PODs) to a parser. This makes the usage syntax little easier to follow. An example of this adaption is as follows:

struct type
{
   std::string s;
   double d;
   int i;
   char c;
   bool b;
};

strtk_parse_begin(type)
 strtk_parse_type(s)
 strtk_parse_type(d)
 strtk_parse_type(i)
 strtk_parse_type(c)
 strtk_parse_type(b)
strtk_parse_end()

int main()
{
   type t;
   std::string s = "abcdefghijklmnop|123.456|987654321|A|1";
   strtk::parse(s,"|",t);
   return 0;
}

Another similar example to the previous, would be parsing a text file of 3D coordinates into a sequence. This can be done easily and cleanly using lambdas and StrTk as follows:

struct point
{
   double x,y,z;
};

int main()
{
   std::string point_data = "point_data.txt";
   std::deque<point> points;
   point p;
   strtk::for_each_line(point_data,
                        [&points,&p](const std::string& str)
                        {
                           if (strtk::parse(str,",",p.x,p.y,p.z))
                              points.push_back(p);
                        });
   return 0;
}

Simple Date-Time Parser

Assuming the datetime struct defined below, and a string representation of a combined date and time in the form of YYYY-MM-DD HH:MM:SS.MS eg: 2005-06-26 15:57:03.678

struct datetime
{
   unsigned int year;
   unsigned int month;
   unsigned int day;
   unsigned int hour;
   unsigned int minute;
   unsigned int second;
   unsigned int msecond;
};

The following assumes an input of date-time values separated by a pipe. To facilitate parsing of a date-time by the strtk::parse routine into an STL compatible sequence an implementation of string_to_type_converter_impl specific to the datetime type is required. The following demonstrates how such a routine can be generated and used with the strtk::parse context:

strtk_string_to_type_begin(datetime)
   static const std::string delimiters ("-:. ");
   return strtk::parse(begin, end, delimiters,
                       t.year, t.month, t.day,
                       t.hour, t.minute, t.second, t.msecond);
strtk_string_to_type_end()

A simple example of using strtk::parse in conjunction with the newly integrated datetime variable mixed with variables of other types is as follows:

int main()
{
   const std::string data = "abc 123 xyz,2000-01-10 03:01:16.123|+98765.43210";

   std::string var0;
   datetime var1;
   double var2;

   strtk::parse(data,",|",var0,var1,var2);

   return 0;
}

Bringing the above pieces together, in the following we can then proceed to parse a sequence of date-time strings delimited by pipe "|" into a deque of type datetime.

int main()
{
   static const std::string data = "2000-01-10 03:01:16.123|2001-02-22 05:12:24.234|"
                                   "2002-03-13 07:23:32.345|2003-04-24 09:34:47.456|"
                                   "2004-05-15 11:46:51.767|2005-06-26 15:57:03.678|"
                                   "2006-07-17 17:45:31.561|2007-08-26 19:06:02.809|"
                                   "2008-09-18 21:16:23.267|2009-10-26 23:12:03.798|"
                                   "2010-11-23 13:47:11.963|2011-12-26 15:35:08.168";

   std::deque<datetime> datetime_list;
   strtk::parse(data,"|",datetime_list);

   for (std::size_t i = 0; i < datetime_list.size(); ++i)
   {
      datetime& dt = datetime_list[i];
      std::cout << dt.year << "-"
                << strtk::text::right_align(2,'0',  dt.month) << "-"
                << strtk::text::right_align(2,'0',    dt.day) << " "
                << strtk::text::right_align(2,'0',   dt.hour) << ":"
                << strtk::text::right_align(2,'0', dt.minute) << ":"
                << strtk::text::right_align(2,'0', dt.second) << "."
                << strtk::text::right_align(3,'0',dt.msecond) << std::endl;
   }
   return 0;
}

Parsing Sub-Lists

So far the demonstrated capabilities of the strtk::parse function has been based on passing a series of parameters that are populated in a linear fashion as the parser processes the tokens it encounters. That said, some formats have their own sub-structures, a simple example would be a list of values (such as integers) that need to be loaded into a deque or stack. StrTk provides a series of sink facilities that consume a range and an STL container which can be forwarded onto strtk::parse.

In the following example, the data string is comprised of 3 separate lists delimited by a pipe "|". An integer, a string and a double type list. Each list is to be parsed into an STL container of appropriate type. In this case a vector, a deque and a list. StrTk provides the ability to instantiate a sink for the specific container type that is compatible with strtk::parse.

int main()
{
   std::string data = "1,+2,-3,4|abc,ijk,rst,xyz|123.456,+234.567,-345.678,456.789,567.890";

   //define containers
   std::vector<int> int_vector;
   std::deque<std::string> string_deque;
   std::list<double> double_list;
   std::set<int> int_set;
   std::queue<std::string> string_queue;
   std::stack<double> double_stack;
   std::priority_queue<int> int_priority_queue;

   //define sinks
   strtk::vector_sink<int>::type         vec_sink(",");
   strtk::deque_sink<std::string>::type  deq_sink(",");
   strtk::list_sink<double>::type        lst_sink(",");
   strtk::set_sink<int>::type            set_sink(",");
   strtk::queue_sink<std::string>::type  que_sink(",");
   strtk::stack_sink<double>::type       stk_sink(",");
   strtk::priority_queue_sink<int>::type prq_sink(",");


   strtk::parse(data,"|",vec_sink(  int_vector),
                         deq_sink(string_deque),
                         lst_sink( double_list));

   strtk::parse(data,"|",set_sink(           int_set),
                         que_sink(      string_queue),
                         stk_sink(      double_stack),
                         prq_sink(int_priority_queue));

   return 0;
}

If only a certain number of elements in the list are required, for example only the first 3, then the element count on the sink can be set appropriately. The above example could be modified as follows:

int main()
{
   std::string data = "1,+2,-3,4|string0|abc,ijk,rst,xyz|string1|123.456,+234.567,-345.678,456.789,567.890";

   std::vector<int> int_vector;
   std::deque<std::string> string_deque;
   std::list<double> double_list;

   strtk::vector_sink<int>::type        vec_sink(",");
   strtk::deque_sink<std::string>::type deq_sink(",");
   strtk::list_sink<double>::type       lst_sink(",");

   std::string string_0;
   std::string string_1;

   strtk::parse(data,"|",vec_sink(  int_vector).count(2),  // consume first 2 values
                         string_0,
                         deq_sink(string_deque).count(3),  // consume first 3 values
                         string_1,
                         lst_sink( double_list).count(4)); // consume first 4 values
   return 0;
}

Note: If there aren't enough elements in a particular list, then parsing of that list fails and subsequently the whole strtk::parse call will fail as well.

Extending The Date-Time Parser Example

Building upon the previous datetime example, We are presented with a tuple of data that represents an astronomical event. The event defines a name, a location and a series of date-times in UTC the event was observed. In order to construct the necessary sink(s) that will be used for parsing the required type into a container, the macro strtk_register_userdef_type_sink with the specified type is invoked. The following is a definition of the struct one might construct:

struct event_information
{
   std::size_t id;
   std::string name;
   std::string location;
   std::deque<datetime> observation_datetimes;
};

strtk_register_userdef_type_sink(datetime)

Bringing the above together with a call to strtk::parse results in the following code which parses the event data tuple into the allocated event_information instance.

int main()
{
   std::string event_data = "172493|Lunar Impact|Mare Tranquillitatis|"
                            "2010-01-19 00:28:45.357,2010-02-18 00:57:07.109,"
                            "2010-03-20 01:15:11.261,2010-04-21 01:07:27.972";

   strtk::deque_sink<datetime>::type deq_sink(",");

   event_information evt_info;
   strtk::parse(event_data,"|",evt_info.id,
                               evt_info.name,
                               evt_info.location,
                               deq_sink(evt_info.observation_datetimes));
   return 0;
}

Ignoring Tokens During Parsing

There may be scenarios when given a delimited tuple of data, that one or more of the tokens need to be ignored or skipped. StrTk provides a mechanism called strtk::ignore_token that allows the parser to consume specific tokens easily without affecting overall performance. Below is an example of how strtk::ignore_token can be used in conjunction with strtk::parse to skip the 2nd and 4th tokens in the tuple:

int main()
{
   static const std::string data = "+123,ignore0,123.456,ignore1,abcdef,ignore2";

   int i;
   double d;
   std::string s;

   strtk::ignore_token ignore;
   strtk::parse(data,",",i,ignore,d,ignore,s);
   std::cout << "i=" << i << " d=" << d << " s=" << s << std::endl;

   return 0;
}

Simple 3D Mesh File Format Parser

Lets assume there is a file format for expressing a mesh. The format consists of a section that defines the unique vertexes in the mesh, and another section that defines the triangles in the mesh as a tuple of three indicies indicating the vertexes used. Each section is preceded by an integer that denotes the number of subsequent elements in that section. An example of such a file is the following:

5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4

Code to parse such a file format is as follows:

struct point
{
   double x,y,z;
};

struct triangle
{
   std::size_t i0,i1,i2;
};

int main()
{
   std::string mesh_file = "mesh.txt";
   std::ifstream stream(mesh_file.c_str());
   std::string s;
   // Process points section
   std::deque<point> points;
   point p;
   std::size_t point_count = 0;
   strtk::parse_line(stream," ",point_count);
   strtk::for_each_line_n(stream,
                          point_count,
                          [&points,&p](const std::string& line)
                          {
                             if (strtk::parse(line,",",p.x,p.y,p.z))
                                points.push_back(p);
                          });

   // Process triangles section
   std::deque<triangle> triangles;
   triangle t;
   std::size_t triangle_count = 0;
   strtk::parse_line(stream," ",triangle_count);
   strtk::for_each_line_n(stream,
                          triangle_count,
                          [&triangles,&t](const std::string& line)
                          {
                             if (strtk::parse(line,",",t.i0,t.i1,t.i2))
                                triangles.push_back(t);
                          });
   return 0;
}

Pick A Random Line From A Text File

A random line of text is to be selected from a user provided text file comprised of lines of varying length, such that the probability of the line being selected is 1/N where N is the number of lines in the text file. There are many trivial solutions to this problem, however if one were to further constrain the problem by indicating the file is very large (TBs) and that the system upon which the solution is to run is very limited memory-wise, most if not all trivial solutions such as storing indexes of all line offsets, or reading the entire file into memory etc will be eliminated.

However, there exists a very elegant solution to this problem of O(n), O(1) time and space complexities respectively, that involves scanning the entire file once line by line, and at every ith line choosing whether or not to replace the final result line with the current line by sampling a uniform distribution of range [0,1) and performing a replace if and only if the random value is less than 1 / i.

StrTk Line Probability Diagram - Copyright Arash Partow

The logic behind this solution revolves around the fact that the probability of selecting the ith line will be 1/i and as such the total probability for selecting any of the previous i-1 lines will be 1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the ith line, we divide the previous sum of probabilities by (i - 1), resulting in a selection probability of 1/i for any one of the lines up to and including the ith line. If the ith line were to be the last line of the text file, this then results in each of the lines having a selection probability of 1/N - simple and sweet, and so to is the following implementation of said solution:

int main(int argc, char* argv[])
{
   std::string file_name = argv[1];
   std::string line;
   std::size_t i = 0;
   strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0));

   strtk::for_each_line(file_name,
                        [&i,&line,&rng](const std::string& s)
                        { if (rng() < (1.0 / ++i)) line = s; }
                       );

   std::cout << line << std::endl;
   return 0;
}

Note: TAOCP Volume II section 3.4.2 has an in depth discussion about this problem and other similar problems relating to random distributions. Also one should note that the above example has an inefficiency, in that upon every string replace, an actual string is being copied, normally all one needs to do is maintain a file offset to the beginning of the line, not doing this causes slow downs due to continuous memory allocations/reallocations which is made all the worse when large lines are encountered.

Token Grid

StrTk provides a means to easily parse and consume 2D grids of tokens in an efficient and simple manner. A grid is simply defined as a series of rows comprised of tokens, otherwise known as Delimiter Separated Values (DSV). The ith token of a row is grouped with every ith token of all other rows to produce a column. Tokens can be processed as either rows or columns.

An example of a simple token grid, where each numeric value is deemed to be a token:

1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5

A token grid can be either passed in via a file or a string buffer. The delimiters to be used for parsing the columns and rows can also be provided, if not provided standard common defaults will be used.

The following demonstrates how each cell in the grid can be access and cast to a specific type.

std::string data = "1,2,3,4,5,6\n"
                   "7,8,9,0,1,2\n"
                   "3,4,5,6,7,8\n"
                   "9,0,1,2,3,4\n"
                   "5,6,7,8,9,0\n";

strtk::token_grid grid(data,data.size(),",");

for (std::size_t r = 0; r < grid.row_count(); ++r)
{
   strtk::token_grid::row_type row = grid.row(r);
   for (std::size_t c = 0; c < row.size(); ++c)
   {
      std::cout << grid.get<int>(r,c) << "\t";
   }
   std::cout << std::endl;
}

The following example demonstrates how averages over rows and columns of a token grid can be computed:

std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
                   "2.2,2.2,2.2,2.2,2.2,2.2\n"
                   "3.3,3.3,3.3,3.3,3.3,3.3\n"
                   "4.4,4.4,4.4,4.4,4.4,4.4\n"
                   "5.5,5.5,5.5,5.5,5.5,5.5\n"
                   "6.6,6.6,6.6,6.6,6.6,6.6\n";

strtk::token_grid grid(data,data.size(),",");

std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);

for (std::size_t i = 0; i < grid.row_count(); ++i)
{
   grid.row(i).parse<double>(tmp.begin());
   std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus<double>());
   avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}

for (std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());

std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator<double>(std::cout,"\t"));
std::cout << "\n";

std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator<double>(std::cout,"\t"));
std::cout << "\n";

</double></double></double></double>

Processing Of Comma Seperated Values Data

The original intent of the token grid was to support fast and efficient processing of simple data tuples, such as comma seperated values (CSV) formats et. al. The following example demonstrates a simple summation of traded floor volume and average daily volume based on NASDAQ OHLC (Open-High-Low-Close) data.

                          //Date,Symbol,Open,Close,High,Low,Volume
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
                          "20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
                          "20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
                          "20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
                          "20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
                          "20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
                          "20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
                          "20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
                          "20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
                          "20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
                          "20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
                          "20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
                          "20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
                          "20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
                          "20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
                          "20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";

strtk::token_grid grid(market_data,market_data.size(),",");

struct stock_info
{
   stock_info(const std::string& s = " ")
   : symbol(s),
     total_volume(0),
     day_count(0),
     average_daily_volume(0.0)
   {}

   std::string symbol;
   unsigned long long total_volume;
   std::size_t day_count;
   double average_daily_volume;
};

stock_info goog("GOOG");
stock_info msft("MSFT");

static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;

goog.day_count = grid.accumulate_column(volume_column,
                                        [](const strtk::token_grid::row_type& row) -> bool
                                        {
                                           static const std::string google_symbol("GOOG");
                                           return google_symbol == row.get<std::string>(symbol_column);
                                        },
                                        goog.total_volume);

msft.day_count = grid.accumulate_column(volume_column,
                                        [](const strtk::token_grid::row_type& row) -> bool
                                        {
                                           static const std::string microsoft_symbol("MSFT");
                                           return microsoft_symbol == row.get<std::string>(symbol_column);
                                        },
                                        msft.total_volume);

goog.average_daily_volume = (1.0 * goog.total_volume) / goog.day_count;
msft.average_daily_volume = (1.0 * msft.total_volume) / msft.day_count;

std::cout << "[GOOG] Total Volume: " << goog.total_volume << std::endl;
std::cout << "[MSFT] Total Volume: " << msft.total_volume << std::endl;

std::cout << "[GOOG] ADV: " << goog.average_daily_volume << std::endl;
std::cout << "[MSFT] ADV: " << msft.average_daily_volume << std::endl;

The strtk::token_grid is thread-safe iff read operations are in play. As such the above calls to accumulate_column et al. can all be safely and easily executed concurrently using threads. This allows for a far more efficient data processing methodology.

TIMTOWTDI

Playing devil's advocate, another way of performing the above processing task, assuming only the specific values for computing the ADV are required and no further processing of the CSV data is needed, then the problem can be solved efficiently by utilizing a single pass of the data as follows:

std::string file_name = "market_data.csv";

std::unordered_map<std::string,stock_info> stock_map;

stock_map.insert(std::make_pair<std::string,stock_info>("GOOG",stock_info("GOOG")));
stock_map.insert(std::make_pair<std::string,stock_info>("MSFT",stock_info("MSFT")));

strtk::for_each_line(file_name,
                     [&stock_map](const std::string& line)
                     {
                        strtk::ignore_token ignore;
                        stock_info temp;
                        const bool result = strtk::parse(line,
                                                         ",",
                                                         ignore,
                                                         temp.symbol,
                                                         ignore,
                                                         ignore,
                                                         ignore,
                                                         ignore,
                                                         temp.total_volume);
                        if (!result) return;
                        auto iter = stock_map.find(temp.symbol);
                        if (stock_map.end() == iter) return;
                        (*iter).second.total_volume += temp.total_volume;
                        (*iter).second.day_count++;
                     });

auto itr = stock_map.begin();
auto end = stock_map.end();

while (end != itr)
{
   stock_info& stock = (*itr++).second;
   stock.average_daily_volume = (1.0 * stock.total_volume) / stock.day_count;
}

Sequential Partitions

A typical operation carried out upon time-series data is to group tuples into buckets (or bins) based upon the time index value. For example grouping data into 2-minute buckets and then performing some kind of operation upon the grouped tuples such as a summation or an average etc. This process is sometimes also called: "discretization"

The strtk::token_grid class provides a method known as sequential_partition. The sequential_partition method requires a Transition Predicate, a Function and optionally a row-range. The Transition Predicate consumes a row and returns true only if the row is in the next partition. All other subsequent consecutive rows until the transition predicate returns a true are said to be in the current partition. Prior to transitioning to a new partition, the function predicate is invoked and provided with the range of rows in the current partition.

The following example takes a simple time-series (time and value), partitions the tuples into groups of Time-Buckets of period length 3 and then proceeds to compute the total sum of each group. The below summarizer class provides provides a Transition Predicate and Function.

StrTk Sequential Partition - Copyright Arash Partow

class summarizer
{
public:

   enum column_index
   {
      tick_time_column  = 0,
      tick_value_column = 1
   };

   summarizer(std::deque<double>& sum_value)
   : next_tick_time_(0),
     sum_value_(sum_value)
   {}

   // Transition Predicate
   bool operator()(const strtk::token_grid::row_type& row)
   {
      if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
      {
         next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
         return true;
      }
      else
         return false;
   }

   // Function
   bool operator()(const strtk::token_grid& grid,
                   const strtk::token_grid::row_range_type& range)
   {
      double bucket_sum = 0.0;
      if (!grid.accumulate_column(tick_value_column,range,bucket_sum))
      {
         std::cout << "failed to accumulate!" << std::endl;
         return false;
      }
      else
         sum_value_.push_back(bucket_sum);
      return true;
   }

private:

   summarizer& operator=(const summarizer&);

   std::size_t next_tick_time_;
   std::deque<double>& sum_value_;
};

int main()
{
                      //time index, value
   std::string data = "10000,123.456\n"
                      "10001,612.345\n"
                      "10002,561.234\n"
                      "10003,456.123\n"
                      "10004,345.612\n"
                      "10005,234.561\n"
                      "10006,123.456\n";

   strtk::token_grid grid(data,data.size(),",");
   std::deque<double> sum_value;
   summarizer s(sum_value);
   grid.sequential_partition(s,s);
   for (std::size_t i = 0; i < sum_value.size(); ++i)
   {
      std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
   }
   return 0;
}

The expected output is as follows:

  bucket[0] = 1297.035
  bucket[1] = 1036.296
  bucket[2] = 123.456

Parsing CSV Files With Embedded Double-Quotes

One of the simple extensions to the CSV format is the concept of double quoted tokens. Such tokens may contain column or row delimiters. When such a scenario is encountered, all subsequent delimiters are ignored, and kept as part of the token, until the corresponding closing double quote is encountered. The StrTk token_grid supports the parsing of such tokens. This parsing mode can be easily activated via the token_grid option set. Below is an example of a token_grid loading a CSV data set representing various airports from around the world and their specific codes and locations, in which some of the cells are double-quoted:

ICAOIATAAirportCityCountry
AYGAGKA"Goroka Gatue"GorokaPapua New Guinea
BGCOGCO"Nerlerit Inaat Constable Pynt""Nerlerit Inaat"Greenland
BZGDZGDGodleyAucklandNew Zealand
CYQMYQM"Greater Moncton International"MonctonCanada
EDRKZNV"Koblenz Winningen"KoblenzGermany
FAHUAHU"HMS Bastard Memorial"Kwazulu-NatalSouth Africa
FQMPMZB"Mocimboa Da Praia""Mocimboa Da Praia"Mozambique
KINSINS"Indian Springs AF AUX"Indian SpringsUSA
UHNNHNNNikolaevsk"Nikolaevsk Na Amure"Russia
WBKKBKI"Kota Kinabalu International"Kota KinabaluMalaysia
ZSJDJDZ"Jingdezhen Airport"JingdezhenChina

The following is the StrTk code example using token_grid to parse the above CSV data set:

int main()
{
   std::string airport_data_file_name = "airport_data.csv";

   strtk::token_grid::options options;
   options.column_delimiters = "| ,";
   options.support_dquotes = true;

   strtk::token_grid airport_grid(airport_data_file_name,options);

   // for each row r, for each column c, print cell[r,c]
   for (std::size_t r = 0; r < airport_grid.row_count(); ++r)
   {
      strtk::token_grid::row_type row = airport_grid.row(r);
      for (std::size_t c = 0; c < row.size(); ++c)
      {
         std::cout << "[" << row.get<std::string>(c) << "] ";
      }
      std::cout << std::endl;
   }
   return 0;
}

</std::string>

Extending Delimiter Predicates

As previously mentioned the concept of a delimiter based predicate can lead to some very interesting solutions. A predicate as has been defined so far, with the exception of the offset predicate, has been a stateless entity. Adding the ability to maintain a state based on what the predicate has encountered so far can allow it to behave differently from the simple single and multiple delimiter predicates.

For this example, lets assume a typical command line parsing problem which consists of double quotation mark groupings and escapable special characters, which can be considered being dual use as either delimiters or data. An example input and output is as follows:

Inputs

Dataabc;"123, mno xyz",i\,jk
Delimiters<space>;,.

Output Tokens

Token0abc
Token1123, mno xyz
Token2i\,jk

In order to tokenize the above described string, one can create a composite predicate using a multiple char delimiter predicate and some simple state rules. The following is an example of such an extended predicate:

class extended_predicate
{
public:

   extended_predicate(const std::string& delimiters)
   : escape_(false),
     in_bracket_range_(false),
     mdp_(delimiters)
   {}

   inline bool operator()(const unsigned char c) const
   {
      if (escape_)
      {
         escape_ = false;
         return false;
      }
      else if ('\\' == c)
      {
         escape_ = true;
         return false;
      }
      else if ('"' == c)
      {
         in_bracket_range_ = !in_bracket_range_;
         return true;
      }
      else if (in_bracket_range_)
         return false;
      else
         return mdp_(c);
   }

   inline void reset()
   {
      escape_ = false;
      in_bracket_range_ = false;
   }

private:

   mutable bool escape_;
   mutable bool in_bracket_range_;
   mutable strtk::multiple_char_delimiter_predicate mdp_;
};

Usage of the newly defined extended predicate is as follows:

int main()
{
   std::string str = "abc;\"123, mno xyz\",i\\,jk";
   strtk::std_string::token_list_type token_list;
   strtk::split(extended_predicate(".,; "),
                str,
                std::back_inserter(token_list),
                strtk::split_options::compress_delimiters);
   return 0;
}

Performance Comparisons

The following are tables of results generated by running the strtk_tokenizer_cmp test. Currently it covers simple comparisons between Boost String Algorithms, Boost lexical_cast, The Standard Library, Spirit (Karma Qi) and StrTk in the following areas:

  • Tokenization
  • Splitting
  • Integer To String
  • String To Integer
  • String To Double

Scenario 0 - MSVC 2010 (64-bit, O2, Ot, GL and PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 8.5857sec 2795359.4074tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 3.5019sec 6853393.1186tks/sec 40.7%, 245.1%
Boost Split 9600000 5.5414sec 1732414.5137tks/sec 100.0%, 100.0%
StrTk Split 9600000 0.8218sec 11681814.9167tks/sec 14.8%, 674.3%
sprintfInteger To String80000000 35.8128sec 2233840.0564nums/sec 100.0%, 100.0%
Boost Integer To String80000000 19.3994sec 4123832.0477nums/sec 54.1%, 184.6%
Karma Integer To String80000000 6.2528sec 12794349.6524nums/sec 17.4%, 572.7%
StrTk Integer To String80000000 1.5664sec 51071439.9822nums/sec 4.3%, 2286.2%
atoi String To Integer88500000 5.1802sec 17084370.4936nums/sec 100.0%, 100.0%
Boost String To Integer88500000119.6261sec 739805.3712nums/sec2309.2%, 4.3%
Qi String To Integer88500000 2.1951sec 40317238.6629nums/sec 42.3%, 235.9%
StrTk String To Integer88500000 1.8181sec 48677773.5466nums/sec 35.0%, 284.9%
atof String To Double 30650000 15.2306sec 2012396.7122nums/sec 100.0%, 100.0%
Boost String To Double 30650000 52.9244sec 579127.8866nums/sec 347.4%, 28.7%
Qi String To Double 30650000 2.8665sec 10692313.5853nums/sec 18.8%, 531.3%
StrTk String To Double 30650000 1.6069sec 19073975.7679nums/sec 10.5%, 947.8%

StrTk MSVC10 Result Chart - Copyright Arash Partow


Scenario 1 - MSVC 2010 (O2, Ot, GL and PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 9.4715sec 2533910.4769tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 2.8889sec 8307786.9292tks/sec 30.5%, 327.8%
Boost Split 9600000 7.2291sec 1327965.9706tks/sec 100.0%, 100.0%
StrTk Split 9600000 1.1301sec 8494610.9664tks/sec 15.6%, 639.6%
sprintfInteger To String80000000 38.2576sec 2091088.8038nums/sec 100.0%, 100.0%
Boost Integer To String80000000 28.9931sec 2759277.4769nums/sec 75.7%, 131.9%
Karma Integer To String80000000 4.9173sec 16269254.0190nums/sec 12.8%, 778.0%
StrTk Integer To String80000000 1.8270sec 43786838.0279nums/sec 4.7%, 2093.9%
atoi String To Integer88500000 6.0076sec 14731435.8942nums/sec 100.0%, 100.0%
Boost String To Integer88500000185.4955sec 477100.6474nums/sec3087.0%, 3.2%
Qi String To Integer88500000 2.5060sec 35314785.8370nums/sec 41.7%, 239.7%
StrTk String To Integer88500000 2.2095sec 40054213.0736nums/sec 36.7%, 271.8%
atof String To Double 30650000 17.6435sec 1737179.9302nums/sec 100.0%, 100.0%
Boost String To Double 30650000 78.6528sec 389687.3997nums/sec 445.7%, 22.4%
Qi String To Double 30650000 3.8034sec 8058494.1994nums/sec 21.5%, 463.8%
StrTk String To Double 30650000 2.0450sec 14987780.2310nums/sec 11.5%, 862.7%

StrTk MSVC10 Result Chart - Copyright Arash Partow


Scenario 2 - MSVC 2008 SP1 (O2, Ot, GL and PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 9.6533sec 2486184.8282tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 3.4748sec 6906943.9529tks/sec 35.9%, 277.8%
Boost Split 9600000 10.2600sec 935674.7490tks/sec 100.0%, 100.0%
StrTk Split 9600000 1.3793sec 6959830.0652tks/sec 13.4%, 743.8%
sprintfInteger To String80000000 24.6427sec 3246397.8287nums/sec 100.0%, 100.0%
Boost Integer To String80000000 27.5865sec 2899968.5753nums/sec 111.9%, 89.3%
Karma Integer To String80000000 5.4864sec14581630.6963nums/sec 22.2%, 449.1%
StrTk Integer To String80000000 2.4224sec33025441.1256nums/sec 9.8%, 1017.2%
atoi String To Integer88500000 5.9297sec14924814.8683nums/sec 100.0%, 100.0%
Boost String To Integer88500000186.1372sec 475455.6660nums/sec3139.0%, 3.1%
Qi String To Integer88500000 2.0874sec42397446.1804nums/sec 35.2%, 284.0%
StrTk String To Integer88500000 2.0485sec43202160.1371nums/sec 34.5%, 289.4%
atof String To Double 30650000 18.0458sec 1698455.0767nums/sec 100.0%, 100.0%
Boost String To Double 30650000 77.4527sec 395725.4111nums/sec 429.2%, 23.2%
Qi String To Double 30650000 3.9631sec 7733881.1294nums/sec 21.9%, 455.3%
StrTk String To Double 30650000 2.0723sec14790236.0804nums/sec 11.4%, 870.8%

StrTk MSVC9 Result Chart - Copyright Arash Partow


Scenario 3 - Intel C++ v11.1.060 IA-32 (O2, Ot, Qipo, QxHost and PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 10.0096sec 2397697.7836tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 3.1837sec 7538416.8541tks/sec 31.8%, 314.4%
Boost Split 9600000 9.5450sec 1005760.0310tks/sec 100.0%, 100.0%
StrTk Split 9600000 1.4292sec 6716893.1359tks/sec 14.9%, 667.8%
sprintfInteger To String80000000 23.8979sec 3347577.5824nums/sec 100.0%, 100.0%
Boost Integer To String80000000 27.5618sec 2902565.2045nums/sec 115.3%, 86.7%
Karma Integer To String80000000 4.6600sec17167208.7654nums/sec 19.4%, 512.8%
StrTk Integer To String80000000 2.8450sec28119857.2736nums/sec 11.9%, 840.0%
atoi String To Integer88500000 5.9386sec14902610.8922nums/sec 100.0%, 100.0%
Boost String To Integer88500000180.5856sec 490072.4001nums/sec3040.8%, 3.2%
Qi String To Integer88500000 2.5273sec35017073.8639nums/sec 42.5%, 234.9%
StrTk String To Integer88500000 1.8718sec47281492.1287nums/sec 31.5%, 317.2%
atof String To Double 30650000 18.4357sec 1662538.0810nums/sec 100.0%, 100.0%
Boost String To Double 30650000 78.1543sec 392172.9598nums/sec 423.9%, 23.5%
Qi String To Double 30650000 2.8321sec10822353.0510nums/sec 15.3%, 650.9%
StrTk String To Double 30650000 2.2930sec13366541.5515nums/sec 12.4%, 803.9%

StrTk Intel C++ v11 Result Chart - Copyright Arash Partow


Scenario 4 - GCC 4.5 (O3, PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 9.2510sec 2594305.4347tks/sec100.0%, 100.0%
StrTk Tokenizer 24000000 3.9717sec 6042688.5734tks/sec 42.9%, 232.9%
Boost Split 9600000 5.0640sec 1895728.2331tks/sec100.0%, 100.0%
StrTk Split 9600000 1.5411sec 6229231.8384tks/sec 30.4%, 328.5%
sprintfInteger To String8000000014.7807sec 5412477.0993nums/sec100.0%, 100.0%
Boost Integer To String8000000019.1131sec 4185620.7707nums/sec129.3%, 77.3%
Karma Integer To String80000000 6.4455sec12411808.2841nums/sec 43.6%, 229.3%
StrTk Integer To String80000000 4.5174sec17709364.5349nums/sec 30.5%, 327.1%
atoi String To Integer88500000 5.2139sec16973721.6103nums/sec100.0%, 100.0%
Boost String To Integer8850000050.5326sec 1751344.8498nums/sec969.1%, 10.3%
Qi String To Integer88500000 1.9694sec44937612.8835nums/sec 37.7%, 264.7%
StrTk String To Integer88500000 1.9008sec46558706.5833nums/sec 36.4%, 274.2%
atof String To Double 30650000 6.6975sec 4576328.3036nums/sec100.0%, 100.0%
Boost String To Double 3065000029.6375sec 1034162.2422nums/sec442.5%, 22.5%
Qi String To Double 30650000 2.9852sec10267435.7138nums/sec 44.5%, 224.3%
StrTk String To Double 30650000 1.5961sec19202937.1409nums/sec 23.8%, 419.6%

StrTk GCC v4.5 Result Chart - Copyright Arash Partow


Scenario 5 - GCC 4.5 (O3, PGO) Intel Atom N450

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 29.1370sec 823695.4389tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 12.3607sec 1941644.0499tks/sec 42.4%, 235.7%
Boost Split 9600000 16.5261sec 580899.9726tks/sec 100.0%, 100.0%
StrTk Split 9600000 4.9102sec 1955110.2611tks/sec 29.7%, 336.5%
sprintfInteger To String80000000 50.3456sec 1589015.6118nums/sec 100.0%, 100.0%
Boost Integer To String80000000 91.1475sec 877698.1401nums/sec 181.0%, 55.2%
Karma Integer To String80000000 21.8904sec 3654568.8712nums/sec 43.4%, 229.9%
StrTk Integer To String80000000 12.1877sec 6564009.9274nums/sec 24.2%, 413.0%
atoi String To Integer88500000 17.6615sec 5010896.5768nums/sec 100.0%, 100.0%
Boost String To Integer88500000191.9446sec 461070.5357nums/sec1086.7%, 9.2%
Qi String To Integer88500000 6.2808sec14090561.7119nums/sec 35.5%, 281.1%
StrTk String To Integer88500000 6.1552sec14378086.8208nums/sec 34.8%, 286.9%
atof String To Double 30650000 21.4865sec 1426474.1027nums/sec 100.0%, 100.0%
Boost String To Double 30650000139.8166sec 219215.7409nums/sec 650.7%, 15.3%
Qi String To Double 30650000 11.3916sec 2690567.9223nums/sec 53.0%, 188.6%
StrTk String To Double 30650000 6.4396sec 4759608.7027nums/sec 29.9%, 333.6%

StrTk GCC v4.5 Atom N450 Result Chart - Copyright Arash Partow

Scenario 6 - GCC 4.5 (O3, PGO) Intel Xeon

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 7.5657sec 3172216.8787tks/sec100.0%, 100.0%
StrTk Tokenizer 24000000 2.7379sec 8765832.8290tks/sec 36.1%, 276.3%
Boost Split 9600000 3.0706sec 3126386.1126tks/sec100.0%, 100.0%
StrTk Split 9600000 1.1279sec 8511136.2899tks/sec 36.7%, 272.2%
sprintfInteger To String8000000010.9012sec 7338642.9638nums/sec100.0%, 100.0%
Boost Integer To String8000000012.3317sec 6487328.7872nums/sec113.1%, 88.3%
Karma Integer To String80000000 3.7202sec21504260.6660nums/sec 34.1%, 293.0%
StrTk Integer To String80000000 2.5183sec31768042.4612nums/sec 23.1%, 432.8%
atoi String To Integer88500000 4.0087sec22077037.6357nums/sec100.0%, 100.0%
Boost String To Integer8850000030.3659sec 2914454.4393nums/sec757.4%, 13.2%
Qi String To Integer88500000 1.7976sec49231871.5454nums/sec 43.3%, 223.0%
StrTk String To Integer88500000 1.7384sec50908881.7303nums/sec 43.3%, 230.5%
atof String To Double 30650000 5.2118sec 5880843.9328nums/sec100.0%, 100.0%
Boost String To Double 3065000021.5546sec 1421966.9538nums/sec413.5%, 24.1%
Qi String To Double 30650000 3.2149sec 9533840.3118nums/sec 61.6%, 162.1%
StrTk String To Double 30650000 1.3929sec22003661.2944nums/sec 26.7%, 374.1%

StrTk GCC v4.5 Intel Xeon E5540 Result Chart - Copyright Arash Partow

Note 1: The tests are compiled with specific optimisation flags to produce the best possible results for the respective compilers and architectures. Furthermore the tests are run natively (no virtualizations were used) on an almost completely idle machine so as to reduce interference from background processes. The Boost version used was 1.45. Furthermore the standard libraries including libc were rebuilt for the linux system based tests, using architecture specific flags and optimizations. The following is a table mapping the scenarios to their respective architectures:

ScenarioArchitecture
0ThinkPad W510 (64-Bit Intel Quad Core Extreme i7-920XM 2.0GHz, 16GB RAM, Windows 7)
1-3ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Windows 7)
4ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Ubuntu 10.10)
5Acer Aspire One (32-Bit Intel Atom N450 1.6Ghz, 1GB RAM, Ubuntu 10.10)
6HP Proliant DL380G6 (64-Bit Intel Xeon E5540 2.5GHz, 8GB RAM, Ubuntu 10.10)

Note 2: The percentages in the final column represent the percentage of the current row versus the baseline in total running time and rate respectively. For the first percentage the lower the value the better and for the second percentage the higher the value the better. The baseline used for a specific combination of tests is defined in the following table:

Test CombinationBaseline
Boost, StrTkBoost
Boost, StdLib/STL, Spirit, StrTkStdLib/STL
StdLib/STL, Spirit, StrTkStdLib/STL

Note 3: The test sizes are set such that no single run will result in a running time less than one second. This is done so as to ensure that runs-per-second results are not deemed to have been projected. In the future these sizes may need to be revisited once 3.5+GHz CPU speeds become more commonplace. Furthermore the charts represent the rate of operation over a one second interval - In short, the larger the rate the better.

Note 4: The binaries used for the above performance tests can be downloaded from here

Note 5: It would be great to have comparisons for other architectures. If you can provide access to shell accounts with GCC 4.5+ or Clang/LLVM 2.0+ for the following architectures: UltraSPARC T2 Plus, SPARC64 VII, POWER6/7, please feel free to get in contact.

StrTk Library Dependency

Originally StrTk was mostly compatible with C++ compilers ranging from GCC v2.95 and Visual studio v6.0, however due to upgrades it now requires at least GCC v4.4, Intel C++ Compiler v10 or Visual Studio v9.0 to compile easily. StrTk also makes use of the Boost library for its boost::lexical_cast routine for types other than PODs, and its TR1 compliant Random and Regex libraries. These dependencies are not crucial and can be easily removed simply by defining the preprocessor: strtk_no_tr1_or_boost. That said Boost is an integral part of modern C++ programming, and having it around is as beneficial as having access to the STL, hence it is recommended that it be installed. For Visual Studio users, BoostPro provides a free and easy to use installer for the latest Boost libraries that can be obtained from Here. For Linux users, mainstream distributions such as Ubuntu and Red-Hat (Fedora) provide easy installation of the Boost libraries via their respective package management systems. For more information please consult the readme.txt found in the StrTk distribution.

Compiler Support

The following is a listing of the various compilers that StrTk can be built with error and warning free.

  • GCC - verions 4.4+
  • Intel C++ Compiler - versions 10+
  • Clang/LLVM - versions 1.0+
  • MSVC - versions 9.0+
  • Comeau C/C++ - versions 4.3+

Note: Versions of compilers prior to the ones denoted above "should" compile, however they may require a very lineant warning/error level be set during compilation.

Conclusion

StrTk was designed with performance and efficiency as its sole primary principles, and as such some of the available interfaces may not be as user-friendly as they should - however that said, the gains made in other areas hopefully will compensate for any perceived difficulties. Like most things there is a trade-off between performance and usability with the above mentioned tokenizers and parsing methods. The original aim was to provide an interface similar to that of the Boost Tokenizer and Split routines, but to also avail the developer simplifications that will hopefully provide them more flexibility and efficiency in the long run. That said, tokenizing a string isn't the most fascinating problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing a simple data processing job today or next month.


推荐.NET配套的通用数据层ORM框架:CYQ.Data 通用数据层框架
新浪微博粉丝精灵,刷粉丝、刷评论、刷转发、企业商家微博营销必备工具"