Topic: Regex-class that works with STL


Author: bjorn@algonet.se (Bjorn Fahller)
Date: 1998/01/18
Raw View
On Thu, 15 Jan 1998 18:48:22, mottl@miss.wu-wien.ac.at (Markus Mottl)
wrote:

Here's an implementation (I'm not saying it's perfect by any means,
but it works nicely in the STL vein.) It allows constructs like:


  RegExp re("^str[0-9]*\\..$", RegExp::extended);
  if (re.match("str1")) cout << "str1" << endl;

  RegExp
re2("str\\[([0-9]+)\\][^\"]+\"([^\"]+)\".*",RegExp::extended);
  RegMatch m = re2.match("str[23534]43ksdjn5\"kalaskula\"sd");
  if (m) cout << "m[0]=" << m[0] << endl;
  cout << "-- iterator --" << endl;
  for (RegMatch::const_iterator i=m.begin(); i!= m.end();++i)
    cout << *i << endl;


Note that in the sources below namespace std is not used at all (my
compiler doesn't support namespaces.) It's also perfectly possible
that I'm using some non-standard behaviour of the string class, since
I'm using my own C++ exercise implementation.
   _
/Bjorn.


// regexp.hpp
#ifndef REGEXP_HPP
#define REGEXP_HPP

#include <regex.h>
#include <string>

class RegError
{
public:
  RegError(int c) : code(c) {};
  operator int(void) const { return code; };

protected:
private:
  int code;
};

class RegMatch;

class RegMatchIterator
{
public:
  typedef string const_reference_type;
  const_reference_type operator*() const;
  RegMatchIterator& operator++() { if (pos < size) ++pos; return
*this;}
  RegMatchIterator operator++(int) {
    RegMatchIterator i(*this); if (pos < size) ++pos;  return i;
  }
  RegMatchIterator& operator--() { if (pos > 0) --pos; return *this;}
  RegMatchIterator operator--(int) {
    RegMatchIterator i(*this); if (pos > 0) --pos;  return i;
  }
  RegMatchIterator(const RegMatchIterator& m)
    : str(m.str),size(m.size),pmatch(m.pmatch),pos(m.pos) {};
  RegMatchIterator& operator=(const RegMatchIterator& m) {
    str = m.str; size=m.size; pmatch=m.pmatch; pos=m.pos;
    return *this;
  }
  int operator==(const RegMatchIterator& m) const {
    return m.str==str && m.size==size && m.pmatch==pmatch &&
m.pos==pos;
  }

protected:
private:
  RegMatchIterator(string s, size_t n, regmatch_t* p, size_t current)
    : str(s), size(n), pmatch(p), pos(current) {};
  string str;
  size_t size;
  regmatch_t* pmatch;
  size_t pos;
  friend class RegMatch;
};

class RegExp;
class RegMatch
{
public:
  typedef size_t size_type;
  typedef RegMatchIterator const_iterator;
  //  typedef reverse_iterator<const_iterator> const_reverse_iterator;
  ~RegMatch() { delete[] pmatch;}
  size_t size() const { return items;}
  operator void*() const { return items ? (void*)1 : 0; }
  string operator[](size_t pos) const;
  const_iterator begin() const;
  const_iterator end() const;
  //  const_reverse_iterator rbegin() const;
  //  const_reverse_iterator rend() const;
protected:
private:
  RegMatch(const RegMatch& m) : items(m.items), pmatch(m.pmatch) {};
  RegMatch(const string& s, size_t n, regmatch_t* p)
    : str(s), items(n), pmatch(p) {};

  string str;
  size_t items;
  regmatch_t* pmatch;

  friend class RegExp;
};

class RegExp
{
public:
  typedef enum { notbol = REG_NOTBOL, noteol = REG_NOTEOL } EolType;
  typedef enum {
    extended = REG_EXTENDED,
    newline = REG_NEWLINE,
    ignore_case = REG_ICASE,
    nosub = REG_NOSUB
  } Category;
  RegExp(const char* str, Category c = Category(0)) throw (RegError);
    extended = REG_EXTENDED,
    newline = REG_NEWLINE,
    ignore_case = REG_ICASE,
    nosub = REG_NOSUB
  } Category;
  RegExp(const char* str, Category c = Category(0)) throw (RegError);
  RegExp(const string& s, Category c = Category(0)) throw (RegError);

  virtual ~RegExp();

  RegMatch match(const char* str, EolType et = EolType(0)) const
    throw (RegError) {
      return match(string(str), et);
  }
  RegMatch match(const string& s, EolType et = EolType(0)) const
    throw (RegError);
private:
  regex_t reg;
  size_t subs;
};

#endif // REGEXP_HPP



// regexp.cpp
#include "regexp.hpp"

Regexp::Base::Base(Regexp::Base::Category c, const char* str)
  throw (Regexp::error)
{
  int i = regcomp(&reg,str,c);
  if (i) throw error(i);
}

Regexp::Base::~Base()
{
  regfree(&reg);
}

Regexp::Match Regexp::Base::match(const char* str, Regexp::EolType et)
const throw (error)
{
  size_t subs = 0;
  for (const char* p=str;*p;++p)
  {
    if (*p == '(') ++subs;
  }
  regmatch_t* pmatch = new regmatch_t[subs];
  int err = regexec(&reg,str,subs,pmatch,et);
}

Regexp::Match Regexp::Base::match(const char* str, Regexp::EolType et)
const throw (error)
{
  size_t subs = 0;
  for (const char* p=str;*p;++p)
  {
    if (*p == '(') ++subs;
  }
  regmatch_t* pmatch = new regmatch_t[subs];
  int err = regexec(&reg,str,subs,pmatch,et);
  if (err) throw error(err);
  return Match(subs,pmatch);
}

Regexp::Match::~Match()
{
  delete[] pmatch;
}

Regexp::MatchIterator::operator*() const
{
}
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: mottl@miss.wu-wien.ac.at (Markus Mottl)
Date: 1998/01/13
Raw View
I would like to know, whether there is a free implementation of a
Regex-class for matching regular expressions in STL-objects of "string",
"istream",...

Why does such an important class for matching input or other data miss
in the STL?  Practically every program has to read in input and has to
make sure that it matches some sort of pattern. Adding a Regex-class
could greatly reduce line numbers as well as errors.


An interesting capability would also be to match "regular" sequences
of objects.

Example: You have the following vector of integers: 31 71 71 11 15 15 15 1
Now you want to take out every sequence of equal numbers (more than one)
(in this case 71 71 and 15 15 15 - leaving 31 11 1 in the container)

Maybe the solution could work like this:
(better strategies very welcome ;-) )

---
// vector<int> my_vector already contains the numbers above...
RegexContainer<int> my_regex_container;

my_regex_container.push_back(".{2,}"); // . -> always match
                                       // {2,} -> match at least two times

// This takes all elements described by the regular expression in
// "my_regex_container" out of "my_vector" (starting-point is the first
// element)
my_regex_container.take_out_from(&my_vector[0]);
---

The member-function "push_back" of "RegexContainer" might also take
as argument a value to match of type "int" (in the example) or even a
function-object which operates on an element and returns a bool, whether
the current element fulfils a certain condition (match attributes of an
object rather than the object itself).

I know that not everything can be placed into the STL. Still, I miss
general concepts like regular expressions. (You, too???)
This method of pattern specification is very much appreciated in many
areas, reaching from theoretical computer science over operating systems
(you can find it nearly everywhere in UNIX) and it is even integral part
of a whole programming language (PERL).

Regards,
Markus Mottl

--
*  Markus Mottl              |  University of Economics and       *
*  Department of Applied     |  Business Administration           *
*  Computer Science          |  Vienna, Austria                   *
*  mottl@miss.wu-wien.ac.at  |  http://miss.wu-wien.ac.at/~mottl  *
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]





Author: mottl@miss.wu-wien.ac.at (Markus Mottl)
Date: 1998/01/15
Raw View
In answer to Peter's mail (peter.milliken@gecms.com.au):

> Perhaps I am missing the point, but at least with regular expression
> capability, have you looked at Flex? It generates a C++ class and
> contains the support for easy definition of regular expressions etc.

You are right, Flex is a wonderful tool for specifying regular expressions
and I like using it for implementing large scanners.

Still, there are several problems with it:

- It cannot use regular expressions which are generated at runtime.
- It cannot read from (STL) strings (you'd have to use strstreams or
  the like as intermediate objects :-( ).
- You have to construct a whole scanner class for every different source
  of input, which adds quite some complexity to your code and is probably
  unnecessary if you just want to match a single pattern.
- Maybe you just want to know whether some input matches a pattern
  without having to manipulate the input (extract the matched string).
  e.g.:

    Regex my_regex = "(blabla)+";
    string my_string;
    some_stream >> my_string;
    if (my_string == my_regex) ... // operator== tries to match pattern...

It is not just a matter of matching input, but you might also want to
substitute regular expressions in strings, for example. A Regex-class
in the STL would really come handy...

         Regards,
            Markus

--
*  Markus Mottl              |  University of Economics and       *
*  Department of Applied     |  Business Administration           *
*  Computer Science          |  Vienna, Austria                   *
*  mottl@miss.wu-wien.ac.at  |  http://miss.wu-wien.ac.at/~mottl  *
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]