Topic: Complexity of search and string::find


Author: musiphil@bawi.org (Seungbeom Kim)
Date: Mon, 18 Apr 2005 16:47:11 GMT
Raw View
Scott Meyers wrote:
>   - Am I correct that there are no complexity guarantees for basic_string
>     member functions, or did I overlook something?

I cannot see any, either, but I guess I don't know better than you do. :)

>   - Is there a good reason for Josuttis listing search's complexity as
>     linear?

I didn't see Josuttis about this, but basically O(n*m) is quadratic only
if m is on the order of n. If m remains to be a constant, it's linear with
respect to n, which can be valid depending on the circumstances, and in
fact it can be a more common circumstance. (The 'needle' doesn't grow as
the 'haystack' grows.. ;-))

>   - Is there a reason why search doesn't guarantee the complexity that my
>     reader suggests is possible?

I'm not an expert on string searching algorithms, but as far as I know,
O(n) algorithms require some 'setup' or 'preprocessing' time before the
actual matching occurs. Again, depending on the situation, this may or may
not be desirable. Especially, if both strings are short enough, then a
simple algorithm with O(n*m) is good enough and any sophisticated one that
requires additional preprocessing could be worse. Therefore I don't think
it would be wise to specify O(n) complexity guarantee for the
basic_string::find() function, though maybe O(n*m) would be reasonable.

Hope this helps,

--
Seungbeom Kim

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: fw@deneb.enyo.de (Florian Weimer)
Date: Tue, 19 Apr 2005 20:10:16 GMT
Raw View
* Pete Becker:

> Scott Meyers wrote:
>>   - Is there a reason why search doesn't guarantee the complexity that my
>>     reader suggests is possible?
>>
>
> Big-oh doesn't say all that needs to be said. Faster techiques (in the
> big-oh sense) typically require more setup, so may well be slower (in
> clock time) for smaller cases.

Fortunately, you can special-case the smaller cases (if you can detect
them efficiently) without affecting asymptotic performance.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: Usenet@aristeia.com (Scott Meyers)
Date: Fri, 15 Apr 2005 04:44:58 GMT
Raw View
I recently got this from a reader of Effective STL:

  the thing that bugs me most is that some algorithms simply aren't
  implemented in the optimal way.  Take basic_string::find() as an example.
  If the string we're searching in has length n and the substring has m,
  then the complexity of this method is O(n*m), while the optimal algorithm
  has complexity O(n).  If m is close to n, then we are comparing quadratic
  function against linear!  How does this conform to the complexity
  guarantees?

I replied as follows:

  Regarding complexity guarantees, STL algorithms have them.  string member
  functions do not.  As a result, the standard says nothing about the
  complexity of string::find.  However, string::find performs essentially
  the same operation as the STL search algorithm, and there is a complexity
  guarantee for it:

    At most (last1 - first1) * (last2 - first2) applications of the
    corresponding predicate.

  So given sequences of lengths n and m, the complexity guarantee is
  O(n*m), and, yes, if m's length is on the order of the length of n, the
  complexity bound is quadratic.  (Interestingly, Josuttis lists the
  complexity as linear.)

  I'll try to find out why the bound on search is quadratic instead of
  linear.  I know nothing about substring algorithms, but sometimes big-Oh
  guarantees are less important than the constants in front of them for
  typical inputs, and that may play a role here.

Questions:
  - Am I correct that there are no complexity guarantees for basic_string
    member functions, or did I overlook something?
  - Is there a good reason for Josuttis listing search's complexity as
    linear?
  - Is there a reason why search doesn't guarantee the complexity that my
    reader suggests is possible?

Thanks,

Scott

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: petebecker@acm.org (Pete Becker)
Date: Sat, 16 Apr 2005 02:09:24 GMT
Raw View
Scott Meyers wrote:
>   - Is there a reason why search doesn't guarantee the complexity that my
>     reader suggests is possible?
>

Big-oh doesn't say all that needs to be said. Faster techiques (in the
big-oh sense) typically require more setup, so may well be slower (in
clock time) for smaller cases.

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: caj@cs.york.ac.uk (Christopher Jefferson)
Date: Sat, 16 Apr 2005 02:09:01 GMT
Raw View
Scott Meyers wrote:
> I recently got this from a reader of Effective STL:
>
>   the thing that bugs me most is that some algorithms simply aren't
>   implemented in the optimal way.  Take basic_string::find() as an example.
>   If the string we're searching in has length n and the substring has m,
>   then the complexity of this method is O(n*m), while the optimal algorithm
>   has complexity O(n).  If m is close to n, then we are comparing quadratic
>   function against linear!  How does this conform to the complexity
>   guarantees?
>
> I replied as follows:
>
>   Regarding complexity guarantees, STL algorithms have them.  string member
>   functions do not.  As a result, the standard says nothing about the
>   complexity of string::find.  However, string::find performs essentially
>   the same operation as the STL search algorithm, and there is a complexity
>   guarantee for it:
>
>     At most (last1 - first1) * (last2 - first2) applications of the
>     corresponding predicate.
>
>   So given sequences of lengths n and m, the complexity guarantee is
>   O(n*m), and, yes, if m's length is on the order of the length of n, the
>   complexity bound is quadratic.  (Interestingly, Josuttis lists the
>   complexity as linear.)
>
>   I'll try to find out why the bound on search is quadratic instead of
>   linear.  I know nothing about substring algorithms, but sometimes big-Oh
>   guarantees are less important than the constants in front of them for
>   typical inputs, and that may play a role here.
>
> Questions:
>   - Am I correct that there are no complexity guarantees for basic_string
>     member functions, or did I overlook something?
>   - Is there a good reason for Josuttis listing search's complexity as
>     linear?
>   - Is there a reason why search doesn't guarantee the complexity that my
>     reader suggests is possible?

I should point out that I'm coming to this as someone whose tried to
implement these functions, but has no great knowledge of why the
decisions were made as they are :)

There are string searching algorithms for searching for a string length
m in another string length m with alphabet size S which only require
O(n+m) time and constant extra space (such as Galil-Seiferas).

It is often faster however to allow O(m) extra memory, so something like
Knuth-Morris-Pratt, or on of the Boyer-Moore like algorithms which are
more memory efficent, like Reverse Factor

Boyer-Moore and it's alternatives are often considered "fastest", as
they take O(n/m) time in the average case, but they require allocating
and filling a buffer of size O(S), which a) takes a comparatively long
time if you are looking for small strings in a small buffer and b) is
clearly infeasable for unicode, and for an arbitary type.

Implementing anything other than stupid brute force for general
std::search is unfortunatly impossible, as the requirements on an
comparison functions are too weak (in particular, it looks to me like it
doesn't have to be an equivalence relation, and you don't have to even
be able to compare elements in either the sequence you are searching or
the sequence being searched for to themselves). I think (but I'm not
positive) that sufficent requirements are place on T for basic_string<T>
that you could use a more efficent algorithms. You definatly could
overload basic_string<char>, basic_string<w_char>, etc.

Most implementations of the C++ std library seem to just call
std::search for string::find, but there is not reason to not implement
std::search more efficently for some types (in particular the built in
simple types, like int, char, w_char, etc).

By conicidence, I have recently been trying to improve std::search (and
thereby string::find)'s performance by implementing one of these
algorithms. The main problem is that without the ability to allocate
O(m) extra memory, while you can get better theoretical complexity its
hard to get better average case performance (although of course it is
tricky to decide what average case you should optimise for).

Chris

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: wade@stoner.com
Date: Fri, 15 Apr 2005 21:15:07 CST
Raw View
Scott Meyers wrote:
> Questions:
>   - Am I correct that there are no complexity guarantees for
basic_string
>     member functions, or did I overlook something?

21.3/2: "... basic_string [is] a Sequence ..."

So its members need to meet sequence guarantees (however it is common
for implementations to fail this test).  However there is no sequence
guarantee for find(sub-sequence).

>   - Is there a good reason for Josuttis listing search's complexity
as
>     linear?

Josuttis says "linear(at most, numberOfElements*numberOfSearchElements
.)" which seems to be a strange definition of linear.

>   - Is there a reason why search doesn't guarantee the complexity
that my
>     reader suggests is possible?

I'd guess it is because the n*m algorithm is simple, and in practice is
rarely quadratic.

Possibly they did not want to outlaw other algorithms, which are often
much faster (approaching O(n/m)) but sometimes take more than linear
time.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: jfa1@cec.wustl.edu ("James Aguilar")
Date: Sat, 16 Apr 2005 02:13:38 GMT
Raw View
"Scott Meyers" <Usenet@aristeia.com> wrote in message
news:MPG.1cc796c4222174c69897d1@news.hevanet.com...
> Questions:
>  - Is there a good reason for Josuttis listing search's complexity as
>    linear?

I can address this one because of my biosequencing class this semester.
Yes, O(n) is the longest time needed to do the comparisons, and a fast O(n)
at that, using an optimized Knuth-Morris-Pratt algorithm.  Actually, the
guarantee is not just O(n), but real time -- that is, each character in the
text need only be examined once.  It's pretty amazing, really, and the proof
is not even very complicated.

See http://www-igm.univ-mlv.fr/~lecroq/string/node8.html, or any number of
other sites.

>  - Is there a reason why search doesn't guarantee the complexity that my
>    reader suggests is possible?

The algorithm I'm talking about requires O(m) preprocessing time and O(m)
extra memory, which probably takes longer than the search itself for the
average user searching strings shorted than, say, 100 chars.  Something
tells me that the average case user will not benefit from O(n) string
comparisons, since, in the average case, the search time will likely be
closer to linear anyway.

- James Aguilar


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: kuyper@wizard.net
Date: Sat, 16 Apr 2005 12:44:32 CST
Raw View
Scott Meyers wrote:
.
>   Regarding complexity guarantees, STL algorithms have them.  string
member
>   functions do not.  As a result, the standard says nothing about the
>   complexity of string::find.  However, string::find performs
essentially
>   the same operation as the STL search algorithm, and there is a
complexity
>   guarantee for it:       At most (last1 - first1) * (last2 - first2)
applications of the
>     corresponding predicate.
>     So given sequences of lengths n and m, the complexity guarantee
is
>   O(n*m), and, yes, if m's length is on the order of the length of n,
the
>   complexity bound is quadratic.  (Interestingly, Josuttis lists the
>   complexity as linear.)
>
>   I'll try to find out why the bound on search is quadratic instead
of
>   linear.  I know nothing about substring algorithms, but sometimes
big-Oh
>   guarantees are less important than the constants in front of them
for
>   typical inputs, and that may play a role here.
>
> Questions:
>   - Am I correct that there are no complexity guarantees for
basic_string
>     member functions, or did I overlook something?


21.3p2 specifies that std::basic_string<> conforms to the requrirements
of Sequence and Reversible Container. It therefore inherits all of the
complexity requirements of those categories. Thus tables 65, 66, and 67
all apply. It's debateable whether table 68 applies; but if it does,
then any operation listed there that is supported by
std::basic_string<> should have constant time. That would apply to a[],
a.at(n), and push_back(). In addition, std::basic_string<>::swap() has
it's own specific requirement of constant time complexity.

None of those requirements mention find().

>   - Is there a good reason for Josuttis listing search's complexity
as
>     linear?


The key question is, "linear in what quantity, holding which other
quantities constant?". I haven't read Josutti, but I'd guess he's
saying that it's linear in n, holding m constant. I consider that a
reasonable approach, given typical uses where m << n.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]