Thread

Topic: Should std::sort be O(n log n) in C++0x?

Author: sjhoweATdialDOTpipexDOTcom@giganews.com ("Stephen Howe")
Date: Thu, 21 Dec 2006 15:45:50 GMT Raw View

>      Well, yes.  A standard sort slower than O(n log n) is embarassing.
> That problem was solved a long time ago.

Well yes. Heap Sort is guaranteed O(n log n) and that is known since 1964.
It is just that, on average Quick Sort is also O(n log n) but is faster
because the constant is smaller.
And I think Sedgewick proved, mathematically, that among the comparison
based sorts, on average, Quick Sort is best.
So it comes down to, "Can we get the average case performance of Quick Sort
and at the same time avoid its worse case performance of O(n * n)?". And
Musser's hybrid Introsort means we can.

And this means in turn that the C++ standard should tighten the complexity
for std::sort() and removed the wording "Approximately" and remove the
non-normative note at the bottom. And also "Approximately" should be removed
from partial_sort() and partial_sort()_copy().

>      Sorts can beat n Log n; see SyncSort.

???

> The theoretical limit is O(n log n) binary comparisons; sorts which look
at key distribution
> can do better. But adaptive bucket algorithms are too complicated
> for the C++ library sort...

Yes, bucket sorts, radix sorts, counting sorts. They can be O(n).
The problem is that they not general-purpose, they only work for particular
data types.
That means you could implement std::sort() for particular types using these
but not use these for the general-case.

Stephen Howe


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: John Nagle <nagle@animats.com>
Date: Thu, 21 Dec 2006 10:58:16 CST Raw View

Daniel Kr   gler wrote:
> John Nagle schrieb:
>
>
>>     Sorts can beat n Log n; see SyncSort.  The theoretical limit is
>>O(n log n) binary comparisons; sorts which look at key distribution
>>can do better. But adaptive bucket algorithms are too complicated
>>for the C++ library sort, and only pay off for very large N.
>
>
> Could you provide a reference for SyncSort? I only find companies,
> products, but no such algorithm.

    Knuth, "Searching and Sorting", vol. 3 says a little about this.

    U.S. Patent #3,380,029.

    The basic idea is simple.  Suppose you were sorting a collection
which consisted of items with an integer key between 0 and N-1,
each value occuring exactly once.  You can then make an O(N) pass over the
data, putting each item in its final position on the first try.

    That concept can be extended to more complex and less well
distributed keys.  A function has to be generated which
takes in a key and produces a bin number into which the
item should go.   That function has to distribute a
roughly equally sized section of the keyspace into each bin,
based on the statistics of the data.

     This is really off topic, because it's not a big win for
in-memory sorts such as the C++ library offers.  It's a huge
win for sorts so big they have to write intermediate disk
files (and in the early days, tapes.)  But that, today, is
an unusual application.

    John Nagle
    Animats

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Chris Jefferson" <4zumanga@gmail.com>
Date: Tue, 19 Dec 2006 12:03:16 CST Raw View

I believe the C++ standard should change the complexity of std::sort to
O(n log n), like std::stable_sort currently is and I am trying to see
if there is any kind of agreement on this issue. There are 2 major
reasons to do this:

1) Remove the need to advise users to use stable_sort if they want O(n
log n) sorting.
2) All implementations I have access to already satisfy this
requirement.

The main reason there appears to have been flexability in the original
is quicksort. As I'm sure you know, sorting based around quicksort
usually provides the practically fastest sorting. Trivial
implementations of quicksort can provide O(n^2)

Trivial implementations of quicksort (where the pivot is always chosen
to be in one particular place) can produce very bad behaviour in some
fairly obvious cases (already sorted data for example). There is a
reasonably easy fix (choosing from 3 pivots) which makes it much better
behaved in almost all cases, but it still keeps O(n^2) behaviour.

There is a well-known, highly efficent and trivial to implement (in a
C++ implementation) fix for this problem, known as "introsort". The
basic idea of this algorithm is to notice when quicksort is performing
very poorly, usually by keeping count of the number of partition splits
and seeing when it reaches some threshold, and then in this case
switching sorting THE CURRENT PARTITION to another method (usually
make_heap / sort_heap in C++). This doesn't throw away ANY of the work
quicksort has done, but provides an "easy catch" of the poor
performance case.

I have tried removing the code which performs this "catch" in g++ and
been unable to find any performance difference at all except in the
case of very carefully constructed data where the make_heap/sort_heap
triggered, in which case the time decreased.

Short version of argument:

1) The only type of sort which is practical and not O(n log n) is
quicksort.
2) It is trivial and effectively free to add a "O(n^2) catch" to
quicksort to switch to another type of sorting in a degenerate case and
get O(n log n) behaviour. The code for this is already in all C++
implementations (sort_heap).
3) All C++ implementations already do this.
4) People are currently being advised, incorrectly, that if they want
O(n log n) behaviour they should use the slower stable_sort.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: sjhoweATdialDOTpipexDOTcom@giganews.com ("Stephen Howe")
Date: Wed, 20 Dec 2006 14:40:14 GMT Raw View

> I believe the C++ standard should change the complexity of std::sort to
> O(n log n)

I second this. Thanks to David Musser's work, there is no reason why the
complexity should have a worse-case bounds of ?(N * N). It should be
tightened up to be a guaranteed ?(N * log N). We now know we can always
write a fast std::sort() with no bad worse case.

> 4) People are currently being advised, incorrectly, that if they want
> O(n log n) behaviour they should use the slower stable_sort.

The advice is poor. In the event of limited memory stable_sort() can have a
average performance considerably worse than O(N * log N). Better to use
partial_sort().

Stephen Howe



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: nagle@animats.com (John Nagle)
Date: Wed, 20 Dec 2006 16:51:56 GMT Raw View

Stephen Howe wrote:
>>I believe the C++ standard should change the complexity of std::sort to
>>O(n log n)

     Well, yes.  A standard sort slower than O(n log n) is embarassing.
That problem was solved a long time ago.

     Sorts can beat n Log n; see SyncSort.  The theoretical limit is
O(n log n) binary comparisons; sorts which look at key distribution
can do better. But adaptive bucket algorithms are too complicated
for the C++ library sort, and only pay off for very large N.

     Insist, though, on O(n log n) for the hard cases, like
a reversed sequential data set.

    John Nagle

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Robert Mabee <rmabee@comcast.net>
Date: Wed, 20 Dec 2006 15:36:27 CST Raw View

John Nagle wrote:
>     Insist, though, on O(n log n) for the hard cases, like
> a reversed sequential data set.

QOI: if quicksort has a hardcoded position for the pivot it ought
to be the middle of the region instead of the first element.  This
has no effect on theoretical performance but already-sorted data
(forward or reversed) will then be best case instead of worst case;
this may be common as one way to insert a few items is to append
them and sort the whole.  Also, it's probably better (for maintenance
cost) to use the sort tool to reverse than to hand-code even a trivial
swap loop.

Downside: you'll need a clever program to generate worst-case data
for testing when repeated sorting is no longer evil.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Chris Jefferson" <4zumanga@gmail.com>
Date: Wed, 20 Dec 2006 16:09:23 CST Raw View

Robert Mabee wrote:
> John Nagle wrote:
> >     Insist, though, on O(n log n) for the hard cases, like
> > a reversed sequential data set.
>
> QOI: if quicksort has a hardcoded position for the pivot it ought
> to be the middle of the region instead of the first element.  This
> has no effect on theoretical performance but already-sorted data
> (forward or reversed) will then be best case instead of worst case;
> this may be common as one way to insert a few items is to append
> them and sort the whole.  Also, it's probably better (for maintenance
> cost) to use the sort tool to reverse than to hand-code even a trivial
> swap loop.

Every implementation I'm aware of is derived from SGI's original code,
which get the middle value from the first, middle and last element of
the array (you obviously can't take the average, as in general you only
have a comparison operator).

Chris

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "=?iso-8859-1?q?Daniel_Kr=FCgler?=" <daniel.kruegler@googlemail.com>
Date: Wed, 20 Dec 2006 17:41:20 CST Raw View

John Nagle schrieb:

>      Sorts can beat n Log n; see SyncSort.  The theoretical limit is
> O(n log n) binary comparisons; sorts which look at key distribution
> can do better. But adaptive bucket algorithms are too complicated
> for the C++ library sort, and only pay off for very large N.

Could you provide a reference for SyncSort? I only find companies,
products, but no such algorithm.

Greetings from Bremen,

Daniel Kr   gler


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]