Topic: Add natural_compare for user friendly string ordering


Author: jpakkane@gmail.com
Date: Fri, 22 Jul 2016 16:35:51 -0700 (PDT)
Raw View
------=_Part_1213_368682288.1469230551727
Content-Type: multipart/alternative;
 boundary="----=_Part_1214_1588490975.1469230551727"

------=_Part_1214_1588490975.1469230551727
Content-Type: text/plain; charset=UTF-8

Hello

Currently string ordering in C++ is done with lexicographic_compare. This
is also known as asciibetical sorting and causes strings to appear in an
order that seems unnatural to people. As an example sorting strings "2
two", "1 one", "10 ten" gives the following order:

1 one
10 ten
2 two

However most people would prefer the following order:

1 one
2 two
10 ten

This deficiency is often worked around by adding leading zeroes to strings
when they are generated. This is a lot of busywork and also fragile. Almost
without fail all "user facing" string sorting should be done with natural
order rather than asciibetical.

A detailed article of this issue can be found
here: https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

Adding a natural_order to the standard library in addition to
lexicographic_order would make it easy to sort strings naturally. It is
straightforward to implement. The input sequence can be split into two
parts: numbers and non-numbers. The algorithm is just like
lexicographic_compare except it needs to detect number sequences, convert
them to integers and do the comparison on this integer value.

An implementation of this has been placed
here: https://github.com/jpakkane/naturalsort

It has the following features:

- implemented using only input iterators
- includes an overload for objects that have .begin() and .end()
- linear time complexity
- constant space complexity
- assumes char but could be extended to any character type

The project also contains a bunch of tests as well as a simple demo app to
show how to do sorting with natural_order.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/dcd97b5f-2b01-4b14-9b78-dbcb92d998a3%40isocpp.org.

------=_Part_1214_1588490975.1469230551727
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello<div><br></div><div>Currently string ordering in C++ =
is done with lexicographic_compare. This is also known as asciibetical sort=
ing and causes strings to appear in an order that seems unnatural to people=
.. As an example sorting strings &quot;2 two&quot;, &quot;1 one&quot;, &quot=
;10 ten&quot; gives the following order:</div><div><br></div><div>1 one</di=
v><div>10 ten</div><div>2 two</div><div><br></div><div>However most people =
would prefer the following order:</div><div><br></div><div>1 one</div><div>=
2 two</div><div>10 ten</div><div><br></div><div>This deficiency is often wo=
rked around by adding leading zeroes to strings when they are generated. Th=
is is a lot of busywork and also fragile. Almost without fail all &quot;use=
r facing&quot; string sorting should be done with natural order rather than=
 asciibetical.</div><div><br></div><div>A detailed article of this issue ca=
n be found here:=C2=A0https://blog.codinghorror.com/sorting-for-humans-natu=
ral-sort-order/</div><div><br></div><div>Adding a natural_order to the stan=
dard library in addition to lexicographic_order would make it easy to sort =
strings naturally. It is straightforward to implement. The input sequence c=
an be split into two parts: numbers and non-numbers. The algorithm is just =
like lexicographic_compare except it needs to detect number sequences, conv=
ert them to integers and do the comparison on this integer value.=C2=A0</di=
v><div><br></div><div>An implementation of this has been placed here:=C2=A0=
https://github.com/jpakkane/naturalsort</div><div><br></div><div>It has the=
 following features:</div><div><br></div><div>- implemented using only inpu=
t iterators</div><div>- includes an overload for objects that have .begin()=
 and .end()</div><div>- linear time complexity</div><div>- constant space c=
omplexity</div><div>- assumes char but could be extended to any character t=
ype</div><div><br></div><div>The project also contains a bunch of tests as =
well as a simple demo app to show how to do sorting with natural_order.</di=
v><div><br></div></div>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/dcd97b5f-2b01-4b14-9b78-dbcb92d998a3%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/dcd97b5f-2b01-4b14-9b78-dbcb92d998a3=
%40isocpp.org</a>.<br />

------=_Part_1214_1588490975.1469230551727--

------=_Part_1213_368682288.1469230551727--

.


Author: Nevin Liber <nevin@eviloverlord.com>
Date: Sat, 23 Jul 2016 00:25:24 -0500
Raw View
--001a113e4d6286e20d053846c9a0
Content-Type: text/plain; charset=UTF-8

On 22 July 2016 at 18:35, <jpakkane@gmail.com> wrote:

> 1 one
> 2 two
> 10 ten
>

I know what the total ordering for lexicographical_compare is.  What is the
total ordering for natural sort?  For instance, how should the following be
sorted:

0 0
0 1
1 0
1 1
0 10
10 0
10 10
1 10
10 1

If it is just the first column, I don't think it is all that useful.  And
if it isn't just the first column, I think you are going to have an
extremely hard time defining it.
--
 Nevin ":-)" Liber  <mailto:nevin@eviloverlord.com>  +1-847-691-1404

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CAGg_6%2BOi9CNpjcX7psOtowgO-xyww%3DUpxuzhGec25fsO_o4Fpw%40mail.gmail.com.

--001a113e4d6286e20d053846c9a0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 22 July 2016 at 18:35,  <span dir=3D"ltr">&lt;<a href=
=3D"mailto:jpakkane@gmail.com" target=3D"_blank">jpakkane@gmail.com</a>&gt;=
</span> wrote:<br><div class=3D"gmail_extra"><div class=3D"gmail_quote"><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>1 one</div><div>2 two</di=
v><div>10 ten</div></div></blockquote><div><br></div><div>I know what the t=
otal ordering for lexicographical_compare is.=C2=A0 What is the total order=
ing for natural sort?=C2=A0 For instance, how should the following be sorte=
d:</div><div><br></div><div>0 0</div><div>0 1</div><div>1 0</div><div>1 1</=
div><div>0 10</div><div>10 0</div><div>10 10</div><div>1 10</div><div>10 1<=
/div><div><br></div><div>If it is just the first column, I don&#39;t think =
it is all that useful.=C2=A0 And if it isn&#39;t just the first column, I t=
hink you are going to have an extremely hard time defining it.</div></div>-=
- <br><div data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div di=
r=3D"ltr"><div>=C2=A0Nevin &quot;:-)&quot; Liber=C2=A0 &lt;mailto:<a href=
=3D"mailto:nevin@eviloverlord.com" target=3D"_blank">nevin@eviloverlord.com=
</a>&gt; =C2=A0<a href=3D"tel:%2B1-847-691-1404" value=3D"+18476911404" tar=
get=3D"_blank">+1-847-691-1404</a></div></div></div></div></div>
</div></div>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/CAGg_6%2BOi9CNpjcX7psOtowgO-xyww%3DUp=
xuzhGec25fsO_o4Fpw%40mail.gmail.com?utm_medium=3Demail&utm_source=3Dfooter"=
>https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CAGg_6%2BOi9C=
NpjcX7psOtowgO-xyww%3DUpxuzhGec25fsO_o4Fpw%40mail.gmail.com</a>.<br />

--001a113e4d6286e20d053846c9a0--

.


Author: Thiago Macieira <thiago@macieira.org>
Date: Fri, 22 Jul 2016 23:23:23 -0700
Raw View
On sexta-feira, 22 de julho de 2016 16:35:51 PDT jpakkane@gmail.com wrote:
> It has the following features:
>=20
> - implemented using only input iterators
> - includes an overload for objects that have .begin() and .end()
> - linear time complexity
> - constant space complexity
> - assumes char but could be extended to any character type

Doesn't seem to support other Unicode digits besides those in the US-ASCII=
=20
range. Support for other digits like =D9=A1=D9=A2=D9=A3  is probably necess=
ary.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/1583416.qDurXR0l6W%40tjmaciei-mobl1.

.


Author: Jussi Pakkanen <jpakkane@gmail.com>
Date: Sat, 23 Jul 2016 12:18:23 +0300
Raw View
On Sat, Jul 23, 2016 at 8:25 AM, Nevin Liber <nevin@eviloverlord.com> wrote:

> I know what the total ordering for lexicographical_compare is.  What is the
> total ordering for natural sort?  For instance, how should the following be
> sorted:
>
> 0 0
> 0 1
> 1 0
> 1 1
> 0 10
> 10 0
> 10 10
> 1 10
> 10 1

In this case asciibetical is the same as lexicographic because '1'
sorts before '10'. But if we replace all instances of '1' with '2',
then the difference is more obvious. I updated the sample sorting demo
to do this and the output is the following:

ASCIIbetical sort:

0 0
0 10
0 2
10 0
10 10
10 2
2 0
2 10
2 2

Natural sort:

0 0
0 2
0 10
2 0
2 2
2 10
10 0
10 2
10 10

> If it is just the first column, I don't think it is all that useful.  And if
> it isn't just the first column, I think you are going to have an extremely
> hard time defining it.

It is not hard, in fact it is quite simple:

Do exactly what lexicographic_compare does but if you find a number
sequence in both strings, convert them into integers and compare those
values as with standard lexicograpic order. If they are equal proceed
with the comparison with the characters following the number
sequences.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CAAjYPQkkkyK0T8ZcTC-%3DwYHg4G_kxbpEHoxbpW3gY8E78tLqUg%40mail.gmail.com.

.


Author: Arthur O'Dwyer <arthur.j.odwyer@gmail.com>
Date: Sun, 24 Jul 2016 13:13:44 -0700 (PDT)
Raw View
------=_Part_3509_492854981.1469391224799
Content-Type: multipart/alternative;
 boundary="----=_Part_3510_467823970.1469391224799"

------=_Part_3510_467823970.1469391224799
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Friday, July 22, 2016 at 4:35:52 PM UTC-7, jpak...@gmail.com wrote:
>
> Hello
>
> Currently string ordering in C++ is done with lexicographic_compare. This=
=20
> is also known as asciibetical sorting and causes strings to appear in an=
=20
> order that seems unnatural to people. As an example sorting strings "2=20
> two", "1 one", "10 ten" gives the following order:
>
> 1 one
> 10 ten
> 2 two
>
> However most people would prefer the following order:
>
> 1 one
> 2 two
> 10 ten
>
> This deficiency is often worked around by adding leading zeroes to string=
s=20
> when they are generated. This is a lot of busywork and also fragile.
>

Well, that's not true, is it?  Anyone who wants alphanum/"natural" sorting=
=20
certainly knows where to get it. Adding leading zeroes to numerical fields=
=20
is often =E2=80=94 I would even go so far as to say generally =E2=80=94 don=
e for other=20
reasons than internal sorting:
- to get proper column alignment when printing
- to remove ambiguity
- to preserve the internal sorting order even when the output is handled by=
=20
an external program such as "ls"


Almost without fail all "user facing" string sorting should be done with=20
> natural order rather than asciibetical.
> A detailed article of this issue can be found here:=20
> https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
> An implementation of this has been placed here:=20
> https://github.com/jpakkane/naturalsort=20
> <https://www.google.com/url?q=3Dhttps%3A%2F%2Fgithub.com%2Fjpakkane%2Fnat=
uralsort&sa=3DD&sntz=3D1&usg=3DAFQjCNEuxPY0IMY80W0NTx67IW8FvHznHw>
>

That looks like a mostly correct(*) implementation of the algorithm Jeff=20
Atwood describes in that blog post. However, this is not the right forum=20
for getting feedback on your implementation of the algorithm; I suggest=20
codereview.stackexchange.com.

If you were actually to propose a utility function like this for=20
standardization, you'd have to explain why this relatively trivial handling=
=20
of ASCII digits is going to be useful to working programmers. As I said,=20
most programmers use padding and/or standard (strcmp-style) ASCIIbetical=20
sorting for a variety of reasons: it's standard, it's consistent across the=
=20
universe of computer programs (such as "ls"), it's easy to explain, it's=20
very efficient (in fact it has had a dedicated opcode in Intel processors=
=20
since at least as far back as the 8086), and it is easy to implement=20
without introducing bugs. (It also Does The Right Thing for Unicode=20
codepoints expressed in UTF-8 format and for calendar dates expressed in=20
ISO 8601 format, which are nice properties for an efficient algorithm to=20
have.)

For programmers who *do* care about sorting user data in "dictionary order"=
=20
=E2=80=94 for example, dictionary manufacturers =E2=80=94 you're going to h=
ave to solve=20
their other problems too, or else your sort is useless to them as well.=20
Some of their specific problems include:
- Unicode normalization
- case-insensitive sort (e.g. "10 hot-air balloons" should sort before "10=
=20
Luftballoons"
- locale-specific handling of accents (e.g. "r=C3=A9sum=C3=A9" should sort =
after=20
"resume" but before "resumes" in an English dictionary)
- locale-specific handling of digraphs (e.g. "llano" should sort after=20
"lograr" in a traditional Spanish dictionary)

Many of these problems are handled by the Unicode "collation order"=20
algorithms, which stand a better chance of making it into some future C++=
=20
standard library than ad-hoc ASCII algorithms.

HTH,
=E2=80=93Arthur

(* =E2=80=94 You don't handle overflow:=20
http://melpon.org/wandbox/permlink/2M30QounIh7Lzr5K )

--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/3ebb484d-f116-4e01-bc18-a8412c1d3926%40isocpp.or=
g.

------=_Part_3510_467823970.1469391224799
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Friday, July 22, 2016 at 4:35:52 PM UTC-7, jpak...@gmai=
l.com wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-lef=
t: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">H=
ello<div><br></div><div>Currently string ordering in C++ is done with lexic=
ographic_compare. This is also known as asciibetical sorting and causes str=
ings to appear in an order that seems unnatural to people. As an example so=
rting strings &quot;2 two&quot;, &quot;1 one&quot;, &quot;10 ten&quot; give=
s the following order:</div><div><br></div><div>1 one</div><div>10 ten</div=
><div>2 two</div><div><br></div><div>However most people would prefer the f=
ollowing order:</div><div><br></div><div>1 one</div><div>2 two</div><div>10=
 ten</div><div><br></div><div>This deficiency is often worked around by add=
ing leading zeroes to strings when they are generated. This is a lot of bus=
ywork and also fragile.</div></div></blockquote><div><br></div><div>Well, t=
hat&#39;s not true, is it? =C2=A0Anyone who wants alphanum/&quot;natural&qu=
ot; sorting certainly knows where to get it. Adding leading zeroes to numer=
ical fields is often =E2=80=94 I would even go so far as to say generally =
=E2=80=94 done for other reasons than internal sorting:</div><div>- to get =
proper column alignment when printing</div><div>- to remove ambiguity</div>=
<div>- to preserve the internal sorting order even when the output is handl=
ed by an external program such as &quot;ls&quot;</div><div><br></div><div><=
br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: =
0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div=
> Almost without fail all &quot;user facing&quot; string sorting should be =
done with natural order rather than asciibetical.</div><div>A detailed arti=
cle of this issue can be found here:=C2=A0<a href=3D"https://blog.codinghor=
ror.com/sorting-for-humans-natural-sort-order/" target=3D"_blank" rel=3D"no=
follow" onmousedown=3D"this.href=3D&#39;https://www.google.com/url?q\x3dhtt=
ps%3A%2F%2Fblog.codinghorror.com%2Fsorting-for-humans-natural-sort-order%2F=
\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGkoMUA1ymy4RGqIgETuBkj_U3B9Q&#39;;=
return true;" onclick=3D"this.href=3D&#39;https://www.google.com/url?q\x3dh=
ttps%3A%2F%2Fblog.codinghorror.com%2Fsorting-for-humans-natural-sort-order%=
2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGkoMUA1ymy4RGqIgETuBkj_U3B9Q&#39=
;;return true;">https://blog.<wbr>codinghorror.com/sorting-for-<wbr>humans-=
natural-sort-order/</a><br></div><div>An implementation of this has been pl=
aced here:=C2=A0<a href=3D"https://www.google.com/url?q=3Dhttps%3A%2F%2Fgit=
hub.com%2Fjpakkane%2Fnaturalsort&amp;sa=3DD&amp;sntz=3D1&amp;usg=3DAFQjCNEu=
xPY0IMY80W0NTx67IW8FvHznHw" target=3D"_blank" rel=3D"nofollow" onmousedown=
=3D"this.href=3D&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.c=
om%2Fjpakkane%2Fnaturalsort\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEuxPY0I=
MY80W0NTx67IW8FvHznHw&#39;;return true;" onclick=3D"this.href=3D&#39;https:=
//www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fjpakkane%2Fnaturalsort=
\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEuxPY0IMY80W0NTx67IW8FvHznHw&#39;;=
return true;">https://github.com/<wbr>jpakkane/naturalsort</a></div></div><=
/blockquote><div><br></div><div>That looks like a mostly correct(*) impleme=
ntation of the algorithm Jeff Atwood describes in that blog post. However, =
this is not the right forum for getting feedback on your implementation of =
the algorithm; I suggest codereview.stackexchange.com.</div><div><br></div>=
<div>If you were actually to propose a utility function like this for stand=
ardization, you&#39;d have to explain why this relatively trivial handling =
of ASCII digits is going to be useful to working programmers. As I said, mo=
st programmers use padding and/or standard (strcmp-style) ASCIIbetical sort=
ing for a variety of reasons: it&#39;s standard, it&#39;s consistent across=
 the universe of computer programs (such as &quot;ls&quot;), it&#39;s easy =
to explain, it&#39;s very efficient (in fact it has had a dedicated opcode =
in Intel processors since at least as far back as the 8086), and it is easy=
 to implement without introducing bugs. (It also Does The Right Thing for U=
nicode codepoints expressed in UTF-8 format and for calendar dates expresse=
d in ISO 8601 format, which are nice properties for an efficient algorithm =
to have.)</div><div><br></div><div>For programmers who <i>do</i> care about=
 sorting user data in &quot;dictionary order&quot; =E2=80=94 for example, d=
ictionary manufacturers =E2=80=94 you&#39;re going to have to solve their o=
ther problems too, or else your sort is useless to them as well. Some of th=
eir specific problems include:</div><div>- Unicode normalization</div><div>=
- case-insensitive sort (e.g. &quot;10 hot-air balloons&quot; should sort b=
efore &quot;10 Luftballoons&quot;</div><div>- locale-specific handling of a=
ccents (e.g. &quot;r=C3=A9sum=C3=A9&quot; should sort after &quot;resume&qu=
ot; but before &quot;resumes&quot; in an English dictionary)<br></div><div>=
- locale-specific handling of digraphs (e.g. &quot;llano&quot; should sort =
after &quot;lograr&quot; in a traditional Spanish dictionary)</div><div><br=
></div><div>Many of these problems are handled by the Unicode &quot;collati=
on order&quot; algorithms, which stand a better chance of making it into so=
me future C++ standard library than ad-hoc ASCII algorithms.</div><div><br>=
</div><div>HTH,</div><div>=E2=80=93Arthur</div><div><br></div><div>(* =E2=
=80=94 You don&#39;t handle overflow:=C2=A0<a href=3D"http://melpon.org/wan=
dbox/permlink/2M30QounIh7Lzr5K">http://melpon.org/wandbox/permlink/2M30Qoun=
Ih7Lzr5K</a>=C2=A0)</div></div>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/3ebb484d-f116-4e01-bc18-a8412c1d3926%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/3ebb484d-f116-4e01-bc18-a8412c1d3926=
%40isocpp.org</a>.<br />

------=_Part_3510_467823970.1469391224799--

------=_Part_3509_492854981.1469391224799--

.


Author: Jussi Pakkanen <jpakkane@gmail.com>
Date: Mon, 25 Jul 2016 12:33:44 +0300
Raw View
On Sun, Jul 24, 2016 at 11:13 PM, Arthur O'Dwyer
<arthur.j.odwyer@gmail.com> wrote:

> That looks like a mostly correct(*) implementation of the algorithm Jeff
> Atwood describes in that blog post. However, this is not the right forum for
> getting feedback on your implementation of the algorithm; I suggest
> codereview.stackexchange.com.

The point of that code was _not_ to provide a fully ready
implementation. It was only meant as a simple example so the
discussion could be based on code rather than abstract algorithm
discussion.

> If you were actually to propose a utility function like this for
> standardization, you'd have to explain why this relatively trivial handling
> of ASCII digits is going to be useful to working programmers. As I said,
> most programmers use padding and/or standard (strcmp-style) ASCIIbetical
> sorting for a variety of reasons:

The reason for this is so simple it can be expressed with one single word:

Users.

Most data in the world is _not_ generated by programmers, but rather
by users. They will _not_ zero pad their file names, strings or what
have you but they _do_ expect things to be sorted "the right way"
which is to say in natural order rather than asciibetical. The world
is also full of data generators that do not do number padding. As an
example exporting any data from Libreoffice (and probably Excel too)
yields non-padded data. Joining two different data sets, one of which
pads to three digits and the other to four digits fails also. Fixing
every data generator in the world to do field padding is not feasible
or even possible.

Natural sorting is used by Windows explorer (AFAIK) and a bunch of
other apps. Apple uses natural sorting consistently throughout their
OS from Finder to iTunes. Their Objective C libraries even come with a
natural sort function so there is precedence in having natural sort in
a standard library. One possible definition of how natural sort should
be implemented would be to just duplicate what it does.

> it's standard, it's consistent across the
> universe of computer programs (such as "ls"), it's easy to explain, it's
> very efficient (in fact it has had a dedicated opcode in Intel processors
> since at least as far back as the 8086), and it is easy to implement without
> introducing bugs. (It also Does The Right Thing for Unicode codepoints
> expressed in UTF-8 format and for calendar dates expressed in ISO 8601
> format, which are nice properties for an efficient algorithm to have.)

All of this is true. There are many cases where asciibetical sorting
is the right thing to do. No-one is claiming that asciibetical order
should be abolished or not be the default. However there are also
cases where natural sorting is the right thing to do. As an example,
when displaying data to end users in a GUI application. It would be
nice if C++ stdlib were to provide both so app developers could choose
which one to use based on their actual use case.

> Many of these problems are handled by the Unicode "collation order"
> algorithms, which stand a better chance of making it into some future C++
> standard library than ad-hoc ASCII algorithms.

Getting natural sort into unicode collation algorithms would be great.
This is one of the reasons the sample code project was so simple. It
was only meant to illustrate a point so it can be thrown away and the
proper implementation can be put elsewhere.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CAAjYPQm7WDC4xkGgZj0Kw_CimJfnyNQg0phSgvo__r4bcd7a3Q%40mail.gmail.com.

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Mon, 25 Jul 2016 10:03:50 -0400
Raw View
On 2016-07-23 05:18, Jussi Pakkanen wrote:
> Natural sort:
>
> 0 0
> 0 2
> 0 10
> 2 0
> 2 2
> 2 10
> 10 0
> 10 2
> 10 10
>
> It is not hard, in fact it is quite simple:
>
> Do exactly what lexicographic_compare does but if you find a number
> sequence in both strings, convert them into integers and compare those
> values as with standard lexicograpic order. If they are equal proceed
> with the comparison with the characters following the number
> sequences.

Okay, what's the sort order for:

b0d3a0c
cad88d7
eeaafb1
f07a547
f633612
4ba8b68
9cff628
10c7c1a
69eef33
97fd258
329eb9f
07488f1
8295a7c
87688b6
8445797
8484991

....?

(Hint: the above was sorted with 'sort -n', which I believe matches what
you described above. It's also the *wrong* order for this data set.)


What about:

pi314
pic1
pic1-edit
pic1b
pic2
pic10
pic4fec
pic8672
picf34e
picante

....?

(Hint: the above order is "correct".)

Natural sorting is *hard*, maybe even impossible. The second example in
particular depends on the knowledge of which parts of the string are
"text" and which are "numeric", and what *types* of numbers.

p.s. why are you proposing a sort function and not a comparison function?


--
Matthew

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nn5687%2495t%241%40ger.gmane.org.

.


Author: Jussi Pakkanen <jpakkane@gmail.com>
Date: Mon, 25 Jul 2016 17:35:27 +0300
Raw View
On Mon, Jul 25, 2016 at 5:03 PM, Matthew Woehlke
<mwoehlke.floss@gmail.com> wrote:

> (Hint: the above was sorted with 'sort -n', which I believe matches what
> you described above. It's also the *wrong* order for this data set.)

Natural order is meant for regular human beings. They only deal with
base 10. No hex. No octal. Nothing but plain vanilla digits 0-9 (plus
unicode versions of same for other languages).

But yes, there are many data sets where natural sorting does the wrong
thing and should not be used. But this is a question of mechanism vs
policy. Providing natural sort is a question of mechanism. Choosing
whether to use it or not is policy. The former is the job of a library
and the latter is the job of applications using said library.

> p.s. why are you proposing a sort function and not a comparison function?

A comparison function is exactly what I am proposing. The
implementation is in a function called natural_compare which works
just like lexicographic_compare. The sorting part is only added on to
demonstrate the result of using natural order to sort strings which is
this functionality's main use case. Granted, the git repo name
probably should have been natural_compare.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CAAjYPQmo-aTRkH9g3TX_RCZcNFZQO-%2B6i8s02Af-BM-uOAiGcg%40mail.gmail.com.

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Mon, 25 Jul 2016 12:51:42 -0400
Raw View
On 2016-07-25 10:35, Jussi Pakkanen wrote:
> On Mon, Jul 25, 2016 at 5:03 PM, Matthew Woehlke wrote:
>> (Hint: the above was sorted with 'sort -n', which I believe matches what
>> you described above. It's also the *wrong* order for this data set.)
>
> Natural order is meant for regular human beings. They only deal with
> base 10. No hex. No octal. Nothing but plain vanilla digits 0-9 (plus
> unicode versions of same for other languages).

I beg to differ. I collect pictures off the internet. Sometimes, the
file names (which I don't change) use hex. Trying to find a certain
image in a directory when I know the name is *hard* because my file
manager uses broken "natural" sorting.

Don't assume that just because you wouldn't *manually* assign something
like a SHA as a file name means that real users don't *have* files named
like that.

Even in base 10, your algorithm can give a wrong answer. Here's a
(slightly simplified) real life example from someone other than myself
(see https://forum.kde.org/viewtopic.php?f=66&t=95332 for the original):

  foo_20110401.jpg
  foo_201101011253.jpg

The above order is wrong (January comes before April), but is what your
algorithm will produce. (This one has some chance of a sufficiently
advanced algorithm being able to recognize "time stamps" and sort them
appropriately.)

> But yes, there are many data sets where natural sorting does the wrong
> thing and should not be used. But this is a question of mechanism vs
> policy. Providing natural sort is a question of mechanism.

But your proposed mechanism still falls apart on some real-life data
sets. Even if there is a policy as to which order to use, it won't be
right for all data sets (assuming that the user can change policy e.g.
on a per-directory basis, which is unlikely, if they can change it at
all). And it's not *that* far fetched that a user could have a *single*
file list containing a combination of items where your algorithm is
better for some and worse for others.

I guess it's better than nothing, but I'm still waiting for a sorting
function that can at least manage both decimal and hexadecimal numbers
with some degree of intelligence.

Getting the *truly* pathological cases right is probably impossible; the
second example in my previous message intentionally included names where
meta-information is required for correct sorting. (Specifically, is a
string like "pic1b" supposed to decompose as {"pi", 0xc1b}, {"pic",
0x1b} or {"pic", 1, "b"}? I don't think it's possible for the computer
to know the correct answer.)

>> p.s. why are you proposing a sort function and not a comparison function?
>
> A comparison function is exactly what I am proposing.

Okay, maybe I misunderstood that from the original post. (Or I'm reading
too much into the "sample implementation".) Thanks for clarifying.

--
Matthew

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nn5g2u%24hgb%241%40ger.gmane.org.

.