Topic: operator>> for numbers: stream state after failed read
Author: "James Kanze" <james.kanze@gmail.com>
Date: Wed, 11 Apr 2007 10:12:58 CST Raw View
On Apr 10, 10:05 pm, "s...@roguewave.com" <s...@roguewave.com> wrote:
> On Mar 28, 9:02 am, "James Kanze" <james.ka...@gmail.com> wrote:
>
> > On Mar 28, 1:31 am, heples...@gmail.com wrote:
> [...]
> > I think that that's the way it is supposed to work. The
> > algorithm is described in some detail in ?22.2.2.1.2, but in
> > general, if I understand it correctly, all characters that could
> > be part of a number are first accumulated. (The current draft
> > seems to have lost an important sentence here; the original
> > standard says that characters are accumulated as long as they
> > are "allowed as the next character of an input field of the
> > conversion specifier", but there's nothing at all in N2134
> > concerning when a character is accumulated.)
> That's correct. The algorithm you refer to isn't perfect (for
> example, given "1.0e-x" on input, Stage 2 says 'x' must be
> accumulated since it's in atoms, even though the grammar
> only allows decimal digits after the sign in the exponent).
> Robust implementations, including Rogue Wave and Apache
> stdcxx avoid extracting invalid characters and end processing
> when they encounter one. The algorithm really needs to be
> rewritten to fix this and some other minor problems such ashttp://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#23http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#427
> orhttp://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#459
> > Note that not all implementations actually behave this way,
> > however. Given "1.0e-x" and reading a double, Sun CC---both
> > with the Rogue Wave and the STLport--- and VC++ fail (in
> > accordance with the standard), with the next character read
> > being x, g++ succeeds (with the next character to be read also
> > x---whatever happened to the e-?).
> I suspect the fact that libstdc++ extracts the 'x' is just a bug
> and if you report it to them they'll quickly fix it.
The next character which is read is 'x'. If I read "1.0e-x"
into a double, the characters "1.0e-" are extracted from the
stream, the target value is set to 1.0, and no error bits are
set. The next character I read will be the 'x'.
I agree that this should be an error, by any interpretation. As
I understand it, the correct behavior would be to extract the
characters as g++ does, and then to set failbit. Alternatively,
of course, one might argue (not from the standard, but from what
one would like) that the correct behavior would be to extract
"1.0", without an error, but leaving 'e' as the next character
to be read. (To implement this, of course, would require that
streabuf support more than one character look-ahead or push
back. Which in turn would break user defined streambuf's all
over the place.)
> > > A work-around presumably is to read to a string, place it into a
> > > stringstream and then extract from the latter.
> > That's the usual procedure anytime you have to deal with
> > variations in the format. In all but the simplest cases, in
> > fact, I'll use regular expressions to check the format up front;
> > transactional integrity is a lot easier if you don't do any
> > assignments before knowing that everything is correct.
> The numeric parsing code is pretty tricky. I wouldn't go there
> unless you have plenty of time and patience :)
I've got regular expressions which match different types of
legal numbers in my tool kit; I just have to use them. Once I
know that my input is a legal number, I use an istringstream to
actually parse it and do the conversions. I'm not about to
attact converting string to double (or vice versa) unless I
absolutely have to.
> In addition,
> by extracting fields as strings and then parsing them you
> lose the ability to put back what you don't need (such as
> the 'x' in your example).
That's not at all what I meant. I mean that I have, for
example, a line that is supposed to contain a double, followed
by a name. Rather than wonder what the library is going to do
if I give it something like "1.3Ethel", I'll write a regular
expression to match the format of the line, excluding such limit
cases (or if I want to handle them, say by not allowing
exponential representation of floating point, trapping the
subfields in the regular expression). Only once I'm sure that
the format is OK will I pass it off to an istringstream to do
the actual parsing.
This also allows all sorts of additional restrictions which I
can't easily do with an istream.
> FWIW, I took the liberty to modify your test case slightly to
> give more insight into what's going on. Given "1.0e-x" on
> input the expected output is 0 (--F), x (\x78).
> #include <cstdio>
> #include <iostream>
>
> int main()
> {
> double x = 0;
> std::cin >> x;
> const std::ios::iostate state = std::cin.rdstate ();
> std::cin.clear ();
> const int next = std::cin.peek ();
> std::printf ("%g (%c%c%c), %c (\\x%02x)\n", x,
> state & std::cin.badbit ? 'B' : '-',
> state & std::cin.eofbit ? 'E' : '-',
> state & std::cin.failbit ? 'F' : '-',
> next, next);
> }
And g++ (4.1.0) gives: 1 (---), x (\x78). Which looks like an
error to me.
I'm actually fairly flexible with regards to what should be
required. What I insist on is:
1. it is fully, 100% specified, so that all implementations do
the same thing, and
2. there are no cases where characters are extracted, and then
not used, without an error being recognized.
All in all, what you describe seems the most reasonable.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S mard, 78210 St.-Cyr-l' cole, France, +33 (0)1 30 23 00 34
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Author: "sebor@roguewave.com" <sebor@roguewave.com>
Date: Tue, 10 Apr 2007 14:05:59 CST Raw View
On Mar 28, 9:02 am, "James Kanze" <james.ka...@gmail.com> wrote:
> On Mar 28, 1:31 am, heples...@gmail.com wrote:
[...]
> I think that that's the way it is supposed to work. The
> algorithm is described in some detail in ?22.2.2.1.2, but in
> general, if I understand it correctly, all characters that could
> be part of a number are first accumulated. (The current draft
> seems to have lost an important sentence here; the original
> standard says that characters are accumulated as long as they
> are "allowed as the next character of an input field of the
> conversion specifier", but there's nothing at all in N2134
> concerning when a character is accumulated.)
That's correct. The algorithm you refer to isn't perfect (for
example, given "1.0e-x" on input, Stage 2 says 'x' must be
accumulated since it's in atoms, even though the grammar
only allows decimal digits after the sign in the exponent).
Robust implementations, including Rogue Wave and Apache
stdcxx avoid extracting invalid characters and end processing
when they encounter one. The algorithm really needs to be
rewritten to fix this and some other minor problems such as
http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#23
http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#427
or
http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#459
>
> Note that not all implementations actually behave this way,
> however. Given "1.0e-x" and reading a double, Sun CC---both
> with the Rogue Wave and the STLport--- and VC++ fail (in
> accordance with the standard), with the next character read
> being x, g++ succeeds (with the next character to be read also
> x---whatever happened to the e-?).
I suspect the fact that libstdc++ extracts the 'x' is just a bug
and if you report it to them they'll quickly fix it.
>
> > A work-around presumably is to read to a string, place it into a
> > stringstream and then extract from the latter.
>
> That's the usual procedure anytime you have to deal with
> variations in the format. In all but the simplest cases, in
> fact, I'll use regular expressions to check the format up front;
> transactional integrity is a lot easier if you don't do any
> assignments before knowing that everything is correct.
The numeric parsing code is pretty tricky. I wouldn't go there
unless you have plenty of time and patience :) In addition,
by extracting fields as strings and then parsing them you
lose the ability to put back what you don't need (such as
the 'x' in your example).
FWIW, I took the liberty to modify your test case slightly to
give more insight into what's going on. Given "1.0e-x" on
input the expected output is 0 (--F), x (\x78).
#include <cstdio>
#include <iostream>
int main()
{
double x = 0;
std::cin >> x;
const std::ios::iostate state = std::cin.rdstate ();
std::cin.clear ();
const int next = std::cin.peek ();
std::printf ("%g (%c%c%c), %c (\\x%02x)\n", x,
state & std::cin.badbit ? 'B' : '-',
state & std::cin.eofbit ? 'E' : '-',
state & std::cin.failbit ? 'F' : '-',
next, next);
}
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Author: "James Kanze" <james.kanze@gmail.com>
Date: Wed, 28 Mar 2007 09:02:53 CST Raw View
On Mar 28, 1:31 am, heples...@gmail.com wrote:
> I posted this a week ago on comp.lang.c++, but did not get a
> response---I hope to be more successful here.
> Summary:
> Does the C++ standard require
> std::basic_istream<>::operator>>(double&) to leave the input stream
> untouched in case of a read failure?
How could it possibly do that? You've got to try to read in
order to have the failure, and once you've gotten the failure,
you've changed the state.
> Details:
> I noticed an unexpected behavior of operator>>() for numbers (double,
> int) when reading from cin. I would like to ask for expert
> clarification on whether I am misunderstanding the rules of the game,
> or whether my library implementation has a bug. I tested this on g++
> 4.1.1 under Linux, g++ 3.4.5 MinGW and cxx 7.1 under Tru64 Unix
> (behavior there is slightly different than described below). I checked
> Josuttis "The C++ Standard Library" and the C++ standard ch 22.2.2 and
> 27.6.1, but haven't been able to get anything useful out of them.
> The problem is as follows: If would like to read sequences like "1 2
> +" by first trying to read into a double, and if that fails try to
> read into a char, see sample program below (the program reads only a
> single number/symbol). This works fine as long as any non-number token
> is not a symbol that could be the first symbol in a number, i.e. the
> plus sign, the minus sign or the decimal point. If the symbol is one
> of those three, the program simply hangs. If I change the locale to,
> e.g., Norwegian, it will hang on the decimal comma instead of the
> decimal point.
> My interpretation is that the operator reads +, -, or ., then tries to
> read the next digit, which it does not find, and then raises the
> failbit and returns WITHOUT putting +, -, or . back into the input
> stream. Should this/must this be so?
I think that that's the way it is supposed to work. The
algorithm is described in some detail in ?22.2.2.1.2, but in
general, if I understand it correctly, all characters that could
be part of a number are first accumulated. (The current draft
seems to have lost an important sentence here; the original
standard says that characters are accumulated as long as they
are "allowed as the next character of an input field of the
conversion specifier", but there's nothing at all in N2134
concerning when a character is accumulated.)
Note that not all implementations actually behave this way,
however. Given "1.0e-x" and reading a double, Sun CC---both
with the Rogue Wave and the STLport--- and VC++ fail (in
accordance with the standard), with the next character read
being x, g++ succeeds (with the next character to be read also
x---whatever happened to the e-?).
> A work-around presumably is to read to a string, place it into a
> stringstream and then extract from the latter.
That's the usual procedure anytime you have to deal with
variations in the format. In all but the simplest cases, in
fact, I'll use regular expressions to check the format up front;
transactional integrity is a lot easier if you don't do any
assignments before knowing that everything is correct.
--
James Kanze (GABI Software) mailto:james.kanze@gmail.com
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Author: heplesser@gmail.com
Date: Tue, 27 Mar 2007 17:31:22 CST Raw View
I posted this a week ago on comp.lang.c++, but did not get a
response---I hope to be more successful here.
Summary:
Does the C++ standard require
std::basic_istream<>::operator>>(double&) to leave the input stream
untouched in case of a read failure?
Details:
I noticed an unexpected behavior of operator>>() for numbers (double,
int) when reading from cin. I would like to ask for expert
clarification on whether I am misunderstanding the rules of the game,
or whether my library implementation has a bug. I tested this on g++
4.1.1 under Linux, g++ 3.4.5 MinGW and cxx 7.1 under Tru64 Unix
(behavior there is slightly different than described below). I checked
Josuttis "The C++ Standard Library" and the C++ standard ch 22.2.2 and
27.6.1, but haven't been able to get anything useful out of them.
The problem is as follows: If would like to read sequences like "1 2
+" by first trying to read into a double, and if that fails try to
read into a char, see sample program below (the program reads only a
single number/symbol). This works fine as long as any non-number token
is not a symbol that could be the first symbol in a number, i.e. the
plus sign, the minus sign or the decimal point. If the symbol is one
of those three, the program simply hangs. If I change the locale to,
e.g., Norwegian, it will hang on the decimal comma instead of the
decimal point.
My interpretation is that the operator reads +, -, or ., then tries to
read the next digit, which it does not find, and then raises the
failbit and returns WITHOUT putting +, -, or . back into the input
stream. Should this/must this be so?
A work-around presumably is to read to a string, place it into a
stringstream and then extract from the latter.
I'd appreciate expert advice on this.
Hans
/*
Illustrate problems with symbols that can
be prefixes to numbers (+,-,. if no locale is set).
If a number is entered on the keyboard, the program prints
OK --- double: -1.3
If any characted but + - . is entered, the program prints
FAIL --- double
OK --- char: a
If + - or . are entered, the program hangs after
FAIL --- double
*/
#include <iostream>
using namespace std;
int main()
{
double x = 0;
char c = 0;
if ( cin >> x )
cerr << "OK --- double: " << x << endl;
else
{
cerr << "FAIL --- double" << endl;;
cin.clear();
if ( cin >> c )
cerr << "OK --- char: " << c << endl;
else
cerr << "FAIL --- char" << endl;
}
return 0;
}
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]