Topic: Can std::string handle unsigned char?


Author: "Peter Olcott" <NoSpam@SeeScreen.com>
Date: Tue, 5 Jan 2010 17:19:50 CST
Raw View
I want to use a std::string (or equivalent) to store UTF-8
characters can it always do this?



--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Leigh Johnston" <leigh@i42.co.uk>
Date: Wed, 6 Jan 2010 12:47:10 CST
Raw View
"Peter Olcott" <NoSpam@SeeScreen.com> wrote in message
news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...
> I want to use a std::string (or equivalent) to store UTF-8
> characters can it always do this?
>

std::string is fine for storing UTF-8, you do not need unsigned char to do
this.

/Leigh


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Clebson Derivan <cderivan@gmail.com>
Date: Wed, 6 Jan 2010 13:02:31 CST
Raw View
On 5 jan, 21:19, "Peter Olcott" <NoS...@SeeScreen.com> wrote:
> I want to use a std::string (or equivalent) to store UTF-8
> characters can it always do this?
>
> --
> [ comp.std.c++ is moderated.  To submit articles, try just posting with ]
> [ your news-reader.  If that fails, use
> mailto:std-...@netlab.cs.rpi.edu<std-c%2B...@netlab.cs.rpi.edu>
> ]
> [              --- Please see the FAQ before posting. ---               ]
> [ FAQ:http://www.comeaucomputing.com/csc/faq.html                     ]

use std::vector insted, you can acess &v[0] to acess the data.


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: James Kanze <james.kanze@gmail.com>
Date: Wed, 6 Jan 2010 21:54:27 CST
Raw View
On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message

> news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...

> > I want to use a std::string (or equivalent) to store UTF-8
> > characters can it always do this?

> std::string is fine for storing UTF-8, you do not need
> unsigned char to do this.

In practice.  Formally, the results of assigning a value which
is not representable in the target type is unspecified.  And if
char is signed, and 8 bits, things like 0xC3 aren't
representable.

--
James Kanze

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Leigh Johnston" <leigh@i42.co.uk>
Date: Wed, 6 Jan 2010 21:54:18 CST
Raw View
>
> use std::vector insted, you can acess &v[0] to acess the data.
>

If he must have a string of unsigned char then why not use
std::basic_string<unsigned char> instead of std::vector?  However I don't
think this is necessary and std::string is just fine.  When converting from
UTF-8 to some other encoding (e.g. UTF-16 and std::wstring) you can always
cast characters to unsigned as required.

/Leigh


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Nick Hounsome <nick.hounsome@googlemail.com>
Date: Thu, 7 Jan 2010 13:43:53 CST
Raw View
On 5 Jan, 23:19, "Peter Olcott" <NoS...@SeeScreen.com> wrote:
> I want to use a std::string (or equivalent) to store UTF-8
> characters can it always do this?
>

>From wikipedia article on C++0X (although I'm pretty sure that I read
it in teh official stuff as well):

For the purpose of enhancing support for Unicode in C++ compilers, the
definition of the type char has been modified to be both at least the
size necessary to store an eight-bit coding of UTF-8 and large enough
to contain any member of the compiler's basic execution character set.
It was previously defined as only the latter.

There are three Unicode encodings that C++0x will support: UTF-8,
UTF-16, and UTF-32. In addition to the previously noted changes to the
definition of char, C++0x will add two new character types: char16_t
and char32_t. Each of these is designed to store UTF-16 and UTF-32
respectively.

Note that size() is always the number of char or char_16_t rather than
the number of printing characters (glyphs) - This pretty much has to
be so unless you have a much restricted interface or a very slow size
() function (that might fail!!).

So I would stick to using std::string rather than vector

Another issue that is beyond the scope of any string class and yet you
might want to consider is the issue of  when/whether/how to output a
BOM (Byte order marker) for UTF-16/UTF-32


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Mathias Gaunard <loufoque@gmail.com>
Date: Thu, 7 Jan 2010 13:43:03 CST
Raw View
On Jan 7, 3:54 am, James Kanze <james.ka...@gmail.com> wrote:
> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>
> > "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
> >news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...
> > > I want to use a std::string (or equivalent) to store UTF-8
> > > characters can it always do this?
> > std::string is fine for storing UTF-8, you do not need
> > unsigned char to do this.
>
> In practice.  Formally, the results of assigning a value which
> is not representable in the target type is unspecified.

The value obtained by assigning 0xC3 to a signed 8-bit integer is
perfectly specified.
What isn't is how it is represented.


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Leigh Johnston" <leigh@i42.co.uk>
Date: Thu, 7 Jan 2010 13:50:57 CST
Raw View

"James Kanze" <james.kanze@gmail.com> wrote in message
news:a4b66b5a-1997-4591-9a9e-921f91863e26@v25g2000yqk.googlegroups.com...

> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>
>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>>
>
> news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...
>>
>
> > I want to use a std::string (or equivalent) to store UTF-8
>> > characters can it always do this?
>>
>
> std::string is fine for storing UTF-8, you do not need
>> unsigned char to do this.
>>
>
> In practice.  Formally, the results of assigning a value which
> is not representable in the target type is unspecified.  And if
> char is signed, and 8 bits, things like 0xC3 aren't
> representable.
>
>
If by "unspecified" you mean "Implementation Defined" then yes, however I
take issue with your definition of representable:

int main()
{
       unsigned char ch1 = 0xC3;
       char ch2 = ch1;
       unsigned char ch3 = ch2;
// following is fine in VC++ and g++ where char is signed yet can represent
0xC3 because there are sufficient bits.
       assert(ch3 == 0xC3);
// following is also fine
       assert((ch2 & 0xFF) == 0xC3);
}

/Leigh

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Johannes Schaub (litb)" <schaub-johannes@web.de>
Date: Thu, 7 Jan 2010 16:34:48 CST
Raw View
Leigh Johnston wrote:

>
>
> "James Kanze" <james.kanze@gmail.com> wrote in message
> news:a4b66b5a-1997-4591-9a9e-921f91863e26@v25g2000yqk.googlegroups.com...
>
>> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>>
>>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>>>
>>
>> news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...
>>>
>>
>> > I want to use a std::string (or equivalent) to store UTF-8
>>> > characters can it always do this?
>>>
>>
>> std::string is fine for storing UTF-8, you do not need
>>> unsigned char to do this.
>>>
>>
>> In practice.  Formally, the results of assigning a value which
>> is not representable in the target type is unspecified.  And if
>> char is signed, and 8 bits, things like 0xC3 aren't
>> representable.
>>
>>
> If by "unspecified" you mean "Implementation Defined" then yes, however I
> take issue with your definition of representable:
>
> int main()
> {
>        unsigned char ch1 = 0xC3;
>        char ch2 = ch1;
>        unsigned char ch3 = ch2;
> // following is fine in VC++ and g++ where char is signed yet can
> represent 0xC3 because there are sufficient bits.
>        assert(ch3 == 0xC3);

I don't think your example demonstrates that "char" can represent 0xC3
there.

It cannot represent the *value* 0xC3. What you have there is storing an
implementation defined value into ch2, and *that* value can be represented
(probably they will just take the bitpattern of 0xC3 from ch1 and store that
one into ch2). If they do that, then on a two's complement representation
when assigning back to ch3, of course the value won't have changed from ch1
to ch3, but this won't say anything with regard to whether 0xC3 is
representable by "char" on that platform.

For an example, -1 cannot be represented in "unsigned char", but instead if
you assign to one you will store the value UCHAR_MAX, which can be
represented.

> // following is also fine
>       assert((ch2 & 0xFF) == 0xC3);

Same issue: The operation promotes "ch2" to an int - that integer will
probably be negative on these platforms. Masking off all but the first 8
bits yields a value that's the same as 0xC3. But again this won't say
anything either.



--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Johannes Schaub (litb)" <schaub-johannes@web.de>
Date: Thu, 7 Jan 2010 16:33:43 CST
Raw View
Mathias Gaunard wrote:

> On Jan 7, 3:54 am, James Kanze <james.ka...@gmail.com> wrote:
>> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>>
>> > "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>> >news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...
>> > > I want to use a std::string (or equivalent) to store UTF-8
>> > > characters can it always do this?
>> > std::string is fine for storing UTF-8, you do not need
>> > unsigned char to do this.
>>
>> In practice.  Formally, the results of assigning a value which
>> is not representable in the target type is unspecified.
>
> The value obtained by assigning 0xC3 to a signed 8-bit integer is
> perfectly specified.
> What isn't is how it is represented.
>
4.7 Integral conversions: "If the destination type is signed, the value is
unchanged if it can be represented in the destination type (and bit-field
width); otherwise, the value is implementation-defined."


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Pete Becker <pete@versatilecoding.com>
Date: Thu, 7 Jan 2010 16:34:29 CST
Raw View
Leigh Johnston wrote:
>
>
> "James Kanze" <james.kanze@gmail.com> wrote in message
> news:a4b66b5a-1997-4591-9a9e-921f91863e26@v25g2000yqk.googlegroups.com...
>
>> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>>
>>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>>>
>>
>> news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...
>>>
>>
>> > I want to use a std::string (or equivalent) to store UTF-8
>>> > characters can it always do this?
>>>
>>
>> std::string is fine for storing UTF-8, you do not need
>>> unsigned char to do this.
>>>
>>
>> In practice.  Formally, the results of assigning a value which
>> is not representable in the target type is unspecified.  And if
>> char is signed, and 8 bits, things like 0xC3 aren't
>> representable.
>>
>>
> If by "unspecified" you mean "Implementation Defined" then yes,

"Unspecified" means that the standard does not tell you which of the
various reasonable alternatives (usually listed in the specification)
will happen. "Implementation defined" means that a conforming
implementation must document what it does. Showing that code does what
you expect it to do does not tell you that the behavior is
implementation defined, nor that it is unspecified. You determine that
by reading the standard.

--
    Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of
"The Standard C++ Library Extensions: a Tutorial and Reference"
(www.petebecker.com/tr1book)

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Mathias Gaunard <loufoque@gmail.com>
Date: Fri, 8 Jan 2010 11:21:07 CST
Raw View
On 7 jan, 19:43, Nick Hounsome <nick.houns...@googlemail.com> wrote:
> On 5 Jan, 23:19, "Peter Olcott" <NoS...@SeeScreen.com> wrote:
>
> > I want to use a std::string (or equivalent) to store UTF-8
> > characters can it always do this?
>
> >From wikipedia article on C++0X (although I'm pretty sure that I read
>
> it in teh official stuff as well):
>
> For the purpose of enhancing support for Unicode in C++ compilers, the
> definition of the type char has been modified to be both at least the
> size necessary to store an eight-bit coding of UTF-8 and large enough
> to contain any member of the compiler's basic execution character set.
> It was previously defined as only the latter.

AFAIK, C89, C99, C++98 and C++03 all already mandate that CHAR_BIT is
at least 8, so it's just redundancy.

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Leigh Johnston" <leigh@i42.co.uk>
Date: Fri, 8 Jan 2010 11:21:30 CST
Raw View
I am sorry for not qualifying my original response with
"implementation-defined behaviour"; as usual I rush my usenet postings.

However, consider:

unsigned char unsignedChar = foo();
signed char signedChar = unsignedChar;

1) The compiler cannot know in advance what value unsignedChar has so must
always generate the same code (probably optimized to a simple single store
instruction).
2) Stripping off the high bit of unsignedChar makes even less sense than
simply converting its value to a signed (negative) value.
3) There should be a 1 to 1 mapping of any unsigned value with the high bit
set to some signed (negative) value (assuming two's complement of course)
which in effect means that signed char *can* represent any unsigned value
allbeit sometimes as a negative number.

Although actual behaviour is "implementation-defined" I doubt there are many
sane implementations that do not follow the above and it is perhaps wise to
consider the behaviour of your target implementation in the real world over
and above some hypothetical implementation in fantasy land. :)  Mixing
std::string and std::basic_string<unsigned char> in the same code base can
be a PITA.

If in doubt always check your compiler's documentation.

/Leigh


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Scott Meyers <NeverRead@aristeia.com>
Date: Fri, 8 Jan 2010 13:06:34 CST
Raw View
Nick Hounsome wrote:

> From wikipedia article on C++0X (although I'm pretty sure that I read
>>
> it in teh official stuff as well):
>
> There are three Unicode encodings that C++0x will support: UTF-8,
> UTF-16, and UTF-32.
>

I believe this is a bit misleading, as C++0x seems to offer a funny mixture
of support for UTF-8, UTF-16, UCS-2 (which is a subset of UTF-16), and
UTF-32 (which is essentially identical to UCS-4). A lot depends on what you
mean by "support".

I suggest you consult these threads for discussions related to this topic:

http://tinyurl.com/yeafdzv
http://tinyurl.com/y8pzsd9

Scott

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Kaz Kylheku <kkylheku@gmail.com>
Date: Fri, 8 Jan 2010 13:03:56 CST
Raw View
On 2010-01-07, Leigh Johnston <leigh@i42.co.uk> wrote:
>
>
> "James Kanze" <james.kanze@gmail.com> wrote in message
> news:a4b66b5a-1997-4591-9a9e-921f91863e26@v25g2000yqk.googlegroups.com...
>
>> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>>
>>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>>>
>>
>> news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...
>>>
>>
>> > I want to use a std::string (or equivalent) to store UTF-8
>>> > characters can it always do this?
>>>
>>
>> std::string is fine for storing UTF-8, you do not need
>>> unsigned char to do this.
>>>
>>
>> In practice.  Formally, the results of assigning a value which
>> is not representable in the target type is unspecified.  And if
>> char is signed, and 8 bits, things like 0xC3 aren't
>> representable.
>>
>>
> If by "unspecified" you mean "Implementation Defined" then yes, however I
> take issue with your definition of representable:
>
> int main()
> {
>        unsigned char ch1 = 0xC3;
>        char ch2 = ch1;
>        unsigned char ch3 = ch2;
> // following is fine in VC++ and g++ where char is signed yet can
represent

In g++, char is signed by default, but this can be controlled by
a command line option.

Note that this newsgroup is comp.STD.c++.

``Proof by several compilers''

isn't good enough. Of course, it's good enough in practical development
where portability is weighed against economics: schedules, budgets,
markets, software lifetimes.

> 0xC3 because there are sufficient bits.

If char is signed and 8 bits wide, the value 0xC3 is
not representable. The maximum value of the type is 127,
whereas 0xc3 is 195.  195 being greater than 127 is a significant
obstacle to representability.

Of course, we can /encode/ values outside of the range of char,
by sacrificing some other values.

If we don't need any negative values, then we can use the values
-CHAR_MAX through -1 to encode values beyond CHAR_MAX.

That's not a direct representation: then it's the char type plus an
additional convention we have imposed on it which is doing the
representing.

>        assert(ch3 == 0xC3);

This assertion proves nothing, other than that the
implementation-defined mapping of the 0xC3 value to the char type is
reversible by a conversion to unsigned char, on those two compilers.

This need not be the case. An implementation can treat out-of-range
numbers by clamping them, so that   char c = 195   produces a value of
127. Good idea or not, this is valid implementation-defined behavior.

Programs which avoid such a conversion are more portable (even if only
in a mathematical sense, rather than in the real world).

> // following is also fine
>        assert((ch2 & 0xFF) == 0xC3);

This additionally relies two behaviors:

- conversions of an out-of-range value to an integer type are treated
 by bit truncation, whereby a mantissa bit in the original value which
 corresponds to the narrower type's sign bit is preserved into
 that sign bit.

- the two's complement representation is used for signed integers,
 thus obeying sign-extension when such values are converted to a wider
 signed type.

This means that the out-of-range value 0xC3 becomes a two's complement
value in the type char, which has the bit pattern 0xC3: there are 8
bits, which are simply preserved in the conversion; bit 7 becomes the
sign bit.

When this char object is evaluated, it produces a negative value of type
int (due to promotion) which is sign-extended; i.e the least signfiicant
8 bits of this int value continue to hold the value 0xC3, and the sign
bit is propagated through all the higher bits including the sign bit.
When we reduce this value with the bit operation & 255, we retain the
least significant 8 bits which hold the pattern 0xC3. That's the binary
value we obtain, since the 7th bit is not treated as the sign bit.

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: James Kanze <james.kanze@gmail.com>
Date: Sun, 10 Jan 2010 09:47:44 CST
Raw View
On Jan 8, 5:21 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
> I am sorry for not qualifying my original response with
> "implementation-defined behaviour"; as usual I rush my usenet
> postings.

> However, consider:

> unsigned char unsignedChar = foo();
> signed char signedChar = unsignedChar;

> 1) The compiler cannot know in advance what value unsignedChar
> has so must always generate the same code (probably optimized
> to a simple single store instruction).

Certainly (except that it might not be a single store
instruction.)

> 2) Stripping off the high bit of unsignedChar makes even less
> sense than simply converting its value to a signed (negative)
> value.

And raising an implementation defined signal make more sense
than either.

> 3) There should be a 1 to 1 mapping of any unsigned value with
> the high bit set to some signed (negative) value (assuming
> two's complement of course) which in effect means that signed
> char *can* represent any unsigned value allbeit sometimes as a
> negative number.

Why should there be a 1 to 1 mapping.  The standard certainly
doesn't require it, and there is hardware being sold today which
doesn't support it.

> Although actual behaviour is "implementation-defined" I doubt
> there are many sane implementations that do not follow the
> above

I would expect that a robust implementation raise the signal in
such cases.  It has definite runtime costs (which is why it
isn't widely done), but it's the most robust solution---what one
would want in a critical system for example.

> and it is perhaps wise to consider the behaviour of your
> target implementation in the real world over and above some
> hypothetical implementation in fantasy land. :)  Mixing
> std::string and std::basic_string<unsigned char> in the same
> code base can be a PITA.

No one said that you should mix the two.  That does cause
problems.  (Starting with the fact that instantiating
std::basic_string< unsigned char > is undefined behavior.)

--
James Kanze

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: James Kanze <james.kanze@gmail.com>
Date: Sun, 10 Jan 2010 09:48:29 CST
Raw View
On Jan 8, 7:03 pm, Kaz Kylheku <kkylh...@gmail.com> wrote:
> On 2010-01-07, Leigh Johnston <le...@i42.co.uk> wrote:
> > "James Kanze" <james.ka...@gmail.com> wrote in message
> >news:a4b66b5a-1997-4591-9a9e-921f91863e26@v25g2000yqk.googlegroups.com...

> >> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:

> >>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message

> >>news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...

      [...]
> > int main()
> > {
> >        unsigned char ch1 = 0xC3;
> >        char ch2 = ch1;
> >        unsigned char ch3 = ch2;
> > // following is fine in VC++ and g++ where char is signed yet can represent

      [...]
> >        assert(ch3 == 0xC3);

> This assertion proves nothing, other than that the
> implementation-defined mapping of the 0xC3 value to the char
> type is reversible by a conversion to unsigned char, on those
> two compilers.

> This need not be the case. An implementation can treat
> out-of-range numbers by clamping them, so that   char c = 195
> produces a value of 127. Good idea or not, this is valid
> implementation-defined behavior.

You don't have to go to that point (although a conforming
implementation could even raise an implementation defined signal
if the value overflows).  The simplest implementation of
converting between signed and unsigned is just to copy the bits.
A 1's complement or signed magnitude implementation doesn't have
this liberty when converting from signed to unsigned, however,
so it's quite possible that while the unsigned to signed does
what is wanted (here), the signed back to unsigned doesn't.  (I
know of at least two machines where this would be an issue.
Both have plain char unsigned, however.  Probably intentionally,
to avoid this issue.)

--
James Kanze

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Johannes Schaub (litb)" <schaub-johannes@web.de>
Date: Sun, 10 Jan 2010 14:32:10 CST
Raw View
James Kanze wrote:

> On Jan 8, 7:03 pm, Kaz Kylheku <kkylh...@gmail.com> wrote:
>> On 2010-01-07, Leigh Johnston <le...@i42.co.uk> wrote:
>> > "James Kanze" <james.ka...@gmail.com> wrote in message
>>
>news:a4b66b5a-1997-4591-9a9e-921f91863e26@v25g2000yqk.googlegroups.com...
>
>> >> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>
>> >>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>
>> >>news:wr2dnVFb14zHA9_WnZ2dnUVZ_tednZ2d@giganews.com...
>
>       [...]
>> > int main()
>> > {
>> >        unsigned char ch1 = 0xC3;
>> >        char ch2 = ch1;
>> >        unsigned char ch3 = ch2;
>> > // following is fine in VC++ and g++ where char is signed yet can
>> > represent
>
>       [...]
>> >        assert(ch3 == 0xC3);
>
>> This assertion proves nothing, other than that the
>> implementation-defined mapping of the 0xC3 value to the char
>> type is reversible by a conversion to unsigned char, on those
>> two compilers.
>
>> This need not be the case. An implementation can treat
>> out-of-range numbers by clamping them, so that   char c = 195
>> produces a value of 127. Good idea or not, this is valid
>> implementation-defined behavior.
>
> You don't have to go to that point (although a conforming
> implementation could even raise an implementation defined signal
> if the value overflows).
>

Can it really raise? My understanding was that C allows, while C++ doesn't.
C++ says ";... otherwise, the value is implementation-defined." while C says
"either the result is implementation-defined or an implementation-defined
signal is raised.". To me the C++ portion reads that anything other than
producing a value is not allowed. I suspect i'm reading it wrongly, but what
part? Is there some difference between C and C++ wrt what "value" means? I.e
can it include traps in C++ while in C it can't? C says "An unspecified
value cannot be a trap representation." and C++ says "For POD types, the
/value representation/ is a set of bits in the object representation that
determines a /value/, which is one discrete element of an implementation-
defined set of values". I'm not getting a clue when reading the C++ part wrt
traps.




--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Leigh Johnston" <leigh@i42.co.uk>
Date: Sun, 10 Jan 2010 14:31:08 CST
Raw View
>> and it is perhaps wise to consider the behaviour of your
>> target implementation in the real world over and above some
>> hypothetical implementation in fantasy land. :)  Mixing
>> std::string and std::basic_string<unsigned char> in the same
>> code base can be a PITA.
>
> No one said that you should mix the two.  That does cause
> problems.  (Starting with the fact that instantiating
> std::basic_string< unsigned char > is undefined behavior.)
>

Why is instantiating std::basic_string<unsigned char> undefined behaviour?
I thought std::basic_string was a container like any other.  The
unspecialized char_traits should work fine for unsigned char I think after
taking a quick look at character traits requirements.  The standard mentions
char-like objects but I cannot find a mention of UB:

"The class template basic_string describes objects that can store a sequence
consisting of a varying number
of arbitrary char-like objects with the first element of the sequence at
position zero. Such a sequence is also
called a "string" if the type of the char-like objects that it holds is
clear from context. In the rest of this
Clause, the type of the char-like objects held in a basic_string object is
designated by charT."

/Leigh


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Leigh Johnston" <leigh@i42.co.uk>
Date: Sun, 10 Jan 2010 14:31:47 CST
Raw View
>> and it is perhaps wise to consider the behaviour of your
>> target implementation in the real world over and above some
>> hypothetical implementation in fantasy land. :)  Mixing
>> std::string and std::basic_string<unsigned char> in the same
>> code base can be a PITA.
>
> No one said that you should mix the two.  That does cause
> problems.  (Starting with the fact that instantiating
> std::basic_string< unsigned char > is undefined behavior.)
>

Yeah I guess it is implementation defined behaviour (not undefined
behaviour) as the standard only mandates the following for unspecialized
char_traits:

template<class charT> struct char_traits;

I know it is not proof of anything but VC++, g++ and comeau all seem to
provide a functioning unspecialized char_traits that is compatible with
unsigned char.

/Leigh


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Pete Becker <pete@versatilecoding.com>
Date: Mon, 11 Jan 2010 00:48:02 CST
Raw View
Leigh Johnston wrote:
>
>>> and it is perhaps wise to consider the behaviour of your
>>> target implementation in the real world over and above some
>>> hypothetical implementation in fantasy land. :)  Mixing
>>> std::string and std::basic_string<unsigned char> in the same
>>> code base can be a PITA.
>>
>> No one said that you should mix the two.  That does cause
>> problems.  (Starting with the fact that instantiating
>> std::basic_string< unsigned char > is undefined behavior.)
>>
>
> Yeah I guess it is implementation defined behaviour (not undefined
> behaviour) as the standard only mandates the following for unspecialized
> char_traits:
>
> template<class charT> struct char_traits;
>

Unless there's a requirement that the implementation document its
behavior, the behavior is not implementation defined. That's a formal
term in the standard, and it's not equivalent to the informal
"implementation specific".

--
    Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of
"The Standard C++ Library Extensions: a Tutorial and Reference"
(www.petebecker.com/tr1book)

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]