Topic: Things with strings


Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/06/28
Raw View
Hans-Juergen Boehm <boehm@mti.mti.sgi.com> writes:

|>  2) operator[] and iterator dereferences return a real reference, not a
|>  proxy.  This is required by the draft standard.

Is there any chance of changing this?  For vector, et al., there is a
perfectly valid reason why you must have a real reference: you need to
support things like "v[ i ].f()".  This should be less of an issue for
string.  (The "character-like" type must be POD, and the cases where it
is a POD class must be rare.)

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
            -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: Ross Smith <ross.smith@nz.eds.com>
Date: 1997/06/25
Raw View
The web site devoted to SGI's implementation of the Standard Template
Library (http://www.sgi.com/Technology/STL/) includes a discussion of
string classes (under Design Documents). They claim that the string
class described in the draft standard is seriously flawed.

Specifically, "The current draft standard disallows the expression
s[1]==s[2] where s is a nonconstant string." (This page was added fairly
recently, so I think I'm safe in assuming they mean CD2.)

I think I see what the problem is: The description of string::operator[]
in CD2 (21.3.4 [lib.string.access]) states that "The reference returned
is invalid after any subsequent call to c_str(), data(), or any
non-const member function for the object." Since operator[] is itself a
non-const member function, the expression s[1]==s[2] invokes undefined
behaviour -- the two references can't both be valid.

So, some questions for anyone involved in this part of the standard:
First, is this a genuine problem, or have I (quite possibly) or SGI
(somewhat less likely) got it wrong? Second, if it's a genuine problem,
is anything being done about it?

One part of the SGI discussion has me wondering if I've misunderstood
the problem: They suggest using vector<char> as one possible interim
alternative to strings until the standard is fixed (the other
suggestions are their own "rope" class (which, among other differences,
doesn't have a non-const operator[]) and plain old C strings). But
wouldn't vector<char> suffer from exactly the same problem? What does
vector guarantee that string doesn't (or vice versa)?

CD2 just says (23.1.1 [lib.sequence.reqmts]) that vector::operator[]
returns a reference to the nth element, with no comment at all on how
long the reference is valid for.

--
Ross Smith ............................. <mailto:ross.smith@nz.eds.com>
Internet and New Media, EDS (New Zealand) Ltd., Wellington, New Zealand
     "I'm as interested as anybody else in all the things no decent
     person would be interested in."          -- Ashleigh Brilliant
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]





Author: Jason Merrill <jason@cygnus.com>
Date: 1997/06/25
Raw View
>>>>> Ross Smith <ross.smith@nz.eds.com> writes:

> Specifically, "The current draft standard disallows the expression
> s[1]==s[2] where s is a nonconstant string." (This page was added fairly
> recently, so I think I'm safe in assuming they mean CD2.)

> I think I see what the problem is: The description of string::operator[]
> in CD2 (21.3.4 [lib.string.access]) states that "The reference returned
> is invalid after any subsequent call to c_str(), data(), or any
> non-const member function for the object." Since operator[] is itself a
> non-const member function, the expression s[1]==s[2] invokes undefined
> behaviour -- the two references can't both be valid.

Yep.

> So, some questions for anyone involved in this part of the standard:
> First, is this a genuine problem, or have I (quite possibly) or SGI
> (somewhat less likely) got it wrong? Second, if it's a genuine problem,
> is anything being done about it?

It is a genuine theoretical problem.  For most situations, it's not a
problem.  You would only hit it (for a typical reference-counted
implementation) if you made a copy of the string in between
the references, possibly in another thread.

> One part of the SGI discussion has me wondering if I've misunderstood
> the problem: They suggest using vector<char> as one possible interim
> alternative to strings until the standard is fixed (the other
> suggestions are their own "rope" class (which, among other differences,
> doesn't have a non-const operator[]) and plain old C strings). But
> wouldn't vector<char> suffer from exactly the same problem? What does
> vector guarantee that string doesn't (or vice versa)?

vector does not allow sharing of representations between vectors, so
there's no reason for operator[] to invalidate references.

> CD2 just says (23.1.1 [lib.sequence.reqmts]) that vector::operator[]
> returns a reference to the nth element, with no comment at all on how
> long the reference is valid for.

Methods that invalidate vectors say as much in their definitions.  The STL
containers are much better about defining lifetimes of references than the
string class.

Jason
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]





Author: Hans-Juergen Boehm <boehm@mti.mti.sgi.com>
Date: 1997/06/26
Raw View
(Background: I wrote most of the document in
http://www.sgi.com/Technology/STL/string_discussion.html.  Jason was
very instrumental in helping the several of us, including me, understand
the issue.)

Jason Merrill wrote:
>
> >>>>> Ross Smith <ross.smith@nz.eds.com> writes:
>
> > Specifically, "The current draft standard disallows the expression
> > s[1]==s[2] where s is a nonconstant string." (This page was added fairly
> > recently, so I think I'm safe in assuming they mean CD2.)
>
> > I think I see what the problem is: The description of string::operator[]
> > in CD2 (21.3.4 [lib.string.access]) states that "The reference returned
> > is invalid after any subsequent call to c_str(), data(), or any
> > non-const member function for the object." Since operator[] is itself a
> > non-const member function, the expression s[1]==s[2] invokes undefined
> > behaviour -- the two references can't both be valid.
>
>
> It is a genuine theoretical problem.  For most situations, it's not a
> problem.  You would only hit it (for a typical reference-counted
> implementation) if you made a copy of the string in between
> the references, possibly in another thread.
>
I think it's much more than a theoretical problem.

Jason is right that in the single-threaded case you have to work fairly
hard to produce an example on which current implementations fail.  And
testing is likely (though of course not certain) to point out such a
failure.  Probably the most serious problem here is that it appears to
be quite messy to define the behavior of current implementations.  Thus
future implementations could start breaking things like s[0] == s[1]
while still conforming to the standard.  And if you do encounter such a
failure, there is little you can do about without understanding the
string implementation.

Overall, the single-threaded situation strikes me as unacceptable in the
longer term.  But you might argue that it's not a showstopper for
getting the standard out on time.

In my opinion, the multithreaded situation is far more serious.  A
troublesome failure case is:

Thread 1 executes ...== s[1]...

Just after producing the reference, but before dereferencing it,
thread 2 executes

t = s;  // Add a second reference
c2 = s[j]; //  Makes a new copy of s' data, since s[j] must produce
    // modifiable reference.
t = ""; // Deletes the original character array associated with s,
 // which was now t's data.

s[1] in thread 1 is now a dangling reference.

This scenario has the following characteristics:

1) It looks natural, not contrived, to me.  There are many similar
scenarios.

2) S is the only shared variable; it is not written by either thread.
Normal rules tell me no locking is needed.  It's very hard to understand
that locking might be necessary without understanding the
implementation.

3) I know of no set of rules that tells me how to avoid such a situation
or identify problematic code, short of protecting all string accesses
with a single lock.  You can probably do better.  But the fact that
representations are shared makes life complicated, and anything else
appears to me to require a proof.

4) The failure probability is very low.  On a uniprocessor, thread 1
needs to be preempted in what's probably a one instruction window.  It's
very unlikely such a problem will be caught during testing.  Instead you
would probably see sporadic and unreproducible failures in released
software.

Given this situation, I will argue that neither SGI nor anyone else
should release a reference-counted string class that conforms to CD2
into a multithreaded environment.  Multithreaded programming is hard
enough without library classes that encourage intermittent failures.I
don't make decisions for SGI, so this doesn't imply anything about what
will actually happen.  But if anyone were to pay attention to me, that
would make it very difficult to release such a string class at all.  I
think most vendors think twice about introducing new thread-unsafe
libraries.

(Basic_string has other problems.  But this is probably the most serious
and concrete one.)

>
> vector does not allow sharing of representations between vectors, so
> there's no reason for operator[] to invalidate references.
>
The two crucial ingredients in many string implementations that combine
to cause the problem are:

1) Reference counting.  I don't believe this is required by CD2, but it
seems to be considered a selling point.  It potentially improves
performance, especially for long strings in single-threaded
applications.  (In my opinion, it's not suficient for this case, since
concatenation is still slow.  Some implementations apparently use
general locks around the reference count operations, in which case it
often loses performance for multithreaded applications.)

2) operator[] and iterator dereferences return a real reference, not a
proxy.  This is required by the draft standard.

--
Hans-Juergen Boehm
boehm@mti.sgi.com
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/06/26
Raw View
Ross Smith <ross.smith@nz.eds.com> writes:

|>  The web site devoted to SGI's implementation of the Standard Template
|>  Library (http://www.sgi.com/Technology/STL/) includes a discussion of
|>  string classes (under Design Documents). They claim that the string
|>  class described in the draft standard is seriously flawed.
|>
|>  Specifically, "The current draft standard disallows the expression
|>  s[1]==s[2] where s is a nonconstant string." (This page was added fairly
|>  recently, so I think I'm safe in assuming they mean CD2.)
|>
|>  I think I see what the problem is: The description of string::operator[]
|>  in CD2 (21.3.4 [lib.string.access]) states that "The reference returned
|>  is invalid after any subsequent call to c_str(), data(), or any
|>  non-const member function for the object." Since operator[] is itself a
|>  non-const member function, the expression s[1]==s[2] invokes undefined
|>  behaviour -- the two references can't both be valid.
|>
|>  So, some questions for anyone involved in this part of the standard:
|>  First, is this a genuine problem, or have I (quite possibly) or SGI
|>  (somewhat less likely) got it wrong? Second, if it's a genuine problem,
|>  is anything being done about it?

It's definitly a genuine problem, and SGI isn't the only one to have
noticed it.  It's one of the French comments on the CD, so the committee
is required to give an answer one way or another.  Some of the people at
SGI are active in the library committee, so I feel certain that they
will be looking at it as well.

Note too that similar problems exist with iterators, at least in most
implementations.  Consider:

    string              s1 , s2 ;
    string::iterator    i = s1.begin() ;
    s2 = s1 ;
    *i = 'a' ;

The usual solution involves making the iterator intelligent, a class
(with a pointer to the string object, as well as the position) rather
than just a pointer.  This solution is, however, currently forbidden for
references, which are required to be real references, and not a class
which acts like a reference.

|>  One part of the SGI discussion has me wondering if I've misunderstood
|>  the problem: They suggest using vector<char> as one possible interim
|>  alternative to strings until the standard is fixed (the other
|>  suggestions are their own "rope" class (which, among other differences,
|>  doesn't have a non-const operator[]) and plain old C strings). But
|>  wouldn't vector<char> suffer from exactly the same problem? What does
|>  vector guarantee that string doesn't (or vice versa)?

Guarantee, I don't know.  But in practice, vector's are not generally
meant to be copied, and the implementations usually use a deep copy.
The semantics of string have been defined (I think) in a way to allow
copy on write.  Or at least, that was what was attempted.

|>  CD2 just says (23.1.1 [lib.sequence.reqmts]) that vector::operator[]
|>  returns a reference to the nth element, with no comment at all on how
|>  long the reference is valid for.

I'm not sure as to the exact wording in the CD, but the intent with
vector is obvious (I think): references, like iterators, are valid as
long as the vector does not grow.  Whereas with strings, they are valid
only until the next non-const function: if copy-on-write is used, any
non-const function may trigger a new copy, with a resulting
reallocation, and invalidation of the iterators and references.

With regards to the temporary solution proposed by SGI: I'd just ignore
the problem in the case of comparison.  First, I feel certain that the
committee will make the comparison defined somehow.  And second, it will
work with any reasonable implementation.  Even in the case of copy on
write, only the first non-const function will in fact trigger the copy,
and the returned reference will be into the new copy.

Do be aware of the problem, however, and avoid multiple accesses like
this in cases where the string might also be copied in the expression.
And avoid saving the reference for later.

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
            -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]