Topic: Guarantees concerning references/pointers into string


Author: John Max Skaller <maxtal@suphys.physics.su.oz.au>
Date: 1996/01/28
Raw View
J. Kanze wrote:

>         a = "A lot of junk" ;
>         a.reserve( 100 ) ;
>         char*               p1 = &a[ 1 ] ;
>         //  b = a ;
>         a.insert( 6 , "---" ) ;
>         char*               p2 = &a[ 1 ] ;
>         cout << "Reserve test: " << (p1 == p2 ? "Passed" : "Failed") << endl ;
>
> As I interpret the standard, the above is required to work.

 I agree. It should work.


> The presense of
> `reserve' makes copy on write anything but trivial to implement
> correctly.  (The exact case above is actually easy to avoid, and I'm
> surprised that g++ has the error.  But try uncommenting the assignment
> in the above code, and it becomes considerably more difficult to avoid
> the problem.)

 I do NOT understand. There is a problem with
copy on write implementations, but it has nothing to do with
reserve(). Rather, the problem occurs because of aliasing:

 void f(string &x, string const &y) {
  char &ch = y[0];
  x = "fred";
 }

Is "ch" assured? Nope:

 string s = "Hi";
 f(s,s);


Consider now:

 void f(string &x, string const &y) {
  char &ch = y[0];
  cout << x[0] << ch;
 }

Here, ch might _still_ be invalidated even though there
is no write operation -- merely binding a non-const
lvalue into the string _potentially_ allows writing,
and the string class has no choice but to do the copy
immediately. C++ overloads on the constness of the object
and not whether the context is an lvalue or rvalue context.

> One further question occurs: when may reallocation (and the resulting
> invalidation of the pointers) occur when reserve has not been called?
> For example, is the following guaranteed to work as expected:
>
>     string              s ;
>     size_t              i , j ;
>     assert( i < s.size() , j < s.size() ) ;
>     s[ i ] = s[ j ] ;
>
> I would like for the last statement to be well defined, but according to
> my reading of the standard, it isn't.  Reallocation is allowed in the
> non-const version of operator[] (and must be, if copy on write is to be
> a legal implementation).

 "must be" doesn't follow -- indexing _could_ return
a proxy object rather than a reference, couldn't it?

> This operator is called twice in the
> expression, and the compiler could very easily (and legally) generate
> both calls before using either of the results.  But if the second call
> reallocates, the reference returned by the first call is invalidated.
> (I cannot actually conceive of an implementation in which the second
> call reallocates, but I cannot find anything in the draft to guarantee
> that it won't.)

A naive COW implementation in which the "first" evaluated index operator
is the RHS above and for which a const alias is used will
invoke the CONST indexing operator, while evaluating
the LHS invokes the NONCONST indexing operator and triggers
reallocation. For example:

 string s = "Hello";
 string const &cs = s;
 s[0] = cs[0];

is almost certain to fail in a naive COW implementation
if the RHS indexing happens to return a lvalue instead
of an rvalue.

Here's another case:

 s.insert(const_iterator, "x");

[No, there's nothing wrong with inserting at a const iterator.
It is the string object before the dot (.) that must be
non-const to support insertion, NOT the iterator, which merely
marks the place where the insertion is to occur.]
But calling the "insert" method might trigger a reallocation.

I would not be surprised at an implemention
which split the representation when ANY iterator
was got from the string -- const or not. The same
fix can be applied to indexing (requiring a mutable
pointer inside the string object).

Is there a rule for writing COW that makes it work???

Yes. IF any method returns something that
binds directly to the representation, it must
be split BEFORE the binding is done.

For example, if the indexing operator uses
a suitable proxy, there's no problem.
Similarly, if the iterators are pairs:

 (string*, index)

there's no problem.

My current string class MTLstring does the latter AND is a naive
(non-COW) implementation AND allows accesses
"off the end" -- it is highly robust, and VERY hard to break
without doing "obvious" things (like deleting the string).
It permits you to insert or erase in the string
and will NOT invalidate any iterators. In fact,
my iterators CANNOT be invalidated except by deleting
the string. They may, of course, magically point to
a new character after an insertion. [In fact,
my iterators can be anchored at EITHER end of the
string, so a "end()-1" iterator is NOT moved
by an insertion at end()-2, because it's anchored
to the RHS end of the string, not the LHS end.]

The price is inefficiency: read/write can
be made very efficient by grabbing a pointer
to the raw array, but duplicate copying
make the class unsuitable for som applications.

But I'm just not interested in chasing the kind of
horrible problems that fragile implementations have, in a class
I personally conceive as a formatting
and message passing utility.

A reasonable compromise for a COW implementation
to make users of the indexing operators
and STL iterators pay for using non-OO syntax:

 s.put(index, chr)

does not suffer from these problems, for example.

--
John Max Skaller               voice: 61-2-566-2189
81 Glebe Point Rd              fax:   61-2-660-0850
GLEBE NSW 2037                 web: http://www.maxtal.com.au/~skaller/
AUSTRALIA                      email: skaller@maxtal.com.au

[ comp.std.c++ is moderated.  Submission address: std-c++@ncar.ucar.edu.
  Contact address: std-c++-request@ncar.ucar.edu.  The moderation policy
  is summarized in http://dogbert.lbl.gov/~matt/std-c++/policy.html. ]






Author: kanze@gabi.gabi-soft.fr (J. Kanze)
Date: 1996/01/25
Raw View
I've just tried out the following code:

    #include    <string>
    #include    <iostream.h>

    int
    main()
    {
        string              a , b ;
        a = "A lot of junk" ;
        a.reserve( 100 ) ;
        char*               p1 = &a[ 1 ] ;
        //  b = a ;
        a.insert( 6 , "---" ) ;
        char*               p2 = &a[ 1 ] ;
        cout << "Reserve test: " << (p1 == p2 ? "Passed" : "Failed") << endl ;
        return 0 ;
    }

With the one `standard' string class I have access to, the result is
"Failed".

As I interpret the standard, the above is required to work.  The
`Notes' after the function `reserve' state that: ``It is guaranteed
that no reallocation takes place during the insertions that happen
after reserve() takes place till the time when the size of the string
reaches the size specified by reserver().''  From further comments
concerning reallocation, I would conclude that the results of
operator[] should be valid until a reallocation takes place.  Is this
the intended interpretation?

Just a guess, but I would imagine from the above, even without seeing
the g++ code, that it uses the widespread copy on write technique.  My
own implementation of basic_string< T > uses copy on write, and it was
the amount of effort required to avoid the above problem that suggested
trying it out on someone elses implementation.  The presense of
`reserve' makes copy on write anything but trivial to implement
correctly.  (The exact case above is actually easy to avoid, and I'm
surprised that g++ has the error.  But try uncommenting the assignment
in the above code, and it becomes considerably more difficult to avoid
the problem.)

In a similar veine, what functions are implied by the word `insertion'
in the above quote?  I chose `insert' for my test, because if it isn't
an insertion, I don't know what is.  What about replace?  remove?
assign?  non-const at?

One further question occurs: when may reallocation (and the resulting
invalidation of the pointers) occur when reserve has not been called?
For example, is the following guaranteed to work as expected:

    string              s ;
    size_t              i , j ;
    assert( i < s.size() , j < s.size() ) ;
    a[ i ] = a[ j ] ;

I would like for the last statement to be well defined, but according to
my reading of the standard, it isn't.  Reallocation is allowed in the
non-const version of operator[] (and must be, if copy on write is to be
a legal implementation).  This operator is called twice in the
expression, and the compiler could very easily (and legally) generate
both calls before using either of the results.  But if the second call
reallocates, the reference returned by the first call is invalidated.
(I cannot actually conceive of an implementation in which the second
call reallocates, but I cannot find anything in the draft to guarantee
that it won't.)
--
James Kanze           (+33) 88 14 49 00          email: kanze@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs Bourgeois, 67000 Strasbourg, France
Conseils, itudes et rialisations en logiciel orienti objet --
              -- A la recherche d'une activiti dans une region francophone
---
[ comp.std.c++ is moderated.  Submission address: std-c++@ncar.ucar.edu.
  Contact address: std-c++-request@ncar.ucar.edu.  The moderation policy
  is summarized in http://dogbert.lbl.gov/~matt/std-c++/policy.html. ]