Thread

Topic: C++ multithreading model

Author: terekhov@web.de (Alexander Terekhov)
Date: Sat, 5 Mar 2005 00:14:18 GMT Raw View

Michael Pryhodko wrote:
[...]
>
> ===================================== MODERATOR'S COMMENT: ...

Followup-To set to comp.programming.threads

[... IA-32 ...]

> 1. Reads can be carried out *speculatively* and in any order."
> .
>
> "In a multiple-processor system, the following ordering rules apply:
> Individual processors use the same ordering rules as in a
> single-processor system."
>
> To my understanding according to these rules 'v=A' could be executed
> speculatively before 'u=B' on P4, Xeon & P6 family of processors. Where
> I am wrong?

To enforce ordering (acquire semantics of loads), IA-32 processor
detects when another processor writes to a location that has been
speculatively read by unretired instruction. Ordering violation is
detected and processor rolls back.

You aren't paying attention to links. Once again, this time
with quote:

http://www.well.com/~aleks/CompOsPlan9/0005.html

<quote author=an architect at Intel>

The PPro does speculative and out-of-order loads.  However,
it has a mechanism called the "memory order buffer" to ensure
that the above memory ordering model is not violated.  Load
and store instructions do not get retired until the processor
can prove there are no memory ordering violations in the actual
order of execution that was used.  Stores do not get sent to
memory until they are ready to be retired.  If the processor
detects a memory ordering violation, it discards all unretired
operations (including the offending memory operation) and
restarts execution at the oldest unretired instruction.

</quote>

I suggest you also read this paper

http://www.cs.wisc.edu/~cain/pubs/micro01_correct_vp.pdf

on value prediction. It's about dependent loads and how to
correctly implement msync::dd_hlb. ;-)

>
> Also it should be noticed that according to 7.2.4:
> "The memory type range registers (MTRRs) can be used to strengthen or
> **weaken** memory ordering for specific area of physical memory (see
> Section 10.11., "Memory Type Range Registers (MTRRs)")."
> not sure if it could break your statements however...

Yes, but it doesn't apply to ordinary memory.

regards,
alexander.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "Michael Pryhodko" <mpryhodko@westpac.com.au>
Date: Fri, 4 Mar 2005 18:12:45 CST Raw View

> IA32 stores have release semantics (sink-load/store mbar for
> preceding loads/stores in the program order).
>
> IA32 loads have acquire semantics (hoist-load/store mbar for
> subsequent loads/stores in the program order)

Because IA-64 emulates IA-32 instruction using rules described in
'6.3.4 Memory Ordering Interactions' it does not mean that IA-32
store/loads executed on IA-32 hardware have relase/acquire semantic.
Consider example:

A=0
B=0

P1      P2
A=1     u=B
B=1     v=A

if every read is 'acquire', and every store is 'release' (using
'acquire/release' definition of RC model) outcome (u,v) = (1,0) is
impossible. We have two possibilities to get this outcome:
1. writes get reordered. Impossible on IA-32 according to '7.2.2 Memory
ordering for P4, Xeon & P6' of Vol3 IA-32 Manual:
"Writes to memory are always carried out in program order"
.
"Writes by a single processor are observed in the same order by all
processor"
2. reads get reordered, i.e. v=A executed before 'u=B'. As far as I
understand it is possible according to the same chapter in IA-32
manual:
"In a single-processor system for memory regions defined as write-back
cacheable, the following ordering rules apply:
.
1. Reads can be carried out *speculatively* and in any order."
.

"In a multiple-processor system, the following ordering rules apply:
Individual processors use the same ordering rules as in a
single-processor system."

To my understanding according to these rules 'v=A' could be executed
speculatively before 'u=B' on P4, Xeon & P6 family of processors. Where
I am wrong?


Also it should be noticed that according to 7.2.4:
"The memory type range registers (MTRRs) can be used to strengthen or
**weaken** memory ordering for specific area of physical memory (see
Section 10.11., "Memory Type Range Registers (MTRRs)")."
not sure if it could break your statements however...


> But IA64 is also overly constrained, so to speak. Think of a
> read-write lock, for example.
[skip]

Sorry, Alexander, I am not a pthreads guru. I'll try to get through all
those bushes later when I'll get some free time. Meanwhile -- could you
describe in 'human words' why do you think IA-64 is constrained? and in
comparison with what?
Because in this way -- every memory model is constrained in comparison
with model that does not give any guarantee...


And by the way -- all these speculations have almost nothing to do with
'C++ MT model' subject and my idea about it. :)

Bye.
Sincerely yours, Michael.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "Michael Pryhodko" <mpryhodko@westpac.com.au>
Date: Mon, 28 Feb 2005 11:06:00 CST Raw View

>>>IA-32 (in its current incarnation with acquire-loads, release-stores
>>>[for ordinary memory] and fully-fenced [acquire+release] LOCK-stuff)
>>>is overly constrained.
>>
>> Could you be a bit more specific? I can't find word 'acquire' in
>> IA-32 manual.
>
> It's implicit. It's stated explicitly in Itanium Architecture
> Software Developer's Manual, 6.3.4 Memory Ordering Interactions.
> "IA-32 instructions are mapped into the Itanium memory ordering
> model as follows...".

Well, according to my investigations:
1. IA-32 has nothing to do with 'acquire' and 'release' semantics
2. Itanium architecture introduces acquire & release semantic, but
quick glance over manuals did not revealed any "over-constrain"

What exactly did you mean with your statement? "Itanium's
acquire-release is overly constrained in comparison with
acquire-release of RCpc model (as it is described in 'WRL-95-9.pdf')"?


By the way -- thank you for reading suggestion. WRL-95-9.pdf turned out
to be very enlightening paper which cover almost every aspect of
multiprocessor programming (maybe except of simultaneous access to the
same memory location with differently sized modification operations,
i.e. 4-byte & 2-byte modifications).

I think that with idea described in my "paper" I intuitively hit very
nice (maybe not best) spot between [easy&slow] and [complex&efficient].
Here are some arguments:
1. C++ memory consistency model (C-model) should not give any
guarantees about any order. Consider Fig.2.24 from WRL-95-9.pdf: in
order for C-model to be portable it you need to place it below on this
diagram and connect it with arrows with RCpc, PSO, RMO and PowerPC
models. Now:
a) suppose we choose C-model to be "union" of these models. In this
case future models which is more relaxed or unrelated to C-model will
face problems due to incompatibilities with existing C++ MT software.
b) if we choose C-model to be less strict of all possible we will allow
support of any future memory model. WRL-95-9 mentions algorithms that
could be implemented on some models without any means of fencing or
synchronisation. Our C-model will not support such programs due to its
"excessive relaxing". but this could be solved with something like
that:

#pragma coderegion(begin, memmodel("PowerPC"))

// code that assumes PowerPC memory consistency model

#pragma coderegion(end, memmodel("PowerPC"))

When compiled on hardware with RCpc memmodel compiler should generate
code that emulates any "closest" to RCpc memmodel that could run
PowerPC code according to Fig.2.24. This is IBM-370 memodel in our
case. If our hardware has no way to emulate IBM-370 it should fallback
to the next model it could -- SC in our case. It is obvious that any
hardware should be able to emulate SC model.
Also high level compiler warning should be generated if resulting code
is too inefficient.

2. You could extend my idea with acquire/release semantic, but I
believe gains are too low in comparison with complexity (if you
consider general developer). Personally, I think that "right" solution
would be to adapt programmer-centric model described in WRL-95-9. My
idea is similar to it in following ways:
a) 'fences' is somewhat similar to PL marks
a) both approaches uses most relaxed memmodel -- i.e. does not give any
guarantee about any order

It seems that my idea is simpler than PL and combined with
'coderegions' could be reasonably powerful and flexible (especially if
you allow memmodel-specific constructs inside of regions, i.e.
'acquire/release' inside of RCpc region, 'sync' inside of 'PowerPC'
regions).
Also I am fascinated with idea of 'participation levels', which allows
some code resuse and speed gains if, for example, your app works only
on one processor in multi-processor environment. :))


Bye.
Sincerely yours, Michael.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: terekhov@web.de (Alexander Terekhov)
Date: Tue, 1 Mar 2005 02:20:01 GMT Raw View

Michael Pryhodko wrote:
>
> >>>IA-32 (in its current incarnation with acquire-loads, release-stores
> >>>[for ordinary memory] and fully-fenced [acquire+release] LOCK-stuff)
> >>>is overly constrained.
> >>
> >> Could you be a bit more specific? I can't find word 'acquire' in
> >> IA-32 manual.
> >
> > It's implicit. It's stated explicitly in Itanium Architecture
> > Software Developer's Manual, 6.3.4 Memory Ordering Interactions.
> > "IA-32 instructions are mapped into the Itanium memory ordering
> > model as follows...".
>
> Well, according to my investigations:
> 1. IA-32 has nothing to do with 'acquire' and 'release' semantics

IA32 stores have release semantics (sink-load/store mbar for
preceding loads/stores in the program order).

IA32 loads have acquire semantics (hoist-load/store mbar for
subsequent loads/stores in the program order)

IA32 lock-stuff instructions have release and acquire semantics
(fully fenced).

It means that, for example, you still need store-load fence
(sink-store + hoist-load) on IA32 for things like

<http://www.well.com/~aleks/CompOsPlan9/0005.html>

> 2. Itanium architecture introduces acquire & release semantic, but
> quick glance over manuals did not revealed any "over-constrain"

First off, the comment was about IA32, not IA64. But IA64 is also
overly constrained, so to speak. Think of a read-write lock, for
example. You certainly don't need full acquire/release for read-only
locking. There's no need to constrain stores for sensibly defined
"rdlock()". See

<http://www.google.de/groups?selm=41C301DF.5B50EDB7%40web.de>,

I mean:

---
[...] There is a happens-before inter-thread ordering with respect
to preceding (in program and inter-thread order) modifications of
memory locations in

[...]

  a thread calling pthread_rwlock_unlock() on a specified read-
  write lock releasing a write lock and read accesses to modified
  memory locations in another thread after that thread acquires a
  read lock using pthread_rwlock_rdlock() or
  pthread_rwlock_timedrdlock() or pthread_rwlock_tryrdlock() on
  the same read-write lock unlocked by the calling thread;

  a thread calling pthread_rwlock_unlock() on a specified read-
  write lock releasing a write lock and read/write accesses to
  modified memory locations in another thread after that thread
  acquires a write lock using pthread_rwlock_timedwrlock() or
  pthread_rwlock_trywrlock() or pthread_rwlock_wrlock() on
  the same read-write lock unlocked by the calling thread;
---

and

<http://www.opengroup.org/austin/mailarchives/ag/msg07704.html>

---
[...]

     #include <stdio.h>
     #include <pthread.h>

     pthread_rwlock_t rwlock;

     int var;

     void * thread(void *) {
       pthread_rwlock_rdlock(&rwlock);
       var = 0;
       //pthread_rwlock_unlock(&rwlock);
       return 0;
     }

     int main() {
       pthread_t tid;
       pthread_rwlock_init(&rwlock, 0);
       pthread_rwlock_wrlock(&rwlock);
       pthread_create(&tid, 0, thread, 0);
       var = 1;
       pthread_rwlock_unlock(&rwlock);
       pthread_join(tid, 0);
       printf("%d\n", var);
     }

Under the original wording there are no data races here, the
program is data race-free and shall print 0 (out-of-resource
conditions aside for a moment).

Under the tightened rules there is a data race here, the
program is data race-UNfree and may not only print 1, but
also "format the hard disk."
---

regards,
alexander.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: xxx@xxx.com ("SenderX")
Date: Wed, 2 Mar 2005 20:08:08 GMT Raw View

> Could you be a bit more specific? I can't find word 'acquire' in IA-32
> manual. Does 'acquire-load' = LFENCE, 'release-stores' = SFENCE,

Nope. lfence affects loads, and sfence affects stores; acquire & release
affect both loads "and" stores. Current x86 doesn't require explicit
barriers ( LOCK prefix aside ) unless your algorithm simply cannot cope with
the fact that a store followed by a load from "another location" can, and
will, be reordered... And, the CPU isn't the only thing that can reorder
something critical. Here are some more highly relevant links:


http://groups-beta.google.com/group/comp.programming.threads/msg/423df394a0370fa6
( C compiler can screw you over real good!! )


http://groups-beta.google.com/group/comp.programming.threads/msg/1d9d4e6b888609e4
( how can we even attempt to fix this? )


http://groups-beta.google.com/group/comp.programming.threads/msg/ba30ef0314a585fb
( snippets from a rough draft of my AppCore documentation )


http://groups-beta.google.com/group/comp.programming.threads/msg/8ae09f9e9bea21b9
( barriers are not needed, there just for docs. Alex points that out )


<< please read "all" links and threads! >>



After reading all of that, you should know that assembly language is the
safest place you can possibly be wrt lock-free programming, IMHO. As of now,
unfortunately, using C/C++ for hard-core thread programming is considered
"highly" dangerous!

;(...


Even if compiler could recognize Alex's nice atomic<> solution, I would
probably use externally assembled function libraries for all of my critical
synchronization primitives. Assembly language really eases my mind when it
comes down to putting a lock-free algorithm in production server code...

;)


--
http://appcore.home.comcast.net/
(portable lock-free data-structures)


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: richtesi@informatik.tu-muenchen.de (Simon Richter)
Date: Thu, 3 Mar 2005 06:24:00 GMT Raw View

Hi,

"Michael Pryhodko" <mpryhodko@westpac.com.au> writes:

>> what I didn't find is a discussion of the problem that a compiler
>> usually cannot tell the execution environment and hence must always
>> assume worst case, which would be that there is another CPU in shared
>> memory, but with a separate cache that will have to be invalidated.

>Because this "paper" describes only idea of how C++ MT model could look
>like.

Yes, that's the point. Your proposal makes assumptions about the
implementation, so the programmer using an interface derived from it will
automatically do so as well. Differing implementations would need to
guess things that are not specified by the programmer because the
interface did not ask her to do so.

You may notice that the language specification basically talks about an
abstract machine that makes only very few guarantees (IMO still too many,
but that's a different matter), like sequence points and volatile
accesses. This model is capable of doing MT, although in a suboptimal way.

I can see the need for two specifications: One that changes the memory
model to be more fine-grained about such things and gives the compiler
enough hints to use platform-specific optimisations, and an API that a
programmer can use to describe what she intends to do. The library is
supposed to hand down the information gathered to the compiler, which can
use it to optimize the code.

I think the right way to go about this would be to write an API that wraps
MT for the programmer and preserves as much information as possible. If
your architecture does not make a distinction between several cases, e.g.
of how many components are synchronized, then still define two different
lock types and let a library implementor typedef them to each other.

An evil example could be an extension to the IA-64 ABI that will set a
certain predicate register on a context switch if all other threads from
the same application are running on the same CPU. A lock that is taken in a
fast path will definitely benefit from making the fence conditional in this
case, and this will possibly outweigh the performance hit taken from having
to take CPU from this thread, clear the predicate, do a fence and give back
CPU in case another thread becomes executable and another CPU is available.

>Also speaking by language proposed in 'WRL-95-9.pdf' this model is
>'system-centric' and compiler needs only to guarantee fence_X behavior,
>everything else is the same is it is now -- i.e. it could use any
>optimization technique. While proposed model will not be 100% efficient
>on every platform -- it still gives you portability, simplicity and
>efficiency. And even reusability -- by substituting necessary elements
>with your types you could use exisiting MT algorithms not only for
>structures shared by processors in memory, but also for structures
>shared by computers on network and so on.

Yes, but I'm not sure how much an application programmer should know about
the execution environment. Ideally, I think "nothing" would be best, as
there can be no assumptions that may no longer hold true in another
environment. I'd even go as far and say that any access to an object that
was last written in another thread is UB if the object is not either of a
type that defines methods for it to be sent to another thread and said
methods have been invoked or another object that causes a barrier if
accessed (that would be a special kind of mutex that has the highest
penalty to take) has been accessed. The second option is basically a way to
make old applications work on commodity hardware, i.e. it can obviously not
be used to allow threads to run in differing memory layouts or on different
machines, but these applications were not meant to do that. All programs
using the "new" or fine-grained ways of manipulating non-local can be
recompiled to work in different environments by simply exchanging the type
of thread-safety container.

>:) dreams-dreams... As far as I know people doesn't have common opinion
>over model for shared-memory, not speaking about model that will
>encapsulate both shm and pipes.

Pipes can and will be easily emulated on systems where SHM is cheaper than
a pipe by creating a SHM ring buffer with a bunch of semaphores. The
important point is that this is to be done by the standard library so the
code can be exchanged.

   Simon

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "Michael Pryhodko" <mpryhodko@westpac.com.au>
Date: Thu, 3 Mar 2005 23:06:46 CST Raw View

===================================== MODERATOR'S COMMENT:

Two moderation notes:
(1) The FAQ provides information on what to do if an article doesn't
appear
(2) This discussion is at this point only tenuously related to C++, and
has been
allowed as a detailed understanding of the semantics of parallel
hardware will
be necessary in order to provide solid foundations for concurrency
support in
any language standard.  Please do ensure that followups are helpful in
trying
to consider the future of concurrency support in C++.


===================================== END OF MODERATOR'S COMMENT
**** This is a repost of message that should have appeared here 3 days
ago ****
**** What happens to this discussion group? new policy? ****



> IA32 stores have release semantics (sink-load/store mbar for
> preceding loads/stores in the program order).
>
> IA32 loads have acquire semantics (hoist-load/store mbar for
> subsequent loads/stores in the program order)

Because IA-64 emulates IA-32 instruction using rules described in
'6.3.4 Memory Ordering Interactions' it does not mean that IA-32
store/loads executed on IA-32 hardware have relase/acquire semantic.
Consider example:

A=0
B=0

P1      P2
A=1     u=B
B=1     v=A

if every read is 'acquire', and every store is 'release' (using
'acquire/release' definition of RC model) outcome (u,v) = (1,0) is
impossible. We have two possibilities to get this outcome:
1. writes get reordered. Impossible on IA-32 according to '7.2.2 Memory
ordering for P4, Xeon & P6' of Vol3 IA-32 Manual:
"Writes to memory are always carried out in program order"
.
"Writes by a single processor are observed in the same order by all
processor"
2. reads get reordered, i.e. v=A executed before 'u=B'. As far as I
understand it is possible according to the same chapter in IA-32
manual:
"In a single-processor system for memory regions defined as write-back
cacheable, the following ordering rules apply:
.
1. Reads can be carried out *speculatively* and in any order."
.

"In a multiple-processor system, the following ordering rules apply:
Individual processors use the same ordering rules as in a
single-processor system."

To my understanding according to these rules 'v=A' could be executed
speculatively before 'u=B' on P4, Xeon & P6 family of processors. Where
I am wrong?


Also it should be noticed that according to 7.2.4:
"The memory type range registers (MTRRs) can be used to strengthen or
**weaken** memory ordering for specific area of physical memory (see
Section 10.11., "Memory Type Range Registers (MTRRs)")."
not sure if it could break your statements however...


> But IA64 is also overly constrained, so to speak. Think of a
> read-write lock, for example.
[skip]

Sorry, Alexander, I am not a pthreads guru. I'll try to get through all
those bushes later when I'll get some free time. Meanwhile -- could you
describe in 'human words' why do you think IA-64 is constrained? and in
comparison with what?
Because in this way -- every memory model is constrained in comparison
with model that does not give any guarantee...


And by the way -- all these speculations have almost nothing to do with
'C++ MT model' subject and my idea about it. :)

Bye.
Sincerely yours, Michael.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: terekhov@web.de (Alexander Terekhov)
Date: Thu, 17 Feb 2005 18:27:17 GMT Raw View

Michael Pryhodko wrote:
>
> > IA-32 (in its current incarnation with acquire-loads, release-stores
> > [for ordinary memory] and fully-fenced [acquire+release] LOCK-stuff)
> > is overly constrained.
>
> Could you be a bit more specific? I can't find word 'acquire' in IA-32
> manual.

It's implicit. It's stated explicitly in Itanium Architecture
Software Developer's Manual, 6.3.4 Memory Ordering Interactions.
"IA-32 instructions are mapped into the Itanium memory ordering
model as follows...".


>         Does 'acquire-load' = LFENCE, 'release-stores' = SFENCE,

No.

> 'fully-fenced' = MFENCE?

Sort of.

regards,
alexander.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: ben-public-nospam@decadentplace.org.uk (Ben Hutchings)
Date: Thu, 17 Feb 2005 18:29:15 GMT Raw View

Michael Pryhodko wrote:
>> IA-32 (in its current incarnation with acquire-loads, release-stores
>> [for ordinary memory] and fully-fenced [acquire+release] LOCK-stuff)
>> is overly constrained.
>
> Could you be a bit more specific? I can't find word 'acquire' in IA-32
> manual.

You're looking in the wrong place.  Try the IA64 manual.

> Does 'acquire-load' = LFENCE, 'release-stores' = SFENCE,
> 'fully-fenced' = MFENCE?

An "acquire" barrier prevents memory accesses that logically come
after it from being moved before it, and a "release" barrier prevents
memory accesses that logically come before it from being moved after
it.  The names come from their use in implementing mutexes.  Such a
barrier is only meaningful when combined with a particular memory
access; hence acquire-load = load with acquire barrier and release-
store = store with release barrier.

A full memory barrier prevents movement either way; it needn't be
combined with any memory access to be useful.

> Also MFENCE is cheaper than LOCK stuff.

It should not be surprising that a full memory barrier that isn't
combined with memory access can be cheaper than one that is.

>> As for reading, start here:
>> http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf
>
> Thanks. I will read it in next few days.
>
> Model I proposed is natural to IA-32. I was wondering:
> 1. could it be implemented efficiently on other platforms?

Not all.  Full memory barriers are really not necessary that often,
and some architectures (such as IA64) provide cheaper, less
restrictive barriers.

The PowerPC architecture at least doesn't have CAS; it has two
separate instructions for load-and-watch and store-if-not-modified (I
forget what they are actually called).  These are a bit more powerful
than CAS but it may be inefficient to emulate CAS on top of them.

> 2. how much we will 'lose' in terms of possible efficiency if C++
> accepts this MT model?

Given that there's no standard means to do multithreading in C++,
and that threading APIs currently do not generally provide the
necessary primitives for lock-free algorithms, it could be a
step forward.  However there is an existing discussion group that
has brought together a number of experts to draft something like
this.  If you're interested in getting involved in that, mail
Andrei Alexandrescu who has organised it.

--
Ben Hutchings
I say we take off; nuke the site from orbit.  It's the only way to be sure.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Simon Richter <richtesi@informatik.tu-muenchen.de>
Date: Thu, 17 Feb 2005 12:27:52 CST Raw View

This is a multi-part message in MIME format.
--------------080505040701030301020405
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

Michael Pryhodko schrieb:

> Also MFENCE is cheaper than LOCK stuff.

Because it serves a different purpose. I must admit that I just quickly
glanced over your paper during a compile run, but what I didn't find is
a discussion of the problem that a compiler usually cannot tell the
execution environment and hence must always assume worst case, which
would be that there is another CPU in shared memory, but with a separate
cache that will have to be invalidated.

The present "model", using pthread_* functions and the like, basically
says that any writes to volatile data must have happened at the time of
the function call (as the function might perform volatile accesses), so
the accesses are at least scheduled in the CPU. The Thread management
function then performs some mutex access that will include a fence, so
the synchronisation with the other thread is the last thing done. The
other thread may now pick up the data, possibly fetching the page
containing the actual data if it happens to be running on another CPU.

For this to work, the CPUs must communicate cache invalidations properly
(which is something not all CPU architectures do), and since the write
access invalidating the page in the other CPUs could very well happen
just before the mutex access (and in fact often does), the fence is very
expensive as it requires synchronisation of the page caches with all
other CPUs right at this point before continuing execution.

IMO, a good thread model would take care of this bottleneck by providing
a dedicated data transfer service that could "send" and "receive"
objects in memory by doing whatever is necessary on that platform to
provide coherency (which can go as far as "emulate" cache invalidation
on platforms that don't support it in hardware, or copy data on
non-unified memory architectures or virtual parallel machines). The
fallback "volatile object + fence" can serve as a default implementation
or can be replaced by "volatile object" for single-CPU systems (but the
resulting executables will not be MP-safe then).

Two additional issues I would also add to the discussion:

   - the compiler should know about the synchronisation semantics to
allow for better optimisation. On platforms with explicit invalidation
semantics (rather than bus snooping), it would definitely be a major
plus if the pages in question would only be invalidated once, rather
than for each access (as long as there is no mutex access, the only
thing that is guaranteed is the order of the object writes, not that
they happen instantaneous, so we're within our rights to not invalidate
the page until the next mutex access); however that should happen as
early as possible (i.e. after the last write). The compiler could, at
this point, decide to schedule some other operations that do not involve
volatile objects between the cache invalidation and the mutex access.

   - there are computer designs that use pipes rather than shared memory
for inter-component communication, allowing for the streaming of data. A
lot of data is, in fact, streamed. A thread model should provide both
pipe- and shared-memory-based communication, emulating pipes over shm
and vice versa if necessary.

     Simon


--------------080505040701030301020405
Content-Type: application/pgp-signature;
 name="signature.asc"
Content-Transfer-Encoding: base64
Content-Disposition: inline;
 filename="signature.asc"

LS0tLS1CRUdJTiBQR1AgU0lHTkFUVVJFLS0tLS0KVmVyc2lvbjogR251UEcgdjEuNC4wIChH
TlUvTGludXgpCgppUUNWQXdVQlFoU2hwMVlyNENON2dDSU5BUUtScHdQOUh5ZGI0cUMrWXc1
Vm81V2NTaUNCSDBaOEtjZ2JqOVg0ClR4SkFDRkx1KytyWWxSclBDNjBENWV0QnBwK1JoaWZ5
bjZSVXZCVDlMNTRjSWlXRWtJQ1JwY0Z3dUpsNTllRzkKM05zTGpRUHY1NGphb24xaWhKdjJJ
ZEdoVjIreEtoMTl5a0VLVU9aZDNHUUxoeWdodk4xQWRwNWNkWHVWbjM1egpBeU45dFJhU0V6
bz0KPUNvbEEKLS0tLS1FTkQgUEdQIFNJR05BVFVSRS0tLS0tCgo=
--------------080505040701030301020405--

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "Michael Pryhodko" <mpryhodko@westpac.com.au>
Date: Wed, 23 Feb 2005 01:00:33 CST Raw View

>> Could you be a bit more specific? I can't find word 'acquire' in
>> IA-32 manual.
>
> You're looking in the wrong place.  Try the IA64 manual.

thanks. I am a bit "outdated" it seems :)


> Does 'acquire-load' = LFENCE, 'release-stores' = SFENCE,
> 'fully-fenced' = MFENCE?

[skip explanation]

Hmm... When I was working on my idea I was inclined to introduce two
types of fences -- one which prevents followings accesses from passing
it, another prevents preceding accesses. It was is almost the same as
acquire-release mechanism. But at the end I decided to stick to
fully-fenced mechanism because it is much simpler and has almost the
same efficiency.


>> Model I proposed is natural to IA-32. I was wondering:
>> 1. could it be implemented efficiently on other platforms?
>
> Not all.  Full memory barriers are really not necessary that often,
> and some architectures (such as IA64) provide cheaper, less
> restrictive barriers.

Strictly speaking most of current MT software is lock-based...


> The PowerPC architecture at least doesn't have CAS; it has two
> separate instructions for load-and-watch and store-if-not-modified (I
> forget what they are actually called).  These are a bit more powerful
> than CAS but it may be inefficient to emulate CAS on top of them.

LL(load linked)/SC(store conditional) -- I knew about them. With them
you could implement 'pop-one-element' operation on single-linked list,
with CAS it is impossible, imho.


> If you're interested in getting involved in that, mail Andrei
> Alexandrescu who has organised it.

I'll try.


Bye.
Sincerely yours, Michael.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "Michael Pryhodko" <mpryhodko@westpac.com.au>
Date: Wed, 23 Feb 2005 10:00:36 CST Raw View

> what I didn't find is a discussion of the problem that a compiler
> usually cannot tell the execution environment and hence must always
> assume worst case, which would be that there is another CPU in shared
> memory, but with a separate cache that will have to be invalidated.

Because this "paper" describes only idea of how C++ MT model could look
like.
Also speaking by language proposed in 'WRL-95-9.pdf' this model is
'system-centric' and compiler needs only to guarantee fence_X behavior,
everything else is the same is it is now -- i.e. it could use any
optimization technique. While proposed model will not be 100% efficient
on every platform -- it still gives you portability, simplicity and
efficiency. And even reusability -- by substituting necessary elements
with your types you could use exisiting MT algorithms not only for
structures shared by processors in memory, but also for structures
shared by computers on network and so on.
I can not say much about 'programmer-centric' model. I didn't read all
of 'WRL-95-9.pdf' and it will take some time to digest this
information, however. :) At this time it looks too complex and I have
no idea how to put it into C++ painlessly.


> - there are computer designs that use pipes rather than shared memory
> for inter-component communication, allowing for the streaming of
> data. A lot of data is, in fact, streamed. A thread model should
> provide both pipe- and shared-memory-based communication, emulating
> pipes over shm and vice versa if necessary.

:) dreams-dreams... As far as I know people doesn't have common opinion
over model for shared-memory, not speaking about model that will
encapsulate both shm and pipes.

Bye.
Sincerely yours, Michael.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "Michael Pryhodko" <mpryhodko@westpac.com.au>
Date: Tue, 15 Feb 2005 09:46:49 CST Raw View

Recently I've been reading a lot about MT issues in general, Intel
IA-32 manual and C++ specific MT issues.
I came up with idea of how C++ MT model could look like. I am
definitely not an article writer it seems, but I tried :).

Well, what do you think about this article:
http://home.iprimus.com.au/crusader7/prg/cpp_mt_model.htm

Bye.
Sincerely yours, Michael.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: terekhov@web.de (Alexander Terekhov)
Date: Tue, 15 Feb 2005 21:40:55 GMT Raw View

Michael Pryhodko wrote:
>
> Recently I've been reading a lot about MT issues in general, Intel
> IA-32 manual and C++ specific MT issues.

IA-32 (in its current incarnation with acquire-loads, release-stores
[for ordinary memory] and fully-fenced [acquire+release] LOCK-stuff)
is overly constrained. As for reading, start here:

http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf

regards,
alexander.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "Michael Pryhodko" <mpryhodko@westpac.com.au>
Date: Wed, 16 Feb 2005 18:40:26 CST Raw View

> IA-32 (in its current incarnation with acquire-loads, release-stores
> [for ordinary memory] and fully-fenced [acquire+release] LOCK-stuff)
> is overly constrained.

Could you be a bit more specific? I can't find word 'acquire' in IA-32
manual. Does 'acquire-load' = LFENCE, 'release-stores' = SFENCE,
'fully-fenced' = MFENCE?

Also MFENCE is cheaper than LOCK stuff.


> As for reading, start here:
> http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf

Thanks. I will read it in next few days.

Model I proposed is natural to IA-32. I was wondering:
1. could it be implemented efficiently on other platforms?
2. how much we will 'lose' in terms of possible efficiency if C++
accepts this MT model?

Bye.
Sincerely yours, Michael.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]