Thread

Topic: About introducing parallel constructs (comments
Author: =?UTF-8?Q?M=C3=A1t=C3=A9_Ferenc_Nagy?= <nagymatef@gmail.com>
Date: Mon, 8 Apr 2013 06:36:09 -0700 (PDT) Raw View
------=_Part_35_29893455.1365428169603
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable



I understand that standardizing something relatively new has it's pitfalls.=
=20
Thus people say that it is more wise waiting to see what happens with all=
=20
this GPU hype and after N years we'll come back to it, and in the meantime=
=20
other CPU parallelization techniques have matured enough and have earned=20
their place in the standard. This IMHO is shortsightedness. Not because it=
=20
doesn't do what I want. I understand that going ahead of practice is taking=
=20
a risk, and those having a say in the standard might not want to take that=
=20
risk. I can understand that. The thing I think of when saying=20
'shortsighted' is that "waiting to see what happens" by far does not=20
emphasize enough, that what should really be meant is "yes, we know that=20
all this will have to have it's place in the standard roughly 3-4-5 years=
=20
from now, but now is the time to think and observe, as opposed to acting".


There is a difference between the two attitudes. The first believes that=20
something is going to happen, but because we cannot be absolutely sure that=
=20
it wil, I do not calculate with it and pollute either the core language=20
with keywords, or the STL, that later will prove to be a pain when indeed=
=20
he will be faced by public demand of including massive parallelism. On the=
=20
other hand the second believes that something is going to happen, just not=
=20
exactly in what manner, and tries to steer the wheel so there will be room=
=20
in the standard when the time comes.


I could have started with the usual tale about the math co-processor, and=
=20
how you don't even think about it now. It is out of the question that=20
near-future APUs (let's call them this way) will feature a multi-core=20
latency optimized unit, and a many-core throughput optimized unit. If there=
=20
is debate about this then there is really no point in continuing the=20
discussion.


The kind of discussion such a topic would deserve would touch on the needs=
=20
of introducing 'abstract' containers similar to concurrency::array that act=
=20
as handles to objects that might reside on a given device, or can we safely=
=20
assume that memory will be coherent and pointer-compatible accross all HW=
=20
in the near future on devices that we target, thus neglecting the need to=
=20
introduce such objects resulting in IGPs being able to operate on any STL=
=20
container. Or does the standard wish to target dedicated GPUs which=20
definately requires such objects? These are questions that need to be=20
answered before any decision is made about introducing parallelism into the=
=20
language.


I know it is hard to get ahead of practice, since it's not even obvious=20
whether dedicated GPUs will continue to exist 10 years from now, but if one=
=20
does not even try to think ahead, he/she will always be behind. And the way=
=20
that I see it is that the greatest setback for GPU/IGP acceleration is=20
language support. It's a pain to access the HW. The language everywhere is=
=20
some subset of C/C++, but the programmer constantly has to jump through=20
hoops to get as far as obtaining a GPU context, not to mention graphics=20
interop capable context.


I somehow envision that by the time things will settle in this matter,=20
modern APIs will be experimenting by pulling the graphics pipeline=20
(whatever state it will be in by then) into this whole bussiness, once=20
again invalidating the feat of finally standardizing at least the compute=
=20
side of things.

Anyhow, thank you for your insight, and hope to hear more from either you,=
=20
or anyone else.
=20

2013. =E1prilis 5., p=E9ntek 11:26:13 UTC+2 id=F5pontban M=E1t=E9 Ferenc Na=
gy a=20
k=F6vetkez=F5t =EDrta:

> This is the second time around me writing this post, since the first one,=
=20
> which I wrote for roughly 1.5 hours got lost. This time I'll try to make =
it=20
> short, maybe that helps you guys read the post (as opposed to it being 3=
=20
> pages long, which nobody takes the time to read). Also this is my first=
=20
> post here, so take it easy, should I mess something up.
>
> I want to comment on proposals N3530 (OpenMP) and N3554 (Parallel=20
> algorithms lib).
>
> I believe that while both proposals are good in their sense, but they are=
=20
> flawed in different ways. They aim something similar (parallelism that is=
),=20
> but want to give it to the user in a completely different manner. Having =
a=20
> parallel std::transform, std::reduce... algorithms surely would be great=
=20
> and must be considered. These would even make up for some of the OpenMP=
=20
> pragmas that can be leveraged for parallelism. The wording is completely=
=20
> different, but the result is somewhat the same. (I do understand that due=
=20
> their different nature, namely that one wishes to be a low level building=
=20
> block (N3530) while the other is a lot more limited, these two proposals=
=20
> are far from being homomorphic, they only have a great deal of overlap.)
>
> One being a low level building block, that could even be used as a means=
=20
> for vendors to implement their parallel algorithms library, and the other=
=20
> being an abstraction to parallel methods operating on STL containers=20
> allowing the freedom for vendors to have implementations allowing even GP=
U=20
> support, IMHO both seem to be missing the bigger picture. N3554 proves th=
at=20
> a lot of investigation was put into the syntax and aims of other parallel=
=20
> libraries, and it even states that vendors are allowed to have their own=
=20
> implementations. This is all very nice, but obtaining plat-specific=20
> execution_policy constructs would require plat-secific initalizion (which=
=20
> in portable code must be #ifdef-ed, which certainly doesn't 'look' like=
=20
> using a standard feature), not to mention that not talking anything about=
=20
> data locality in a GPU accel implementation, the only plausible behavior =
is=20
> that after all calls of std::sort(my_policy,...) data is synced with=20
> host-side memory which will result in datamovement between every atomic=
=20
> operation a series of computations might consist of. (By atomic I mean=20
> having a workflow of std::transform, std::sort, std::reduce... to get the=
=20
> final result)
>
> I believe we can all agree, that GPU parallelism is just around the=20
> corner, or more like on our doorstep actually. All new IGPs support OpenC=
L=20
> 1.2, and DX11.1 feature is also becoming the standard. In my first (and=
=20
> lost) post I have talked a lot about OpenCL and why I believe it's a=20
> dead-end despite it's great potential and initial momentum (slow evolutio=
n,=20
> C++ kernel language being utterly far, NV sabotaging it's own=20
> implementation...), there is an extremely elegant and powerful alternativ=
e,=20
> C++AMP. It interfaces extremely well with C++, using STL compatible=20
> containers, using std::futures for querying results of 'kernels', it=20
> standardizes the notion of devices and the means to query them, it=20
> introduces the notion of indexers which can be used to iterate on a=20
> container (transposing a matrix stored as a 1-dim array never been easier=
),=20
> introduces computeunit native storage (__local, for those familiar with=
=20
> OpenCL), takes into account data locality, exposes direct memory movement=
=20
> functions (std::copy actually), it makes only the most minimalistic=20
> limitations to amp-compatible classes and functions (those that are GPU H=
W=20
> restrictions, although this could be a seperate discussion), standardizes=
=20
> the workflow (but not the language!!) of storing an intermediate code, pl=
us=20
> it's an open standard. It feels as if it weren't even Microsoft. In some=
=20
> ways it is crude, it could interface a lot better with STL, but has taken=
=20
> an extremely large step forward.
>
> Anyhow, to boil down to my main point. Instead of giving just the low=20
> level building blocks of CPU parallelism as keywords that the existing=20
> OpenMP implementations could provide (N3530), and instead of giving these=
=20
> constructs as functions that completely lack the ability to build efficie=
nt=20
> larger blocks from (N3554), incorporating C++AMP into the standard could =
be=20
> hitting 2 birds with one stone. Naturally I do not mean dragging it in as=
=20
> it is, but...
>
> Supplementing other constructs beside parallel_for_each() for eg.=20
> parallel_reduce() similar to what exists in OpenMP, plus creating=20
> restrict(amp,cpu) implementations of STL algorithms, it would provide a=
=20
> completely portable and flexible way of writing parallel and massively=20
> parallel code for both CPUs, GPUs, or whatever other compute backends tha=
t=20
> the compiler vendor supports (like FPGAs, that N3554 also mentioned, but=
=20
> there too it must be supported by the compiler).
> I do not have the time, neither the expertise to write a legitimate=20
> proposal, I am just a lowlife physicist skilled in OpenCL with the=20
> perversion of knowing a lot about CPU and GPU architectures, but I=20
> certainly know that in 2013 neither of the 2 proposals as they are solve=
=20
> the issue that is at hand, that is that there is virtually no portable=20
> language to write GPU code (NV sabotaging it's own OpenCL implementation)=
=20
> in C++ language, which now and in the near future most HW (CPUs, GPUs=20
> alike) will support near fully. I think it would be time to take the=20
> initiative, and not standardize CPU-only parallel constructs which to be=
=20
> honest should have been done at least 10 years ago, but definately by the=
=20
> time C++11 hit release state. For C++14 or C++17, it would be time to=20
> leverage the ever evolving IGPs too (with one naming convention), which=
=20
> will serve as the horsepower of future PCs and mobile equipments, and=20
> should not standardize things that in the future will conflict/overlap in=
=20
> naming/functionality with the long-run goal.
> =20
> Ideas?
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.



------=_Part_35_29893455.1365428169603
Content-Type: text/html; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable

<p>I understand that standardizing something relatively new has it's pitfal=
ls. Thus people say that it is more wise waiting to see what happens with a=
ll this GPU hype and after N years we'll come back to it, and in the meanti=
me other CPU parallelization techniques have matured enough and have earned=
 their place in the standard. This IMHO is shortsightedness. Not because it=
 doesn't do what I want. I understand that going ahead of practice is takin=
g a risk, and those having a say in the standard might not want to take tha=
t risk. I can understand that. The thing I think of when saying 'shortsight=
ed' is that "waiting to see what happens" by far does not emphasize enough,=
 that what should really be meant is "yes, we know that all this will have =
to have it's place in the standard roughly 3-4-5 years from now, but now is=
 the time to think and observe, as opposed to acting".</p><p><br>There is a=
 difference between the two attitudes. The first believes that something is=
 going to happen, but because we cannot be absolutely sure that it wil, I d=
o not calculate with it and pollute either the core language with keywords,=
 or the STL, that later will prove to be a pain when indeed he will be face=
d by public demand of including massive parallelism. On the other hand the =
second believes that something is going to happen, just not exactly in what=
 manner, and tries to steer the wheel so there will be room in the standard=
 when the time comes.</p><p><br>I could have started with the usual tale ab=
out the math co-processor, and how you don't even think about it now. It is=
 out of the question that near-future APUs (let's call them this way) will =
feature a multi-core latency optimized unit, and a many-core throughput opt=
imized unit. If there is debate about this then there is really no point in=
 continuing the discussion.</p><p><br>The kind of discussion such a topic w=
ould deserve would touch on the needs of introducing 'abstract' containers =
similar to concurrency::array that act as handles to objects that might res=
ide on a given device, or can we safely assume that memory will be coherent=
 and pointer-compatible accross all HW in the near future on devices that w=
e target, thus neglecting the need to introduce such objects resulting in I=
GPs being able to operate on any STL container. Or does the standard wish t=
o target dedicated GPUs which definately requires such objects? These are q=
uestions that need to be answered before any decision is made about introdu=
cing parallelism into the language.</p><p><br>I know it is hard to get ahea=
d of practice, since it's not even obvious whether dedicated GPUs will cont=
inue to exist 10 years from now, but if one does not even try to think ahea=
d, he/she will always be behind. And the way that I see it is that the grea=
test setback for GPU/IGP acceleration is language support. It's a pain to a=
ccess the HW. The language everywhere is some subset of C/C++, but the prog=
rammer constantly has to jump through hoops to get as far as obtaining a GP=
U context, not to mention graphics interop capable context.</p><p><br>I som=
ehow envision that by the time things will settle in this matter, modern AP=
Is will be experimenting by pulling the graphics pipeline (whatever state i=
t will be in by then) into this whole bussiness, once again invalidating th=
e feat of finally standardizing at least the compute side of things.</p><di=
v><br>Anyhow, thank you for your insight, and hope to hear more from either=
 you, or anyone else.</div><div>&nbsp;</div><div><br>2013. =E1prilis 5., p=
=E9ntek 11:26:13 UTC+2 id=F5pontban M=E1t=E9 Ferenc Nagy a k=F6vetkez=F5t =
=EDrta:</div><blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px=
 0.8ex; padding-left: 1ex; border-left-color: rgb(204, 204, 204); border-le=
ft-width: 1px; border-left-style: solid;"><p>This is the second time around=
 me writing this post, since the first one, which I wrote for roughly 1.5 h=
ours got lost. This time I'll try to make it short, maybe that helps you gu=
ys read the post (as opposed to it being 3 pages long, which nobody takes t=
he time to read). Also this is my first post here, so take it easy, should =
I mess something up.</p><p>I want to comment on proposals N3530 (OpenMP) an=
d N3554 (Parallel algorithms lib).</p><p>I believe that while both proposal=
s are good in their sense, but they are flawed in different ways. They aim =
something similar (parallelism that is), but want to give it to the user in=
 a completely different manner. Having a parallel std::transform, std::redu=
ce... algorithms surely would be great and must be considered. These would =
even make up for some of the OpenMP pragmas that can be leveraged for paral=
lelism. The wording is completely different, but the result is somewhat the=
 same. (I do understand that due their different nature, namely that one wi=
shes to be a low level building block (N3530) while the other is a lot more=
 limited, these two proposals are far from being homomorphic, they only hav=
e a great deal of overlap.)</p><p>One being a low level building block, tha=
t could even be used as a means for vendors to implement their parallel alg=
orithms library, and the other being an abstraction to parallel methods ope=
rating on STL containers allowing the freedom for vendors to have implement=
ations allowing even GPU support, IMHO both seem to be missing the bigger p=
icture. N3554 proves that a lot of investigation was put into the syntax an=
d aims of other parallel libraries, and it even states that vendors are all=
owed to have their own implementations. This is all very nice, but obtainin=
g plat-specific execution_policy constructs would require plat-secific init=
alizion (which in portable code must be #ifdef-ed, which certainly doesn't =
'look' like using a standard feature), not to mention that not talking anyt=
hing about data locality in a GPU accel implementation, the only plausible =
behavior is that after all calls of std::sort(my_policy,...) data is synced=
 with host-side memory which will result in datamovement between every atom=
ic operation a series of computations might consist of. (By atomic I mean h=
aving a workflow of std::transform, std::sort, std::reduce... to get the fi=
nal result)</p><p>I believe we can all agree, that GPU parallelism is just =
around the corner, or more like on our doorstep actually. All new IGPs supp=
ort OpenCL 1.2, and DX11.1 feature is also becoming the standard. In my fir=
st (and lost) post I have talked a lot about OpenCL and why I believe it's =
a dead-end despite it's great potential and initial momentum (slow evolutio=
n, C++ kernel language being utterly far, NV sabotaging it's own implementa=
tion...), there is an extremely elegant and powerful alternative, C++AMP. I=
t interfaces extremely well with C++, using STL compatible containers, usin=
g std::futures for querying results of 'kernels',&nbsp;it standardizes the =
notion of devices and the means to query them, it introduces the notion of =
indexers which can be used to iterate on a container (transposing a matrix =
stored as a 1-dim array never been easier), introduces computeunit native s=
torage (__local, for those familiar with OpenCL), takes into account data l=
ocality, exposes direct memory movement functions (std::copy actually), it =
makes only the most minimalistic limitations to amp-compatible classes and =
functions (those that are GPU HW restrictions, although this could be a sep=
erate discussion), standardizes the workflow (but not the language!!) of st=
oring an intermediate code, plus it's an open standard. It feels as if it w=
eren't even Microsoft. In some ways it is crude, it could interface a lot b=
etter with STL, but has taken an extremely large step forward.</p><p>Anyhow=
, to boil down to my main point. Instead of giving just the low level build=
ing blocks of CPU parallelism as keywords that the existing OpenMP implemen=
tations could provide (N3530), and instead of giving these constructs as fu=
nctions that completely lack the ability to build efficient larger blocks f=
rom (N3554), incorporating C++AMP into the standard could be hitting 2 bird=
s with one stone. Naturally I do not mean dragging it in as it is, but...</=
p><p>Supplementing other constructs&nbsp;beside parallel_for_each() for eg.=
 parallel_reduce() similar to what exists in OpenMP, plus creating restrict=
(amp,cpu) implementations of STL algorithms, it would provide a completely =
portable and flexible way of writing parallel and massively parallel code f=
or both CPUs, GPUs, or whatever other compute backends that the compiler ve=
ndor supports (like FPGAs, that N3554 also mentioned, but there too it must=
 be supported by the compiler).</p><div>I do not have the time, neither the=
 expertise to write a legitimate proposal, I am just a lowlife physicist sk=
illed in OpenCL with the perversion of knowing a lot about CPU and GPU arch=
itectures, but I certainly know that in 2013 neither of the 2 proposals as =
they are solve the issue that is at hand, that is that there is virtually n=
o portable language to write GPU code (NV sabotaging it's own OpenCL implem=
entation) in C++ language, which now and in the near future most HW (CPUs, =
GPUs alike) will support near fully. I think it would be time to take the i=
nitiative, and not standardize CPU-only parallel constructs which to be hon=
est should have been done at least 10 years ago, but definately by the time=
 C++11 hit release state. For C++14 or C++17, it would be time to leverage =
the ever evolving IGPs too (with one naming convention), which will serve a=
s the horsepower of future PCs and mobile equipments, and should not standa=
rdize things that in the future will conflict/overlap in naming/functionali=
ty with the long-run goal.</div><div>&nbsp;</div><div>Ideas?</div></blockqu=
ote>

<p></p>

-- <br />
&nbsp;<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_35_29893455.1365428169603--

.