Thread

Topic: Parallel & RT GC (was Real-Time GC (was Widespread C++...?)

Author: eachus@spectre.mitre.org (Robert I. Eachus)
Date: 23 Jan 1995 17:06:40 GMT Raw View

In article <19950122.094355.149517.NETNEWS@UICVM.UIC.EDU> dhanley@matisse.eecs.uic.edu (David Hanley) writes:

 >      Hmmm.  While I suppose this is possible, I haven't yet run
 > across this "large body of algorithms" in which the execution time
 > is reduced by the order of magnitude when memory blocks can be
 > zeroed out very quickly.  Even if the algorithm did have to zero
 > out the memory itself, I don't see how that could take it from,
 > say, n^2 to n^3.  Could you please point some of these out to me?

    Maybe I should have said "real-time" or "on-line" algorithms.  The
distinguishing characteristic seems to be that you have to choose the
solution method without knowing the size of (or all of) the data set
in advance.  Another way of looking at it is that you are concerned
with solving a problem, but you are only concerned with the time
between receiving the last data and getting a result.  (Although in
practice, what you want to do is, at every stage of reading in data,
have a solution based on all the data seen so far.)

    The seminal paper is "Real-time computation." by M. O. Rabin in
1963. (The reference I have is Isreal J. Math 1, 203-211, but I think
it has been reprinted.)  Hartmanis and Sterns, "On the computational
complexity of algorithms," Trans. Amer. Math. Soc. 117, 285-306(1965).

 >   Hmm.  Not sure what this has to do with GC, really.  In any
 > case, it would be pretty easy to design memory hardware that would
 > zero out pages really quickly( just don't refresh them one cycle ).
 > And the bzero() function on your C compiler could be set to use it.
 > But mabye I'm just confused.  Could you please explaain if this is
 > wrong?

      Nothing confused in that, it is done in real-time applications
all the time.  A typical scenario is one where you build a hash table
using a hash function with a very low likelihood of collision.
Normally you would have to factor the time to clear the hash table, or
the time for maintaining a structure just for clearing the table into
the time required for the algorithm.  Poof!--if supported by the
hardware--is the most elegent solution.

      The only thing it has to do with garbage collection AFIAK is
that there is a nice technique where data is ping-ponged back and
forth between two heaps.  For example, you have a structure containing
all current aircraft tracks.  If the old data is reviewed and, if
still valid, moved every sweep, all the garbage created by building
maintaining the tracks can be made to dissapear with never a worry
about fragmentation.


--

     Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...

Author: eachus@spectre.mitre.org (Robert I. Eachus)
Date: 20 Jan 1995 14:49:01 GMT Raw View

In article <3fmgbn$rns@watnews1.watson.ibm.com> ncohen@watson.ibm.com (Norman H. Cohen) writes:

 > This doesn't make sense, because there is a well known programming trick
 > (used more in analysis-of-algorithms than in real programs because of the
 > large constant factor) for marking a large section of memory as
 > uninitialized in constant time...

   Read barriers are a neat trick, but for real-time applications
replacing a boolean with a pointer and... is totally intolerable.  I
have used algorithms in this class in real-time systems, and although
we devoted a whole page to each bit-vector, the typical maximums are
on the order of a few hundred bits.  But the difference between say 5
clocks and 20 (clearing two words at a time) is crucial.

    So yes, you are correct that these algorithms can be replaced by
more complex versions without the special memory clearing instruction,
but the large constant multiplier makes that approach useless.  To
give a real world example, in one application, if all the time
available was assigned to the innermost loop of an N^2 lg2 N
algorithm, you had 25 clocks per iteration--and a cache miss could
take most of that.  Fortunately I found an algorithm which met the
theoretical complexity limit, and where the all the inner loop did was
set a bit-vector of length N to false.  In that case limiting the size
of the bit-vector (so I could keep it in a register pair and avoid
write throughs) was sufficient.

--

     Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...

Author: dhanley@matisse.eecs.uic.edu (David Hanley)
Date: Sun, 22 Jan 1995 02:56:58 +0000 Raw View

Robert I. Eachus (eachus@spectre.mitre.org) wrote:
:     There is another major reason for putting the custom hardware in
: the memory system.  There is a fairly large class of problems which
: require time proportional to n^k lg2 n on any von Neuman architecture,
: but can be solved in time n^(k-1) lg2 on hardware which supports
: clearing large sections of memory to zero in constant time.  (Note
: that if k is one, you can get orders of magnitude improvement.)

        Hmmm.  While I suppose this is possible, I haven't yet run
across this "large body of algorithms" in which the execution time is
reduced by the order of magnitude when memory blocks can be zeroed out
very quickly.  Even if the algorithm did have to zero out the memory
itself, I don't see how that could take it from, say, n^2 to n^3.
Could you please point some of these out to me?

:      Since many memory systems can support this happy property, all you
: need to do to get the faster algorithms is to bypass the OS, or more
: politely, have an OS call which clears a section of memory.

        Hmm.  Not sure what this has to do with GC, really.  In any
case, it would be pretty easy to design memory hardware that would
zero out pages really quickly( just don't refresh them one cycle ).
And the bzero() function on your C compiler could be set to use it.
But mabye I'm just confused.  Could you please explaain if this is
wrong?

--
------------------------------------------------------------------------------
| David James Hanley  -- dhanley@lac.eecs.uic.edu  -- C++, OOD, martial arts |
| Laboratory for advanced computing      |      My employer barely KNOWS me. |
------------------------------------------------------------------------------
"There are trivial truths & there are great truths. The opposite of a
trivial truth is plainly false."
"The opposite of a great truth is also true."

-Neils Bohr

Author: beckwb@ois.com (R. William Beckwith)
Date: 16 Jan 1995 21:45:33 -0500 Raw View

: Kelvin Nilsen/Dept. of Computer Science/Iowa State University/Ames, IA  50011
:  (515) 294-2259   kelvin@cs.iastate.edu  uunet!cs.iastate.edu!kelvin

Thanks for the informative followup Kevin.  If the chip you're working on
has or will have an Ada95 compiler for it, please let me know.  I try and
plug in a nice hardware assisted GC'tor.

... Bill

-----------------------------------------------------
e-mail: Bill.Beckwith@ois.com       |    Team Ada
Objective Interface Systems, Inc.   | dist, full O-O
1895 Preston White Drive, Suite 250 | multithreading
Reston, VA  22091-5448  U.S.A.      |    built in

Author: mbr@research.nj.nec.com (Mark Reinhold)
Date: 17 Jan 95 10:37:55 Raw View

In article <hbaker-1301951307490001@192.0.2.1> hbaker@netcom.com writes:

> If people are interested in these issues, and haven't yet read the following
> paper, they are in for a treat:

> Reinhold, M.B.  "Cache Performance of Garbage-Collected Programs".  PLDI'94,
> ACM Sigplan Notices 29, 6 (June 1994), 206-217.  (It _may_ be on the net.
> Reinhold works for NEC in New Jersey, I believe.)

It is indeed on the net.  See http://www.neci.nj.nec.com/homepages/mbr.html.
Comments welcome.

Author: eachus@spectre.mitre.org (Robert I. Eachus)
Date: 17 Jan 1995 16:29:13 GMT Raw View

In article <kelvin.790009178@kickapoo> kelvin@cs.iastate.edu (Kelvin Nilsen) writes:

   > By placing the custom hardware within the memory subsystem,
   > we are able to provide the desired benefits without most of the
   > costs of developing special-purpose CPUs.

    There is another major reason for putting the custom hardware in
the memory system.  There is a fairly large class of problems which
require time proportional to n^k lg2 n on any von Neuman architecture,
but can be solved in time n^(k-1) lg2 on hardware which supports
clearing large sections of memory to zero in constant time.  (Note
that if k is one, you can get orders of magnitude improvement.)

     Since many memory systems can support this happy property, all you
need to do to get the faster algorithms is to bypass the OS, or more
politely, have an OS call which clears a section of memory.


     Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...
--

     Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...

Author: hbaker@netcom.com (Henry Baker)
Date: Wed, 18 Jan 1995 15:27:50 GMT Raw View

In article <EACHUS.95Jan17112913@spectre.mitre.org>,
eachus@spectre.mitre.org (Robert I. Eachus) wrote:

> In article <kelvin.790009178@kickapoo> kelvin@cs.iastate.edu (Kelvin
Nilsen) writes:
>
>    > By placing the custom hardware within the memory subsystem,
>    > we are able to provide the desired benefits without most of the
>    > costs of developing special-purpose CPUs.
>
>     There is another major reason for putting the custom hardware in
> the memory system.  There is a fairly large class of problems which
> require time proportional to n^k lg2 n on any von Neuman architecture,
> but can be solved in time n^(k-1) lg2 on hardware which supports
> clearing large sections of memory to zero in constant time.  (Note
> that if k is one, you can get orders of magnitude improvement.)

Simple example, please?

Author: bischoff@oracorp.com (Kurt Bischoff)
Date: Thu, 19 Jan 1995 15:57:33 GMT Raw View

In article <3ffb0d$qbp@gamma.ois.com> beckwb@ois.com (R. William Beckwith) write s:
>: Kelvin Nilsen/Dept. of Computer Science/Iowa State University/Ames, IA  50011
>:  (515) 294-2259   kelvin@cs.iastate.edu  uunet!cs.iastate.edu!kelvin
>
>Thanks for the informative followup Kevin.  If the chip you're working on
>has or will have an Ada95 compiler for it, please let me know.  I try and
>plug in a nice hardware assisted GC'tor.
>

I think Kelvin was making the point that GC support should be kept out of
the CPU--it should be in a separate memory module.  Then compiler implement-
ation is independent of hardware-assisted GC implementation.  Kelvin
wrote:

:  3. Since the memory architecture is portable between CPU architectures,
:     we can separate issues of instruction-set compatibility from issues
:     of real-time garbage collection support.

-------------
Kurt Bischoff, Odyssey Research, 301A Dates Drive, Ithaca, NY,  14850
(607) 277-2020
bischoff@oracorp.com

Author: hbaker@netcom.com (Henry Baker)
Date: Fri, 20 Jan 1995 02:20:30 GMT Raw View

In article <3fmgbn$rns@watnews1.watson.ibm.com>, ncohen@watson.ibm.com wrote:

> In article <EACHUS.95Jan17112913@spectre.mitre.org>,
> eachus@spectre.mitre.org (Robert I. Eachus) writes:
>
> |>     There is another major reason for putting the custom hardware in
> |> the memory system.  There is a fairly large class of problems which
> |> require time proportional to n^k lg2 n on any von Neuman architecture,
> |> but can be solved in time n^(k-1) lg2 on hardware which supports
> |> clearing large sections of memory to zero in constant time.  (Note
> |> that if k is one, you can get orders of magnitude improvement.)
>
> This doesn't make sense, because there is a well known programming trick
> (used more in analysis-of-algorithms than in real programs because of the
> large constant factor) for marking a large section of memory as
> uninitialized in constant time.  (The effect of initializing a large
> section of memory to zero is achieved by checking on each read whether
> the location read from has been initialized, and using the value zero if
> it has not been.)

This technique is known as a 'read barrier' in GC circles.

See:

ftp://ftp.netcom.com/pub/hb/hbaker/RealTimeGC.html (or .ps.Z)

for 1978 CACM reference.  (The read barrier technique is discussed twice in
this paper -- once for the basic GC algorithm itself, and once for 'moving'
and/or copying large arrays incrementally.)

Author: ncohen@watson.ibm.com (Norman H. Cohen)
Date: 19 Jan 1995 19:59:51 GMT Raw View

In article <EACHUS.95Jan17112913@spectre.mitre.org>,
eachus@spectre.mitre.org (Robert I. Eachus) writes:

|>     There is another major reason for putting the custom hardware in
|> the memory system.  There is a fairly large class of problems which
|> require time proportional to n^k lg2 n on any von Neuman architecture,
|> but can be solved in time n^(k-1) lg2 on hardware which supports
|> clearing large sections of memory to zero in constant time.  (Note
|> that if k is one, you can get orders of magnitude improvement.)

This doesn't make sense, because there is a well known programming trick
(used more in analysis-of-algorithms than in real programs because of the
large constant factor) for marking a large section of memory as
uninitialized in constant time.  (The effect of initializing a large
section of memory to zero is achieved by checking on each read whether
the location read from has been initialized, and using the value zero if
it has not been.)

The trick works as follows: View the section of memory as an array of N
initializable units numbered 0 to N-1.  Allocate two arrays of integers,
Stack and Stack_Index, indexed from 0 to N-1.  Initially, both arrays
contain garbage.  There is a stack pointer, Top, pointing to the
highest-indexed element of Stack that does NOT contain garbage (so that
initially it equals -1).  When initializable unit I is initialized, Top
is incremented, the new value of Top is placed in Stack_Index(I), and I
is placed in Stack(Top).  Then the index of every initialized unit is
found exactly once in Stack(0 .. Top), and if unit I has been
initialized, Stack_Index(I) contains the index within Stack(0 .. Top)
containing I.  The test for whether initializable unit I is initialized
is:

   if Stack_Index(I) in 0 .. Top then
      if Stack(Stack_Index(I)) = I then
         return True;
         -- Stack_Index(I) contains a valid index into the stack, and the
         --   corresponding index in the stack indicates that unit I has
         --   been initialized.
      else
         return False;
         -- The garbage in Stack_Index(I) happened to point into the
         --   initialized part of Stack, but the value in Stack reveals
         --   that unit I was not the initializable unit for which this
         --   component of Stack was initialized.
      end if;
   else
      return False;
      -- Garbage in Stack_Index(I) is not even a valid stack index.
   end if;

or, more succinctly,

   return Stack_Index(I) in 0 .. Top and then Stack(Stack_Index(I))=I;

--
Norman H. Cohen    ncohen@watson.ibm.com

Author: hbaker@netcom.com (Henry Baker)
Date: Wed, 11 Jan 1995 01:44:48 GMT Raw View

In article <3eua1r$4ea@gnat.cs.nyu.edu>, dewar@cs.nyu.edu (Robert Dewar) wrote:

> "Is there any work-in-progress by the large chip manufacturers to
> design GC into their next-generation CPU architectures?  It seems
> like the next logical step."
>
> I sure hope not, I hope we have seen the end of this kind of incorrect
> CISC thinking. At most what you want is some very specific hardware
> assist instructions that are consistent with RISC instruction design
> philosophy

Actually, the performance of GC these days is more hindered by cache
and VM designs than instruction sets.  In particular, GC needs
"write-allocate with subblock placement", such as is found on the
DEC MIPS machines.  I believe that Alphas also have write-allocate,
but I'm not completely sure.  The Pentium apparently does _not_ do
write-allocate, which makes any kind of initialization of untouched
memory pretty much of a disaster.  Ditto for VM implementations --
people keep talking about 'log-based backing stores', but the major
thing that is required isn't so much a log, as the ability to blind
write to a page without having to read it first.

PLDI'94 and LFP'94 had some good papers on cache issues in GC.

Author: beckwb@ois.com (R. William Beckwith)
Date: 13 Jan 1995 08:30:55 -0500 Raw View

Robert Dewar (dewar@cs.nyu.edu) wrote:
: "Is there any work-in-progress by the large chip manufacturers to
: design GC into their next-generation CPU architectures?  It seems
: like the next logical step."

: I sure hope not, I hope we have seen the end of this kind of incorrect
: CISC thinking. At most what you want is some very specific hardware
: assist instructions that are consistent with RISC instruction design
: philosophy (for an example of this kind of thinking look at the trapped
: add in the SPARC design, certainly we do NOT want to put LISP type checking
: into the hardware, but this simple RISC consistent instruction is quite a
: help for certain approaches to dynamic type checking in LISP interpretors.

I'm not looking for a complex solution.  I'm looking for the simplest
design changes in instruction set, cache, and register files to
accommodate transparent memory management.

Henry Baker comments that the primary issues are cache management and VM.
The fact that GC experts like Henry and Paul Wilson quickly digress to
chip level issues leads me to believe that a concerted effort on the
part of the CPU designer could lead to a much more elegant and efficient
solution to memory management in compiled languages.

It seems to me that GC is more than just a process level decision.  There
are three layers of participation in the effort:

    1. CPU
    2. O/S kernel
    3. system libraries

This is all in hopes that the application does not have to participate
in the memory management excercise.

Why not design the CPU with the GC'tor in mind instead of designing the
GC'tor with the CPU in mind.

I just seems silly that hardware designers aren't thinking more about
relieving us programmers from thinking about what memory we are currently
using.

... Bill

--------------------------------------------------------------------------
e-mail: Bill.Beckwith@ois.com         FAX:  703-264-1721 |    Team Ada
Objective Interface Systems, Inc.   Voice:  703-264-1900 | dist, full O-O
1895 Preston White Drive, Suite 250                      | multithreading
Reston, VA  22091-5448  U.S.A.                           |    built in

Author: kelvin@cs.iastate.edu (Kelvin Nilsen)
Date: 13 Jan 95 14:59:38 GMT Raw View

beckwb@ois.com (R. William Beckwith) writes:

>: "Is there any work-in-progress by the large chip manufacturers to
>: design GC into their next-generation CPU architectures?  It seems
>: like the next logical step."

>I'm not looking for a complex solution.  I'm looking for the simplest
>design changes in instruction set, cache, and register files to
>accommodate transparent memory management.

Paul Wilson mentioned my proposed hardware support for GC.  In case
anyone is interested in more information on this, I suggest the following
five references:

  Schmidt, Nilsen.  Performance of a Hardware-Assisted Real-Time Garbage
  Collector.  Sixth ASPLOS.  Oct. 1994.

  Nilsen, Schmidt.  Cost-Effective Object-Space Management for Hardware-
  Assisted Real-Time Garbage Collection.  ACM LOPLAS.  Dec. 1992.

  Nilsen.  Reliable Real-Time Garbage Collection of C++.  USENIX Computing
  Systems.  Fall 1994.

  Nilsen, Schmidt.  A High-Performance Hardware-Assisted Real-Time
  Garbage Collection System.  Journal of Programming Languages.  Jan. 1994.

  Nilsen.  Cost-Effective Hardware-Assisted Real-Time Garbage Collection.
  1994 ACM PLDI Workshop on Languages, Compilers and Tool Support for
  Real-Time Systems (ftp.cs.iastate.edu:/pub/kelvin/lcts-rts.PS.Z)
  (this paper provides an overview of the hardware prototype that is currently
  under development.  there are a number of recent innovations that we have
  introduced to reduce hardware costs and improve system flexibility)

>Why not design the CPU with the GC'tor in mind instead of designing the
>GC'tor with the CPU in mind.

>I just seems silly that hardware designers aren't thinking more about
>relieving us programmers from thinking about what memory we are currently
>using.

Our design proposes to place all of the special circuitry within a custom
memory module.  The design is portable between CPU architectures, and allows
memory to be cached.  The economic benefits are:

  1. Nobody (these days) would even think of designing such specialized
     support into a CPU.  The benefits have not been proven.  The costs
     are not well understood.  The investment required to deliver a
     price/performance competitive CPU in today's marketplace is too
     huge.  By placing the custom hardware within the memory subsystem,
     we are able to provide the desired benefits without most of the
     costs of developing special-purpose CPUs.

  2. The hardware can operate at memory speeds rather than at CPU speeds.
     This simplifies the implementation.

  3. Since the memory architecture is portable between CPU architectures,
     we can separate issues of instruction-set compatibility from issues
     of real-time garbage collection support.  (It would be a real bummer
     if our custom-CPU died in the marketplace not because hardware-assisted
     GC was a bad idea, but because the instruction set simply wasn't
     "popular".)

Let me add a little bit of clarification on this topic:

  1. Our work has been motivated by the need to support tight worst-case
     time bounds on dynamic memory management routines while supporting
     predictable memory utilization.  Given these requirements, hardware
     support seemed like the "right thing to do."

  2. If you do not need hard-real-time performance guarantees, then it
     is much more difficult to justify the costs of custom hardware.  There
     are many good software garbage collection techniques that perform
     well enough in the absence of custom hardware.

  3. On the other hand, if the proposed custom hardware is manufactured in
     large enough quantities, then it becomes commodity hardware and is
     much more affordable.  This may raise the ante.  Programmers will
     have higher expectations of their programming environments, and
     garbage collection activities will be a more dominant part of a
     "typical" system workload.  (We are trying to find some high-volume
     users for our system in order to bring the per-unit cost down.
     Potential buyers include the video game, virtual reality, and
     television-top multimedia box developers.)

  4. Lots of "good ideas" never go anywhere.  To turn a good idea into
     a useful capability requires huge amounts of technology development,
     salesmanship, risk taking, faith (blinders?), and a little bit of
     luck.  We have a "chip manufacturer" partner who is currently helping
     to bring this idea to the marketplace, but the partner is not one of
     the big names and they are extremely careful to control their risks
     and limit their investments.  A big part of "computer architecture"
     research seems to consist of simply learning how to "play your cards."

Kelvin Nilsen/Dept. of Computer Science/Iowa State University/Ames, IA  50011
 (515) 294-2259   kelvin@cs.iastate.edu  uunet!cs.iastate.edu!kelvin

Author: hbaker@netcom.com (Henry Baker)
Date: Fri, 13 Jan 1995 21:04:42 GMT Raw View

In article <3f5vaf$r07@gamma.ois.com>, beckwb@ois.com (R. William
Beckwith) wrote:

> Henry Baker comments that the primary issues are cache management and VM.
> The fact that GC experts like Henry and Paul Wilson quickly digress to
> chip level issues leads me to believe that a concerted effort on the
> part of the CPU designer could lead to a much more elegant and efficient
> solution to memory management in compiled languages.

If people are interested in these issues, and haven't yet read the following
paper, they are in for a treat:

Reinhold, M.B.  "Cache Performance of Garbage-Collected Programs".  PLDI'94,
ACM Sigplan Notices 29, 6 (June 1994), 206-217.  (It _may_ be on the
net.  Reinhold works for NEC in New Jersy, I believe.)

Author: wilson@cs.utexas.edu (Paul Wilson)
Date: 2 Jan 1995 12:03:44 -0600 Raw View

In article <19941230.201628.350635.NETNEWS@UICVM.UIC.EDU>,
David Hanley <dhanley@matisse.eecs.uic.edu> wrote:
>
>        I had an idea, that, if practical, would go a long way to
>justifying GC for me.  If this idea is not new( and I'm guessing it
>isn't ) could someone please comment about it to me?
>
>        The idea is to use an extra processor to do garbage collection.
>This would relieve the first processor for almost any memory management
>tasks.  GC'd programs might therefore run even more quickly than manually
>msnsged ones.  This might also be a good way to utilize extra processors
>on multiple-processor machines.  Multiprocessor machines will be the
>norm in a couple of more years, IMHO.
>
> [...]
>        Comments?
>

One problem with parallelizing GC's is that the major cost of an
incremental or concurrent GC is often in the coordination between
the GC itself and the running application program.  Parallelizing
the GC's tracing of reachable data only gets you so far, and then
you're up against the coordination costs, which are harder to
eliminate and are likely to be higher anyway, especially in an
incremental generational GC.  (Generational techniques decrease
tracing costs at an increase in coordination costs.)

In the real-time context, this is especially important, because the
coordination costs are likely to be the most unpredictable---whether
the a program operation is expensive or cheap depends not only on
what the program itself is up to, but which data the collector has
traced so far and which data it hasn't.

This is one of the reasons we chose not to use a copying strategy
in our stock-hardware real-time GC.  With a copy collector, you have
to coordinate changes by the collector (moving things around) with
what the program is doing, as well as coordination changes by the
application with what the GC is doing.  It's easier to put reasonable
bounds on coordination costs if only the application is modifying
the data.  (This is discussed in the paper pub/garbage/GC93/wilson.ps
on cs.utexas.edu, and more generally in the survey paper
pub/garbage/bigsurv.ps.)

As far as parallel GC goes, in some circumstance you may be better
of parallelizing the GC with itself, but not with the application
program, i.e., you can run the app in parallel for a while, then
stop and GC using all procesors for GC work.

Unfortunately, in a parallel system it's much harder to provide
real-time guarantees, because you can't guarantee that a parallel
GC will find enough parallelism to keep up with a parallel application.
In bad cases, the application may construct linear lists that take
a long time for the GC to traverse, and whose traversal can't be
parallelized in any reasonably efficient way.  A more likely problem
is the construction of data structures that could in principle
be traversed in parallel, but which interact poorly with the GC's
task-spawning heuristics.

--
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (wilson@cs.utexas.edu)
| Recent papers on garbage collection, memory hierarchies, persistence and
| Scheme interpreters and compilers available via ftp from cs.utexas.edu
| (128.83.139.9), in pub/garbage or http://www.cs.utexas.edu/oops/)