Thread

Topic: Looking for a standard library compliant solution...

Author: James.Kanze@dresdner-bank.com
Date: 1999/09/21 Raw View

In article <37E0121C.94B9DF3F@ix.netcom.com>,
  "Paul D. DeRocco" <pderocco@ix.netcom.com> wrote:
> Erik Funkenbusch wrote:

> > I should have been a bit more precise about my needs.  I want to
> > read the file into memory to gain the fastest possible parsing
> > speed.  I've already implemented a "parse from disk" approach and
> > found it to be orders of magnitude slower than a file in memory or
> > memeory mapped file.

> > Here's the real problem.  I am creating a text file viewer which has
> > to do many things upon "opening" a file.  The largest problem is
> > mapping all the newlines in the file so that I can get an accurate
> > representation of how many lines the file contains (so that I can
> > both accurately set the scrollbar bounds, as well as being able to
> > randomly scroll line-by-line in the file).  The problem with speed
> > is that the user is unable to scroll in the file until the scroll
> > bounds are known.  This means the user has to wait for the parsing
> > to be finished before they can do anything with the file.

> > On a small file, the disk based parse is acceptable.  On a large
> > file (in excess of 32MB) this takes more than a minute.  This is
> > unacceptable to my client.

> The only reason it takes more time to read and process a file one
> character at a time, as compared to reading the whole file and
> processing it, is the larger number of calls to the OS. If, however,
> your character reading routine reads through a large enough buffer,
> this difference evaporates.

> If you've tried it and it's too slow, then it would seem your library
> uses a buffer that's too small. Unfortunately, there's no
> standard-conforming way of controlling this (although pubsetbuf may do
> what you want in some implementations) for an existing filebuf.

> You can, however, invent a forwarding streambuf of your own design,
> that has, say, 64K of buffer space, and translates underflow() into a
> 64K read on the actual filebuf. This should speed up the file I/O to
> an acceptable level, without requiring huge memory allocations for
> huge files.

This is one case where a forwarding streambuf will not help.  The only
way it can read more than one character at a time from the actual
filebuf is with sgetn.  This function just calls xsgetn, whose default
implementation just loops calling sbumpc.

Generally, a good implementation *should* use optimally sized buffers.
Still, optimally sized may depend on what you are doing.  At any rate,
if you are designing a streambuf for a very specific application, you
can pretty much ignore a lot of the functions: no need to implement
seeking if you are not going to use it (and seeking can complicate
buffer management significantly).  A unidirectional (only read, or only
write) custom filebuf which doesn't support seeking is really very
simple.  On the other hand, it is, by definition, OS dependant -- there
is no way of reading a file in standard C++ other than through a filebuf
(or a FILE*), and this is precisely what you are trying to implement.

--
James Kanze                   mailto: James.Kanze@dresdner-bank.com
Conseils en informatique orient=E9e objet/
                  Beratung in objekt orientierter Datenverarbeitung
Ziegelh=FCttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627


Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: fbp@stlport.org
Date: 1999/09/17 Raw View

In article <cXxD3.1801$S5.196606@ptah.visi.com>,
  "Erik Funkenbusch" <erikf@visi.com> wrote:

> Again, I'm trying to stick with a standard library compliant solution
if
> possible.  I have already implemented non-standard library compliant
> solutions which have acceptable speed, but I find them to be less
elegant
> than standard library conforming solutions (as well as more difficult
to
> maintain).  I'm beginning to think that I won't be able to do this
using the
> standard library classes.

If you are using standard classes, you do not really have control on
low-level imlementation details, which seem to be very important to you,
right ? For example, SGI STL does memory-mapped i/o for fstream, and you
might get reasonable speed even with rdbuf() example below, but if you
want to port, you'll have to deal with not-so-smart fstream
implementations - that is going to be hard.
Just do it yourself.

> Again, I'm not really interested in discussing my reasons for doing it
this
> way unless you can provide an extremely efficient method of doing what
I
> need.

Exactly, just do memory map of the whole file - EXTREMELY efficient.
Even if you're on Windows, you have corresponding primitives.

-Boris.

>
> [ comp.std.c++ is moderated.  To submit articles, try just posting
with ]
> [ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu
]
> [              --- Please see the FAQ before posting. ---
]
> [ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html
]
>
>

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Paul D. DeRocco" <pderocco@ix.netcom.com>
Date: 1999/09/16 Raw View

Erik Funkenbusch wrote:
>
> I should have been a bit more precise about my needs.  I want to read the
> file into memory to gain the fastest possible parsing speed.  I've
> already implemented a "parse from disk" approach and found it to be
> orders of magnitude slower than a file in memory or memeory mapped file.
>
> Here's the real problem.  I am creating a text file viewer which has to
> do many things upon "opening" a file.  The largest problem is mapping all
> the newlines in the file so that I can get an accurate representation of
> how many lines the file contains (so that I can both accurately set the
> scrollbar bounds, as well as being able to randomly scroll line-by-line
> in the file).  The problem with speed is that the user is unable to
> scroll in the file until the scroll bounds are known.  This means the
> user has to wait for the parsing to be finished before they can do
> anything with the file.
>
> On a small file, the disk based parse is acceptable.  On a large file (in
> excess of 32MB) this takes more than a minute.  This is unacceptable to
> my client.

The only reason it takes more time to read and process a file one character
at a time, as compared to reading the whole file and processing it, is the
larger number of calls to the OS. If, however, your character reading
routine reads through a large enough buffer, this difference evaporates.

If you've tried it and it's too slow, then it would seem your library uses
a buffer that's too small. Unfortunately, there's no standard-conforming
way of controlling this (although pubsetbuf may do what you want in some
implementations) for an existing filebuf.

You can, however, invent a forwarding streambuf of your own design, that
has, say, 64K of buffer space, and translates underflow() into a 64K read
on the actual filebuf. This should speed up the file I/O to an acceptable
level, without requiring huge memory allocations for huge files.

--

Ciao,                       Paul D. DeRocco
Paul                        mailto:pderocco@ix.netcom.com
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Erik Funkenbusch" <erikf@visi.com>
Date: 1999/09/13 Raw View

Thanks for the input (pardon the pun ;)

Comments inline...

Sam Lindley <sam@redsnapper.net> wrote in message
news:7rc8lo$nd1$1@uranium.btinternet.com...
>
> Yes. I've had exactly, the same problem. Here's a solution: (this works
for
> any input stream; not just an ifstream):
>
> string loadToString(istream& ist)
> {
> ostringstream ost;
> ist >> ost.rdbuf();    // or equivalently: ost << ist.rdbuf();
> return ost.str();
> }

This method seems to be incredibly inefficient.  When using this method on a
32 meg file I ended up having to kill it after 10 minutes and it appeared to
be nowhere near done.  When watching the memory allocations it appeared to
be doing incredibly slow reallocations all over the place.  This worked ok
on a small file though.

The problem here seems to be the single character read methods.  it uses
fgetc to get individual characters, reallocating if necessary (but does not
appear to do the logorithmic allocation that other STL classes do such as
vector).

I can't seem to find any way to preallocate the storage for the
stringstream's buffer.

I realize that 32 MB is a large amount of memory to work with in a string.
I was merely hoping to be able to use the standard library to manage things
if possible.

Sam Lindley <sam@redsnapper.net> wrote in message
news:7regmd$fk1$1@neptunium.btinternet.com...
> I've also just realised, there's a much more elegant method for loading
> directly into a string. (Now I see what istreambuf_iterators are for.)
> Unfortunately, some compilers don't seem to support the necessary
features,
> though.
>
> Suppose ist is an already initialised istream (an ifstream, for example).
> Then:
>
>     string s(istreambuf_iterator<char>(ist), istreambuf_iterator<char>());
>
> should work, as should:
>
>     string s;
>     copy(istreambuf_iterator<char>(ist), istreambuf_iterator<char>(),
> back_inserter(s));
>
> However, the former requires template member support, and the latter
> requires the string class to have a push_back member. I tried both in
> MSVC++6, and neither worked at first (due to outdated libraries). It was
> pretty straightforward to edit the standard headers to define a push_back
> method, which worked well.

I haven't been able to get this to work yet as I'm somewhat reluctant to
modify the files that come with the compiler for maintenance reasons (plus
i'd have to modify the file on every developers machine as well).
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Andrew R. Thomas-Cramer" <artc@prism-cs.com>
Date: 1999/09/14 Raw View


Erik Funkenbusch wrote in message <9X%B3.838$S5.81385@ptah.visi.com>...
>
>I'm looking for a good solution for what seems to me to be a common problem.
>
>I want to read a text file into memory and parse it.
>

Pardon the silly question, but why do you want to read it into memory in one
large chunk, rather than reading tokens directly from the file, when needed,
during parsing?

[ moderator's note: The original question called for a standard-
  conforming solution, which is why it was approved for this newsgroup.
  Let's try to keep replies on-topic for comp.std.c++. See the FAQ. -sdc ]



[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Ed Brey" <brey@afd.mke.etn.com>
Date: 1999/09/14 Raw View

Erik Funkenbusch <erikf@visi.com> wrote in message
news:9X%B3.838$S5.81385@ptah.visi.com...
>
> I want to read a text file into memory and parse it.
>
> I would like to do this using std::string for the text buffer rather than
a
> char array and some form of std::*stream function for file processing.
> [...]
> It just seems to me that this would be a common enough case that there
would
> exist a fairly simple, elegant solution which is also highly efficient.

On the contrary, grammars are generally designed to avoid this case.  When a
compiler parses a C++ program, for instance, it can always just read through
the program sequentially, never having to revisit earlier in the file or
jump ahead.  All information it needs to retain from the program, the
compiler keeps in it own internal format.

I don't know what the nature of your text file is, but I think it is
unlikely that the entire text file really needs to reside in memory at once.
A good tokenizer can read the file piecemeal and provide the parser an
easy-to-use interface without a using a lot of memory or requiring the delay
of having to read in the whole file before starting to get any work done.

OTOH, there are applications, such as for databases, where random access to
the entire file is useful.  However, in these cases, usually it is better to
work with the file directly on disk, rather than making a copy into memory.
With the file on disk, mapping fstream into a string doesn't make sense,
because many of string's operations would be way too expensive.  One may
question, however, the absence of memory-mapped I/O from the standard.





[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Erik Funkenbusch" <erikf@visi.com>
Date: 1999/09/14 Raw View

Ed Brey <brey@afd.mke.etn.com> wrote in message
news:7rm0ge$i852@interserv.etn.com...
>
> Erik Funkenbusch <erikf@visi.com> wrote in message
> news:9X%B3.838$S5.81385@ptah.visi.com...
> >
> > I want to read a text file into memory and parse it.
> >
> > I would like to do this using std::string for the text buffer rather than a
> > char array and some form of std::*stream function for file processing.
> > [...]
> > It just seems to me that this would be a common enough case that there
> would
> > exist a fairly simple, elegant solution which is also highly efficient.
>
> On the contrary, grammars are generally designed to avoid this case.  When a
> compiler parses a C++ program, for instance, it can always just read through
> the program sequentially, never having to revisit earlier in the file or
> jump ahead.  All information it needs to retain from the program, the
> compiler keeps in it own internal format.

Interesting.  Yes, this is true.

I should have been a bit more precise about my needs.  I want to read the
file into memory to gain the fastest possible parsing speed.  I've already
implemented a "parse from disk" approach and found it to be orders of
magnitude slower than a file in memory or memeory mapped file.

Here's the real problem.  I am creating a text file viewer which has to do
many things upon "opening" a file.  The largest problem is mapping all the
newlines in the file so that I can get an accurate representation of how
many lines the file contains (so that I can both accurately set the
scrollbar bounds, as well as being able to randomly scroll line-by-line in
the file).  The problem with speed is that the user is unable to scroll in
the file until the scroll bounds are known.  This means the user has to wait
for the parsing to be finished before they can do anything with the file.

On a small file, the disk based parse is acceptable.  On a large file (in
excess of 32MB) this takes more than a minute.  This is unacceptable to my
client.

Again, I'm trying to stick with a standard library compliant solution if
possible.  I have already implemented non-standard library compliant
solutions which have acceptable speed, but I find them to be less elegant
than standard library conforming solutions (as well as more difficult to
maintain).  I'm beginning to think that I won't be able to do this using the
standard library classes.

> I don't know what the nature of your text file is, but I think it is
> unlikely that the entire text file really needs to reside in memory at once.

Speed is the reason.  If you can suggest another technique that approaches
the speed of in-memory parsing while using the standard library, I'll
investigate it.

> A good tokenizer can read the file piecemeal and provide the parser an
> easy-to-use interface without a using a lot of memory or requiring the delay
> of having to read in the whole file before starting to get any work done.

I agree.  If speed were not an issue, I would use this approach (and already
have, in fact).

> OTOH, there are applications, such as for databases, where random access to
> the entire file is useful.  However, in these cases, usually it is better to
> work with the file directly on disk, rather than making a copy into memory.
> With the file on disk, mapping fstream into a string doesn't make sense,
> because many of string's operations would be way too expensive.  One may
> question, however, the absence of memory-mapped I/O from the standard.

While I do have to do random access in the file, it's not for parsing. It's
simply for displaying.  The parsing is extremely simple, I'm just storing
character positions in a vector so that I can randomly access the lines of
data later.

Again, I'm not really interested in discussing my reasons for doing it this
way unless you can provide an extremely efficient method of doing what I
need.


[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Sam Lindley" <sam@redsnapper.net>
Date: 1999/09/12 Raw View

Oops... an accidental drag and drop garbled my last paragraph; it should
have read:

"Another possibility that we might consider, is to do the parsing directly
from the input stream. The only real advantage here is a reduction in the
memory requirements. I recently wrote a recursive macro parser which did
exactly this. Unfortunately, due to the extra house-keeping functions
performed by standard iostreams (I tried it with an istringstream, so it
wasn't due to innefficent file-buffering), it was an order of magnitude
slower than a version which loaded the whole thing into a string first."

I've also just realised, there's a much more elegant method for loading
directly into a string. (Now I see what istreambuf_iterators are for.)
Unfortunately, some compilers don't seem to support the necessary features,
though.

Suppose ist is an already initialised istream (an ifstream, for example).
Then:

    string s(istreambuf_iterator<char>(ist), istreambuf_iterator<char>());

should work, as should:

    string s;
    copy(istreambuf_iterator<char>(ist), istreambuf_iterator<char>(),
back_inserter(s));

However, the former requires template member support, and the latter
requires the string class to have a push_back member. I tried both in
MSVC++6, and neither worked at first (due to outdated libraries). It was
pretty straightforward to edit the standard headers to define a push_back
method, which worked well.

Sam Lindley
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Erik Funkenbusch" <erikf@visi.com>
Date: 1999/09/10 Raw View

I'm looking for a good solution for what seems to me to be a common problem.

I want to read a text file into memory and parse it.

I would like to do this using std::string for the text buffer rather than a
char array and some form of std::*stream function for file processing.

I own probably half a dozen books on the standard library, and they all
contain very basic examples of streaming files.  They all contain examples
of reading a single line into a char array.

It seems from my reading that my only choices are:

1:  Allocate a large enough char array on the heap (this could be megabytes)
and using the get member function of fstream or the like to read the text
in, then copy all that data into a std:string.

2:   Allocate a small buffer and read in the text up to the size of the
buffer, then concatenate it on the end of the string and repeat until end of
file is reached.

Option 1 is incredibly messy and requires massive allocations and copying.
Option 2 is neater, but requires lots of smaller reads and copying and
reallocation of buffers (slow).

It just seems to me that this would be a common enough case that there would
exist a fairly simple, elegant solution which is also highly efficient.

I could use any number of other solutions, but in the end, it still requires
lots of conversions and copying to get everything into a std::string.  I'm
trying to avoid all the intermediate steps to make the code cleaner.  I
mean, isn't that what the standard library was created for?

There is no need to worry about small memory models or lack of physical
memory.  The code will run on turnkey systems with adequate memory for
whatever solution is required.  I just want to both minimize resources, and
simplify code while staying as compliant with the standard library as
possible.

Any suggestions?




[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Sam Lindley" <sam@redsnapper.net>
Date: 1999/09/11 Raw View

Erik Funkenbusch <erikf@visi.com> wrote in message
news:9X%B3.838$S5.81385@ptah.visi.com...
>
> I'm looking for a good solution for what seems to me to be a common
problem.
>
> I want to read a text file into memory and parse it.
>
> I would like to do this using std::string for the text buffer rather than a
> char array and some form of std::*stream function for file processing.
>
> I own probably half a dozen books on the standard library, and they all
> contain very basic examples of streaming files.  They all contain examples
> of reading a single line into a char array.
>
> It seems from my reading that my only choices are:
>
> 1:  Allocate a large enough char array on the heap (this could be megabytes)
> and using the get member function of fstream or the like to read the text
> in, then copy all that data into a std:string.
>
> 2:   Allocate a small buffer and read in the text up to the size of the
> buffer, then concatenate it on the end of the string and repeat until end of
> file is reached.
>
> Option 1 is incredibly messy and requires massive allocations and copying.
> Option 2 is neater, but requires lots of smaller reads and copying and
> reallocation of buffers (slow).
>
> It just seems to me that this would be a common enough case that there would
> exist a fairly simple, elegant solution which is also highly efficient.
>
> I could use any number of other solutions, but in the end, it still requires
> lots of conversions and copying to get everything into a std::string.  I'm
> trying to avoid all the intermediate steps to make the code cleaner.  I
> mean, isn't that what the standard library was created for?
>
> There is no need to worry about small memory models or lack of physical
> memory.  The code will run on turnkey systems with adequate memory for
> whatever solution is required.  I just want to both minimize resources, and
> simplify code while staying as compliant with the standard library as
> possible.
>
> Any suggestions?

Yes. I've had exactly, the same problem. Here's a solution: (this works for
any input stream; not just an ifstream):

string loadToString(istream& ist)
{
ostringstream ost;
ist >> ost.rdbuf();    // or equivalently: ost << ist.rdbuf();
return ost.str();
}

This method requires an 'unnecessary' copy from the ostringstream to the
string, but this is likely to be insignificant compared to the time taken to
load from a file-system and the time taken to do the parsing. (We could also
use a string reference instead of a return value, but reference-counting
string implementations mean this extra string copy isn't usually a problem.)

Another possibility that we might consider, is to do the parsing directly
from the input stream. The only real advantage here is a reduction in the
memory requirements. I recently wrote a recursive macro parser which did
exactistringstream, so it wasn't due to innefficent file-buffering), it was
an order of magnitude slower than a version which loaded the whole thing
into a string first.

ly this. Unfortunately, due to the extra house-keeping functions performed
by standard iostreams (I tried it with an Sam Lindley

PS:
WARNING: many compilers have problems with the new standard streams library;
particularly stringstreams

Some of the ones I've tried:

gcc 2.95: doesn't support stringstreams, but libstd++ has some support; it
isn't finished yet, though.
MSVC++ 6.0 has pretty good support (at least it seems to be stable if not
fully standards compliant).
Metrowerks CodeWarrior 5 has good support.
SunPRO C++ 5.0 claims to support stringstreams, but has nasty bugs. (If a
single write of greater than 128 characters is made to an ostringstream,
then the str() method returns an empty string, for instance - I keep
reporting this bug to Sun, but they don't seem to take much notice).




[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]