Topic: c++advio - portable handling and _compression_ of binary data
Author: oleg@ponder.csci.unt.edu (Kiselyov Oleg)
Date: 1995/06/27 Raw View
c++advio is a set of classes (C++ streams) to do various
advanced i/o on binary streams of arithmetical data. I mean:
- a variable-bit coding of sequences of integers (including the Arithmetic
compression),
- a trick of sharing a stream buffer (a "file") among several streams,
- handling of extended file names (shell pipes),
- an explicit endian specification in dealing with integer streams.
Everything is tested on many platforms (UNIXes + Mac) and works. You
can see that for yourself: the test code is included, too.
To start with, the package defines BitIO streams that have methods for
a simple variable-length coding of short integers. This is useful when
one needs to read/write a collection of short integers where many
integers are rather small in value; still, big values can crop up at
times, so he can't limit the size of the code to anything less than 16
bits. The code is a variation of a start-stop code described in
Appendix A, "Variable-length representations of the integers" of the
"Text Compression" book by T.Bell, J.Cleary and I.Witten. The
differences are the support for both negative and positive numbers
plus some optimization based on the fact that all numbers are no
larger than 2^15-1 and an assumption that most of them are smaller
than 512 (in absolute value).
The present package provides a clean C++ implementation of Bell,
Cleary and Witten's arithmetic compression code, with a _clear_
separation between a model and the coder. ArithmCodingIn /
ArithmCodingOut act as i/o streams that encode signed short integers
you put() to, and decode them when you get() them. The
ArithmCodingIn/Out object needs a "plug-in" object of a class
Input_Data_Model when the stream is created. The Input_Data_Model
object is responsible for providing the codec with probabilities of
symbol occurrences. Input_Data_Model may also modify itself to adapt
to the input stream.
The current version of the package provides two Input_Data_Model
plug-ins, both performing adaptive "modeling" of a stream of
integers. The first plug-in uses a simple 0-order adaptive prediction
(like the model given in the Witten's book). The other needs a
histogram to _sketch_ the initial distribution, and is a bit
sophisticated in updating the model. I use it to compress wavelet
decompositions of images.
It goes without saying that one needs to be careful when dealing with
binary data; I mean, when one writes, say, integers, in their binary
form into a file and wants to be able to read them later (probably on
a different platform). The package makes taking this care easier:
EndianOut stream("/tmp/aa");
stream.set_littlendian();
stream.write_long(1);
1 would be written as a 4-byte integer with the least significant byte
first, NO MATTER which computer (computer architecture) the code is
running on. Using explicit endian specification (like above) is the
only way to ensure portability of binary files containing arithmetic
data.
One funny thing about EndianIn/Out streams defined in c++advio is that
they can share the same i/o buffer. This is useful when one needs to
read/write a waffle-pie-like file consisting of various variable-bit
encoded data interspersed with headers. For example, a file may begin
with a header (telling the total number of data items and
normalization factors) followed by some variable-bit encoding of
items, followed by another header, followed by an arithmetic
compressed stream of data, etc. Just like a pie made of waffle crusts
separating different fillings. As with the pie, it feels great to
savor the whole combination. In less gastronomic terms, it's neat to
take a whole bite through, and swallow and digest it as it is, rather
than create a mess on a plate trying to separate the crusts and the
fillings. With the c++advio package, it's possible to open a file and
read it through only once, attaching/detaching different "streams" as
we go to interpret different layers. Each of these streams
collectively operate the same file and the same file pointer (and the
same buffer). The situation is similar to sharing an open file (and a
file pointer) among parent and child (forked) processes.
Note that merely opening a stream on a dup()-ed file handle, or
sync()-ing streams don't cut it entirely. See comments in a source
code file endian_io.cc for more discussion. The bottom line is, this
package implements stream sharing in a safe and portable way: it works
on a Mac just as well as on different flavors of UNIX.
The package adds support for "extended" file names with pipes in them.
That is, the name of a file to open may be specified now as "|
command" or "command |" i.e. as a pipe. For example,
EndianIn istream;
istream.open("gunzip < /tmp/aa.gz |");
EndianOut stream("| compress > /tmp/aa.Z");
image.write_pgm("| xv -"); // a trick to display an image!
The <command> is launched in a subprocess through '/bin/sh' with its
standard input/output hooked, through pipe(), to the file being
opened. This extension is implemented on the lowest possible level,
right before the request to open a file goes to OS (through the system
call open(2)). A function sys_open() (in the source file sys_open.cc)
acts as a "patch": that is, if you call sys_open() instead of open()
to open a file, you get all the open() functionality plus the extended
file names.
A README file has more examples and discussion. A few verification
code files check to see that all the functions have compiled and run
well. The test code also can serve as an example how package's
classes/functions can be used.
The code is written in a portable way. In fact, it compiles and works
as it is under different flavors of UNIX (using gcc 2.6.3) as well as
on a Mac (using CodeWarrior 6.0), without too many #ifdef's. The code
is rather commented.
This package is used in image processing/image compression software
(which I'm going to submit later). However, the package is
self-contained, coherent and (do I hope!) has some value by
itself. I'm committed to maintaining and upgrading the code, and I'll
really appreciate any comment/question etc if any. Please mail them
to me at oleg@ponder.csci.unt.edu or oleg@unt.edu
The code has been posted to comp.sources.misc and info-mac
(info-mac:/dev/lib/advanced-io-cpp.hqx) and is also available from
ftp://replicant.csci.unt.edu/pub/oleg/c++advio.shar
ftp://replicant.csci.unt.edu/pub/oleg/c++advio.cpt.hqx (Mac version)
The Mac version is identical to the UNIX version, but includes CW projects
and a compiled library (for a PowerMac).
http://replicant.csci.unt.edu/~oleg/ftp/
tells what else is available from that FTP site.