Exploring Haskell - ByteString (Strict)
ByteString is one of the core GHC libraries (emphasis on GHC because
it is not part of the Haskell 2010
Language Report and I am not sure if it is compatible with other
compilers like Hugs). The ByteString library is probably a dependency
(directly or indirectly) for most Hackage libraries.
ByteStrings are highly efficient string types. A
ByteString is a sequence of bytes (8-bit characters). There
are strict (single large array) and lazy (call-by-need, good for
streaming) varieties. For using the ByteString library, the
Haddock documentation is good to get started. We will briefly look at
some of the source code for strict ByteStrings to get a
general understanding of the implementation. While this level of
understanding is not necessary for every day use of
ByteStrings, it should be useful for having a deeper
understanding of the GHC compiler.
Data.ByteString
Here is the definition of strict ByteString from Data.ByteString
in version 0.10.8.1.
data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
{-# UNPACK #-} !Int -- offset
{-# UNPACK #-} !Int -- lengthUNPACK Pragma
ByteString has a constructor PS that takes
three values: a payload, an offset and a length. The first think I
noticed were the UNPACK
pragmas. This pragma unpacks the contents of a constructor field by
removing one level of indirection, much like newtype. If
passed to a non-strict function, the compiler will rebox these types.
However, the -O compiler flag will try to eliminate as many
reboxings as possible.
Strict Bang
Next, each of these types are prefixed with a bang (exclamation
point) !, making them all strict. Strict means that these
values are all call-by-value as opposed to call-by-need (value). This
helps eliminate thunks and can improve performance in some cases.
Data.Int and Integer
The offset and length types are Ints (Data.Int).
Int is a fixed-precision integer type with guaranteed range
of at least -2^29 to 2^29 - 1. This is
different from Integer (Prelude),
which has an arbitrary range, but probably has worse performance because
it has no bounds.
Word8
The last type is ForeignPtr Word8. The
ByteString payload is a pointer to a single
Word8 in C. This is similar to an array in C. We need a
pointer to a value of a certain type, current location in the array and
the total length of an array. We will start with Word8
because it is less complicated then ForeignPtr. Here is the
definition from GHC.Word.
data {-# CTYPE "HsWord8" #-} Word8 = W8# Word#CTYPE Pragma
Word8 is an 8-bit unsigned integer type. There are a few
things here that may be confusing if you have never looked at the core
Haskell or GHC libraries. The CTYPE pragma associates
Word8 with a C type HsWord8 (more details here).
We know somewhere in the GHC language code there is a C type name
HsWord8.
Boxed and Unboxed Types
Next, we see a hash sign prefixed to the end of the constructor
W8# and the value Word#. In Haskell, the hash
sign # as a suffix to a name is an Unboxed
type. Unboxed means there is no pointer or heap allocation and the
type is a raw machine or primitive type. Most types in GHC are boxed,
the values of that type are represented by a pointer to a heap object.
Unboxed types cannot be defined in Haskell. They are built into the
language and compiler. Unboxed types are also lifted, meaning they
cannot have the value bottom _. W8# is a
unboxed (primitive) constructor and Word# is an unboxed
(primitive) type.
Many unboxed types in Haskell are simple bit-patterns such as
Int#, Float#, and Double#, but
there are also more complicated boxed types like Array#, a
pointer to a heap-allocated object because the value would be too large
to store in a register.
ForeignPtr
Finally, we will explore the definition of ForeignPtr,
defined in GHC.ForeignPtr.
data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents
data ForeignPtrContents
= PlainForeignPtr !(IORef Finalizers)
| MallocPtr (MutableByteArray# RealWorld) !(IORef Finalizers)
| PlainPtr (MutableByteArray# RealWorld)
data Finalizers
= NoFinalizers
| CFinalizers (Weak# ())
| HaskellFinalizers [IO ()]What is a finalizer?
A finalizer is a routine that is invoked when the Haskell storage manager detects that - within the Haskell heap and stack - there are no more references left that are pointing to the ForeignPtr. Typically, the finalizer will, then, invoke routines in the foreign language that free the resources bound by the foreign object.
GHC.Prim
There are a variety of unboxed and boxed types
Addr#is also an unboxed type that points to something outside the garbarge-collected heap (GHC.Prim).MutableByteArray#is used for FFI data allocation (GHC.Prim and memory allocation details).RealWorldis a boxed type used as a type parameter by unboxed types (GHC.Prim).State#is unboxed type for concurrency (GHC.Prim).Weak#is an unboxed weak pointer (GHC.Prim).
IORef
IORef is a container for a mutable value (GHC.IORef).
newtype IORef a = IORef (STRef RealWorld a)STRef
We have seem RealWorld before, which is a type used by a
state thread. STRef is mutable variable in a state thread
s that contains a value of type a (definition).
data STRef s a = STRef (MutVar# s a)MultVar
Finally, MultVar# is also defined in GHC.Prim.
It is a single item mutable array.