Exploring Haskell - ByteString (Strict)
ByteString is one of the core GHC libraries (emphasis on GHC because it is not part of the Haskell 2010 Language Report and I am not sure if it is compatible with other compilers like Hugs). The ByteString library is probably a dependency (directly or indirectly) for most Hackage libraries.
ByteStrings are highly efficient string types. A
ByteString is a sequence of bytes (8-bit characters). There are strict (single large array) and lazy (call-by-need, good for streaming) varieties. For using the
ByteString library, the Haddock documentation is good to get started. We will briefly look at some of the source code for strict
ByteStrings to get a general understanding of the implementation. While this level of understanding is not necessary for every day use of
ByteStrings, it should be useful for having a deeper understanding of the GHC compiler.
Here is the definition of strict
ByteString from Data.ByteString in version
ByteString has a constructor
PS that takes three values: a payload, an offset and a length. The first think I noticed were the UNPACK pragmas. This pragma unpacks the contents of a constructor field by removing one level of indirection, much like
newtype. If passed to a non-strict function, the compiler will rebox these types. However, the
-O compiler flag will try to eliminate as many reboxings as possible.
Next, each of these types are prefixed with a bang (exclamation point)
!, making them all strict. Strict means that these values are all call-by-value as opposed to call-by-need (value). This helps eliminate thunks and can improve performance in some cases.
Data.Int and Integer
The offset and length types are
Int is a fixed-precision integer type with guaranteed range of at least
2^29 - 1. This is different from
Integer (Prelude), which has an arbitrary range, but probably has worse performance because it has no bounds.
The last type is
ForeignPtr Word8. The
ByteString payload is a pointer to a single
Word8 in C. This is similar to an array in C. We need a pointer to a value of a certain type, current location in the array and the total length of an array. We will start with
Word8 because it is less complicated then
ForeignPtr. Here is the definition from GHC.Word.
Word8 is an 8-bit unsigned integer type. There are a few things here that may be confusing if you have never looked at the core Haskell or GHC libraries. The
CTYPE pragma associates
Word8 with a C type
HsWord8 (more details here). We know somewhere in the GHC language code there is a C type name
Boxed and Unboxed Types
Next, we see a hash sign prefixed to the end of the constructor
W8# and the value
Word#. In Haskell, the hash sign
# as a suffix to a name is an Unboxed type. Unboxed means there is no pointer or heap allocation and the type is a raw machine or primitive type. Most types in GHC are boxed, the values of that type are represented by a pointer to a heap object. Unboxed types cannot be defined in Haskell. They are built into the language and compiler. Unboxed types are also lifted, meaning they cannot have the value bottom
W8# is a unboxed (primitive) constructor and
Word# is an unboxed (primitive) type.
Many unboxed types in Haskell are simple bit-patterns such as
Double#, but there are also more complicated boxed types like
Array#, a pointer to a heap-allocated object because the value would be too large to store in a register.
Finally, we will explore the definition of
ForeignPtr, defined in GHC.ForeignPtr.
data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents data ForeignPtrContents = PlainForeignPtr !(IORef Finalizers) | MallocPtr (MutableByteArray# RealWorld) !(IORef Finalizers) | PlainPtr (MutableByteArray# RealWorld) data Finalizers = NoFinalizers | CFinalizers (Weak# ()) | HaskellFinalizers [IO ()]
What is a finalizer?
A finalizer is a routine that is invoked when the Haskell storage manager detects that - within the Haskell heap and stack - there are no more references left that are pointing to the ForeignPtr. Typically, the finalizer will, then, invoke routines in the foreign language that free the resources bound by the foreign object.
There are a variety of unboxed and boxed types
Addr#is also an unboxed type that points to something outside the garbarge-collected heap (GHC.Prim).
RealWorldis a boxed type used as a type parameter by unboxed types (GHC.Prim).
State#is unboxed type for concurrency (GHC.Prim).
Weak#is an unboxed weak pointer (GHC.Prim).
IORef is a container for a mutable value (GHC.IORef).
We have seem
RealWorld before, which is a type used by a state thread.
STRef is mutable variable in a state thread
s that contains a value of type
MultVar# is also defined in GHC.Prim. It is a single item mutable array.