Exploring Haskell - ByteString (Strict)

2017-06-08
haskell

ByteString is one of the core GHC libraries (emphasis on GHC because it is not part of the Haskell 2010 Language Report and I am not sure if it is compatible with other compilers like Hugs). The ByteString library is probably a dependency (directly or indirectly) for most Hackage libraries. ByteStrings are highly efficient string types. A ByteString is a sequence of bytes (8-bit characters). There are strict (single large array) and lazy (call-by-need, good for streaming) varieties. For using the ByteString library, the Haddock documentation is good to get started. We will briefly look at some of the source code for strict ByteStrings to get a general understanding of the implementation. While this level of understanding is not necessary for every day use of ByteStrings, it should be useful for having a deeper understanding of the GHC compiler.

Data.ByteString

Here is the definition of strict ByteString from Data.ByteString in version 0.10.8.1.

UNPACK Pragma

ByteString has a constructor PS that takes three values: a payload, an offset and a length. The first think I noticed were the UNPACK pragmas. This pragma unpacks the contents of a constructor field by removing one level of indirection, much like newtype. If passed to a non-strict function, the compiler will rebox these types. However, the -O compiler flag will try to eliminate as many reboxings as possible.

Strict Bang

Next, each of these types are prefixed with a bang (exclamation point) !, making them all strict. Strict means that these values are all call-by-value as opposed to call-by-need (value). This helps eliminate thunks and can improve performance in some cases.

Data.Int and Integer

The offset and length types are Ints (Data.Int). Int is a fixed-precision integer type with guaranteed range of at least -2^29 to 2^29 - 1. This is different from Integer (Prelude), which has an arbitrary range, but probably has worse performance because it has no bounds.

Word8

The last type is ForeignPtr Word8. The ByteString payload is a pointer to a single Word8 in C. This is similar to an array in C. We need a pointer to a value of a certain type, current location in the array and the total length of an array. We will start with Word8 because it is less complicated then ForeignPtr. Here is the definition from GHC.Word.

CTYPE Pragma

Word8 is an 8-bit unsigned integer type. There are a few things here that may be confusing if you have never looked at the core Haskell or GHC libraries. The CTYPE pragma associates Word8 with a C type HsWord8 (more details here). We know somewhere in the GHC language code there is a C type name HsWord8.

Boxed and Unboxed Types

Next, we see a hash sign prefixed to the end of the constructor W8# and the value Word#. In Haskell, the hash sign # as a suffix to a name is an Unboxed type. Unboxed means there is no pointer or heap allocation and the type is a raw machine or primitive type. Most types in GHC are boxed, the values of that type are represented by a pointer to a heap object. Unboxed types cannot be defined in Haskell. They are built into the language and compiler. Unboxed types are also lifted, meaning they cannot have the value bottom _. W8# is a unboxed (primitive) constructor and Word# is an unboxed (primitive) type.

Many unboxed types in Haskell are simple bit-patterns such as Int#, Float#, and Double#, but there are also more complicated boxed types like Array#, a pointer to a heap-allocated object because the value would be too large to store in a register.

ForeignPtr

Finally, we will explore the definition of ForeignPtr, defined in GHC.ForeignPtr.

What is a finalizer?

A finalizer is a routine that is invoked when the Haskell storage manager detects that - within the Haskell heap and stack - there are no more references left that are pointing to the ForeignPtr. Typically, the finalizer will, then, invoke routines in the foreign language that free the resources bound by the foreign object.

GHC.Prim

There are a variety of unboxed and boxed types

  • Addr# is also an unboxed type that points to something outside the garbarge-collected heap (GHC.Prim).

  • MutableByteArray# is used for FFI data allocation (GHC.Prim and memory allocation details).

  • RealWorld is a boxed type used as a type parameter by unboxed types (GHC.Prim).

  • State# is unboxed type for concurrency (GHC.Prim).

  • Weak# is an unboxed weak pointer (GHC.Prim).

IORef

IORef is a container for a mutable value (GHC.IORef).

STRef

We have seem RealWorld before, which is a type used by a state thread. STRef is mutable variable in a state thread s that contains a value of type a (definition).

MultVar

Finally, MultVar# is also defined in GHC.Prim. It is a single item mutable array.

References