Rope package:core-text

If you're accustomed to working with text in almost any other programming language, you'd be aware that a "string" typically refers to an in-memory array of characters. Traditionally this was a single ASCII byte per character; more recently UTF-8 variable byte encodings which dramatically complicates finding offsets but which gives efficient support for the entire Unicode character space. In Haskell, the original text type, String, is implemented as a list of Char which, because a Haskell list is implemented as a linked-list of boxed values, is wildly inefficient at any kind of scale. In modern Haskell there are two primary ways to represent text. First is via the [rather poorly named] ByteString from the bytestring package (which is an array of bytes in pinned memory). The Data.ByteString.Char8 submodule gives you ways to manipulate those arrays as if they were ASCII characters. Confusingly there are both strict (Data.ByteString) and lazy (Data.ByteString.Lazy) variants which are often hard to tell the difference between when reading function signatures or haddock documentation. The performance problem an immutable array backed data type runs into is that appending a character (that is, ASCII byte) or concatonating a string (that is, another array of ASCII bytes) is very expensive and requires allocating a new larger array and copying the whole thing into it. This led to the development of "builders" which amortize this reallocation cost over time, but it can be cumbersome to switch between Builder, the lazy ByteString that results, and then having to inevitably convert to a strict ByteString because that's what the next function in your sequence requires. The second way is through the opaque Text type of Data.Text from the text package, which is well tuned and high-performing but suffers from the same design; it is likewise backed by arrays. (Historically, the storage backing Text objects was encoded in UTF-16, meaning every time you wanted to work with unicode characters that came in from anywhere else and which inevitably were UTF-8 encoded they had to be converted to UTF-16 and copied into a further new array! Fortunately Haskell has recently adopted a UTF-8 backed Text type, reducing this overhead. The challenge of appending pinned allocations remains, however.) In this package we introduce Rope, a text type backed by the 2-3 FingerTree data structure from the fingertree package. This is not an uncommon solution in many languages as finger trees support exceptionally efficient appending to either end and good performance inserting anywhere else (you often find them as the backing data type underneath text editors for this reason). Rather than Char the pieces of the rope are ShortText from the text-short package, which are UTF-8 encoded and in normal memory managed by the Haskell runtime. Conversion from other Haskell text types is not O(1) (UTF-8 validity must be checked, or UTF-16 decoded, or...), but in our benchmarking the performance has been comparable to the established types and you may find the resultant interface for combining chunks is comparable to using a Builder, without being forced to use a Builder. Rope is used as the text type throughout this library. If you use the functions within this package (rather than converting to other text types) operations are quite efficient. When you do need to convert to another type you can use fromRope or intoRope from the Textual typeclass. Note that we haven't tried to cover the entire gamut of operations or customary convenience functions you would find in the other libraries; so far Rope is concentrated on aiding interoperation, being good at appending (lots of) small pieces, and then efficiently taking the resultant text object out to a file handle, be that the terminal console, a file, or a network socket.