CMU15-445 B+Tree Index
Last updated on a year ago
B+ Tree Index
Storage Models & Compression
Database Workload
- On-line Transaction Processing: Simple queries that only read/modify a small part of database, sometimes alse needing to read a entire tuple from database
- On-line Analytical Processing: Complex queries that read/modify a large portion of the database, usually needing to scan multiple tuple
- Hybird Transaction + Analytical Processing: The combination of OLTP and OLAP
Database Storage Models
DBMS store tuple in different ways that is batter for eithor OLTP or OLAP. The basic storage models are as follow:
- row storage(n-ary storage model, NSM): Simply store tuple with the entire attribute one after the other, which is ideal for OLTP.
- colume storage(decomposition storage model, DSM): DBMS contiguously store a single attribute(column) for all tuple, which is ideal for OLAP.
Since OLTP tends to read entry tuple and do a lot of inserts, NSM is ideal. On the other hand, let's say we want to extract some attribute from multiple tuple to analysis, NSM would cause expensive overhead.
So in this scenario, DBMS must use DSM to store tuple.
Therefore, we can extract some useful attribute of the tuple instead read all attribute of the tuple.
Tuple Extract
In cloumn storage, how can we extract a tuple from table. We have two approachs to handle it:
First, we can simply use offset to extract tuple. We can store same tuples' attributes in same offset. When we want to extract specific tuple, we can simply use offset to extract it.
Second, we can additional store tuple id to tuple attribute in cloumn. When extract one tuple, we can scan each column and match specific tuple id.
Compression
As an example of musql innoDB compression, which shown in the following figure.
Mysql default page size is 16KB, so we compress it and round to the power of 2. In the front of the compressed page , called mod log, is used to store update log. Let's say we want to update some tuple in compressed page, instead of decompressing it, we can add update log into mod log. When we want to evict certain compressed page, we would evict this page as well as mod log.
Let's say we want to read certain tuple without decompressing it first, this is ideal.
As the shown of the firgre, when we want to specify where clause, in this example, is 'andy', if we can replace 'andy' to another something, in this scenario, we can select in compressed data without decompress it.
The rest of this chapter, we would introduce some compression algorithm.
Run-Length Compression
We can compress data as a sequence of triplets, all of which contain the value of this triplet, the offset of this triplet and the number of value which triplet represents.
Further, if we can sort the original data before compression, we can get higher compression ratio.
Bit-Packing Compression
Bit-Packing is very intuitive, we can store the useful data instead of storing entire data.
Mostly Compression
This is varient of the bit-packing compression. In original data, if exist a specail data, which is large than any other, we can replace this value with a specail marker. If we scan for this marker, we would find this value in another table.
Bitmap Compression
We can store a separate bitmap for special attribute, where its offset is corresponding to the tuple.
It looks like this compression has high comression retio, but, to
make matter worse, let's say we have 1e5 tuple, and we use this encoding
for int32
, the comparition as following:
- Decompression: \(32\ bits\times 10^5\)
- Compression: \(INT32_{MAX}\ bits\times 10^5\)
Delta Compression
For time data or another with same attribute, for each value, we can store its difference with the first value, as the following figure show:
Incremental Compression
In common prefix table, the first entry is null, for every another
enrties, we consider what frefix portion of the front string I have. As
an example, robbed
has the prefix rob
in the
front string rob
, robbing
has the frefix
robb
in the front string robbed
, the rest of
string follow this patten.
Dirtionary Compression
This compression method is suitable for our ideal compression, we can
read compressed data without compress it. As an example of the follow
figure show, we replace 'andy' with integer number 30
This compression is also supprot range query, as the shown of following figure:
Memory Management
Storage model and Compression answer the question about how DBMS represent database in file on disk, and memory management answer the question about how DBMS move and memory data front and back from disk.
Initially, buffer pool is empty. If the above application, as the
figure shown, is execution engine, wants to fetch
page 2
. Buffer pool first need to read directory
page from disk, then read the required page.
Buffer Pool Manager
Page table maps page id
to frame id
as well
as tracks each page id
corresponding the
frame id
(maybe this page
would not store in
physical memory).
Because of the page table is shared resource, so we need to use
latch
to protect it from accessing other thread when
certain thread is accessing.
In database,
lock
used to pretect the content of database(e.g. tuple, table),latch
used to pretect the interal data structure(e.g. hashtable, regions of memory).
At this points, we need to clarify two concepts:
page table
and page directory
.
page table
: used to mappage id
toframe id
. It would be stored in memory without storing in diskpage directory
: used to track the location of every pages in database file. It must be record in disk, which allows DBMS can find certain page when restarts
Buffer Pool Optimizations
Multiple Buffer Pools
DBMS can maintain mulitple buffer pools in each scenario:
per-database buffer pool and per-page type buffer pool(page size vary
from 1K
to 16K
).
Partitioning memory across multiple pool can reduse latch contention and improve locality. We have two approach to implement it.
Every record ID has extending attribute, called object ID, and a mapping from object id to specific buffer pool can be maintainded. Another approach is hashing, which hashs page id to select a buffer pool.
Pre-fetching
At first, BPM would fetch page 0
, then when the query
cursor points to page 1
, BPM would fetch
page 1
. The query cursor may continue downword, so the BPM
would prefetching page 2
and page 3
, as the
following shown:
Therefore, when query cursor move to page 2
, BPM would
not fetch this page
Scan sharing
When Q1
cursor points to page 3
, the
content of buffer pool is page 3, 1, 2
and Q2
comes
In scan sharing, cursor of Q2
would jump to cursor of
Q1
instead of scanning from the beginning. Then,
Q1
and Q2
would scan downword together.
When the part of query of Q2
finished, the cursor of
Q2
would jump to the beginning and scan those page that
have not scanned.
Page Cache
DBMS can fetch certain page from file system
, at the
between of file system
and disk
, OS would
support a cache for those page, which can reduce the I/O form disk. But,
because of redundent copys of page, different eviction policies, DBMS
can bypass this cache layer by using O_DIRECT
.
Buffer Replacement Policy
Laest Recently Used
LRU would maintain a timestamp of each access page, and it would evict the oldest timestamp.
Clock
The Clock policy is approximation of LRU without timestamp. This
policy would organize page as circle buffer with 'clock hand'. Upon
swapping, clock hand would chick the page bit
is set to
'1'. If this bit is '1', then set to '0'; otherwise, evict this page.
The detail process is as follow:
If the ref bit is zero, which suggest that this page has not been accessed before clock hand pointed to it this time.
LRU-K
The porblem of this two approachs is that, the cacahe would susceptible for sequential flood. Let's say a scenario, BPM has record thress pages, then a flood of page fetched by using BPM, those flood page would only read once. After the flood, we want to read the three pages from last time, but now, those three pages no longer in BPM. The sequential flood would pollute the cache in BPM.
The varient of LRU policy, called LRU-K, can handle this situation. LRU-K would track the number of access time is greater and equal than K, for those access times less than K, they would be evicted firstly.
Hash Tables
Hash table implement an unordered associative array that maps keys to value. It used hash function to compute an offset into this array for a given key.
Space complexity: \(O(n)\)
time complexity: * Average: \(O(1)\) * Worse: \(O(n)\)
Designing a hash table requires consider two factor:
- Hash Function: how to map a large key space to smaller value domain
- Hashing Scheme: how to handle collision after Hashing
The basic design of hash table is as shown in the following figure:
Static Hash Scheme
Linear Probe Hashing
It uses a circular buffer of array slots and hash function maps keys to slot. For insert, firstly hashing a given key to a slot, if the slot is free, then the key-value pair is insertted into it; otherwise, adjacent slot are searched until a free slot is found. For lookup, firstly hashing a given key to slot, and linear search adjacent slot, if encounters an free slot, it means this value is not present in hash table.
The question of this approach is when search encounters a null slot which is deleted previously, it would not find the specific value that is mapped to the given key.
Two solutions is shownd as the following figure:
This approach would result in expensive cost, so it's not common used.
If the key is not unique, there are two approachs to handle it:
Robin Hood Hashing
This is varient of Linear Probe Hashing, it records the distence between current position and optimal position. Everytime we insert a key-value into it, it seeks to reduce the distence of each key from their optimal position.
Cuckoo Hashing
This approach maintain multiple hash table with different hash function. When we insert, it check computes hash function to check every table for finding a free slot. If no table has a free slot, it would randomly chose old enrty to evict and rehash the old entry. In the rare cases, we may trap into dead loop, we can rebuild the hash table by using larger table.
Dynamic Hashing Scheme
Chained Hashing
We can maintain a list of bucket for each slot in hash table. The concept of bucket is a container that can store one or more key-value pair.
When the bucket is full, we can allocate a new bucket to store new key-value pair, and add it into the tail of the slot of corresponding key.
If the new bucket is full again, this operation would perform over and over again. Obviously, if buckets gorw infinitely, this approach would degenrate into link list.
Extendible Hashing
Because of the result of hash function is an interge number, so in bucket Pointers, we can chose specific bits to map to one or more buckets. If the bucket is full and can not insert a new key-value pair, we can double the bucket pointer and reshuffle the content of all buckets.
In order to implement the map of bucket pointer and bucket, we set global bit and local bit. The former is used to detemine which slots in bucket pointe correspond to which bucket, as the shown of the following figure:
At this time, the bucket is full, so we need to double the bucket pointer. Then rehash the content of the all buckets
Linear Hashing
Instead of imediately double bucket pointer, Linear Hashing would increase bucket pointer gradually. At some time, if bucket is full, it would add a new bucket into the tail of the corresponding key slot and add a new bucket pointer. It would generate a new hash function and split pointer would downword. The new bucket pointer is the duplication of the old bucket pointer, so they would use different hash function.
As the above figure shows, when we add \(17\) into hash table, the bucket which
slot 1
maps is full, and the split pointer points
to slot 0
. So we would add a new bucket into
slot 1
and add a new bucket pointer. The old
bucket use hash function 1
, the new bucket use
hash function 2
Index
In order to gain faster random access, we can use an index structure. An ordered index stores the value of search key in sorted order and associates the record which search key corresponding to. Besides, index would store seperatedly in another file.
Clustering index & Non-clustering index
Clustering index would change the physical order and when I create it, the contain of this relation would be sorted in specific cloumn. Clustering index can create only one.
Non-clustering index would not change the order of original relation. It is created one or more cloumn to enhance the retrieval speed of specific queries.
Dense index & Sparse index
Dense index would creates index entries for every search key value, sparse key would creates index entries for some of the search key value, the details of them as the following figure shows: