The Design of Hash Table in Lua 5.4
Last updated on 9 minutes ago
- Design of Table
- Operation for Table
Design of Table
1 |
|
- Memory occupation
It should be noted that Node is defined as a union which encompasses both the key and the corresponding value for key-value pairs. Additionally, Lua reorders the fields inside Node with the aim of optimizing memory usage.
Take a 64-bit system as an example. In this context, the size of Node
calculates to be 3 * 8 = 24
.
1 |
|
In contrast, if we define the Node
structure as
follows:
1 |
|
In a 64-bits system, it would occupy 4 * 8 = 32
bits.
- The design of the
next
field
All hash nodes are stored in an array (with the type being
Node
and constituting the hash part of the table). However,
different nodes that have the same hash value will be chained together,
and this chaining is achieved through the next
field.
We don't need to be concerned about the overflow issues of the
int
type. This is because the capacity of the hash part
will not exceed the maximum value that an int
can
represent. Hence, using int
for addressing purposes is
adequate for this context.
1 |
|
For this definition, there are several key points to be pay attention:
array
andnode
field are consecutive, without another fields gap between them. Thanks to this design, we can iterate over both the array part and the hash part (represented by the node field) of the table using the same index, as is done in theluaH_next
function.lsizenode
represents the ceiling of the base-2 logarithm of the size of the node array, that is,ceil(log2(node size))
.- Moreover, the capacity (the space allocated by the allocator) of the hash part is invariably a power of 2, and the same applies to the array part.
- The capacity of hash part can be calculated by
2^(lsizenode)
alimit
does not represent the actual size(or real size) of array part. We can regard it as a hint, which helps us quickly determine whether an integer index exists in the array part or not.- The capacity of the array part is the smallest power of 2 that is
greater than alimit, which is calculated by
luaH_realasize
. - If we explicitly call the setrealasize macro, alimit will represent the real size of the array part. It should be noted that this occurs only when we call the setlimittosize function.
- In other cases, alimit may be set to other values to represent a "false positive" boundary of the array part (merely serving as a hint).
- The capacity of the array part is the smallest power of 2 that is
greater than alimit, which is calculated by
Operation for Table
For basic operations of tables, our focus will be on their semantics, specifically the implications of input and output as well as their internal implementation.
Auxiliary function
Finding main position of key
Given a key, we try to find its main position, which refers to the location where we directly apply the hash function to calculate the corresponding index.
- If the key is an integer, we directly hash its value.
- If it is a floating-point number, we use a custom hash function for floating-point values.
- For boolean values, a key of this type will always be mapped to either an index of 0 or 1 within the hash part of the table.
- For strings, we utilize their hash values for addressing purposes. In particular, for long strings, we calculate their hash values in a lazy manner.
- For light userdata, light C functions, and other types, we hash their pointer addresses.
1 |
|
The contrast version of the last function, given a hash node, tries to find the main position of its key.
It should be noted that the reason we introduce these two functions is that in case a collision occurs (i.e., when two different keys are mapped to the same hash slot), we have to move one of them to another position (in Lua, typically to a higher address space). So, we must write a function to obtain the main position of a key.
1 |
|
Get free position
Starting from t->node
, this function iterates through
the nodes one by one to find the first node with an empty key. It should
be noted that this function does not check whether the capacity is
sufficient or not. Hence, this additional check must be carried out
prior to calling this function.
1 |
|
Array part
The following two functions should be considered together. The
arrayindex
function merely checks whether an integer key is
out of bounds (i.e., its index is greater than the largest array
capacity). If it is out of bounds, the function returns zero; otherwise,
it converts the integer to an unsigned int
.
The second function attempts to find the integer index of a given
key. This function should explicitly specify the array capacity. Let's
explore the details. The findindex function is only called within
luaH_next
, and we can regard it as an auxiliary function
for luaH_next
.
Since the integer keys in Lua start from 1, while the
t->array
indexing starts from 0
, we should
check key - 1
instead of key to determine whether it is out
of bounds. Due to this offset, if we pass the last integer key of the
array part to findindex, it will not actually retrieve that key.
For example, if in the Lua layer the table is defined as
t = {[1] = 1, [2] = 2, [3] = 3, ["key"] = "val"}
and we
push 3
onto the stack as the starting key for
luaH_next
, it will not iterate over key 3
but
rather the key "key". This is because findindex
will return
3
in this scenario. Although 3
is less than
asize
, t->array[3]
is empty (only
t->array[0:2]
have values), so it proceeds to iterate
over the hash part.
In the hash part, we should maintain this property. If we simply
return i + asize
, it will iterate over the input key, which
does not meet our expectations. Instead, we should return
(i + 1) + asize
.
1 |
|
Rehash
The following series of functions implement the rehash and resize functionality of tables. If we read the source code from top to bottom in its original order, it's extremely difficult to understand the implementation of rehash. Therefore, we rearrange the order for better comprehension.
Let's first take a look at the higher-level function, which can assist in understanding the semantics of each auxiliary function (at the lower level). Thanks to the annotations after each line, we can roughly understand the functionality of each function.
The numusearray
function is used to count the number of
keys in the array part, while the numusehash
function is
used to count the number of integer keys in the hash part. The variable
na
represents the number of keys in the array part, and the
array nums
with a size of MAXABITS + 1
serves
as a buffer to distribute each integer key into intervals with a size
that is a power of 2 and count how many keys exist in each interval. For
example, if the keys are dense, say in the range [1, 100]
,
the content of nums
should be as follows:
1 |
|
Based on the implication of this array, the countint
function is used to distribute (by applying the ceiling of
log2
) an integer key to the proper index of the
nums
array. It should be noted that the return values of
both numusearray
and numusehash
represent the
number of keys (regardless of their types) in the array and hash parts
respectively. However, internally, numusehash
only
accumulates integer keys and stores the result in nums
and
na
. Now that we understand the meaning of each variable,
where na
is the number of keys in the entire table and
nums
is the distribution of each integer key, we pass these
two variables to the computesizes
function to calculate the
new size of the array part. This new size should ensure
that half of the integer keys go into the new array part. Note that we
pass na
by reference to computesizes
, so it
will assign the number of integer keys that should go into the array
part. After obtaining the new size of the array part, we pass it along
with the new size of the hash part (total keys minus all integer keys
going to the array part) to resize the whole table.
1 |
|
This function calculates the ceiling of log2
of a key
and adds the value to the corresponding index in the nums
array.
1 |
|
The following two functions calculate the number of keys in the array
part and hash part respectively. For numusehash
, it simply
iterates through all the hash nodes in reverse order and calls
countint
to accumulatively count the integer keys. For
numusearray
, it manually distributes each integer key into
the nums
array.
1 |
|
Now that we understand the meaning of the two input arguments, where
nums
represents the distribution of integer keys and
pna
represents the number of integer keys, this function
should return the new size (referred to as the optimal size) of the
array part, and this return value should possess several attributes:
- The optimal size should be a power of 2.
- Half of the optimal size should be less than the total number of integer keys, and the optimal size should be greater than or equal to the number of total integer keys.
The internal logic of this function is relatively straightforward. It
accumulates the values in nums
by index i
and
checks whether the current accumulated value a
is greater
than 2^i
. If so, it assigns optimal
to the
value 2^i
and na
to the current accumulated
value a
, indicating how many integer keys should go to the
new array part.
1 |
|
The following four functions are relatively straightforward, and we'll briefly look at them.
The setnodevector
function resets the hash part of the
table according to the input size
(where size
represents the number of elements rather than bytes). If
size
is zero, it doesn't actually allocate space but
instead assigns t->node
to the global static dummy node.
This means that if many tables are empty, the t->node
field of these tables will have the same value. If we happen to load
ltable.c
multiple times (where the global static dummy node
is defined), it might cause some bugs. For each hash node, we simply set
its key and value to empty. For t->lastfree
, it will be
assigned to the next position of the last node (to find a free node,
refer to the getfreepos
function).
The reinsert
function simply iterates through all
non-empty nodes from ot
and inserts them into
t
by calling luaH_set
.
The exchangehashpart
function exchanges the hash fields
(specifically, t->node
, t->lsizenode
,
and t->lastfree
), without performing a deep copy.
The getfreepos
function attempts to find the first empty
node starting from the last node of the entire hash node array. If all
nodes are not empty, it returns NULL
, indicating that there
is no free space and rehashing is required.
1 |
|
The following function is used to resize the entire hash table.
According to the explanation in the Rehash
section, we already understand the meaning of the two input arguments.
newasize
represents the new size of the array part, and
nhsize
represents the new size of the hash part. Pay
attention to the definition of size here; it refers to the number of
elements rather than bytes. At the same time, the values of
newasize
and nhsize
might not initially be
powers of 2, but the capacities of these parts must eventually be powers
of 2.
This function declares a table on the stack and only
initializes its hash part (setting all elements to empty), intending to
use it as a temporary hash part of the table. The process of resizing is
as follows: - First, check whether the array part of the table will
shrink or not. - If it will shrink, insert elements that are out of
bounds into the hash part of the temporary table. - Here, a question
might arise as to why we pretend the array has a new (smaller) size. -
This is because we should shrink t->alimit
in case
original keys existing in the array part might be reinserted into the
array part (while we expect these keys to go into the hash part). -
Otherwise, we simply reallocate the array part. If the reallocation
fails, we release the hash part of the temporary table and raise an
error with the array already resized. - Second, we set the key and value
to empty in the newly allocated space, exchange the hash part with the
larger one (exchange the hash part of the current table with that of the
temporary table), and insert keys that previously existed in the array
part into the hash part.
1 |
|
The following function is just a wrapper around
luaH_resize
. The input argument nasize
represents the new array size. We pass it along with the size of the
hash part to the luaH_resize
function, meaning that our
intention is only to resize the size of the array part.
1 |
|
In general, when dealing with these functions related to table
rehashing and resizing, it's crucial to understand how each function
plays a role in maintaining and modifying the internal structure of the
table. The operations are designed with careful consideration of memory
management and the efficient organization of key-value pairs within the
table. For instance, the way luaH_resize
handles different
scenarios of array shrinking or expanding, along with the proper
handling of hash part exchanges and element reinsertions, showcases the
complexity and precision required in this aspect of the Lua table
implementation.
Moreover, functions like setnodevector
,
reinsert
, and exchangehashpart
work in concert
to support the overall resizing process. setnodevector
takes care of initializing or reusing the hash part based on the
specified size, ensuring proper setup of nodes. reinsert
is
responsible for moving elements from one table's hash part to another,
while exchangehashpart
enables the seamless swapping of
hash-related fields between different tables when needed.
All these functions together form a comprehensive mechanism for handling the dynamic nature of Lua tables, allowing them to adapt to changes in the number of elements and maintain optimal performance in terms of memory usage and access speed.
Insert operation
As indicated in the annotations, this function aims to insert a new key into the table following these steps: - Check whether the key's main position is free: - If it's free, then the insertion is completed. - Otherwise, check if the colliding key is in its main position: - If so, the new key will be placed in an empty position (without moving the colliding key). - Otherwise, move the colliding key to an empty position and insert the new key into its main position.
Miscellaneous points: - If the new key is a floating-point number, an
attempt will be made to convert it to an integer. This is done by
applying the floor function and checking whether the result is equal to
the original floating-point value. - mp
represents the main
position of the new key (if not empty, it's the colliding key), and
othern
represents the main position of the colliding key
(which could be mp
). - Regarding why looping with the
condition othern + gnext(othern)!= mp
will eventually find
the previous key of mp
: - Since the colliding key was
inserted before the new key, its main position must precede that of the
new key. Therefore, if we iterate from the colliding key's main
position, we can eventually reach the current colliding position. -
After finding the previous key of the colliding key, we insert the new
key after it (by copying mp
to f
), and then
correct the next
fields of f
and
mp
(gnext(f)
should be set to the distance
between mp
and f
, and gnext(mp)
should be set to zero as it becomes the last node in this chain).
1 |
|
Get operation
Main get function & Generic get function
All get functions are designed to return the value corresponding to the input key.
Main get function
1 |
|
Generic get function
This function simply checks whether the key in a slot is identical to the input key, starting from the main position of the input key.
In Lua, the gnext
macro is used to chain each hash slot.
If a slot has an empty next
field (with a value of zero),
it means that it's the last slot in the hash table.
1 |
|
Next function
This function simply iterates over all elements starting from the
index of the key obtained by findindex
. We've already
explained the inner workings of findindex
, which returns
the next index after the key's index. The implementation of
luaH_next
is quite intuitive as a result.
1 |
|
Auxiliary get function
Get integer from hash table
The key point of this function is to check whether an integer key lies within the range of the array part.
In the simplest situation, if the key is in the range
[1, t->alimit]
, we can directly retrieve it from the
array part (note that alimit
may not represent the real
size of the array part, but in this scenario, we don't need to concern
ourselves with that).
If not, since alimit
may not be the real size of the
array part, we perform the following check:
- Objectives: Check whether
key - 1
is within the capacity of the array part.- Firstly, we know that the capacity of the array part is the smallest
power of 2 that is greater than
alimit
, which means2^p < alimit <= 2^(p + 1)
. - Secondly, we know that
key - 1
is now greater than or equal toalimit
. - Therefore, we should check whether
key - 1
is less than2^(p + 1)
or not.
- Firstly, we know that the capacity of the array part is the smallest
power of 2 that is greater than
We can clear the p
-th bit of key - 1
, which
we'll call res
.
If key - 1
is greater than or equal to
2^(p + 1)
, res
will have some bits higher than
p
, so it must be greater than or equal to
alimit
(alimit <= 2 ^ (p + 1)
). If
key - 1
is less than 2^(p + 1)
,
res
will definitely be less than 2^p
(since
the p
-th bit is cleared), so it must be less than
alimit
(alimit > 2^p
).
Clearing the p
-th bit of key - 1
can be
achieved by applying the operation
(key - 1) & ~(alimit - 1)
. Since
2^p < alimit <= 2 ^ (p + 1)
, we have
2^p <= alimit - 1 < 2 ^ (p + 1)
. Flipping
alimit - 1
will set all bits higher than the
p
-th bit to one and the p
-th bit to zero, and
other bits can be ignored. Therefore, performing a binary AND operation
between key - 1
and ~(alimit - 1)
will clear
the p
-th bit of key - 1
.
If this key does not exist in the array part, we directly pass its
value to the hash and perform probing in the hash part. The probing
process is similar to that of the getgeneric
function,
where we find its main position and iterate from the low address to the
high address.
1 |
|
Get short string from hash table
Since strings always exist in the hash part, its getter function is quite intuitive. It involves calculating the main position of the input key and then iterating through the nodes one by one.
1 |
|
Get long string from hash table
For long strings, we directly call the generic get function.
It should be noted that the way to get short strings is similar to that of getting long strings. Regardless of the string's length, we both leverage its hash value to address it within the hash part. The difference lies in the fact that the hash value for short strings is pre-calculated, while that for long strings is lazily calculated.
1 |
|
Set function
The slot
field is a pointer to the value corresponding
to the key
. Its value should be obtained through the
previous getter function, and then we can directly use the
setobj
macro to set its value.
If isabstkey(slot)
evaluates to true
, it
indicates that this key does not exist in the table. In such a case, we
should insert a new key into it.
1 |
|
The following two functions are wrappers around
luaH_finishset
. The first one is a generic function used to
set a key of any type into the table, and the second one is a specific
function designed to set an integer key into the table.
1 |
|
Finding a boundary
We need to clarify several basic facts:
What's the real size for the array part
- We already know that the capacity of the array part of a table is
always the smallest power of 2 that is greater than
t->alimit
. This implies thatt->alimit
is always less than or equal to the capacity. - Therefore, if
t->alimit
is equal to the capacity of the array part, we consider the currentt->alimit
to be the real size of the array part, meaning thatisrealasize
will returntrue
. - Besides, there is a macro
limitequalsasize
used to check whetheralimit
represents the real size. The condition is relatively simple: it just checks whether the 7th bit of the table's flag is set to zero and whethert->alimit
is a power of 2. - The
luaH_realasize
function is used to return the real size of the array part. Its calculation method involves finding the smallest power of 2 that is greater thant->alimit
.
The implication of
ispow2realasize
- First, when the macro
isrealasize
returnstrue
, it means thatt->alimit
is equal to the capacity of the array, and vice versa. - If
isrealasize
returnsfalse
, it indicates that the real size of the array part is a power of 2. This is because the real size of the array is equivalent to its capacity, and the latter is always a power of 2. - If
isrealasize
returnstrue
, it means thatt->alimit
equals the real size of the array, which is also the capacity of the array. In this case, we then check whethert->alimit
is a power of 2. - It's important to note that
isrealasize
is always updated synchronously with whethert->alimit
is equal to the real size of the array part. - This macro is used to check whether the real size of the array is a power of 2 (although one might think it's somewhat redundant since the capacity of the array part is always allocated as a power of 2).
The following function, luaH_getn
, is used to find a
boundary of the table. That is, it finds an index i
such
that t[i]
is present but t[i + 1]
is absent
(here, i
represents the boundary). Also, it should be noted
that its return index is in Lua form (base 1), while its
internal processing index is in C form (base 0). According to
the annotation before the function, it's relatively easy to understand,
but we need to clarify the functionality of the binsearch
and hash_search
functions.
For the binsearch
function, it attempts to find a
boundary within the range from i
to j
.
Additionally, it has two properties: - For the element pointed to by
i
, it's present; for the element pointed to by
j
, it's absent. - So, internally, when we encounter an
element that is present, we update the left endpoint; and when we
encounter an element that is absent, we update the right endpoint.
For the hash_search
function, it tries to search for a
boundary in the hash part of the table. - The input argument
j
indicates that this element (which is j
, an
element in the table) is present (and j + 1
is also
present). We then try to find a boundary starting from this element
within the hash part. - This function keeps doubling j
until it encounters an absent element. After that, we perform a binary
search within the hash part between the two elements.
Note that luaH_getn
is used to obtain the boundary of
the table, and this boundary is always the maximum boundary if multiple
boundaries exist. For example, if we have a table
t = {1, 2, 3, [5] = 5, [7] = 7}
, according to the
definition of a boundary, the possible boundaries could be
3
, 5
, and 7
because
t[i]
is present while t[i + 1]
is absent. Due
to the existence of multiple boundaries, we always return the
largest (rightmost) boundary.
1 |
|
The key point of this form of binary search is that the left endpoint always satisfies the condition while the right endpoint always does not. Therefore, we initialize the left and right endpoints to be left closed and right open, and we eventually return the left endpoint since it always satisfies the condition. Similarly, if we want to return the right endpoint (i.e., the right endpoint always satisfies the condition), we should initialize the left and right endpoints to be left open and right closed.
The function binsearch
is actually the last part of
hash_search
, so we'll omit it for now.
1 |
|