HenchLua: Representing Values
Values
Lua supports the following standard types:
- Nil
- Booleans
- Numbers (doubles)
- Strings (that is, byte-strings)
- Tables
- UserData (arbitrary object references)
- Functions (both Lua functions and callable user code)
- Threads (coroutine state)
Similarly to .NET types, that list can be split into reference and value types. Nil, booleans, and numbers are true value types, while strings behave like value types due to their immutability. The rest are reference types. Now, since any variable may be of any type, and since that type may change dynamically, the backing storage for values needs to accept all of the above types.
Values in Standard Lua
In standard Lua this is done using a tagged union, which looks something like this (not a direct cut from the Lua source, I've evaluated some of the macros and typedefs and reformatted for clarity - this may also come out differently on other architectures):
struct lua_TValue
{
union
{
struct
{
union Value
{
GCObject *gc; /* collectable objects */
void *p; /* light userdata */
int b; /* booleans */
lua_CFunction f; /* light C functions */
} v__;
int tt__;
} i;
double d__;
} u;
};
Alright, so let's make sense of this. A Lua value is a union of a double (d__
, for storing number values) and a struct (i
) which is a combination of fields for storing all the other possible types of values (v__
) and a field (tt__
) which keeps track of the actual value.
The tt__
field's position in the i
struct and its values are all carefully chosen such that the useful number values and non-number types are distinguishable (if you try to read a non-number as a number you'll see some kind of NaN, and the Lua VM asserts on arithmetic operations that produce NaNs).
This makes lua_TValue
eight bytes long, which is a wonderfully efficient state of affairs.
Values in HenchLua
Unfortunately, there's no way to match the 8-byte value type in C# without boxing (which is counter to one of the primary design goals - being nice to the GC). So what can be done? A naive approach would be to create a struct with a field for each of the value types and a field of type object
for the reference types, along with yet another field to act as the equivalent of tt__
.
public enum ValueType
{
// ...
}
public struct Value
{
private object asRefType;
private double asNumber;
private bool asBool;
private ValueType type;
}
Unfortunately, that Value
type would be somewhere around twenty bytes long (alignment, padding - let's not even bring up x64). That's unacceptably large.
My first attempt to reduce Value
's size was the FieldOffset attribute, which is the obvious way to implement unions in C#. I didn't have much success with that approach. For one thing, the object
field cannot be overlaid over the other fields (just imagine the havoc it would play with the GC), so all I have to play with are asNumber
, asBool
, and type
. While that does indeed bring our struct size down to twelve bytes (which happens to be optimal), it's brittle since I can't actually put the type
field where I want it on all platforms - there's no way to dynamically compute the offset values to account for differences in endianness and alignment between architectures, and there goes the goal of working on all sorts of .NET runtimes.
So I took a step back and looked at the fields one at a time. The first thing that struck me was that asBool
could easily be replaced by reinterpreting asNumber
- zero is false, one is true. The annoying thing about that was that I'd constantly be loading and testing an 8-byte register when there's really only one bit's worth of data around.
But these tests would always come after testing type
. So the first change I made was to split ValueType.Bool
into ValueType.True
and ValueType.False
(naturally this could be hidden behind a public interface that only exposes a Bool
enumerant, to keep things simple for external code).
After that change, all that was left to eliminate was the type
field. I already knew that overlaying it over asNumber
wouldn't work, so all that left was somehow overlaying it over asRefType
. Sentinel object instances to the rescue:
public struct Value
{
internal object RefVal;
internal double NumVal;
internal static readonly object NumTypeTag = new object();
}
internal sealed class BoolBox
{
public readonly bool Value;
private BoolBox( bool value ) { Value = value; }
public static readonly BoolBox True = new BoolBox( true );
public static readonly BoolBox False = new BoolBox( false );
}
The final semantics are fairly straight-forward. RefVal
always carries the type information. For reference types, the already existing .NET type info is sufficient to identify the actual value. For true value types, I either use a preallocated and immutable boxed value (for booleans), or I use the sentinel value Value.NumTypeTag
(which tells us that the actual value is in the NumVal
field). And null
obviously means nil
.
Strings
Lua strings are byte arrays. That's unfortunate, because it means the standard System.String
type can't be used directly. So, first step in writing a custom type is gathering requirements. Lua strings are:
- Byte arrays - they can readily contain embedded zeroes
- Immutable
- Very often used as keys to a hashtable
- Sometimes used to hold large blocks of data
- Reference types
So our implementation needs to be compact and quick to compare. The naive approach follows:
class LString
{
private byte[] data;
//cache the hash code to keep things snappy
private int hashCode;
}
Unfortunately, due to how ubiquitous strings are in Lua, this type violates some of our fundamental requirements. First, it's actually two objects, and that wastes memory since each object has some overhead. Second, it adds a level of indirection: when we need the string data (say, to compare values) we first need to load the LString
object and only then can we read the data
object. That's up to two cache misses where there should only be one, which is relevant in a tight loop like the one at the heart of the GC's marking phase.
Fortunately, while strings are reference types, their immutability makes them behave like value types, which allows us to expose the public interface through a struct, with no loss of clarity, while internally handling the data as a byte array:
struct LString
{
//the first four bytes contain the hash of the remaining data
internal byte[] InternalData;
public bool IsNil { get { return InternalData == null; } }
public int Length
{
get
{
return InternalData != null ? InternalData.Length - 4 : 0;
}
}
// ...
}
When constructing a Value
which contains a string, we don't box the LString
struct, we just grab the InternalData
field directly. A RefVal
of type byte[]
is understood to mean string.
But what if the user gives us a byte[]
as user data? This is probably a rare case (relative to how ubiquitous strings are in Lua), so we handle it by allocating a small proxy object around the user data. This is hidden from the user, so the library interface stays simple.
A small aside: I had originally named the type String
, which was fine and worked well inside the HenchLua
namespace. However, outside that namespace, it was constantly conflicting with System.String
, and after the fourth or fifth time I wrote using LString = HenchLua.String;
I decided to just rename the type for my sanity's sake.
Callables
Callable Lua objects are represented using any of the following types:
public abstract class Function;
internal class Proto : Function;
internal class Closure : Function;
public abstract class UserFunction : Function;
public delegate int UserCallback( Thread thread );
That last one, the delegate, complicates things for us. In all the other cases, it would be enough to treat types derived from Function
specially. However, delegates are too useful to ignore (particularly since they can be cleanly constructed around lambdas, and anonymous and static methods).
I use a trick similar to the LString
struct to keep things clear. The Callable
struct wraps either a Function
or a UserCallback
in the public interface, while the raw object values are passed around internally without any boxed or proxy objects being in the way.
Table
And, finally, this brings me to Lua's core structured type: the table. I'm not going to go into too much detail on the underlying algorithms here, as Table
is a fairly direct port of Lua's implementation. If you're curious, either the Table
code or Lua's luaH_*
functions will tell you everything you need to know, though Table.cs
might be easier for someone new to Lua to understand since nothing is buried in macros.
Table mainly differs from the standard Lua implementation in its interface. Since tables are ordinary .NET objects, with no reliance on any sort of global state, there's no need to hide them behind Lua's cumbersome stack interface. Raw access is directly exposed as an indexer. (Non-raw access requires an execution context, in case metamethods need to be called, and must therefore be done through a Thread
object.)
The only interesting implementation detail is the way table members are stored. The thing with tables is that their storage is often rather sparse (especially in the hashtable part), and using full Value
types would be wasteful. Tables instead use the internal CompactValue
type, which works like Value
except that they don't have a separate NumVal
field. Instead, numbers are boxed.
One thing to note, however, is that they aren't boxed using the standard .NET boxing mechanism. The reason for this is that I wanted to reuse the boxes when possible, to keep allocation pressure low, and the standard boxes are impossible to efficiently mutate in C#.
This is the main exception to the "don't allocate where standard Lua wouldn't" rule. Lua can allocate memory when setting nil fields to non-nil values. HenchLua can, in addition, allocate when setting non-number fields to number values. However updating a number value won't allocate additional boxes. That compromise was made to save memory and to make the GC run a little faster when scanning a table's fields, and I think it's a net win since, in my experience, the type of a table element doesn't often change dynamically in common Lua code.