Streaming containers

This blog post is about an issue I’ve run into several times, but still unsure how to solve in a nice way. Consider this more of an “organize-my-thoughts” type post, rather than a solution to the problem.

In many cases I end up in a situation where I want to pass a bunch of data from one subsystem to another. The data is organized into different types using structs or classes. Unless there is more than one type of data, they may have different sizes, and the most common use case, at least for me, is that such objects share a common base. For example different types of constraints generated by collision detection and joints fed into the rigid body solver or draw calls fed into the graphics system, but let’s take a traditional event system as a simplified example:

struct Event
{
	unsigned char type;
	int flags;
}

struct SoundEvent : public Event
{
	Sound* sound;
	Vec3 worldPos;
	float volume;
};

struct CollideEvent : public Event
{
	Object* a;
	Object* b;
};

These events can be produced from various locations in the code and some data structure is needed to store them. At some point we need to traverse and dispatch them to listeners. In this case, one could imagine some direct callback approach, but it can get messy in a threaded environment, so let’s say we decide it’s a good idea to queue them up and dispatch them later.

There are multiple ways we could store these events in memory. We could, for instance, allocate each new event on the heap, store their pointers in an array, go through that when we dispatch and then delete them, but assuming this happens every frame in a game loop, that could potentially be a lot of allocations every frame, and they would all be scattered in memory, leading to a lot of cache misses.

void addEvent(Event* event)
{
	events.push_back(event);
}

void dispatch()
{
	for(int i=0; i<event.size(); i++)
	{
		dispatch(events[i]);
		T_DELETE(events[i]);
	}
}

Another solution is to keep a separate array for each object type, so each subclass of Event would have it’s own container in EventSystem. This will keep all events of the same type nicely packed in memory, but it leads to more code. Probably not a big problem if the number of different types is small, but once it get bigger, the code will be harder to maintain:

void addEvent(Event& event)
{
	if (event.mType == SOUND)
		soundEvents.push_back((SoundEvent&)event);
	if (event.mType == COLLIDE)
		collideEvents.push_back((CollideEvent&)event);
}

void dispatch()
{
	for(int i=0; i<mSoundEvent.size(); i++)
		dispatch(soundEvent[i]);
	for(int i=0; i<mCollideEvent.size(); i++)
		dispatch(collideEvent[i]);
}

One could of course also imagine multiple addEvent methods that take different types, but the concept and the amount of code duplication is similar. This solution also has the drawback of not maintaining the order in which the events were issued, which may or may not be a problem.

If the order is a important, we could keep a separate array of pointers into the other arrays that can be traversed when dispatching, adding an extra level of indirection, but it’s a bit awkard and sort of counteracts the whole idea of packing the objects tightly in memory.

A third option would be to group the different types into a single uber-type, for instance using a union of structs:

struct Event
{
	unsigned char type;
	int flags;
	union
	{
		struct 
		{
			Sound* sound;
			Vec3 worldPos;
			float volume;
		} sound;
		struct 
		{
			Object* a;
			Object* b;
		} collide;
	} data;
};

This would allow us to easily put them all in the same vector and traverse that linearly when dispatching. But this obviously has the downside of all events now being as large as the largest subtype, wasting a lot of memory.

What we ideally want is actually really simple. We want to store differently sized objects in a tightly packed lump of memory that can later be traversed linearly. Think of it as a stream of data, very similar to how state is usually serialized to disk.

The problem is that C/C++ is ill-equipped for this access pattern, since built-in arrays and containers usually operate on one specific type. I believe the historical reason for this is because older CPUs couldn’t access certain types unless they were properly aligned in memory. Unaligned reads on the Alpha or SPARC CPU (and ARM CPUs up until ARMv6) may cause undefined behavior or simply crash on unaligned pointer access. Struct/class members in C and C++ are automatically aligned (padded with unused data) to circumvent this issue, and the size of a struct itself is also adjusted so that the largest member will always be aligned when stacked on top of each other. This assumes that only structs of the same type are stacked, so if we manually stack structs of different types on top of each other in memory, alignment goes out the door.

On modern 64-bit architectures, alignment is less of an issue, if any at all. The CPU will gladly read and write unaligned pointer access (it may sound simple, but there’s a good amount of complex circuitry to make it happen), but there might be a slight performance penalty in some specific cases. I have not done enough research on this, but it seems that unaligned reads and writes will only get a performance penalty if the access happens to straddle two cache lines. On the contrary, performance for a lot of software running on modern CPUs is restricted by accessing memory. Reducing memory footprint will generally increase performance, so it’s a balancing act.

So, if we would put our events of different types in a contiguous lump of memory and need to respect alignment requirements, it could be implemented with something like this:

template <class T> void addEvent(const T& event)
{
	memcpy(mBuffer+mOffset, &event, sizeof(T));
	mOffset += sizeof(T);
}

addEvent(mySoundEvent);

This solves the problem of lining up objects of different sizes in memory, but how do we get them back when it’s time to dispatch? Since the type is stored in the first member of the base type, we can peek at the first byte to determine the type and use a similar template method for retrieving the data.

unsigned char getNextType()
{
	return mBuffer[mOffset];
}

template <class T> void getNextEvent(T& event)
{
	memcpy(&event, mBuffer+mOffset, sizeof(T));
	mOffset += sizeof(T);
}

unsigned char t = getNextType();
switch(t)
{
	case SOUND:
	{
		SoundEvent sndEvt;
		getNextEvent(sndEvent);
		dispatch(sndEvent);
		break;
	}
	...
}

First copying the event into the memory stream and then copying it again to get it back somewhat counteracts the whole idea of the being really efficient, but if we disregard alignment and rework the interface we can read and write directly into the stream instead with something like this:

template <class T> T& addEvent()
{
	mOffset += sizeof(T);
	return *reinterpret_cast<T*>(mBuffer+mOffset-sizeof(T));
}

unsigned char getNextType()
{
	return mBuffer[mOffset];
}

template <class T> const T& getNextEvent()
{
	mOffset += sizeof(T);
	return *reinterpret_cast<const T*>(mBuffer+mOffset-sizeof(T));
}

//Produce events
SoundEvent& sndEvent = addEvent<SoundEvent>();
sndEvent.volume = 0.5f;
...

//Dispatch events
unsigned char t = getNextType();
switch(t)
{
	case SOUND:
	dispatch(getNextEvent<SoundEvent>());
	break;
	...
}

This would allow us to place all objects of different types linearly in a tightly packed chunk of memory and then retrieve them with no copying, no wasted memory or other overhead. Would this be faster than other methods, or would the unaligned access counteract the smaller memory footprint? Only one way to find out…