Arabic Decoding for Games & Game Engines – Part 2: Decoding

Important

The upcoming parts assuming that you’re already having a working UI system that works for English at least. And we will build up what makes Arabic works for it.

My UI renderer is not very different from most of UI systems, so I’ve UI objects like UIText which is holding the info passed to the renderer to render a text on screen.

In my system, i load font files at the start of the game/app, and then build an atlas from it. This atlas contains all the code points that i need.

Now with the theory part is behind us, in practice, i like to make things more show & tell, so we show what we need to do, and what is the result of that, this way you can proceed quickly in the entire series.

A given string could be anything, could be English, Chinese, Arabic,..others or even a mix of many of those. In order to learn about a given string, we need to decode it into a sequence of “codepoints” which is basically the UTF-8 retransitions. When we open a font file, to get the characters info from it, we need these codepoints as each codepoint is pointing to a glyph in the font file.

But the problem is that not all codepoints are equal! In a given string, a codepoint size could differ per character, it could be a single byte (like common ascii ABC.. & 123..), and can be a size of two bytes (like most of Arabic) and can be 3 bytes size (like some Japanese or Chinese),…etc. And hence we need to decode the given string into a sequence of codepoints, each codepoint is equal a character.

For general decoding (Arabic or not) i used to use the little old (not very) Flexible and Economical UTF-8 Decoder function by Bjoern Hoehrmann, and this is no difference for Arabic, we still decoding with the same way.

C++
// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

#define UTF8_ACCEPT 0
#define UTF8_REJECT 1

static const uint8_t utf8d[] = {
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
  8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
  0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
  0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
  0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
  1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
  1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
  1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

uint32_t inline
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
  uint32_t type = utf8d[byte];

  *codep = (*state != UTF8_ACCEPT) ?
    (byte & 0x3fu) | (*codep << 6) :
    (0xff >> type) & (byte);

  *state = utf8d[256 + *state*16 + type];
  return *state;
}

So if you’ve a custom function or you’re using Bjoern’s, at this step, we’ve nothing to do, except just inject the string, into the text object, and render the decoded codepoints same exact way as if they’re English codepoints, except we point the text object to a different font file!

Considering you passed this string to a text object:

“و الشويه كلاااااام و كلام و الكلام العربى فى سطر لا جديد ههههه ٠١٢٣٤٥٦٧٨٩”

This string may have no meaning, but it including connections, Ligatures, repeated sequences and Arabic numbers. So in short, it is a good test case for all the upcoming changes to the working English text system.

With a basic decoding with the Bjoern’s, passing the codepoints to the function that draw quads (same exact code English does, except using an Arabic ttf font file) you would get something like that:

OK, we start getting something, but it is wrong. What are the issues we’ve right now?

First the text is rendered right to left, and at the same time the characters are all in “isolated” form, which is wrong. And this is exactly what a cheap Arabic support in some games looks like! Just decode like English and ship it to players!

In the drawing function, we do generate vertices for each quad that will draw a codepoint (the codepoint could be character or diacritic), we need to modify this drawing function to generate the quad vertices from Right to Left instead of Left to Right.

Considering this is my drawing code (part of it)

C++
...
//some code before that

//these are used to hold the advance through the loop to the next vertices
f32 _x, _y;

//loop through all characters of the given string
for (u32 c = 0; c < _strLength; ++c)
{
	...
	//some code to fetch glyphs from the font data

	//if not glyph in the font data for the given character, we return
	//the glyphs are read earlier via stb_truetype.h and stored in a data structure and baked into an atlas
	if (!_glyph)
	  return;

	//locals to hold the per vertex info that we will feed to the renderer
	f32 _minX, _maxX, _minY, _maxY, _uvminX, _uvmaxX, _uvminY, _uvmaxY;

	_minX = _x + _glyph->OffsetX;
	_minY = _y + _glyph->OffsetY;

	_maxX = _minX + _glyph->Width;
	_maxY = _minY + _glyph->Height;

	_uvminX = (f32)_glyph->X / Text->Font->AtlasSizeX;
	_uvmaxX = (f32)(_glyph->X + _glyph->Width) / Text->Font->AtlasSizeX;

	_uvminY = (f32)_glyph->Y / Text->Font->AtlasSizeY;
	_uvmaxY = (f32)(_glyph->Y + _glyph->Height) / Text->Font->AtlasSizeY;

	//following code to make the actual vertices from previous info, push them into the vertex buffer, generate indices, push into the index buffer, and proceed to render
	...
}

We need to modify that part of the text renderer to consider Arabic drawing from Right to Left, but how?

I do have in my engine a helper function to check if a given codepoint is an Arabic codepoint, it does nothing special than checking if the given codepoint (number) is in a specific range (predefined).

C++
...
//Arabic Range
#define ARABIC_RANGE_START 0x621
#define ARABIC_RANGE_END 0x64A

static inline b8 IsCharacterInArabicCodepointRange(u32 Codepoint)
{
	return Codepoint >= ARABIC_RANGE_START && Codepoint <= ARABIC_RANGE_END;
}

While this is helpful, but it is not very promising for UI rendering to check this every frame for every character on screen regardless we’re altering it as Arabic character or not. Checking for being in Arabic range per character, can be very frequent and hence i wanted to be far away from that.

What i found it good choice, was introducing a flag (enum) per text object ETEXTDirection, to tell if that text object going to draw Right to Left or Left to Right (i’ve added few more vertical directions, just in case i’ll be supporting ancient Egyptian’s Hieroglyphs in the future 😏), and this flag is cached, we set it only one time, when we initialized the text and we may re-set again if the text direction changed intentionally (through UI or when the string itself fully replaced).

C++
...
enum class ETEXTDirection
{
	TEXT_DIRECTION_LEFT_TO_RIGHT,
	TEXT_DIRECTION_RIGHT_TO_LEFT,
	TEXT_DIRECTION_TOP_TO_BOTTOM,
	TEXT_DIRECTION_BOTTOM_TO_TOP,
	TEXT_DIRECTION_ANY,
};

In my text rendering code, i refresh any text when it only get a matrix changed (move or animated or such) or get a different string value (text reset), otherwise, the text object render the same buffers as the previous frame, no updates to it. So this new Right to Left flag is very aligned with my design that rely a lot on caching what we can cache & don’t always process needless changes.

With such a flag, we can then modify the earlier part of the text drawing function to be more like that

C++
...
//some code before that

//these are used to hold the advance through the loop to the next vertices
f32 _x, _y;

//loop through all characters of the given string
for (u32 c = 0; c < _strLength; ++c)
{
	...
	//some code to fetch glyphs from the font data

	//if not glyph in the font data for the given character, we return
	//the glyphs are read earlier via stb_truetype.h and stored in a data structure and baked into an atlas
	if (!_glyph)
	  return;

	//locals to hold the per vertex info that we will feed to the renderer
	f32 _minX, _maxX, _minY, _maxY, _uvminX, _uvmaxX, _uvminY, _uvmaxY;
  
  //We draw vertices at the cached direction of the text object
  if (Text->TextDirection == ETEXTDirection::TEXT_DIRECTION_RIGHT_TO_LEFT)
  {
    _minX = _x + _glyph->OffsetX;
	  _minY = _y + _glyph->OffsetY;

	  _maxX = _minX + _glyph->Width;
	  _maxY = _minY + _glyph->Height;

	  _uvminX = (f32)_glyph->X / Text->Font->AtlasSizeX;
	  _uvmaxX = (f32)(_glyph->X + _glyph->Width) / Text->Font->AtlasSizeX;

	  _uvminY = (f32)_glyph->Y / Text->Font->AtlasSizeY;
	  _uvmaxY = (f32)(_glyph->Y + _glyph->Height) / Text->Font->AtlasSizeY;
  }
  else
  {
    _minX = _x + _glyph->OffsetX;
	  _minY = _y + _glyph->OffsetY;

	  _maxX = _minX + _glyph->Width;
	  _maxY = _minY + _glyph->Height;

	  _uvminX = (f32)_glyph->X / Text->Font->AtlasSizeX;
	  _uvmaxX = (f32)(_glyph->X + _glyph->Width) / Text->Font->AtlasSizeX;

	  _uvminY = (f32)_glyph->Y / Text->Font->AtlasSizeY;
	  _uvmaxY = (f32)(_glyph->Y + _glyph->Height) / Text->Font->AtlasSizeY;
  }
	

	//following code to make the actual vertices from previous info, push them into the vertex buffer, generate indices, push into the index buffer, and proceed to render
	...
}

Yet the if & else, both are going to result on the same thing, it is same duplicate code as a base, but we will do few modifications to make it render Right to Left.

First we will modify the _minX to not start from the left (0 + some offset), instead, we start from the other side of the screen, which is the size of the framebuffer (window) plus some offset. My rendere have a function that return the framebuffer dimensions, will use the width one MRIGetFramebufferWidth(), so code changes as follow

C++
...

for (u32 c = 0; c < _strLength; ++c)
{
	...
  
  if (Text->TextDirection == ETEXTDirection::TEXT_DIRECTION_RIGHT_TO_LEFT)
  {
    _minX = (MRIGetFramebufferWidth() - _x) + _glyph->OffsetX;//Start from Right of the screen
	  _minY = _y + _glyph->OffsetY;

	  _maxX = _minX + _glyph->Width;
	  _maxY = _minY + _glyph->Height;

	  _uvminX = (f32)_glyph->X / Text->Font->AtlasSizeX;
	  _uvmaxX = (f32)(_glyph->X + _glyph->Width) / Text->Font->AtlasSizeX;

	  _uvminY = (f32)_glyph->Y / Text->Font->AtlasSizeY;
	  _uvmaxY = (f32)(_glyph->Y + _glyph->Height) / Text->Font->AtlasSizeY;
  }
  else
	...
}

With that little change, the text renders like that

You can tell that we’re now drawing Right to Left, because the order of the text starts from the right, and the numbers now are at the left. But every character now looks wrong! Previously the order was wrong and starting from the wrong direction, but at least characters themselves were looking correct but Isolated, now they looking wrong and Isolated.

But you know, this is not a problem at at all! We have modified the positions on the vertices, and where we start drawing them, but because they use the same previous UV, it renders kina “flipped”, we can simply fix that by altering the UV too by flipping the _uvmaxX and _uvminX!

C++
...

for (u32 c = 0; c < _strLength; ++c)
{
	...
  
  if (Text->TextDirection == ETEXTDirection::TEXT_DIRECTION_RIGHT_TO_LEFT)
  {
    _minX = (MRIGetFramebufferWidth() - _x) + _g->OffsetX;//Start from Right of the screen
	  _minY = _y + _glyph->OffsetY;

	  _maxX = _minX + _glyph->Width;
	  _maxY = _minY + _glyph->Height;

	  _uvmaxX = (f32)_glyph->X / Text->Font->AtlasSizeX;//Right ot left, use min UV x as max
    _uvminX = (f32)(_glyph->X + _g->Width) / Text->Font->AtlasSizeX;//Right ot left, use max UV x as min 

	  _uvminY = (f32)_glyph->Y / Text->Font->AtlasSizeY;
	  _uvmaxY = (f32)(_glyph->Y + _glyph->Height) / Text->Font->AtlasSizeY;
  }
  else
	...
}

Now if we render that little UV change, we will see something like that

Now the order of the letters is correct, they render Right to Left, and the UV sampling from the font atlas correctly, and all that was by doing very simple modifications to an already existing Left to Right (English Text) renderer. But the problem now that the characters are not connected to each other, yes it is readable & in correct direction, but still the text not 100% Arab acceptable!!

Let’s solve this in the next part of the series.

-m

Leave a Reply

Your email address will not be published. Required fields are marked *