Arabic Decoding for Games & Game Engines – Part 1: Challenges

Introduction

It is funny when taking a moment to think that the language that is my first language and i was born speaking is Arabic, and the language that i speak more than 50% of the time outside working hours is Arabic too. Also the language that i speak with my family members & close friends is Arabic, and the language i talk to myself while thinking in problems or scratching on paper while self studying is Arabic. Also it happens that over the years, the language i used to auto-translate inside my head while communicating with people in the countries i lived in (English, French & Chinese) is also Arabic. BUT, ironically Arabic is the only language that i never worked on decoding it’s alphabet in a game, engine or any type of software! This could be due to the fact that game companies always either ignoring Arabic localization altogether maybe due to it’s complexity or market slice, or worst case (which was the common case for long time) port it in a poor way.

Few weeks ago, the Arab tech community wake up on a very sad news. The passing of Mohammed Al-Sharekh, or as me & many other would call him “The Father of Arabic Language Digitalization”, as he was the pioneer on brining the Arabic language to digital devices in all forms. The one that matters a lot to me & many people of my generation, was the Arabic langue support for صخر MSX & Atari, which where a lot of my childhood was spent. And yes, MSX used to have a totally different name back then in the Middle East, which was صخر .

Reading the news that day, made me take a pause for few minutes, and think loud with myself (still thinking in Arabic), that i always admired the effort that was put into bringing such complicated language to computers & other screens. i know the language well, and i know how complicated it can get in terms of writing, specially when considering things like “Tashkeel” or Arabic diacritics; which is not only complicated in writing, but in reading & pronouncing at some cases. i always had the benefit of learning all these at school & practicing it when i was younger, and the complexity & beauty of that language got deciphered for me over the years, but i never tried to understand such complexity from the perspective of bytes & pixels! And hence, i decided to dive into the world of Arabic bytes & decoding, learn it by adding full Arabic support in my Game Engine (Delusion) and let this outcome of this entire work & effort be explained in an article (this) as a reference for anyone unfamiliar with the topic, and want to support Arabic in their game/engine, and at the same time, i do this work as a final thank you letter to Mohammed Al-Sharekh for participating in shaping my childhood & inspiring me with Arabic content on the devices of the future, MSX & Atari.

What Makes Arabic Differ?

Now having properly working Arabic in game is a problem we need to solve. And because of that, we need to first understand what makes Arabic different than most other languages including the other Right to Left languages. When you well understand the problem in hand, you can evaluate it accurately, then you can truly understand the reasons behind the problem, and then you can properly solve it.

I believe that most of the games that suffers from broken Arabic support, are games that never go through that simple problem solving steps i mentioned above, and due to the lake of correct understanding to the correct differences behind Arabic typography & glyphs, the problem of decoding Arabic correctly shows up & we end up with ironically wrong (and funny) Arabic text representation for the games’ localized texts.

In this first part of the series we will go through the challenges of Arabic as a language compared to other common languages, it is important to understand this layer of complexity, so we can overcome it when drawing characters to screen. But all in all, as you’ll see later, that having proper Arabic, is just a matter of “few” extra lines of code to properly decode it!

Not an Arabic Crash Course!

Keep in mind the info below is not meant to teach you Arabic or make you a good reader😅, it is more meant to show you how many details matter & how tiny some details could be and how much they can impact the final text readability.

1.Direction

Believe it or not, people who don’t have interaction with Arabs would get shocked to know that Arabic is read & written from Right to Left, as most of the languages they interact with is Left to Right, and they believe that is the normal everywhere else! Arabic can’t be with any mean written from Left to Right, it would not only be wrong, but would look weird, as Arabic is a language that count a lot on the shape & the looks of the characters, specially that as you’ll see later, that each character can have multiple shaper or forms.

Try to give a pen to an Arab, and ask them to write an Arabic sentence from Left to Right….their brain will stuck, and won’t coordinate the hand movement!! I tried that before when i was at school, and it never worked!

Unfortunately many games display Arabic text not only in wrong “forms” per character, but also displays them from Left to Right. Thankfully such issue is very simple to solve, as all what you need is to generate the quads from the right of the screen instead of the left of the screen, or in another meaning, instead starting generating the quads from the 0.0f on X and advance forward, you just start from the SCREEN_WIDTH (let’s assume 1920 on a 1080p target) value and advance at the other way towards the 0.0f.

It is worth mentioning that there’re some AAA companies that put a good effort in making sure that Arabic is well presented, companies like Ubisoft, Guerrilla, Quantic Dream, (and many other of Sony 1st party), which is very very appreciated by the Arabian gaming community <3

Why “Direction” is a problem in Arabic Decoding?

By default, any UI system is going to draw quads on screen from Left to Right, and Arabic is the opposite of that, this means our UI system should support writing in the other direction too. Not only that, but also we need to make sure that reading/drawing glyphs pixels of a given character is also done in the opposite direction.

We can still use Left to Write and draw the text characters backwards, but then we will still need to do the same hack in the kerning and anchoring,..etc. So, it is the same wasted effort to be fair, but making it draw from Right to Left will be much more tidy & convenience.

Also it is not only that global direction per text object/block, but consider the cases where you need to have Arabic text that includes some English or other language text in the middle, we now speak about two different language direction at the same phrase inside the same string and text object, which is SUPER common use case by Arab users specially in gaming & tech space.

2.Connections

English (French and others) can be written as connected/attached characters, but it is not mandatory & the default case is to write it as a set of isolated characters. Other languages that use different alphabet or character set, like Japanese or Chinese is not attached too (at least by default and what i know). Things like Hindi or Thai,…etc. i’ve no idea about. But Arabic is the opposite of all that, as it MUST be written in connected/attached characters. It may be still readable when not connected characters, but it is totally wrong. Arabic made to be connected, all characters of a single word must be connected together.

Arabic can still look readable when not connected, but it is wrong. The only case i’ve noticed where Arabic is written & read that way, is at the kindergarten, so toddlers can start learning first the basic shape of each character, how to form a tiny word, how to read and write, not so long after, the toddler will start write connected characters to form (still) simple words (like words of 3 or 4 characters). But all in all, any content presented to kids to seniors, should have connected characters, or they would make fun of it. From book, to newsletters to applications & games, Arabic need to respect connections. In fact, a good software like Microsoft Word or even a Text editor like Notepad, will never let you write two Arabic characters without a connection, it will always force the connections rules, which is great!

Why “Connections” is a problem in Arabic Decoding?

By default, decoding a given string will not include any type of character to character connection information, and hence the text would look wrong. It may still be readable, but totally wrong and unrespectful to the language and it’s rules.

3.Forms/Shapes

Each character can have up to 4 different shapes or forms (most of the characters does), based on their location in the word. These four are as follow

Isolated: Which usually used by the end of the word in some special cases. And it has no connections.
Connected Initial: At the start of a word, and it has 1 connection at it’s left, to connect with the next character in the word.
Connected Middle: At the middle of a word, and it has 2 connections, one at each side to connect with the previous & next characters in the word.
Connected Ending: At the end of a word, and it has 1 connection at it’s right, to connect with the previous character in the word.

Now let’s take the character “Seen” which is kinda the alternative to something like “S” in English, under same color coding, the “Seen” would change shape based on it’s location to look like:

س…………سـ…………ـسـ…………ـس

Now if you may’ve noticed (or may not) that the fist case (the Isolated) is what games mistakenly force use, as it is the default case for any Arabic character when decoding a UTF-8 string into codepoints. So we can end up with a full word made of only Isolated forms of Arabic characters, instead of a mix that is respecting the position of each character withing a single word. So given the word “بندقية” for example, which means “a rifle”, games poorly supported Arabic would read it something like (red is wrong, green is right):

ب ن د ق ي ة………………………بندقية

Not only that, but adding the previously mentioned point, where the poor integration would treat Arabic as Left to Right not Right to left, we would end up with totally unreadable word. Where in the previous example, the red word still kinda readable, it’s how a toddler would right it in school anyways, isolated characters from right to left, but a poor game Arabic support would not only leave the character isolated, but also reverse their order considering the first character is on the left not on the right. (While both below are wrong, but the red is more wrong as it is Left to Right and Isolated, but the green is wrong (Isolated) but readable & respecting the order of the characters in the given word)

ة ي ق د ن ب………………….ب ن د ق ي ة

Why “Forms & Shapes” is a problem in Arabic Decoding?

By default, decoding a given string will return the isolated shape of the character. Regardless the location of that character in the word, and regardless if it is a “Ligatures” or not. Decoding a UTF-8 will always give us the “raw” shape of the character like if it is written in the void, which is in result will give us a weird looking text, So we need to make sure we set a proper shape for each character based on it’s location in the word and based on it’s Ligatures state.

4.Points or Dots

Arabic love the little “dots”, which is one of the corner stones of Arabic language as of today. There are a ton of characters that would look just identical, but what differs them the existence of a dot or more and the position of these dots. For example, Take this:

(all examples in here are using isolated & middle-connected forms of the characters, but this applies to all the forms of the character)

ٮ

ـٮـ

No Sound

Isolated form, followed buy middle-connected form, followed by the sound or English alternate

This previous thing, has no sound, as it can be many things, but at this form, and as an isolated character, it has no meaning and very vague. putting between a dot up to 3 and based on their position, it can differ as character, like

ب……….ت……….ث……….ن

ـبـ……….ـتـ……….ـثـ……….ـنـ

N……………………………………TH……………………………………T…………………………………B or P

Isolated form, followed buy middle-connected form, followed by the sound or English alternate
The sound “N” case is little special, as you see the character gets a little curve at it’s bottom in isolated form

These 4 types of dots positioning are the major ones, that represents characters in the official alphabet. But also there are other variations that we don’t learn at school, and they are existed for some reason…probably different sounding for the previous cases. But there are things that are less common, but they’re exist and digitally exists as UTF-8 codepoints that we could decode!

ٿ………پ………ڀ

And this is not the only example, pretty much more than 70% of the Arabic alphabet is going that way, but mostly with one dot that change it’s position. Here is another example (you’ll see in a second why this is very interesting)

ح

ـحـ

H

Isolated form, followed buy middle-connected form, followed by the sound or English alternate

This example differs a bit, as the “no-dots” form, actually can be pronounced as “H” while being isolated character, but also with dots it differ a lot…

ج………..چ………..خ

ـجـ……….ـچـ……….ـخـ

KH………………………………………….J………………………………………….G

Isolated form, followed buy middle-connected form, followed by the sound or English alternate

And for that same character, still there are some more forms that i don’t know or didn’t learn at school, they’re not major, but they are also kinda different pronunciation to the previous forms! The point is, they look different, they’ve impact and they’ve codepoints when we decode a string including them.

څ……….ڄ

Now come to the interesting part about the “dots” or “points”, while these little dot’s can change an entire meaning of a word, and they are important & part of the Arabic we speak, read & type today, but the dots never been part of the original Arabic language in the past during the times of the first Arabs, and these dots been added to the language to make it easier for “strangers” or “none Arabs” back in the day to learn to read. You can see old texts in museums that look like that:

You see there is not a single dot on any character, and still it is readable if you are an Arab.

There are very common type of images going around the internet like these ones below, that has no dots on any characters, and what the text says is something like “You don’t need dots to be able to read Arabic, and the dots were invented for none-natives as Arab can understand the characters from the context of the phrase……”. And it is true, me or any one speaking/reading Arabic, can fully read this block of text in one go, without a single mistake, despite the fact it should not be readable at all!

Anyway, as of today, dots are important & computers (utf-8) are considering that, so if you try to type Arabic, you’ll always be typing characters with dots, so we when we decode, we need to make sure that we get the exact correct character, specially when it comes to different forms. So if a character has 3 variations base don the dots, and it has 4 variations based on the forms, this makes it total of 12 variations for a single character!

Is “Points or Dots” a problem in Arabic Decoding?

No it is not! Decoding a UTF-8 Arabic string will give us the correct codepoint (number) for the character, which hold the correct info about dots (if exist). This is not a problem, but it is something we need to consider checking while decoding to make sure we decoding correctly, as a single integer increment to a codepoint number can result in totally different character that could look the same, but has one more/less dot on it.

5.Tashkeel (Diacritics)

Apart from the little dots, Arabic have something much more decorating but yet important. Diacritics is one more corner stone in Arabic, but thankfully this one is not really essential in typing (handwriting or digital) as Arabs can read words without diacritics (based on context too), despite the fact that a single word can have a million (well this is an exaggeration) meaning based on the number of Diacritics move & their locations. But again, context is king for Arabs, despite the fact that typing Diacritics would take time, imaging typing with lots of diacritics in a chat! nobody does!

Let’s take an example, a word without any dots or diacritics, and what diacritics only would makes it mean:

حمام

no context, no exact meaning

حَمّام…..حَمَام…..حَمام…..حُمام…..حِمامُ

Death………………………..Fever………………………..Pigeon………………………..Secure………………………..Toilet

And that’s not all, you can see all variations for that single word in all it’s states in here. And yet, this was diacritics only, imagine adding dots to that first character in the word (above or below the character) would generate some more new words & meanings!

Some of the other languages have diacritics indeed, such as Chinese pinyin, but the importance of diacritics in Arabic comes from the fact that they are:
– First, more than any other languages, where Chinese have like 4 moves/tones per character, Arabic have many many of them that they can mix & match, but indeed for major pronunciation, there’re 4 too, but they works different.
– Second, where diacritics in other languages will change the way you pronounce the character but it remains the same character, for example A remains sounding like an ‘A’ after the tone or diacritics move, but in Arabic, you would need to pronounce (an non exist) O/U, E or AA after the character you pronouncing. And it can get also more complicated with special diacritics moves added to these 3 major ones, that adds extra hidden N to that extra (already hidden) character, which makes is a ON/UN, EN, or AN, but this case usually comes at the last character in the word only. Anyways, for basic diacritics, we can take for example, my name, in Arabic it is just 4 characters, but in English,…you can count!

م ح م د………محمد………مُحمَّد

from right to left, isolated character, the name as a word without any diacritics (correct forms & connected), and finally the name with diacritics.

Muhammad

The difference in total character count, comes from the sounds that they little tiny diacritics adds to the sound of the character itself!

Why “Tashkeel/Diacritics” is a problem in Arabic Decoding?

While usually folks write Arabic in digital without diacritics, like in chat or posts or comments but by default, but if they does, decoding a given string will include the diacritics information as part of the string, like if a diacritic symbol is a “character” in that string, and hence it will look weird. But a diacritic symbol in Arabic is not an independent entity, it can’t be alone, and it can’t be “next” to a character. It is something that accompany a major character from the alphabet, and it need to be rendered on top or bottom of a character or on top or bottom of another diacritic symbol. So when we get the codepoints from the given string, we need to make sure to treat diacritics as diacritics not as individuals if they’re presented, otherwise if they’re presented and we don’t consider them, the test would be missed up & would have invalid glyphs rendered.

6.Contextual Words

Arabic is full of contextual words, if we said earlier that similar words that made up of exact same characters with the exact same order, can be differentiated by the Tashkeel/Diacritics, then there are another category of words that look same (same characters, with same order, same dots) and at the same time with the exact same Tashkeel/Diacritics (which means 100% exact same sound or pronunciation) but a totally different in meaning based on context (again), let’s take this example:

المَغْرِبُ……..المَغْرِبُ

Can you tell the difference between these two words?

Well, it is the exact same word, with exact same characters and exact same Tashkeel/Diacritics which makes them have the exact same pronunciation. So in that context, there are no difference, and even me at this context, can’t tell a single difference! But when put it in a phrase like this…

صَلَّيْتُ المَغْرِبُ فِي بِلاد المَغْرِبُ

now they are waaaaay two different words, where the blue ones means a time of day (sunset time) the other red one means a location or country to be exact, which is Morocco!

Is “Contextual Words” a problem in Arabic Decoding?

No it is not! This was to show more the complexity of the language in “readability”, and why even with diacritics, it can still be a very contextual language to the reader.

This was pretty much all the challenges that we’ve to overcome in the next parts of the series! You’ll see later that decoding Arabic is just like decoding English, Chinese or Russian,…but with just few little more additions to the code to make sure it decode, render & read correctly!

Arabic is Beautiful

i wanted to leave a note about Arabic in this part of the series as it is more of a part where we talk generally about the language it self. Arabic is not only complicated in writing & grammar, but it is one of the most beautiful written languages in the world if not the best! I’ve been experiencing different languages over the years due to the places where i lived & worked, it was mostly roman letters, but also I’ve been very close to something like Chinese & it’s calligraphy for over a decade! But i never seen in the beauty of Arabic calligraphy art! I’ve always enjoyed printing Arabic calligraphy and pin it on my desk at office, some are common says, some are wise words, some are accouraging phrase,…etc. But in general, all looked like exceptional art, not just standard generic linear text!

Just google “Arabic Calligraphy”, and see what you feel about it…Here are some nice of my taste…

From Camels, to Lions to Horses, Bulls & Elephants to birds, ending up with Fruits & flower & everything else…
Arabic Calligraphy is an ancient craft that has it’s own beauty!

The magic is not only in the beauty of the line strokes & shapes, but also the magic is in the fact that till today, it still a craft & a job, and many people prefer to do it 100% by hand…not in any digital form….Authentic!

Where to Start from?

Now this may sound odd. You always know where to start from, Google, right? but what exactly you’re looking for in Google?

I’m by no means an expert in language support, but i wanted to leave here one of my many two cents on that topic as i’ve done this several times and for multiple languages in the past. My takeaway, is always to look for the technical info from the native speakers of the language you want to support “EVEN IF” you don’t speak that language. But what does that mean?

Now i wanted to integrate Arabic in my Engine, i can read & write it already, and i know the grammars very well, but this is not helpful at all when we talk about reading a otf/ttf file, storing it’s info, decode a given string/text to codepoints and manipulate them to draw that on screen somehow. And hence you need some technical guidance in the “none-general” aspects, which is the bytes of the language characters & how to treat them correctly. For that part, always try to learn it from a native language speaker, not from an article written by a none-native to that language you’re working on supporting.

You may now say, “well, it’s easy for you to say that, you speak Arabic already, so it is easy to learn from Arabic resources”, and you’re partially right, but this applies only for Arabic, remember that i said i supported multiple languages in the past, and for each i was looking for native/local knowledge from a native speaker blog or website. For example, when i worked in Chinese decoding in the past (i wrote entire in-game chat system & it’s UI for CoD Heroes), and while i was surrounded by Chinese team members to help me where needed, i was watching resources on Youku & Bilibili (Chinese YouTube like) with translation to English (and sometime translation is Chinese, and every minute i pause, copy text to translate in Google, it is tough, and lots of time, but was worth it) because i knew that only a native Chinese would be giving the best advise on the topic.

Back to Arabic, while looking for a good Arabic resource, it was not very long till i found a great (and recent) podcase/livestream between the host of the podcase Mohamed Elsherif, and one of the pioneers in the field of Arabic language digitalization Usama Baioumy who been specialized in that field since the 80’s. It was a good and very informative resource for me to learn how Arabic came into screens one by one, what was the issues & the challenges & how they overcame them. Yes maybe supporting Arabic in a game engine now is not a problem due to the fact that every OS (and hardware) now capable of UTF-8 & more and it is a matter of rasterization once you got the right data, but it is important to learn where where we in the past, and where are we now & most importantly why Arabic is special & complex in digital form.

Also there is the YouTube version of the stream, it may (or may not) have translation auto-generated. But nowadays, you can get one generated with many services!

Now after you understood some basics of Arabic complexity, and why every single pixel of the text on screen matters, at the next part we will see how to start decoding Arabic, which is not different than decoding any language, the trick is what we do “after” decoding the text.

-m

Leave a Reply

Your email address will not be published. Required fields are marked *