ChatGPT and Arabic Tokenization

28 March 2023

In recent months, Internet users and technology geeks all shared one obsession, ChatGPT. Screenshots of conversations with this tool flooded social media platforms, praising its capabilities to answer questions and converse in a natural, human-like manner.

It's a deserved hype, but we are here to talk about an aspect that may present a particularly challenging problem, which is Arabic tokenization.

What Is ChatGPT?

Who would be better at answering this question than the AI prodigy itself? We asked ChatGPT to introduce itself, and here is the outcome:

"I am an artificial intelligence language model developed by OpenAI. I am designed to process and understand natural language, and I have been trained on a massive amount of text data to assist with a wide range of natural language processing tasks.

Some of the tasks I can perform include answering questions, generating texts, and providing language translation services. My goal is to make it easier for people to interact with and use natural language technology in their daily lives." This is a simple example of how ChatGPT works; users ask questions, and it answers.


This AI-powered chatbot is taking the world by storm and is impressing people with its capabilities; it can compose poems, write essays and product descriptions, provide content ideas, and much more.

ChatGPT can converse in many languages other than English, including French, Spanish, German, and Arabic. But just like other AI chatbots, Arabic is a hurdle. 

Yes, ChatGPT can understand Arabic inquiries and is able to translate. The results are promising, especially when using Modern Standard Arabic (Fus-ha). Still, as far as our experiments went, we encountered unnatural-sounding results and literal translations that didn't convey the right message of the original text. Why?

What Is Tokenization in AI Technology?

Tokenization is a crucial element of AI technology, used to break down a string of text or speech into identifiable units.

It helps AI systems better comprehend complex tasks such as translation and natural language generation by enabling them to accurately identify subject-verb agreement, meaning, and syntax.

It also allows for faster processing times so AI systems can analyse data with precision and speed using less memory than traditional methods.

Tokenization is a critical tool in advancing the capabilities of AI technology and has been highly beneficial in numerous applications.

What Are the Challenges of Arabic Tokenization?

Tokenization is a critical step in building language models, like ChatGPT, which generate natural language text that is grammatically correct and semantically meaningful.


The challenges of Arabic tokenization lie in the complexity of the language itself, which we can summarise in the following points:


Tokenization of the written text is made more difficult due to its diacritical markings. Arabic diacritics (small marks above or below the letters) include fat-ḥah, dammah, and kasrah, which signify vowel sounds and must be considered when tokenizing Arabic writing.

Changes in these diacritics can change the meaning or grammatical function of words.

A Highly-inflected Language

Inflection is modifying words to indicate different grammatical forms, such as tense, number, gender, and case. Arabic has a complex inflection system that affects nouns, verbs and adjectives.


Therefore, tokenizing Arabic text requires understanding its morphological structure and considering the various inflections that words can have.

Dialectal Variations

Tokenization of Arabic texts can pose a unique challenge due to the variety of regional dialects spoken. All dialects stem from the same root language and use the same basic written system. Yet, certain words, syntax and even phonetic variations can render them quite distinct from one another.


Thus, Arabic tokenization requires an advanced level of linguistic knowledge by developers and researchers to accurately segment and assign relevant features for each text entry. Without this understanding, tokenization results may be inaccurate or incomplete.

AI and Humans Joining Forces

The debate of AI vs humans has been going around the internet for years now, with many fantasising about dystopian movies coming true. But what if we put fiction aside and think about AI language models as human power "amplifiers"?


Despite their impressive capabilities, ChatGPT and other AI language models still have some limitations and may not always be able to accurately capture the full meaning and nuances of human language. They have difficulties understanding cultural contexts and processing complex languages like Arabic.


This is where human expertise can come in. By working with AI language models, humans can provide a critical layer of context and understanding that can enhance the accuracy and reliability of the model's output.


AI language models will provide a content base that humans can enhance by:

  • Identifying and correcting errors
  • Providing additional context that the model may not be able to capture on its own
  • Adding an engaging human touch to appeal to readers

Combining AI language models and human expertise can be a powerful tool for achieving faster and more accurate results.


At e-Arabization, we combine machine translation with our in-house team’s editing expertises to deliver high-quality results that will empower your business.

Book a consultation today to learn more about our machine translation services and how we customise them to meet your business requirements.