Photo by Mojahid Mottakin on Unsplash
As the world becomes increasingly interconnected, the demand for localized and translated digital technology continues to rise. Catering to diverse target audiences requires organisations to adapt digital products and services to be more accessible and culturally appropriate for different groups of people. Large language models (LLMs) – machine learning models that use deep learning techniques for language processing tasks – have the potential to play a significant role here. In this article, we will discuss the benefits and challenges of applying LLMs for localization and translation in the digital realm.
Localization and translation are critical aspects of developing any kind of digital technology, especially within the free and open-source software community. By localizing and translating software, developers ensure that their products can be used and understood by people from different linguistic and cultural backgrounds. Web-based translation tools are essential in facilitating collaboration among translators and project maintainers, streamlining the translation process, tracking progress, and maintaining consistency across different languages. Some web-based translation tools like Crowdin and Weblate have already integrated the services Google Translate and DeepL to provide suggestions and accelerate the translation process. However, this application of machine translation has its limitations, such as producing inaccurate translations or struggling to capture idiomatic expressions and cultural nuances.
LLMs like OpenAI GPT-4 and Marian NMT can potentially revolutionise the localization and translation of digital technology. These models utilise advanced techniques like transformer architecture and attention mechanism to process and generate more accurate and contextually-relevant text. The attention mechanism allows the model to focus on different parts of the input text when generating translations, helping it to capture context and maintain accuracy. As the models are exposed to various language patterns, idiomatic expressions, and cultural nuances during training, they develop a better understanding of the languages. These models are already applied in chatbots, writing prompts, and content-generation tools.
Software and app developers could benefit from integrating LLM into translation tools, as this would reduce the time and effort required for manual editing while better capturing cultural nuances. The overall quality of translations in free and open-source software projects would likely improve, streamlining the localization process.
An alternative approach is the use of human-in-the-loop (HitL) systems. In this setup, LLMs would act as an aid to human translators, providing suggestions or initial translations that can be reviewed and refined by the translator. Having the human translator working in tandem with a natural language processing system can help overcome some of the limitations and biases of LLMs, ensuring higher quality and more accurate translations. HitL systems can also preserve the collaborative nature of the translation process, allowing translators to leverage the power of LLMs without sacrificing the open and inclusive spirit of the open-source community.
So why not integrate LLMs in software or app localization and translation?
Despite the potential advantages, challenges and limitations accompany the use of LLMs in translation tools. For instance, these models can generate incorrect or nonsensical translations because of biases in the training data or lack of context. Since these models learn from vast amounts of text data, they can inadvertently adopt biases present in those texts, which could lead to translations that are prejudiced, offensive, or just factually incorrect. The lack of context can result in translations that are grammatically correct but semantically inaccurate, as the model may not fully understand the nuances of the source text or the specific domain knowledge required for accurate translation.
LLMs have shown promising results in localization and translation for well-resourced languages such as English, but their performance is wanting for lesser-resourced languages with limited available training data, resulting in suboptimal translations. Community-driven initiatives like EngageMedia’s digital security localization project play a vital role in addressing the limitations of machine translations. To ensure that the benefits of advanced translation technologies are extended to all language communities, it is crucial to invest in data collection and curation for these lesser-resourced languages. Collaborative efforts between the localization and translation industry, machine learning engineers, and native speakers can help improve the training data for these languages, leading to more accurate and culturally sensitive translations.
The costs of running LLMs can be prohibitive and their use poses challenges for both software development and user experience. The computational resources needed to run these models can be extensive, and the large size of these models present scalability issues. Integrating these models into existing translation systems may require developers to make substantial changes to software architecture. For smaller open-source projects with limited resources, investing in the necessary hardware or cloud infrastructure to run these models effectively may not be feasible. On the user end, the large size of the models delays how fast translation tasks can be completed.
Another concern is that relying on a large language model might make localization and translation in digital technology development a less open and collaborative process—a core value within the open-source community. The development and training of LLMs are typically done by well-funded corporations or institutions, and their underlying algorithms and data sources are often not publicly available. As a result, the translation process might become more centralised and controlled by a few entities rather than being an open and collaborative effort driven by a diverse group of contributors.
Additionally, the closed nature of most advanced LLMs limits the ability of the open-source community to understand, adapt, or improve upon the translation process, especially to identify and correct errors or biases in the translations produced by the model. This, in turn, might lead to reduced quality and trust in the localization process itself.
The future of localization and translation of digital technology in the age of LLMs offers great opportunities, but not without some serious challenges. While these advanced models could improve translation accuracy and efficiency, their limitations and potential reduction in openness and collaboration hinder their full realization. As digital technology continues to advance, the localization and translation industry needs to strike a balance between harnessing the power of LLMs and maintaining the collaborative spirit that defines the open-source community.
Khairil Zhafri is EngageMedia’s Open and Secure Technology Specialist, working on programmatic planning, strategy, and network building to drive the implementation of the organisation’s digital rights and technology initiatives in the Asia-Pacific.