Large Language Models: India
This is a collection of articles archived for the excellence of their content. |
Telugu
Swecha
As of 2025 March
Vidhatri Rao, March 13, 2025: The Indian Express
“How would you describe a cloud when it’s about to rain?”
For close to a year now, an army of volunteers in Telangana has been getting people to talk. And so, they talk — family stories, the humdrum of their daily lives, the day’s weather, fables, legends. Sometimes, they read — from newspaper reports, books. All of which is precious data, the kind that Swecha, a Hyderabad-based non-profit, hopes will some day add up to a Telugu Large Language Model (LLM).
As India attempts to build its own LLM, Swecha’s experience with documenting the language — in this case, Telugu — offers a window into what an Indian model will need to factor in.
Indian languages, rich in detail, are known to have numerous peculiarities — dialects change within a few kilometres, and there are variations depending on religion, communities, castes and so on.
Certain nuances of the language remain spoken, and are rarely documented or codified. For instance, when a cyclone approaches coastal Andhra Pradesh, people say “gaali gurralu padutunnayi (literal translation: the wind horses are singing).” Among Telugu-speaking farming communities, people describe ailments using agricultural metaphors. A popular example: “Naa oopiri panta kotala undi (My breath feels like the cutting of crops)”.
Existing LLMs such as ChatGPT, Gemini, or Grok, which are trained on large datasets, fail to capture these nuances. It’s here that Swecha and other research groups and startups that are working on creating India-specific LLMs hope to make a difference.
Swecha’s is one of the 67 proposals the India AI Mission has received since the government announced its intention to create an indigenous foundational AI model — one that would account for the peculiarities and the context in which languages are spoken in India.
The process of data collection for Swecha started in 2024, led entirely by volunteers from engineering colleges in Telangana.
In collaboration with the government of Telangana, the International Institute of Information Technology (IIIT) Hyderabad, and software companies Ozonetel and Tech Vedika, Swecha organised an “AI yatra” to familiarise the engineering students with the emerging technology and enrol them as volunteers to build the Telugu-language LLM.
A network of 40,000 volunteers emerged through this. “These volunteers collected voice and video samples of people across districts in Telangana and Andhra Pradesh. We asked people to talk about their occupations, way of life… We collected fables, local sayings… To identify dialects, we got people to read an old newspaper article. Collecting such digital-first data is not just important for tech going forward but also for cultural preservation,” explains Kiran Chandra Yarlagadda, who co-founded Swecha in 2001.
In 2005, Swecha, which is part of the Free Software Movement of India, a larger coalition of groups across the country looking to bridge the digital divide, created an operating system that enabled users to use a computer in Telugu — the first such for a regional language in India. In 2023, they created a Telugu Automatic Speech Recognition (ASR) system — which converts speech into text — with 1.5 million voice samples and 45,000 contributors.
In January 2024, Swecha released an AI storytelling project called Chandamama Kathalu, which digitised 40,000 Telugu stories from the monthly children’s magazine Chandamama.
Then, in January 2025, as the race for building AI models picked up, Yarlagadda and his team of volunteers, along with IIIT Hyderabad, started an initiative called Viswam AI to work on “AI Solutions for the Global South”.
Now, as they work on the LLM, Swecha is in the process of labelling data — categorising disparate information in a way Machine Learning (a type of AI) models can understand. The next step is to bring on board software developers who can build the AI models that can be ready for use.
Building the Telugu LLM has become a community activity for Swecha and Viswam AI. Over the past few months, they have been organising AI workshops for students and Telugu indie singer Ram Miryala has agreed to perform free concerts across Telangana to encourage data collection for the LLM.
But building an LLM is no mean task. It requires extensive computational resources, specialised chips called Graphic Processing Units (GPUs), and high energy.
“Running an LLM on the scale of ChatGPT costs crores of rupees per day,” explains Yarlagadda. “But is that the way to do it? No, it is not, fundamentally. We have very different kinds of use cases (for AI in India). All our cultural connections and everything are very different. When we go forward on our journey to build our AI models, we have to think of efficient ways of doing it. The BigTech companies in America use brute force. But there are other ways of going about it.”
Yarlaggada goes on to talk about DeepSeek, the Chinese AI company that shook the world by building AI models at the fraction of the cost of the tech giants in America and with similar performance on benchmarks in the field.
While Yarlagadda has high hopes for AI and where it will position itself, he says the technology needs to take everyone along. “AI will play the role of due diligence and intelligence as opposed to just automation,” he says.
But what are the use-cases of creating an LLM for an Indian language? Yarlagadda says the possibilities are endless. “The first thing is that it opens up a lot of things for Indian users who do not even know how to write. If the models are voice integrated, we will be able to communicate to anyone in another part of the world. That is the kind of a leveller it is going to be,” he says, adding that it could help with solving region-specific problems in health and agriculture.
He says that once we have this foundational model that understands regional nuances and dialects, anyone from a farmer in Srikakulam to a labourer in Nizamabad can use AI.