Our web crawler currently recognizes 107 languages and 25 scripts (like Latin, Hebrew, Han Chinese and Arabic). The list also includes some minority languages like Welsh or Catalan. Today we look at languages on the web: What are the dominant languages? How many websites make use of lesser-known languages? And how are specific languages used on certain geo TLDs?
What’s the primary language on the web?
Half of all websites globally are in English (51%), which is a very high share compared to other languages. The second most common language on the web is Chinese (ZH). Roughly 10% of websites provide their content in Chinese, followed by German (7%), Spanish (4%) and Japanese (4%). But this is perhaps not that surprising, so let’s look at the other end: what are the languages that occur least frequently on the web?
What are the least common languages on the web?
In Figure 2 you can see the 30 least common languages on the web. Yiddish, Pashto, Laothian, Amharic and Punjabi are among the rarest. Together, these five languages occur on nearly 4,600 websites. Despite the low overall share of these 30 languages, it’s nice to see that websites using less common languages and even languages at risk of becoming extinct still exist.
Yiddish, Pashto, Laothian, Amharic and Punjabi are among the rarest languages on the web occurring on only about 4,600 websites.
According to the United Nations Education, Scientific and Cultural Organisation (UNESCO), Frisian and Basque are classified as ‘vulnerable’, meaning most children still speak the language, but it may be restricted to certain domains (e.g., at home), and Yiddish has been classified as ‘definitely endangered’ indicating that children no longer learn the language as their mother tongue at home. Creating internet pages in these languages may help in the preservation of those languages and make them more accessible to younger generations. However, it may not always be easy to find these sites.
Do geoTLDs provide a home for less common languages?
In order to carve out a specific space on the web to promote a certain cultural or linguistic community, geoTLDs were introduced in 2012. While these primarily focus on geographical regions (other than countries), they more generally aim at providing domains for local digital identities; examples include .cymru or .wales for Wales or .cat for Catalonia. In Figure 3, we show the language distribution for a few select geoTLDs that could also be related to a specific language.
GeoTLDs .cymru and .cat are good examples of making use of the local language. The share of websites for these two geoTLDS written in the respective language is 61%. This is in contrast to .scot, where only a meagre 0.05% use (Scottish) Gaelic. We also included the .eus geoTLD for the Basque Country. The language distribution here is 45% Basque, 42% Spanish and 8% English (5% are in other languages). Altogether this shows that the new geoTLDs can provide a good home for websites in local languages.
English is by far the most used language on the web. But what’s also interesting to see on the other end of the spectrum is the use of minority languages, where the web can be a great outlet in helping preserve not only a unique language but also important cultural knowledge and practices. When a minority language is lost, it can also have serious consequences for the individuals speaking it. It can lead to a lack of access to educational and economic opportunities, as well as a sense of marginalization and exclusion from the wider society. Websites are an important contributor in helping protect minority languages. Even better, websites hosted on dedicated geoTLDs can be found more easily and provide the perfect home for local languages.