Map of the world’s languages

Plots

Thanks to the AJSP database by the Max Planck institute, I was able to follow the method of Müller and group and make these plots using MDS(see Methods).

mds — The 45 most natively spoken languages – and danish, finnish, esperanto and lojban – are projected onto a two-dimensional plane. Ideally, the closer two languages are in the plot, the closer they are in real life. However, the plot is a simplification of the distance matrix, so not all distances on the map holds. On the plot below colors indicate how bad this simplification is for each point.

It is not possible to spread out the languages perfectly on a two-dimensional plane. The colors show the appropriateness of each point. For example the languages Yoruba, Turkish and Korean are placed very close to each other on the plot but not according to the distance matrix. Therefore, these points are colored red.

dendogram — The distance matrix shows the distances between all pairs of langauges (measured in LDN squared). The scale goes from red for identical langauges over yellow to white for very dissimiliar languages. Using only this matrix, a method estimates the evolutionary history of the languages. That produces a phylogeny which can be seen to the left and on top of the distance matrix.

Discussion

There are generally big differences between languages with the chosen distance measure. Even though the distance matrix has some obvious language groups, the languages are mostly distant. In addition, the MDS plot has many points which do not fit well to their assigned position. There are simply not enough space to distribute the languages without putting them too close or too distant from each other. Languages need a higher dimensional space to be portrayed better.

Esperanto vs Lojban

It is easiest to learn languages close to one’s native language(Isphording, I.E. and Otten, S., 2011), which is lucky for us europeans, because we learn English relatively easy. Esperanto, which I wrote fondly of in my previous post, is an easy language which is made to give more people easy access to a common language. Esperanto is often accused of being European. People say that as a candidate to a global second language, it should not belong to a specific language group. Esperanto is obviously a european language, but what is the alternative? When languages reside in such a high-dimensional space, a fusion of all languages would also be far away from all languages. It is illustrated by Lojban which was calculated in 1987 based on the 6 languages mandarin, english, spanish, hindi, arabic and russian. Lojban does get positioned in the middle of the MDS plot and is also just classified into the european language group by the phylogeny. Yet, the distances between Lojban and all other languages are large. Lojban is only in the middle of the MDS plot because no language wants it close. So measured with this language measure Lojban is as difficult to learn for everybody as esperanto is for non-europeans.

Methods

The Swadesh list is a list of the 100 most human concepts like I, who, mountain, hear, big and so on. Translations of the words in different languages are used to measure distance between those languages. After its creation the list was first extended to 207 words to increase the statistical power, but then it was reduced to the 40 words which carry most of the statistical power. (It seems that one should use a better statistical model instead, but I also think that about many things).

The Max Planck institute made the database AJSP containing the Swadesh list for more than 7000 languages. All translations are written in the same phonetic alphabet, which makes systematic approaches possible. Müller and coworkers did the following:

The distance between two words is the normalized Levenshtein distance(LDND).
- The Levenshtein distance(LD) is the smallest number of operations that it takes to transform one word into the other.
  - An operation is either a substitution, removal or addition of one letter.
- The normalization consists of two steps.
  - Dividing the distance, LD, with the length of the longest word. The result is LDN
  - Dividing LDN with the average LDN of other words from those two languages
The distance between two languages is the average distance between the Swadesh list words for the two languages.
Pairwise distances between all languages make a phylogeny over all languages using the method Neighbour joining.
- A phylogeny explains the evolutionary history between elements as is done for Human, monkey and mouse here:

I mostly recreate the procedure of the Müller group, but I make some other choices to obtain a ‘map’ of the languages

The distance between two words is the normalized Levenshtein distance(~~LDND~~ LDN).
- The Levenshtein distance(LD) is the smallest number of operations that it takes to transform one word into the other.
  - An operation is either a substitution, removal or addition of one letter.
- The normalization consists of ~~two~~ one steps.
  - Dividing the distance, LD, with the length of the longest word. The result is LDN
  - ~~Dividing LDN with the average LDN of other words from those two languages~~
The distance between two languages is the average distance between the Swadesh list words for the two languages.
Pairwise distances between ~~all~~ some languages make a phylogeny over ~~all~~ some languages using the method Neighbour joining.
- A phylogeny explains the evolutionary history between elements as is done for Human, monkey and mouse here:

Pairwise distances between some languages make a ‘map’ of those languages using Multidimensional scaling(MDS).
- Multidiemnsional scaling transforms a collection of pairwise distances to points in an n-dimensional plane. Imagine that we knew all pairwise distances between cities in Denmark and not the actual positions of the cities. Then MDS(with n=2) would produce a good estimate of how the map of danish cities would look like. MDS achieves this by minimizing the differences between the actual distances and the distances on the estimated map.
I plot the distance matrix
- In the distance matrix each row and each column correspond to a language. An entry is the distance between the row language and column language.

I did not use LDND because I think it is strange and not really necessary. I only used a subset of the languages because I wanted to make a small plot. I included the 45 languages with the highest numbers of native speakers, and my love/love-to-hate languages danish, Esperanto, Lojban and finnish. In the MDS calculations these 4 languages were weighed almost 0 in order not to let the plot be influenced by my hobbies.

The data set can be downloaded from the AJSP webpage(to do exactly like me, you should download the .zip-file Dataset in CLDF [10.9MB]). My source code is on github in the new fancy Rstudio Notebook format.

4 thoughts on “Map of the world’s languages”

Emmanuel Tsukerman April 18, 20172:49 am Reply

Nice. Do you have an interpretation or a historical explanation for why Persian looks like an outlier on the map?

LikeLike
1. svendvnielsen April 18, 20175:53 am Reply
  
  In the chosen set of languages Persian is quite isolated. Compared to the other isolated languages, Persian is not very distant from the Indian and European languages, which allows it inside the empty area. I am not sure how significant it is, but it makes sense because Persian is spoken between Europe and India.
  
  LikeLike
Suav September 21, 202010:48 pm Reply

Would it be a lot of trouble to add a couple of drop down lists and a field giving the distance between two chosen languages?

LikeLike
1. svendvnielsen September 22, 20201:45 am Reply
  
  For me, yes. But if you are interested in looking up specific distances, you are welcome to look at this raw list of computed distances: https://gist.github.com/svendvn/ef95c671696d9b8401a561adab974dcb (1.9 MB). You are also very welcome to make the drop down feature yourself using the list and I would be happy to add a link to it in the article 🙂
  
  LikeLike