After my first blog post on mapping twitter data and some twitter maps for Luxembourg I would like to present some further maps on known multilingual regions in Europe and how they emerge visually according to geocoded Twitter data I collected over the last few months.
Recall: Twitter is used by a certain segment of society, often with a technophile background. Only a fraction of the sent tweets is geocoded, i.e. the user sharing her/his location. Depending on country, only 1 to 3 % of all tweets have geocodes. One tiny dot on the map represents a geocoded tweet colored according to language. The detection of the language is done automatically (either on Twitter’s server or using the python package ‚Chromium compact language detection‚). These methods are not error-free, but there is no other way to get a grip on this big data.
In the following I shall concentrate on Belgium, Spain and German speaking countries.
In total 250.000 tweets are coded for Belgium and the map shows a very colorful picture, with in total 38 languages plotted. Of course Flemish and French are dominating with 75.497 and 79.517 tweets, respectively. This means that around 100.000 tweets are in a language different from Flemish and French. English is found quite often, but also Turkish (magenta) is not rare, especially in Flanders.
The most pertinent question for Belgium concerns of course the language border. The next map contains only data from French and Flemish, and the territorial division between Flanders and Wallonia is clearly discernible.
The actual structure of the so-called language border becomes though less clear, when we draw separate maps for the two languages. We can see the Flemish region reaching southwards to a line south of Brussels with scattered Flemish tweets all over Wallonia. Surprisingly, the French speaking Wallonia is far from clearly mapped. There is of course a larger center in the Brussels area (as there is also for Flemish) and in the cities Liège, Namur, Charleroi and Mons, but the Flemish region also contains lots of French tweets, e.g. on the coastal line (tourists?) and in the cities Kortrijk, Gent and Antwerpen.Thus, the territoriality of the languages is far more blurred as one would expect.
This brings me finally to the officially bilingual city of Brussels. The twitter data here shows a rather clear picture: French is by far the most used language and Flemish is only used in approximately a sixth of the messages.
One has to bear in mind that this data represents the language behavior of a certain social group in a certain communicative condition only – which cannot be taken for granted as the general language situation of the country. On the other hand, this data may question the commonly held view on language distribution, e.g. regarding the clearness of the language border.
The setting of Spain show at first an overwhelming prevalence of Spanish, but I was also interested in the presence of Catalan and Basque. When analysing the Twitter data I realized that Twitter’s language detection algorithm does not recognize Catalan. So I had to redo the language detection using the Chromium compact language detection package, but it turned out that this algorithm is quite error-prone for short text like Twitter messages with their maximum length of only 140 characters. In order to deal with these constraints, I took only longer tweets, i.e. more than 9 words, for the mapping process. Through this measure only 2 millions out of the total of 5.3 millions from Spain were further considered. The split maps according to the 5 most used languages give some insight into the Spanish regions. French and Portuguese are found here probably due to tourism and cross-border contacts. Quite nicely does the Basque language come up (will do later another map for Basque including the France region). Catalan, finally, is not only restricted to Catalonia but found scattered throughout the country. Compared to the number of Spanish tweets, Catalan is clearly in a minority position, though.
Coming to Catalonia itself the following map presents a more detailed view on the presence of Catalan. Taking Catalan and Spanish together, the former is used in approximately 20 % of the tweets. The relatedness of the Romance languages create problems for the automatic language detection algorithm, and the findings here have to be interpreted with great caution.
3. German speaking countries
The following map for German speaking countries, i.e. Germany, Switzerland and Austria, also show a quite colorful, multilingual picture. In general, German and Austrian users of Twitter seems to be much more reluctant in sharing their geoposition as compared e.g. with the Dutch or the Spanish. Thus, the amount of tweets is comparatively low for these countries (total of 700.000 tweets).
When confining only to the most used languages next to German (and English), one is confronted with the following map. It nicely shows the French speaking part of Switzerland and also the Italian speaking area. For Austria one can see some Slovene tweets in the region around Klagenfurt. Several Dutch tweets (due to cross-border contacts?) show up close to the border to the Netherlands. Tweets in Turkish (yellow) can be found in numerous urban centers.
Lastly, zooming into the Berlin area exhibits the presence of the minority languages Turkish and also Russian.
The maps neatly illustrate the reality of multilingualism in the German speaking countries.