Research on linguistic aspects of Twitter is still rather recent and yet might become a new and promising perspective on communication and language use in a highly globalised and mobilised world. From a large-scale perspective, e.g., Mocanu et al. in their article ‚The Twitter of Babel: Mapping World Languages through Microblogging Platforms‚ present several key finding on langue use on Twitter worldwide. They also provide some maps to illustrate the linguistic diversity in various regions. Bryden et al. analyse the actual language in tweets to determine user communities sharing the same linguistic features. And Brice Russ finally opens the research window towards dialectology by mapping lexical features in the U.S. with geocoded Twitter data.
One of the most appealing features of Twitter is the convenience of data collection: Tweets are freely available, they can be captured in real time, they are virtually delivered right on one’s desk and they represent a specific language use in a way which hasn’t been available until now.
Interested in linguistic diversity, I presented my first attempts to map geocoded tweets in Europe according to the language used in the tweet. In the meantime my collection of tweets continued, amounting now to approximately 33 millions tweets for Europe. In addition to the geocodes, required to bring the data on a map, I took the tweet’s language into account as it is defined through the ‚lang‘ attribute. The tweet Language is automatically determined by a language detection algorithm on the Twitter server.
The question remains: How is multilingualism represented through Twitter? Where do we find regions of multilingualism?
In this blog post, I am concentrating on Luxembourg only, a country with a high degree of societal multilingualism and with a rather low degree of territorial multilingualism. The following map contains data for 8759 tweets within the territory of Luxembourg, where most of them are concentrated in the capital and the southern region. In total, 33 different languages are used, resulting in a complex map and the high degree of diversity is recognisable especially for capital.
By splitting up the maps according to languages, the picture becomes a bit more clear. Most tweets, i.e. 3395, are actually send in French and this confirms the status of French as being the most used language in Luxembourg.
Although not an official language, English is the second frequently used language with a total of 2134 tweets. Geographically, this Twitter activity is mainly concentrated in the capital. Due to the fact that several international companies and institutions with staff from all across Europe, it can be presumed that these English tweets originate from the international networks.
Far less tweets are written in German (585) or Portuguese (527) (the latter representing the largest immigrant group in Luxembourg).
In general the maps illustrate the high degree of linguistic diversity in Luxembourg, especially in the capital.
One might wonder about tweets in the Luxembourgish language. As a matter of fact, Luxembourgish is, at least in this data set, very rarely represented. One occasionally stumbles upon tweets like:
@andy_schleck ech well dech fir gesinn.... iooo xD mengn huelen n yougurt oda su eng tranche poulet xD haaaaassssse dat xD fille mej wei eng laich -.- j muss na 1 woch schafen goen op rodange ! -.-.-.-.- j well stiewen !!!! Haut op Besuch op der AG vun @RosaLetzebuerg Hun eng Sinusite xD
The automatic language detection algorithm on the Twitter server doesn’t recognise the Luxembourgish language anyways. These tweets, then, are mostly (falsely) annotated as Dutch or German.
A further interesting aspect arises, if one contrasts the language of a tweet with the language the user has specified in his user profile. This language indicated in the profile represents the user’s first language or the language she/he uses most and especially in linguistically diverse contexts this language does not need to correspond necessarily to the language of the tweet itself. What can be seen from the figures in the following table, is that many more users indicated German or French as their ‚main‘ language: Although we have in the data set 1030 tweets with German as user language, this language is only used on 585 tweets. More dramatically is the difference for French which 4361 users selected through their profile, while in the corpus only 3395 tweets are written in English.
|Language of tweet||585||2134||3395||527|
|Language chosen in user profile||1030||2179||4361||458|
The question arises, which languages these German and French users then employ. From further investigations into the data set it becomes clear that these users mainly use English. On the other hand, though, quite a few users with English as their main language also write their tweets in French or, to a lesser extent, in German. Taken all these observation together illustrates the multilingual behaviour of these Twitter users.
This topic definitely deservers a closer look and I for now I won’t go into further details, but it should have become evident that the obvious discrepancy between tweet language and the language chosen by the user is one key factor for analysing the linguistic diversity on Twitter.