The University of Auckland
Browse
4_3_Dunn_Adams.pdf (927.83 kB)

Mapping Languages and Demographics with Georeferenced Corpora

Download (927.83 kB)
Version 2 2019-12-01, 23:22
Version 1 2019-09-18, 00:22
conference contribution
posted on 2019-09-18, 00:22 authored by Jonathan Dunn, Ben Adams
This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the
datasets are most representative of actual populations; and (iii) how to weight the datasets to provide more accurate representations of underlying populations. The paper finds that the two datasets represent very different populations and that they correlate with actual populations with values of r = 0:60 (social media) and r = 0:49 (web-crawled). Further, Twitter data makes better predictions about the inventory of languages used in each country.

History

Publisher

University of Auckland

Usage metrics

    GeoComputation 2019

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC