In this paper, we analyze the situation of text mining in extremely noisy spatial datasets like when trying to map social media posts to aspects of the physical world.
We analyze from a machine learning perspective, whether a large Twitter sample could be used to assign building functions to individual buildings. In a nutshell, we assign each tweet from our sample to the nearest building from OpenStreetMap exploiting our high-performance implementation as described in (missing reference).
The setting is extremely ill-posed for many reasons. The most pressing ones are
Therefore, we expect that only a very small fraction of these messages is valuable. The question is how to find these few, but powerful messages.
We successfully apply a technique based on information theory known as information-optimal abstaining. The paper is as preproducible as possible including synthetic data generated from mixing up movie reviews (English) with two strongly overlapping corpi Faust and Dr. Faustus in German language.