Information-optimal Abstaining for Reliable Classification of Building Functions

In this paper, we analyze the situation of text mining in extremely noisy spatial datasets like when trying to map social media posts to aspects of the physical world.

We analyze from a machine learning perspective, whether a large Twitter sample could be used to assign building functions to individual buildings. In a nutshell, we assign each tweet from our sample to the nearest building from OpenStreetMap exploiting our high-performance implementation as described in (missing reference).

The setting is extremely ill-posed for many reasons. The most pressing ones are

  • tweets are not necessarily geolocated where they originate (fake location, inaccurate location, etc.)
  • the content of tweets is rarely related to the surroundings of the origin
  • the labels are incomplete and have significant overlap. Aside residential and commercial buildings, there are mixed buildings, industrial buildings and many more.

Therefore, we expect that only a very small fraction of these messages is valuable. The question is how to find these few, but powerful messages.

We successfully apply a technique based on information theory known as information-optimal abstaining. The paper is as preproducible as possible including synthetic data generated from mixing up movie reviews (English) with two strongly overlapping corpi Faust and Dr. Faustus in German language.


    © 2020 M. Werner