This streaming app consumes data from the wikidata-geo topic normalized and analysed them to a common schema. The app decouples the producing of the stream from the normalizing and analysing process.
## Normalizer
The Readable stream has two Writable stream subscribed. One for Analysing the data and build links and one for normalising the data a common schema. The Normaliser sends the data to the topic `wikidata-small`
## Analyser
The analyser extracts specific wikidata properties and send a concordance of links to the `linker` topic
## Docker
To build the image use following command. The image will fetch data from a wikidata topic and streams the result back into kafka. The container based on linux alpine.
The wikidata library provide a set of stream-transformers and utils to work with raw data from wikidata (dump and feed). Originally the normalizer was a pod running in k8, consumed and produced messages from kafka. While a refactoring we decided to use the transformers directly inside the dump-producer and the live-feed. So we can save the round trip to kafka and save some disk storage and $$$.
## wikidata-normalizer-transformer
This transformer takes raw data from wikidata and normalize them to the geolinker default format. It then prepares a message for kafka
## wikidata-analyzer-transformer
This transformer analyse wikidata's raw data and extracts information about links between items. F.e we check for links to other interesting resources. We then extract those links and prepare them as a message for the linker
## wikidata-geofilter-transformer
This transformers try to guess the type of a wikidata item. If its from a defined set of classes it forwards the message, if not it just dumps the message. We nuse it to filter out all documents related to geography (f.e. locations, places, cities and so on)
## wikidata-utils
Simple utils that help to work with wikidata raw format.
### timeToDate()
This function transforms the time value from wikidata into a date format used by the geolinker
### WikidataProperties.getProperties(property: string, query: any = {brief: true})
This method query the sparql endpoint of wikidata adn extracts properties from the result. We use it to get all classes and subclasses from location
### WikidataProperties.init(props: IProperty[])
This method get a list of all subclasses for a set of properties. F.e. you can find all properties expressing the "end" of something.